notebooks/community/model_garden/model_garden_tfvision_image_classification.ipynb (927 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "ur8xi4C7S06n"
},
"outputs": [],
"source": [
"# Copyright 2025 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TirJ-SGQseby"
},
"source": [
"# Vertex AI Model Garden TFVision With Image Classification\n",
"\n",
"<table><tbody><tr>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fcommunity%2Fmodel_garden%2Fmodel_garden_tfvision_image_classification.ipynb\">\n",
" <img alt=\"Google Cloud Colab Enterprise logo\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" width=\"32px\"><br> Run in Colab Enterprise\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_tfvision_image_classification.ipynb\">\n",
" <img alt=\"GitHub logo\" src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" width=\"32px\"><br> View on GitHub\n",
" </a>\n",
" </td>\n",
"</tr></tbody></table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tvgnzT1CKxrO"
},
"source": [
"## Overview\n",
"\n",
"This notebook demonstrates how to use [TFVision](https://github.com/tensorflow/models/blob/master/official/vision/MODEL_GARDEN.md) in Vertex AI Model Garden.\n",
"\n",
"### Objective\n",
"\n",
"* Train new models\n",
" * Convert input data to training formats\n",
" * Create [hyperparameter tuning jobs](https://cloud.google.com/vertex-ai/docs/training/hyperparameter-tuning-overview) to train new models\n",
" * Find and export best models\n",
"\n",
"* Test trained models\n",
" * Upload models to model registry\n",
" * Deploy uploaded models\n",
" * Run predictions\n",
"\n",
"* Cleanup resources\n",
"\n",
"### Costs\n",
"\n",
"This tutorial uses billable components of Google Cloud:\n",
"\n",
"* Vertex AI\n",
"* Cloud Storage\n",
"\n",
"Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KEukV6uRk_S3"
},
"source": [
"## Before you begin"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "Jvqs-ehKlaYh"
},
"outputs": [],
"source": [
"# @title Setup Google Cloud project\n",
"\n",
"# @markdown 1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n",
"\n",
"# @markdown 2. **[Optional]** [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) for storing experiment outputs. Set the BUCKET_URI for the experiment environment. The specified Cloud Storage bucket (`BUCKET_URI`) should be located in the same region as where the notebook was launched. Note that a multi-region bucket (eg. \"us\") is not considered a match for a single region covered by the multi-region range (eg. \"us-central1\"). If not set, a unique GCS bucket will be created instead.\n",
"\n",
"BUCKET_URI = \"gs://\" # @param {type:\"string\"}\n",
"\n",
"# @markdown 3. **[Optional]** Set region. If not set, the region will be set automatically according to Colab Enterprise environment.\n",
"\n",
"REGION = \"\" # @param {type:\"string\"}\n",
"\n",
"# @markdown 4. If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus). You can request for quota following the instructions at [\"Request a higher quota\"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).\n",
"\n",
"# @markdown > | Machine Type | Accelerator Type | Recommended Regions |\n",
"# @markdown | ----------- | ----------- | ----------- |\n",
"# @markdown | a2-ultragpu-1g | 1 NVIDIA_A100_80GB | us-central1, us-east4, europe-west4, asia-southeast1, us-east4 |\n",
"# @markdown | a3-highgpu-2g | 2 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |\n",
"# @markdown | a3-highgpu-4g | 4 NVIDIA_H100_80GB | us-west1, asia-southeast1, europe-west4 |\n",
"# @markdown | a3-highgpu-8g | 8 NVIDIA_H100_80GB | us-central1, europe-west4, us-west1, asia-southeast1 |\n",
"\n",
"! git clone https://github.com/GoogleCloudPlatform/vertex-ai-samples.git\n",
"\n",
"import base64\n",
"import datetime\n",
"import importlib\n",
"import io\n",
"import json\n",
"import os\n",
"import subprocess\n",
"import uuid\n",
"from typing import Any, Dict, List, Union\n",
"\n",
"import yaml\n",
"from google.cloud import aiplatform\n",
"from google.protobuf import json_format\n",
"from google.protobuf.struct_pb2 import Value\n",
"\n",
"common_util = importlib.import_module(\n",
" \"vertex-ai-samples.community-content.vertex_model_garden.model_oss.notebook_util.common_util\"\n",
")\n",
"\n",
"models, endpoints = {}, {}\n",
"\n",
"\n",
"# Get the default cloud project id.\n",
"PROJECT_ID = os.environ[\"GOOGLE_CLOUD_PROJECT\"]\n",
"\n",
"# Get the default region for launching jobs.\n",
"if not REGION:\n",
" if not os.environ.get(\"GOOGLE_CLOUD_REGION\"):\n",
" raise ValueError(\n",
" \"REGION must be set. See\"\n",
" \" https://cloud.google.com/vertex-ai/docs/general/locations for\"\n",
" \" available cloud locations.\"\n",
" )\n",
" REGION = os.environ[\"GOOGLE_CLOUD_REGION\"]\n",
"\n",
"# Enable the Vertex AI API and Compute Engine API, if not already.\n",
"print(\"Enabling Vertex AI API and Compute Engine API.\")\n",
"! gcloud services enable aiplatform.googleapis.com compute.googleapis.com\n",
"\n",
"# Cloud Storage bucket for storing the experiment artifacts.\n",
"# A unique GCS bucket will be created for the purpose of this notebook. If you\n",
"# prefer using your own GCS bucket, change the value yourself below.\n",
"now = datetime.datetime.now().strftime(\"%Y%m%d%H%M%S\")\n",
"BUCKET_NAME = \"/\".join(BUCKET_URI.split(\"/\")[:3])\n",
"\n",
"if BUCKET_URI is None or BUCKET_URI.strip() == \"\" or BUCKET_URI == \"gs://\":\n",
" BUCKET_URI = f\"gs://{PROJECT_ID}-tmp-{now}-{str(uuid.uuid4())[:4]}\"\n",
" BUCKET_NAME = \"/\".join(BUCKET_URI.split(\"/\")[:3])\n",
" ! gsutil mb -l {REGION} {BUCKET_URI}\n",
"else:\n",
" assert BUCKET_URI.startswith(\"gs://\"), \"BUCKET_URI must start with `gs://`.\"\n",
" shell_output = ! gsutil ls -Lb {BUCKET_NAME} | grep \"Location constraint:\" | sed \"s/Location constraint://\"\n",
" bucket_region = shell_output[0].strip().lower()\n",
" if bucket_region != REGION:\n",
" raise ValueError(\n",
" \"Bucket region %s is different from notebook region %s\"\n",
" % (bucket_region, REGION)\n",
" )\n",
"print(f\"Using this GCS Bucket: {BUCKET_URI}\")\n",
"\n",
"STAGING_BUCKET = os.path.join(BUCKET_URI, \"temporal\")\n",
"MODEL_BUCKET = os.path.join(BUCKET_URI, \"tfvision_image_classification\")\n",
"\n",
"\n",
"# Initialize Vertex AI API.\n",
"print(\"Initializing Vertex AI API.\")\n",
"aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)\n",
"\n",
"# Gets the default SERVICE_ACCOUNT.\n",
"shell_output = ! gcloud projects describe $PROJECT_ID\n",
"project_number = shell_output[-1].split(\":\")[1].strip().replace(\"'\", \"\")\n",
"SERVICE_ACCOUNT = f\"{project_number}-compute@developer.gserviceaccount.com\"\n",
"print(\"Using this default Service Account:\", SERVICE_ACCOUNT)\n",
"\n",
"\n",
"# Provision permissions to the SERVICE_ACCOUNT with the GCS bucket\n",
"! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.admin $BUCKET_NAME\n",
"\n",
"! gcloud config set project $PROJECT_ID\n",
"! gcloud projects add-iam-policy-binding --no-user-output-enabled {PROJECT_ID} --member=serviceAccount:{SERVICE_ACCOUNT} --role=\"roles/storage.admin\"\n",
"! gcloud projects add-iam-policy-binding --no-user-output-enabled {PROJECT_ID} --member=serviceAccount:{SERVICE_ACCOUNT} --role=\"roles/aiplatform.user\"\n",
"\n",
"CONFIG_DIR = os.path.join(BUCKET_URI, \"config\")\n",
"CHECKPOINT_BUCKET = os.path.join(BUCKET_URI, \"ckpt\")\n",
"\n",
"# Only regions prefixed by \"us\", \"asia\", or \"europe\" are supported.\n",
"REGION_PREFIX = REGION.split(\"-\")[0]\n",
"assert REGION_PREFIX in (\n",
" \"us\",\n",
" \"europe\",\n",
" \"asia\",\n",
"), f'{REGION} is not supported. It must be prefixed by \"us\", \"asia\", or \"europe\".'\n",
"\n",
"\n",
"def upload_config_to_gcs(url):\n",
" filename = os.path.basename(url)\n",
" destination = os.path.join(CONFIG_DIR, filename)\n",
" print(\"Copy\", url, \"to\", destination)\n",
" ! wget \"$url\" -O \"$filename\"\n",
" ! gsutil cp \"$filename\" \"$destination\"\n",
"\n",
"\n",
"upload_config_to_gcs(\n",
" \"https://raw.githubusercontent.com/tensorflow/models/master/official/vision/configs/experiments/image_classification/imagenet_resnet50_gpu.yaml\"\n",
")\n",
"upload_config_to_gcs(\n",
" \"https://raw.githubusercontent.com/tensorflow/models/master/official/vision/configs/experiments/image_classification/imagenet_resnetrs50_i160_gpu.yaml\"\n",
")\n",
"upload_config_to_gcs(\n",
" \"https://raw.githubusercontent.com/tensorflow/models/master/official/projects/maxvit/configs/experiments/maxvit_base_imagenet_gpu.yaml\"\n",
")\n",
"\n",
"# Define constants.\n",
"OBJECTIVE = \"icn\"\n",
"\n",
"# Evaluation constants.\n",
"EVALUATION_METRIC = \"accuracy\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b2356e904526"
},
"source": [
"## Training\n",
"\n",
"This section trains model with the following steps:\n",
"1. Prepare data by converting the input data into training format.\n",
"2. Run hyperparameter tuning jobs to train new models.\n",
"3. Find and export best models."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "IndQ_m6ddUEM"
},
"outputs": [],
"source": [
"# @title Prepare input data for training\n",
"\n",
"# @markdown This section converts input data to training format, and splits to train/test/validation dataset with given split ratios and number of shards.\n",
"\n",
"# @markdown Prepare data in the format as described [here](https://cloud.google.com/vertex-ai/docs/image-data/classification/prepare-data), and then convert them to the training formats as below:\n",
"# @markdown * `input_file_path`: The input file path for preparing data. Sample input file: `gs://cloud-samples-data/ai-platform/flowers/flowers.csv`\n",
"# @markdown * `input_file_type`: The input file type, can be \"csv\" or \"jsonl\".\n",
"# @markdown * `num_classes`: Number of classes in the dataset.\n",
"# @markdown * `split_ratio`: The proportion of data to split into train/validation/test, e.g. \"0.8,0.1,0.1\".\n",
"# @markdown * `num_shard`: The number of shards for train/validation/test, e.g. \"10,10,10\".\n",
"\n",
"# This job will convert input data as training format, with given split ratios\n",
"# and number of shards on train/test/validation.\n",
"\n",
"from google.cloud.aiplatform import hyperparameter_tuning as hpt\n",
"\n",
"# Data converter constants.\n",
"DATA_CONVERTER_JOB_PREFIX = \"data_converter\"\n",
"DATA_CONVERTER_CONTAINER = f\"{REGION_PREFIX}-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/data-converter\"\n",
"DATA_CONVERTER_MACHINE_TYPE = \"n1-highmem-8\"\n",
"\n",
"data_converter_job_name = common_util.get_job_name_with_datetime(\n",
" DATA_CONVERTER_JOB_PREFIX + \"_\" + OBJECTIVE\n",
")\n",
"\n",
"input_file_path = \"gs://cloud-samples-data/ai-platform/flowers/flowers.csv\" # @param {type:\"string\"} {isTemplate:true}\n",
"input_file_type = \"csv\" # @param [\"csv\", \"jsonl\"]\n",
"num_classes = 5 # @param {type:\"integer\"}\n",
"split_ratio = \"0.8,0.1,0.1\" # @param {type:\"string\"}\n",
"num_shard = \"10,10,10\" # @param {type:\"string\"}\n",
"data_converter_output_dir = os.path.join(BUCKET_URI, data_converter_job_name)\n",
"\n",
"worker_pool_specs = [\n",
" {\n",
" \"machine_spec\": {\n",
" \"machine_type\": DATA_CONVERTER_MACHINE_TYPE,\n",
" },\n",
" \"replica_count\": 1,\n",
" \"container_spec\": {\n",
" \"image_uri\": DATA_CONVERTER_CONTAINER,\n",
" \"command\": [],\n",
" \"args\": [\n",
" \"--input_file_path=%s\" % input_file_path,\n",
" \"--input_file_type=%s\" % input_file_type,\n",
" \"--objective=%s\" % OBJECTIVE,\n",
" \"--num_shard=%s\" % num_shard,\n",
" \"--split_ratio=%s\" % split_ratio,\n",
" \"--output_dir=%s\" % data_converter_output_dir,\n",
" ],\n",
" },\n",
" }\n",
"]\n",
"\n",
"data_converter_custom_job = aiplatform.CustomJob(\n",
" display_name=data_converter_job_name,\n",
" project=PROJECT_ID,\n",
" worker_pool_specs=worker_pool_specs,\n",
" staging_bucket=STAGING_BUCKET,\n",
")\n",
"\n",
"data_converter_custom_job.run()\n",
"\n",
"input_train_data_path = os.path.join(data_converter_output_dir, \"train.tfrecord*\")\n",
"input_validation_data_path = os.path.join(data_converter_output_dir, \"val.tfrecord*\")\n",
"label_map_path = os.path.join(data_converter_output_dir, \"label_map.yaml\")\n",
"print(\"input_train_data_path for training: \", input_train_data_path)\n",
"print(\"input_validation_data_path for training: \", input_validation_data_path)\n",
"print(\"label_map_path for prediction: \", label_map_path)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "2c3211e01b6d"
},
"outputs": [],
"source": [
"# @title Create and run Vertex AI custom job with hyperparameter tuning\n",
"\n",
"# @markdown This section use Vertex AI SDK to create and run the hyperparameter tuning job with Vertex AI Model Garden Training Dockers.\n",
"\n",
"# @markdown Select one of the following experiments:\n",
"# @markdown * [tfhub/EfficientNetV2](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/imageclassification-efficientnet): `Efficientnetv2-m`\n",
"# @markdown * [tfvision/ViT](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/imageclassification-vit): `ViT-ti16`, `ViT-s16`, `ViT-b16`, `ViT-l16`\n",
"# @markdown * [Proprietary/MaxViT](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/imageclassification-proprietary-maxvit): `MaxViT`\n",
"\n",
"# Input train and validation datasets can be found from the section above\n",
"# `Convert input data for training`.\n",
"# Set prepared datasets if exists.\n",
"# input_train_data_path = ''\n",
"# input_validation_data_path = ''\n",
"\n",
"# Training constants.\n",
"TRAINING_JOB_PREFIX = \"train\"\n",
"TRAIN_CONTAINER_URI = f\"{REGION_PREFIX}-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/tfvision-oss\"\n",
"TRAIN_MACHINE_TYPE = \"g2-standard-4\"\n",
"TRAIN_ACCELERATOR_TYPE = \"NVIDIA_L4\"\n",
"TRAIN_NUM_GPU = 1\n",
"\n",
"experiment = \"Efficientnetv2-m\" # @param [\"Efficientnetv2-m\",\"ViT-ti16\",\"ViT-s16\",\"ViT-b16\",\"ViT-l16\", \"MaxViT\"]\n",
"\n",
"train_job_name = common_util.get_job_name_with_datetime(\n",
" TRAINING_JOB_PREFIX + \"_\" + OBJECTIVE\n",
")\n",
"model_dir = os.path.join(BUCKET_URI, train_job_name)\n",
"\n",
"# The arguments here are mainly for test purposes. Kindly update them\n",
"# to get better performances.\n",
"common_args = {\n",
" \"input_train_data_path\": input_train_data_path,\n",
" \"input_validation_data_path\": input_validation_data_path,\n",
" \"objective\": OBJECTIVE,\n",
" \"model_dir\": model_dir,\n",
" \"num_classes\": num_classes,\n",
" \"global_batch_size\": 4,\n",
" \"prefetch_buffer_size\": 32,\n",
" \"train_steps\": 2000,\n",
" \"input_size\": \"224,224\",\n",
"}\n",
"\n",
"# Arguments for different experiments.\n",
"experiment_container_args_dict = {\n",
" \"Efficientnetv2-m\": dict(\n",
" common_args,\n",
" **{\n",
" \"experiment\": \"hub_model\",\n",
" },\n",
" ),\n",
" \"ViT-ti16\": dict(\n",
" common_args,\n",
" **{\n",
" \"experiment\": \"deit_imagenet_pretrain\",\n",
" \"model_name\": \"vit-ti16\",\n",
" \"init_checkpoint\": \"https://storage.googleapis.com/tf_model_garden/vision/vit/vit-deit-imagenet-ti16.tar.gz\",\n",
" \"input_size\": \"224,224\",\n",
" },\n",
" ),\n",
" \"ViT-s16\": dict(\n",
" common_args,\n",
" **{\n",
" \"experiment\": \"deit_imagenet_pretrain\",\n",
" \"model_name\": \"vit-s16\",\n",
" \"init_checkpoint\": \"https://storage.googleapis.com/tf_model_garden/vision/vit/vit-deit-imagenet-s16.tar.gz\",\n",
" \"input_size\": \"224,224\",\n",
" },\n",
" ),\n",
" \"ViT-b16\": dict(\n",
" common_args,\n",
" **{\n",
" \"experiment\": \"deit_imagenet_pretrain\",\n",
" \"model_name\": \"vit-b16\",\n",
" \"init_checkpoint\": \"https://storage.googleapis.com/tf_model_garden/vision/vit/vit-deit-imagenet-b16.tar.gz\",\n",
" \"input_size\": \"224,224\",\n",
" },\n",
" ),\n",
" \"ViT-l16\": dict(\n",
" common_args,\n",
" **{\n",
" \"experiment\": \"deit_imagenet_pretrain\",\n",
" \"model_name\": \"vit-l16\",\n",
" \"init_checkpoint\": \"https://storage.googleapis.com/tf_model_garden/vision/vit/vit-deit-imagenet-l16.tar.gz\",\n",
" \"input_size\": \"224,224\",\n",
" },\n",
" ),\n",
" \"MaxViT\": dict(\n",
" common_args,\n",
" **{\n",
" \"experiment\": \"maxvit_imagenet\",\n",
" \"config_file\": os.path.join(CONFIG_DIR, \"maxvit_base_imagenet_gpu.yaml\"),\n",
" },\n",
" ),\n",
"}\n",
"experiment_container_args = experiment_container_args_dict[experiment]\n",
"\n",
"\n",
"def upload_checkpoint_to_gcs(checkpoint_url):\n",
" filename = os.path.basename(checkpoint_url)\n",
" checkpoint_name = filename.replace(\".tar.gz\", \"\")\n",
" print(\"Download checkpoint from\", checkpoint_url, \"and store to\", CHECKPOINT_BUCKET)\n",
" ! wget $checkpoint_url -O $filename\n",
" ! mkdir -p $checkpoint_name\n",
" ! tar -xvzf $filename -C $checkpoint_name\n",
"\n",
" # Search for relative path to the checkpoint.\n",
" checkpoint_path = None\n",
" for root, dirs, files in os.walk(checkpoint_name):\n",
" for file in files:\n",
" if file.endswith(\".index\"):\n",
" checkpoint_path = os.path.join(root, os.path.splitext(file)[0])\n",
" checkpoint_path = os.path.relpath(checkpoint_path, checkpoint_name)\n",
" break\n",
"\n",
" ! gsutil cp -r $checkpoint_name $CHECKPOINT_BUCKET/\n",
" checkpoint_uri = os.path.join(CHECKPOINT_BUCKET, checkpoint_name, checkpoint_path)\n",
" print(\"Checkpoint uploaded to\", checkpoint_uri)\n",
" return checkpoint_uri\n",
"\n",
"\n",
"# Copy checkpoint to GCS bucket if specified.\n",
"init_checkpoint = experiment_container_args.get(\"init_checkpoint\")\n",
"if init_checkpoint:\n",
" experiment_container_args[\"init_checkpoint\"] = upload_checkpoint_to_gcs(\n",
" init_checkpoint\n",
" )\n",
"\n",
"# Use container that supports MaxViT\n",
"if experiment == \"MaxViT\":\n",
" TRAIN_CONTAINER_URI = f\"{REGION_PREFIX}-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/tfvision-oss-v2\"\n",
"\n",
"worker_pool_specs = [\n",
" {\n",
" \"machine_spec\": {\n",
" \"machine_type\": TRAIN_MACHINE_TYPE,\n",
" \"accelerator_type\": TRAIN_ACCELERATOR_TYPE,\n",
" # Each training job uses TRAIN_NUM_GPU GPUs.\n",
" \"accelerator_count\": TRAIN_NUM_GPU,\n",
" },\n",
" \"replica_count\": 1,\n",
" \"container_spec\": {\n",
" \"image_uri\": TRAIN_CONTAINER_URI,\n",
" \"args\": [\n",
" \"--mode=train_and_eval\",\n",
" \"--params_override=runtime.num_gpus=%d\" % TRAIN_NUM_GPU,\n",
" ]\n",
" + [\"--{}={}\".format(k, v) for k, v in experiment_container_args.items()],\n",
" },\n",
" }\n",
"]\n",
"\n",
"metric_spec = {\"model_performance\": \"maximize\"}\n",
"\n",
"\n",
"LEARNING_RATES = [5e-4, 1e-3]\n",
"# Models will be trained with each learning rate separately and max trial count is the number of learning rates.\n",
"MAX_TRIAL_COUNT = len(LEARNING_RATES)\n",
"parameter_spec = {\n",
" \"learning_rate\": hpt.DiscreteParameterSpec(values=LEARNING_RATES, scale=\"linear\"),\n",
"}\n",
"\n",
"print(worker_pool_specs, metric_spec, parameter_spec)\n",
"\n",
"# Check quota.\n",
"common_util.check_quota(\n",
" project_id=PROJECT_ID,\n",
" region=REGION,\n",
" accelerator_type=TRAIN_ACCELERATOR_TYPE,\n",
" accelerator_count=1,\n",
" is_for_training=True,\n",
")\n",
"\n",
"\n",
"# Add labels for the finetuning job.\n",
"labels = {\n",
" \"mg-source\": \"notebook\",\n",
" \"mg-notebook-name\": \"model_garden_tfvision_image_classification.ipynb\".split(\".\")[\n",
" 0\n",
" ],\n",
"}\n",
"\n",
"labels[\"mg-tune\"] = \"publishers-google-models-tfvision\"\n",
"versioned_model_id = experiment.lower().replace(\"_\", \"-\")\n",
"labels[\"versioned-mg-tune\"] = f\"{labels['mg-tune']}-{versioned_model_id}\"\n",
"\n",
"# Run the hyperparameter job.\n",
"train_custom_job = aiplatform.CustomJob(\n",
" display_name=train_job_name,\n",
" project=PROJECT_ID,\n",
" worker_pool_specs=worker_pool_specs,\n",
" staging_bucket=STAGING_BUCKET,\n",
" labels=labels,\n",
")\n",
"\n",
"train_hpt_job = aiplatform.HyperparameterTuningJob(\n",
" display_name=train_job_name,\n",
" custom_job=train_custom_job,\n",
" metric_spec=metric_spec,\n",
" parameter_spec=parameter_spec,\n",
" max_trial_count=MAX_TRIAL_COUNT,\n",
" parallel_trial_count=MAX_TRIAL_COUNT,\n",
" project=PROJECT_ID,\n",
" search_algorithm=None,\n",
")\n",
"\n",
"train_hpt_job.run()\n",
"\n",
"print(\"experiment is: \", experiment)\n",
"print(\"model_dir is: \", model_dir)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "38a46cfda376"
},
"outputs": [],
"source": [
"# @title Export best models as TF Saved Model format\n",
"\n",
"# @markdown This section exports best model.\n",
"\n",
"# Export models from TF checkpoints to TF saved model format.\n",
"# model_dir is from the section above.\n",
"\n",
"# Export constants.\n",
"EXPORT_JOB_PREFIX = \"export\"\n",
"EXPORT_CONTAINER_URI = f\"{REGION_PREFIX}-docker.pkg.dev/vertex-ai-restricted/vertex-vision-model-garden-dockers/tfvision-model-export\"\n",
"EXPORT_MACHINE_TYPE = \"n1-highmem-8\"\n",
"\n",
"\n",
"def get_best_trial(model_dir, max_trial_count, evaluation_metric):\n",
" best_trial_dir = \"\"\n",
" best_trial_evaluation_results = {}\n",
" best_performance = -1\n",
"\n",
" for i in range(max_trial_count):\n",
" current_trial = i + 1\n",
" current_trial_dir = os.path.join(model_dir, \"trial_\" + str(current_trial))\n",
" current_trial_best_ckpt_dir = os.path.join(current_trial_dir, \"best_ckpt\")\n",
" current_trial_best_ckpt_evaluation_filepath = os.path.join(\n",
" current_trial_best_ckpt_dir, \"info.json\"\n",
" )\n",
" ! gsutil cp $current_trial_best_ckpt_evaluation_filepath .\n",
" with open(\"info.json\", \"r\") as f:\n",
" eval_metric_results = json.load(f)\n",
" current_performance = eval_metric_results[evaluation_metric]\n",
" if current_performance > best_performance:\n",
" best_performance = current_performance\n",
" best_trial_dir = current_trial_dir\n",
" best_trial_evaluation_results = eval_metric_results\n",
" print(\"best_trial_dir: \", current_trial_best_ckpt_evaluation_filepath)\n",
" return best_trial_dir, best_trial_evaluation_results\n",
"\n",
"\n",
"best_trial_dir, best_trial_evaluation_results = get_best_trial(\n",
" model_dir, MAX_TRIAL_COUNT, EVALUATION_METRIC\n",
")\n",
"print(\"best_trial_dir: \", best_trial_dir)\n",
"print(\"best_trial_evaluation_results: \", best_trial_evaluation_results)\n",
"\n",
"worker_pool_specs = [\n",
" {\n",
" \"machine_spec\": {\n",
" \"machine_type\": EXPORT_MACHINE_TYPE,\n",
" },\n",
" \"replica_count\": 1,\n",
" \"container_spec\": {\n",
" \"image_uri\": EXPORT_CONTAINER_URI,\n",
" \"command\": [],\n",
" \"args\": [\n",
" \"--objective=%s\" % OBJECTIVE,\n",
" \"--input_image_size=%s\" % experiment_container_args[\"input_size\"],\n",
" \"--experiment=%s\" % experiment_container_args[\"experiment\"],\n",
" \"--config_file=%s/params.yaml\" % best_trial_dir,\n",
" \"--checkpoint_path=%s/best_ckpt\" % best_trial_dir,\n",
" \"--export_dir=%s/best_model\" % model_dir,\n",
" ],\n",
" },\n",
" }\n",
"]\n",
"\n",
"model_export_name = common_util.get_job_name_with_datetime(\n",
" EXPORT_JOB_PREFIX + \"_\" + OBJECTIVE\n",
")\n",
"model_export_custom_job = aiplatform.CustomJob(\n",
" display_name=model_export_name,\n",
" project=PROJECT_ID,\n",
" worker_pool_specs=worker_pool_specs,\n",
" staging_bucket=STAGING_BUCKET,\n",
")\n",
"\n",
"model_export_custom_job.run()\n",
"\n",
"print(\"best model is saved to: \", os.path.join(model_dir, \"best_model\"))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "c68112dc90b9"
},
"source": [
"## Deployment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "NYuQowyZEtxK"
},
"outputs": [],
"source": [
"# @title Upload and deploy models\n",
"\n",
"# @markdown This section uploads and deploy models to model registry for online prediction. This example uses the exported best model from \"Train new models\" section.\n",
"\n",
"PREDICTION_CONTAINER_URI = f\"{REGION_PREFIX}-docker.pkg.dev/vertex-ai-restricted/prediction/tf_opt-gpu.2-11:latest\"\n",
"SERVING_CONTAINER_ARGS = [\"--allow_precompilation\", \"--allow_compression\"]\n",
"PREDICTION_ACCELERATOR_TYPE = \"NVIDIA_L4\"\n",
"PREDICTION_MACHINE_TYPE = \"g2-standard-12\"\n",
"UPLOAD_JOB_PREFIX = \"upload\"\n",
"DEPLOY_JOB_PREFIX = \"deploy\"\n",
"\n",
"trained_model_dir = os.path.join(model_dir, \"best_model/saved_model\")\n",
"upload_job_name = common_util.get_job_name_with_datetime(\n",
" UPLOAD_JOB_PREFIX + \"_\" + OBJECTIVE\n",
")\n",
"\n",
"serving_env = {\n",
" \"MODEL_ID\": \"tensorflow-hub-efficientnetv2\",\n",
" \"DEPLOY_SOURCE\": \"notebook\",\n",
"}\n",
"match experiment:\n",
" case \"Efficientnetv2-m\":\n",
" publisher_model_id = \"imageclassification-efficientnet\"\n",
" case \"ViT-ti16\" | \"ViT-s16\" | \"ViT-b16\" | \"ViT-l16\":\n",
" publisher_model_id = \"imageclassification-vit\"\n",
" case \"MaxViT\":\n",
" publisher_model_id = \"imageclassification-maxvit\"\n",
" case _:\n",
" raise ValueError(f\"Unknown experiment: {experiment}\")\n",
"\n",
"models[\"model_icn\"] = aiplatform.Model.upload(\n",
" display_name=upload_job_name,\n",
" artifact_uri=trained_model_dir,\n",
" serving_container_image_uri=PREDICTION_CONTAINER_URI,\n",
" serving_container_args=SERVING_CONTAINER_ARGS,\n",
" serving_container_environment_variables=serving_env,\n",
" model_garden_source_model_name=(\n",
" f\"publishers/google/models/{publisher_model_id}\"\n",
" ),\n",
")\n",
"\n",
"models[\"model_icn\"].wait()\n",
"\n",
"print(\"The uploaded model name is: \", upload_job_name)\n",
"\n",
"deploy_model_name = common_util.get_job_name_with_datetime(\n",
" DEPLOY_JOB_PREFIX + \"_\" + OBJECTIVE\n",
")\n",
"print(\"The deployed job name is: \", deploy_model_name)\n",
"\n",
"common_util.check_quota(\n",
" project_id=PROJECT_ID,\n",
" region=REGION,\n",
" accelerator_type=PREDICTION_ACCELERATOR_TYPE,\n",
" accelerator_count=1,\n",
" is_for_training=False,\n",
")\n",
"\n",
"endpoints[\"endpoint_icn\"] = models[\"model_icn\"].deploy(\n",
" deployed_model_display_name=deploy_model_name,\n",
" machine_type=PREDICTION_MACHINE_TYPE,\n",
" traffic_split={\"0\": 100},\n",
" accelerator_type=PREDICTION_ACCELERATOR_TYPE,\n",
" accelerator_count=1,\n",
" min_replica_count=1,\n",
" max_replica_count=1,\n",
" system_labels={\n",
" \"NOTEBOOK_NAME\": \"model_garden_tfvision_image_classification.ipynb\"\n",
" },\n",
")\n",
"\n",
"endpoint_id = endpoints[\"endpoint_icn\"].name\n",
"print(\"endpoint id is: \", endpoint_id)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1ULa2VTQqWfo"
},
"source": [
"## Predict"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "vbIW9me1F2RY"
},
"outputs": [],
"source": [
"# @title Run predictions\n",
"\n",
"# @markdown Once deployment succeeds, you can send image to the endpoint for online prediction.\n",
"\n",
"# @markdown `test_filepath`: gcs uri to the test image file. The uri should start with \"gs://\".\n",
"\n",
"# endpoint_id was generated in the section above (`Upload and deploy models`).\n",
"endpoint_id = endpoints[\"endpoint_icn\"].name\n",
"\n",
"test_filepath = \"gs://cloud-samples-data/ai-platform/flowers/roses/9423755543_edb35141a3_n.jpg\" # @param {type:\"string\"} {isTemplate:true}\n",
"\n",
"\n",
"def get_label_map(label_map_yaml_filepath: str) -> Dict[int, str]:\n",
" \"\"\"Returns class id to label mapping given a filepath to the label map.\n",
"\n",
" Args:\n",
" label_map_yaml_filepath: A string of label map yaml file path.\n",
"\n",
" Returns:\n",
" A dictionary of class id to label mapping.\n",
" \"\"\"\n",
" label_map_filename = os.path.basename(label_map_yaml_filepath)\n",
" subprocess.check_output(\n",
" [\"gsutil\", \"cp\", label_map_yaml_filepath, label_map_filename],\n",
" stderr=subprocess.STDOUT,\n",
" )\n",
" with open(label_map_filename, \"rb\") as input_file:\n",
" label_map = yaml.safe_load(input_file.read())[\"label_map\"]\n",
" return label_map\n",
"\n",
"\n",
"def get_prediction_instances(test_filepath: str, new_width: int = -1) -> Any:\n",
" \"\"\"Generate instance from image path to pass to Vertex AI Endpoint for prediction.\n",
"\n",
" Args:\n",
" test_filepath: A string of test image path.\n",
" new_width: An integer of new image width.\n",
"\n",
" Returns:\n",
" A list of instances.\n",
" \"\"\"\n",
" if new_width <= 0:\n",
" test_file = os.path.basename(test_filepath)\n",
" subprocess.check_output(\n",
" [\"gsutil\", \"cp\", test_filepath, test_file], stderr=subprocess.STDOUT\n",
" )\n",
" with open(test_file, \"rb\") as input_file:\n",
" encoded_string = base64.b64encode(input_file.read()).decode(\"utf-8\")\n",
" else:\n",
" img = common_util.load_img(test_filepath)\n",
" width, height = img.size\n",
" print(\"original input image size: \", width, \" , \", height)\n",
" new_height = int(height * new_width / width)\n",
" new_img = img.resize((new_width, new_height))\n",
" print(\"resized input image size: \", new_width, \" , \", new_height)\n",
" buffered = io.BytesIO()\n",
" new_img.save(buffered, format=\"JPEG\")\n",
" encoded_string = base64.b64encode(buffered.getvalue()).decode(\"utf-8\")\n",
"\n",
" instances = [\n",
" {\n",
" \"encoded_image\": {\"b64\": encoded_string},\n",
" }\n",
" ]\n",
" return instances\n",
"\n",
"\n",
"# If the input image is too large, we will resize it for prediction.\n",
"instances = get_prediction_instances(test_filepath, new_width=1000)\n",
"\n",
"# The label map file was generated from the section above (`Convert input data for training`).\n",
"label_map = get_label_map(label_map_path)\n",
"\n",
"\n",
"def predict_custom_trained_model(\n",
" project: str,\n",
" endpoint_id: str,\n",
" instances: Union[Dict, List[Dict]],\n",
" location: str = \"us-central1\",\n",
"):\n",
" # The AI Platform services require regional API endpoints.\n",
" client_options = {\"api_endpoint\": f\"{location}-aiplatform.googleapis.com\"}\n",
" # Initialize client that will be used to create and send requests.\n",
" # This client only needs to be created once, and can be reused for multiple requests.\n",
" client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)\n",
" parameters_dict = {}\n",
" parameters = json_format.ParseDict(parameters_dict, Value())\n",
" endpoint = client.endpoint_path(\n",
" project=project, location=location, endpoint=endpoint_id\n",
" )\n",
" response = client.predict(\n",
" endpoint=endpoint, instances=instances, parameters=parameters\n",
" )\n",
" return response.predictions, response.deployed_model_id\n",
"\n",
"\n",
"predictions, _ = predict_custom_trained_model(\n",
" project=PROJECT_ID, location=REGION, endpoint_id=endpoint_id, instances=instances\n",
")\n",
"\n",
"probs = dict(predictions[0])[\"probs\"]\n",
"max_prob = max(probs)\n",
"max_index = probs.index(max_prob)\n",
"print(\"The test image: \", test_filepath)\n",
"print(\"max_prob: \", max_prob, \", for label: \", label_map[max_index])\n",
"img = common_util.load_img(test_filepath)\n",
"common_util.display_image(img)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "f72e754f2802"
},
"source": [
"## Clean up resources"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "Ax6vQVZhp9pR"
},
"outputs": [],
"source": [
"# @title Clean up training jobs, models, endpoints and buckets\n",
"\n",
"try:\n",
" # Delete custom and hpt jobs.\n",
" if data_converter_custom_job.list(\n",
" filter=f'display_name=\"{data_converter_job_name}\"'\n",
" ):\n",
" data_converter_custom_job.delete()\n",
" if train_hpt_job.list(filter=f'display_name=\"{train_job_name}\"'):\n",
" train_hpt_job.delete()\n",
" if model_export_custom_job.list(filter=f'display_name=\"{model_export_name}\"'):\n",
" model_export_custom_job.delete()\n",
"except Exception as e:\n",
" print(e)\n",
"\n",
"# @markdown Delete the experiment models and endpoints to recycle the resources\n",
"# @markdown and avoid unnecessary continuous charges that may incur.\n",
"\n",
"# Undeploy model and delete endpoint.\n",
"for endpoint in endpoints.values():\n",
" endpoint.delete(force=True)\n",
"\n",
"# Delete models.\n",
"for model in models.values():\n",
" model.delete()\n",
"\n",
"delete_bucket = False # @param {type:\"boolean\"}\n",
"if delete_bucket:\n",
" ! gsutil -m rm -r $BUCKET_NAME"
]
}
],
"metadata": {
"colab": {
"name": "model_garden_tfvision_image_classification.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}