notebooks/community/gapic/custom/showcase_hyperparmeter_tuning_image_classification.ipynb (2,081 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "copyright" }, "outputs": [], "source": [ "# Copyright 2020 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "title" }, "source": [ "# Vertex client library: Hyperparameter tuning image classification model\n", "\n", "<table align=\"left\">\n", " <td>\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/community/gapic/custom/showcase_hyperparmeter_tuning_image_classification.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/community/gapic/custom/showcase_hyperparmeter_tuning_image_classification.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n", " View on GitHub\n", " </a>\n", " </td>\n", "</table>\n", "<br/><br/><br/>" ] }, { "cell_type": "markdown", "metadata": { "id": "overview:custom,hpt" }, "source": [ "## Overview\n", "\n", "This tutorial demonstrates how to use the Vertex client library for Python to do hyperparameter tuning for a custom image classification model." ] }, { "cell_type": "markdown", "metadata": { "id": "dataset:custom,cifar10,icn" }, "source": [ "### Dataset\n", "\n", "The dataset used for this tutorial is the [CIFAR10 dataset](https://www.tensorflow.org/datasets/catalog/cifar10) from [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/overview). The version of the dataset you will use is built into TensorFlow. The trained model predicts which type of class an image is from ten classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck." ] }, { "cell_type": "markdown", "metadata": { "id": "objective:custom,hpt" }, "source": [ "### Objective\n", "\n", "In this notebook, you learn how to create a hyperparameter tuning job for a custom image classification model from a Python script in a docker container using the Vertex client library. You can alternatively hyperparameter tune models using the `gcloud` command-line tool or online using the Google Cloud Console.\n", "\n", "\n", "The steps performed include:\n", "\n", "- Create an Vertex hyperparameter turning job for training a custom model.\n", "- Tune the custom model.\n", "- Evaluate the study results." ] }, { "cell_type": "markdown", "metadata": { "id": "costs" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud (GCP):\n", "\n", "* Vertex AI\n", "* Cloud Storage\n", "\n", "Learn about [Vertex AI\n", "pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage\n", "pricing](https://cloud.google.com/storage/pricing), and use the [Pricing\n", "Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "install_aip" }, "source": [ "## Installation\n", "\n", "Install the latest version of Vertex client library." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "install_aip" }, "outputs": [], "source": [ "import os\n", "import sys\n", "\n", "# Google Cloud Notebook\n", "if os.path.exists(\"/opt/deeplearning/metadata/env_version\"):\n", " USER_FLAG = \"--user\"\n", "else:\n", " USER_FLAG = \"\"\n", "\n", "! pip3 install -U google-cloud-aiplatform $USER_FLAG" ] }, { "cell_type": "markdown", "metadata": { "id": "install_storage" }, "source": [ "Install the latest GA version of *google-cloud-storage* library as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "install_storage" }, "outputs": [], "source": [ "! pip3 install -U google-cloud-storage $USER_FLAG" ] }, { "cell_type": "markdown", "metadata": { "id": "restart" }, "source": [ "### Restart the kernel\n", "\n", "Once you've installed the Vertex client library and Google *cloud-storage*, you need to restart the notebook kernel so it can find the packages." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "restart" }, "outputs": [], "source": [ "if not os.getenv(\"IS_TESTING\"):\n", " # Automatically restart kernel after installs\n", " import IPython\n", "\n", " app = IPython.Application.instance()\n", " app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "before_you_begin" }, "source": [ "## Before you begin\n", "\n", "### GPU runtime\n", "\n", "*Make sure you're running this notebook in a GPU runtime if you have that option. In Colab, select* **Runtime > Change Runtime Type > GPU**\n", "\n", "### Set up your Google Cloud project\n", "\n", "**The following steps are required, regardless of your notebook environment.**\n", "\n", "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", "\n", "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n", "\n", "3. [Enable the Vertex APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n", "\n", "4. [The Google Cloud SDK](https://cloud.google.com/sdk) is already installed in Google Cloud Notebook.\n", "\n", "5. Enter your project ID in the cell below. Then run the cell to make sure the\n", "Cloud SDK uses the right project for all the commands in this notebook.\n", "\n", "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_project_id" }, "outputs": [], "source": [ "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "autoset_project_id" }, "outputs": [], "source": [ "if PROJECT_ID == \"\" or PROJECT_ID is None or PROJECT_ID == \"[your-project-id]\":\n", " # Get your GCP project id from gcloud\n", " shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null\n", " PROJECT_ID = shell_output[0]\n", " print(\"Project ID:\", PROJECT_ID)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_gcloud_project_id" }, "outputs": [], "source": [ "! gcloud config set project $PROJECT_ID" ] }, { "cell_type": "markdown", "metadata": { "id": "region" }, "source": [ "#### Region\n", "\n", "You can also change the `REGION` variable, which is used for operations\n", "throughout the rest of this notebook. Below are regions supported for Vertex. We recommend that you choose the region closest to you.\n", "\n", "- Americas: `us-central1`\n", "- Europe: `europe-west4`\n", "- Asia Pacific: `asia-east1`\n", "\n", "You may not use a multi-regional bucket for training with Vertex. Not all regions provide support for all Vertex services. For the latest support per region, see the [Vertex locations documentation](https://cloud.google.com/vertex-ai/docs/general/locations)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "region" }, "outputs": [], "source": [ "REGION = \"us-central1\" # @param {type: \"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "timestamp" }, "source": [ "#### Timestamp\n", "\n", "If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append onto the name of resources which will be created in this tutorial." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "timestamp" }, "outputs": [], "source": [ "from datetime import datetime\n", "\n", "TIMESTAMP = datetime.now().strftime(\"%Y%m%d%H%M%S\")" ] }, { "cell_type": "markdown", "metadata": { "id": "gcp_authenticate" }, "source": [ "### Authenticate your Google Cloud account\n", "\n", "**If you are using Google Cloud Notebook**, your environment is already authenticated. Skip this step.\n", "\n", "**If you are using Colab**, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.\n", "\n", "**Otherwise**, follow these steps:\n", "\n", "In the Cloud Console, go to the [Create service account key](https://console.cloud.google.com/apis/credentials/serviceaccountkey) page.\n", "\n", "**Click Create service account**.\n", "\n", "In the **Service account name** field, enter a name, and click **Create**.\n", "\n", "In the **Grant this service account access to project** section, click the Role drop-down list. Type \"Vertex\" into the filter box, and select **Vertex Administrator**. Type \"Storage Object Admin\" into the filter box, and select **Storage Object Admin**.\n", "\n", "Click Create. A JSON file that contains your key downloads to your local environment.\n", "\n", "Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gcp_authenticate" }, "outputs": [], "source": [ "# If you are running this notebook in Colab, run this cell and follow the\n", "# instructions to authenticate your GCP account. This provides access to your\n", "# Cloud Storage bucket and lets you submit training jobs and prediction\n", "# requests.\n", "\n", "# If on Google Cloud Notebook, then don't execute this code\n", "if not os.path.exists(\"/opt/deeplearning/metadata/env_version\"):\n", " if \"google.colab\" in sys.modules:\n", " from google.colab import auth as google_auth\n", "\n", " google_auth.authenticate_user()\n", "\n", " # If you are running this notebook locally, replace the string below with the\n", " # path to your service account key and run this cell to authenticate your GCP\n", " # account.\n", " elif not os.getenv(\"IS_TESTING\"):\n", " %env GOOGLE_APPLICATION_CREDENTIALS ''" ] }, { "cell_type": "markdown", "metadata": { "id": "bucket:custom" }, "source": [ "### Create a Cloud Storage bucket\n", "\n", "**The following steps are required, regardless of your notebook environment.**\n", "\n", "When you submit a custom training job using the Vertex client library, you upload a Python package\n", "containing your training code to a Cloud Storage bucket. Vertex runs\n", "the code from this package. In this tutorial, Vertex also saves the\n", "trained model that results from your job in the same bucket. You can then\n", "create an `Endpoint` resource based on this output in order to serve\n", "online predictions.\n", "\n", "Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bucket" }, "outputs": [], "source": [ "BUCKET_NAME = \"gs://[your-bucket-name]\" # @param {type:\"string\"}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "autoset_bucket" }, "outputs": [], "source": [ "if BUCKET_NAME == \"\" or BUCKET_NAME is None or BUCKET_NAME == \"gs://[your-bucket-name]\":\n", " BUCKET_NAME = \"gs://\" + PROJECT_ID + \"aip-\" + TIMESTAMP" ] }, { "cell_type": "markdown", "metadata": { "id": "create_bucket" }, "source": [ "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_bucket" }, "outputs": [], "source": [ "! gsutil mb -l $REGION $BUCKET_NAME" ] }, { "cell_type": "markdown", "metadata": { "id": "validate_bucket" }, "source": [ "Finally, validate access to your Cloud Storage bucket by examining its contents:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "validate_bucket" }, "outputs": [], "source": [ "! gsutil ls -al $BUCKET_NAME" ] }, { "cell_type": "markdown", "metadata": { "id": "setup_vars" }, "source": [ "### Set up variables\n", "\n", "Next, set up some variables used throughout the tutorial.\n", "### Import libraries and define constants" ] }, { "cell_type": "markdown", "metadata": { "id": "import_aip:protobuf" }, "source": [ "#### Import Vertex client library\n", "\n", "Import the Vertex client library into our Python environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "import_aip:protobuf" }, "outputs": [], "source": [ "import time\n", "\n", "from google.cloud.aiplatform import gapic as aip\n", "from google.protobuf import json_format\n", "from google.protobuf.json_format import MessageToJson, ParseDict\n", "from google.protobuf.struct_pb2 import Struct, Value" ] }, { "cell_type": "markdown", "metadata": { "id": "aip_constants" }, "source": [ "#### Vertex constants\n", "\n", "Setup up the following constants for Vertex:\n", "\n", "- `API_ENDPOINT`: The Vertex API service endpoint for dataset, model, job, pipeline and endpoint services.\n", "- `PARENT`: The Vertex location root path for dataset, model, job, pipeline and endpoint resources." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "aip_constants" }, "outputs": [], "source": [ "# API service endpoint\n", "API_ENDPOINT = \"{}-aiplatform.googleapis.com\".format(REGION)\n", "\n", "# Vertex location root path for your dataset, model and endpoint resources\n", "PARENT = \"projects/\" + PROJECT_ID + \"/locations/\" + REGION" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "accelerators:training" }, "outputs": [], "source": [ "if os.getenv(\"IS_TESTING_TRAIN_GPU\"):\n", " TRAIN_GPU, TRAIN_NGPU = (\n", " aip.AcceleratorType.NVIDIA_TESLA_K80,\n", " int(os.getenv(\"IS_TESTING_TRAIN_GPU\")),\n", " )\n", "else:\n", " TRAIN_GPU, TRAIN_NGPU = (aip.AcceleratorType.NVIDIA_TESLA_K80, 1)" ] }, { "cell_type": "markdown", "metadata": { "id": "container:training" }, "source": [ "#### Container (Docker) image\n", "\n", "Next, we will set the Docker container images for training.\n", "\n", "- Set the variable `TF` to the TensorFlow version of the container image. For example, `2-1` would be version 2.1, and `1-15` would be version 1.15. The following list shows some of the pre-built images available:\n", "\n", " - TensorFlow 1.15\n", " - `gcr.io/cloud-aiplatform/training/tf-cpu.1-15:latest`\n", " - `gcr.io/cloud-aiplatform/training/tf-gpu.1-15:latest`\n", " - TensorFlow 2.1\n", " - `gcr.io/cloud-aiplatform/training/tf-cpu.2-1:latest`\n", " - `gcr.io/cloud-aiplatform/training/tf-gpu.2-1:latest`\n", " - TensorFlow 2.2\n", " - `gcr.io/cloud-aiplatform/training/tf-cpu.2-2:latest`\n", " - `gcr.io/cloud-aiplatform/training/tf-gpu.2-2:latest`\n", " - TensorFlow 2.3\n", " - `gcr.io/cloud-aiplatform/training/tf-cpu.2-3:latest`\n", " - `gcr.io/cloud-aiplatform/training/tf-gpu.2-3:latest`\n", " - TensorFlow 2.4\n", " - `gcr.io/cloud-aiplatform/training/tf-cpu.2-4:latest`\n", " - `gcr.io/cloud-aiplatform/training/tf-gpu.2-4:latest`\n", " - XGBoost\n", " - `gcr.io/cloud-aiplatform/training/xgboost-cpu.1-1`\n", " - Scikit-learn\n", " - `gcr.io/cloud-aiplatform/training/scikit-learn-cpu.0-23:latest`\n", " - Pytorch\n", " - `gcr.io/cloud-aiplatform/training/pytorch-cpu.1-4:latest`\n", " - `gcr.io/cloud-aiplatform/training/pytorch-cpu.1-5:latest`\n", " - `gcr.io/cloud-aiplatform/training/pytorch-cpu.1-6:latest`\n", " - `gcr.io/cloud-aiplatform/training/pytorch-cpu.1-7:latest`\n", "\n", "For the latest list, see [Pre-built containers for training](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "container:training" }, "outputs": [], "source": [ "if os.getenv(\"IS_TESTING_TF\"):\n", " TF = os.getenv(\"IS_TESTING_TF\")\n", "else:\n", " TF = \"2-1\"\n", "\n", "if TF[0] == \"2\":\n", " if TRAIN_GPU:\n", " TRAIN_VERSION = \"tf-gpu.{}\".format(TF)\n", " else:\n", " TRAIN_VERSION = \"tf-cpu.{}\".format(TF)\n", "else:\n", " if TRAIN_GPU:\n", " TRAIN_VERSION = \"tf-gpu.{}\".format(TF)\n", " else:\n", " TRAIN_VERSION = \"tf-cpu.{}\".format(TF)\n", "\n", "TRAIN_IMAGE = \"gcr.io/cloud-aiplatform/training/{}:latest\".format(TRAIN_VERSION)\n", "\n", "print(\"Training:\", TRAIN_IMAGE, TRAIN_GPU, TRAIN_NGPU)" ] }, { "cell_type": "markdown", "metadata": { "id": "machine:training" }, "source": [ "#### Machine Type\n", "\n", "Next, set the machine type to use for training.\n", "\n", "- Set the variable `TRAIN_COMPUTE` to configure the compute resources for the VMs you will use for for training.\n", " - `machine type`\n", " - `n1-standard`: 3.75GB of memory per vCPU.\n", " - `n1-highmem`: 6.5GB of memory per vCPU\n", " - `n1-highcpu`: 0.9 GB of memory per vCPU\n", " - `vCPUs`: number of \\[2, 4, 8, 16, 32, 64, 96 \\]\n", "\n", "*Note: The following is not supported for training:*\n", "\n", " - `standard`: 2 vCPUs\n", " - `highcpu`: 2, 4 and 8 vCPUs\n", "\n", "*Note: You may also use n2 and e2 machine types for training and deployment, but they do not support GPUs*." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "machine:training" }, "outputs": [], "source": [ "if os.getenv(\"IS_TESTING_TRAIN_MACHINE\"):\n", " MACHINE_TYPE = os.getenv(\"IS_TESTING_TRAIN_MACHINE\")\n", "else:\n", " MACHINE_TYPE = \"n1-standard\"\n", "\n", "VCPU = \"4\"\n", "TRAIN_COMPUTE = MACHINE_TYPE + \"-\" + VCPU\n", "print(\"Train machine type\", TRAIN_COMPUTE)" ] }, { "cell_type": "markdown", "metadata": { "id": "tutorial_start:custom,hpt" }, "source": [ "# Tutorial\n", "\n", "Now you are ready to start creating your own hyperparameter tuning and training of a custom image classification." ] }, { "cell_type": "markdown", "metadata": { "id": "clients:custom,hpt" }, "source": [ "## Set up clients\n", "\n", "The Vertex client library works as a client/server model. On your side (the Python script) you will create a client that sends requests and receives responses from the Vertex server.\n", "\n", "You will use different clients in this tutorial for different steps in the workflow. So set them all up upfront.\n", "\n", "- Model Service for `Model` resources.\n", "- Job Service for hyperparameter tuning." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "clients:custom,hpt" }, "outputs": [], "source": [ "# client options same for all services\n", "client_options = {\"api_endpoint\": API_ENDPOINT}\n", "\n", "\n", "def create_job_client():\n", " client = aip.JobServiceClient(client_options=client_options)\n", " return client\n", "\n", "\n", "def create_model_client():\n", " client = aip.ModelServiceClient(client_options=client_options)\n", " return client\n", "\n", "\n", "clients = {}\n", "clients[\"job\"] = create_job_client()\n", "clients[\"model\"] = create_model_client()\n", "\n", "for client in clients.items():\n", " print(client)" ] }, { "cell_type": "markdown", "metadata": { "id": "tune_custom_model:simple" }, "source": [ "## Tuning a model - Hello World\n", "\n", "There are two ways you can hyperparameter tune and train a custom model using a container image:\n", "\n", "- **Use a Google Cloud prebuilt container**. If you use a prebuilt container, you will additionally specify a Python package to install into the container image. This Python package contains your code for hyperparameter tuning and training a custom model.\n", "\n", "- **Use your own custom container image**. If you use your own container, the container needs to contain your code for hyperparameter tuning and training a custom model." ] }, { "cell_type": "markdown", "metadata": { "id": "train_custom_job_specification:prebuilt_container,hpt" }, "source": [ "## Prepare your hyperparameter tuning job specification\n", "\n", "Now that your clients are ready, your first step is to create a Job Specification for your hyperparameter tuning job. The job specification will consist of the following:\n", "\n", "- `trial_job_spec`: The specification for the custom job.\n", " - `worker_pool_spec` : The specification of the type of machine(s) you will use for hyperparameter tuning and how many (single or distributed)\n", " - `python_package_spec` : The specification of the Python package to be installed with the pre-built container.\n", "\n", "- `study_spec`: The specification for what to tune.\n", " - `parameters`: This is the specification of the hyperparameters that you will tune for the custom training job. It will contain a list of the\n", " - `metrics`: This is the specification on how to evaluate the result of each tuning trial." ] }, { "cell_type": "markdown", "metadata": { "id": "train_custom_job_machine_specification" }, "source": [ "### Prepare your machine specification\n", "\n", "Now define the machine specification for your custom hyperparameter tuning job. This tells Vertex what type of machine instance to provision for the hyperparameter tuning.\n", " - `machine_type`: The type of GCP instance to provision -- e.g., n1-standard-8.\n", " - `accelerator_type`: The type, if any, of hardware accelerator. In this tutorial if you previously set the variable `TRAIN_GPU != None`, you are using a GPU; otherwise you will use a CPU.\n", " - `accelerator_count`: The number of accelerators." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "train_custom_job_machine_specification" }, "outputs": [], "source": [ "if TRAIN_GPU:\n", " machine_spec = {\n", " \"machine_type\": TRAIN_COMPUTE,\n", " \"accelerator_type\": TRAIN_GPU,\n", " \"accelerator_count\": TRAIN_NGPU,\n", " }\n", "else:\n", " machine_spec = {\"machine_type\": TRAIN_COMPUTE, \"accelerator_count\": 0}" ] }, { "cell_type": "markdown", "metadata": { "id": "train_custom_job_disk_specification" }, "source": [ "### Prepare your disk specification\n", "\n", "(optional) Now define the disk specification for your custom hyperparameter tuning job. This tells Vertex what type and size of disk to provision in each machine instance for the hyperparameter tuning.\n", "\n", " - `boot_disk_type`: Either SSD or Standard. SSD is faster, and Standard is less expensive. Defaults to SSD.\n", " - `boot_disk_size_gb`: Size of disk in GB." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "train_custom_job_disk_specification" }, "outputs": [], "source": [ "DISK_TYPE = \"pd-ssd\" # [ pd-ssd, pd-standard]\n", "DISK_SIZE = 200 # GB\n", "\n", "disk_spec = {\"boot_disk_type\": DISK_TYPE, \"boot_disk_size_gb\": DISK_SIZE}" ] }, { "cell_type": "markdown", "metadata": { "id": "train_custom_job_worker_pool_specification:prebuilt_container" }, "source": [ "### Define the worker pool specification\n", "\n", "Next, you define the worker pool specification for your custom hyperparameter tuning job. The worker pool specification will consist of the following:\n", "\n", "- `replica_count`: The number of instances to provision of this machine type.\n", "- `machine_spec`: The hardware specification.\n", "- `disk_spec` : (optional) The disk storage specification.\n", "\n", "- `python_package`: The Python training package to install on the VM instance(s) and which Python module to invoke, along with command line arguments for the Python module.\n", "\n", "Let's dive deeper now into the python package specification:\n", "\n", "-`executor_image_spec`: This is the docker image which is configured for your custom hyperparameter tuning job.\n", "\n", "-`package_uris`: This is a list of the locations (URIs) of your python training packages to install on the provisioned instance. The locations need to be in a Cloud Storage bucket. These can be either individual python files or a zip (archive) of an entire package. In the later case, the job service will unzip (unarchive) the contents into the docker image.\n", "\n", "-`python_module`: The Python module (script) to invoke for running the custom hyperparameter tuning job. In this example, you will be invoking `trainer.task.py` -- note that it was not neccessary to append the `.py` suffix.\n", "\n", "-`args`: The command line arguments to pass to the corresponding Pythom module. In this example, you will be setting:\n", " - `\"--model-dir=\" + MODEL_DIR` : The Cloud Storage location where to store the model artifacts. There are two ways to tell the hyperparameter tuning script where to save the model artifacts:\n", " - direct: You pass the Cloud Storage location as a command line argument to your training script (set variable `DIRECT = True`), or\n", " - indirect: The service passes the Cloud Storage location as the environment variable `AIP_MODEL_DIR` to your training script (set variable `DIRECT = False`). In this case, you tell the service the model artifact location in the job specification.\n", " - `\"--epochs=\" + EPOCHS`: The number of epochs for training.\n", " - `\"--steps=\" + STEPS`: The number of steps (batches) per epoch.\n", " - `\"--distribute=\" + TRAIN_STRATEGY\"` : The hyperparameter tuning distribution strategy to use for single or distributed hyperparameter tuning.\n", " - `\"single\"`: single device.\n", " - `\"mirror\"`: all GPU devices on a single compute instance.\n", " - `\"multi\"`: all GPU devices on all compute instances." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "train_custom_job_worker_pool_specification:prebuilt_container" }, "outputs": [], "source": [ "JOB_NAME = \"custom_job_\" + TIMESTAMP\n", "MODEL_DIR = \"{}/{}\".format(BUCKET_NAME, JOB_NAME)\n", "\n", "if not TRAIN_NGPU or TRAIN_NGPU < 2:\n", " TRAIN_STRATEGY = \"single\"\n", "else:\n", " TRAIN_STRATEGY = \"mirror\"\n", "\n", "EPOCHS = 20\n", "STEPS = 100\n", "\n", "DIRECT = True\n", "if DIRECT:\n", " CMDARGS = [\n", " \"--model-dir=\" + MODEL_DIR,\n", " \"--epochs=\" + str(EPOCHS),\n", " \"--steps=\" + str(STEPS),\n", " \"--distribute=\" + TRAIN_STRATEGY,\n", " ]\n", "else:\n", " CMDARGS = [\n", " \"--epochs=\" + str(EPOCHS),\n", " \"--steps=\" + str(STEPS),\n", " \"--distribute=\" + TRAIN_STRATEGY,\n", " ]\n", "\n", "worker_pool_spec = [\n", " {\n", " \"replica_count\": 1,\n", " \"machine_spec\": machine_spec,\n", " \"disk_spec\": disk_spec,\n", " \"python_package_spec\": {\n", " \"executor_image_uri\": TRAIN_IMAGE,\n", " \"package_uris\": [BUCKET_NAME + \"/trainer_cifar10.tar.gz\"],\n", " \"python_module\": \"trainer.task\",\n", " \"args\": CMDARGS,\n", " },\n", " }\n", "]" ] }, { "cell_type": "markdown", "metadata": { "id": "create_study_spec:simple" }, "source": [ "### Create a study specification\n", "\n", "Let's start with a simple study. You will just use a single parameter -- the *learning rate*. Since its just one parameter, it doesn't make much sense to do a random search. Instead, we will do a grid search over a range of values.\n", "\n", "- `metrics`:\n", " - `metric_id`: In this example, the objective metric to report back is `'val_accuracy'`\n", " - `goal`: In this example, the hyperparameter tuning service will evaluate trials to maximize the value of the objective metric.\n", "- `parameters`: The specification for the hyperparameters to tune.\n", " - `parameter_id`: The name of the hyperparameter that will be passed to the Python package as a command line argument.\n", " - `scale_type`: The scale type determines the resolution the hyperparameter tuning service uses when searching over the search space.\n", " - `UNIT_LINEAR_SCALE`: Uses a resolution that is the same everywhere in the search space.\n", " - `UNIT_LOG_SCALE`: Values close to the bottom of the search space are further away.\n", " - `UNIT_REVERSE_LOG_SCALE`: Values close to the top of the search space are further away.\n", " - **search space**: This is where you will specify the search space of values for the hyperparameter to select for tuning.\n", " - `integer_value_spec`: Specifies an integer range of values between a `min_value` and `max_value`.\n", " - `double_value_spec`: Specifies a continuous range of values between a `min_value` and `max_value`.\n", " - `discrete_value_spec`: Specifies a list of values.\n", "- `algorithm`: The search method for selecting hyperparameter values per trial:\n", " - `GRID_SEARCH`: Combinatorically search -- which is used in this example.\n", " - `RANDOM_SEARCH`: Random search.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_study_spec:simple" }, "outputs": [], "source": [ "study_spec = {\n", " \"metrics\": [\n", " {\n", " \"metric_id\": \"val_accuracy\",\n", " \"goal\": aip.StudySpec.MetricSpec.GoalType.MAXIMIZE,\n", " }\n", " ],\n", " \"parameters\": [\n", " {\n", " \"parameter_id\": \"lr\",\n", " \"discrete_value_spec\": {\"values\": [0.001, 0.01, 0.1]},\n", " \"scale_type\": aip.StudySpec.ParameterSpec.ScaleType.UNIT_LINEAR_SCALE,\n", " }\n", " ],\n", " \"algorithm\": aip.StudySpec.Algorithm.GRID_SEARCH,\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "assemble_custom_hpt_job_specification" }, "source": [ "### Assemble a hyperparameter tuning job specification\n", "\n", "Now assemble the complete description for the custom hyperparameter tuning specification:\n", "\n", "- `display_name`: The human readable name you assign to this custom hyperparameter tuning job.\n", "- `trial_job_spec`: The specification for the custom hyperparameter tuning job.\n", "- `study_spec`: The specification for what to tune.\n", "- `max_trial_count`: The maximum number of tuning trials.\n", "- `parallel_trial_count`: How many trials to try in parallel; otherwise, they are done sequentially." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "assemble_custom_hpt_job_specification" }, "outputs": [], "source": [ "hpt_job = {\n", " \"display_name\": JOB_NAME,\n", " \"trial_job_spec\": {\"worker_pool_specs\": worker_pool_spec},\n", " \"study_spec\": study_spec,\n", " \"max_trial_count\": 6,\n", " \"parallel_trial_count\": 1,\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "examine_training_package" }, "source": [ "### Examine the hyperparameter tuning package\n", "\n", "#### Package layout\n", "\n", "Before you start the hyperparameter tuning, you will look at how a Python package is assembled for a custom hyperparameter tuning job. When unarchived, the package contains the following directory/file layout.\n", "\n", "- PKG-INFO\n", "- README.md\n", "- setup.cfg\n", "- setup.py\n", "- trainer\n", " - \\_\\_init\\_\\_.py\n", " - task.py\n", "\n", "The files `setup.cfg` and `setup.py` are the instructions for installing the package into the operating environment of the Docker image.\n", "\n", "The file `trainer/task.py` is the Python script for executing the custom hyperparameter tuning job. *Note*, when we referred to it in the worker pool specification, we replace the directory slash with a dot (`trainer.task`) and dropped the file suffix (`.py`).\n", "\n", "#### Package Assembly\n", "\n", "In the following cells, you will assemble the training package." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "examine_training_package" }, "outputs": [], "source": [ "# Make folder for Python hyperparameter tuning script\n", "! rm -rf custom\n", "! mkdir custom\n", "\n", "# Add package information\n", "! touch custom/README.md\n", "\n", "setup_cfg = \"[egg_info]\\n\\ntag_build =\\n\\ntag_date = 0\"\n", "! echo \"$setup_cfg\" > custom/setup.cfg\n", "\n", "setup_py = \"import setuptools\\n\\nsetuptools.setup(\\n\\n install_requires=[\\n\\n 'tensorflow_datasets==1.3.0',\\n\\n ],\\n\\n packages=setuptools.find_packages())\"\n", "! echo \"$setup_py\" > custom/setup.py\n", "\n", "pkg_info = \"Metadata-Version: 1.0\\n\\nName: CIFAR10 image classification\\n\\nVersion: 0.0.0\\n\\nSummary: Demostration hyperparameter tuning script\\n\\nHome-page: www.google.com\\n\\nAuthor: Google\\n\\nAuthor-email: aferlitsch@google.com\\n\\nLicense: Public\\n\\nDescription: Demo\\n\\nPlatform: Vertex\"\n", "! echo \"$pkg_info\" > custom/PKG-INFO\n", "\n", "# Make the training subfolder\n", "! mkdir custom/trainer\n", "! touch custom/trainer/__init__.py" ] }, { "cell_type": "markdown", "metadata": { "id": "taskpy_contents:hpt,simple" }, "source": [ "#### Task.py contents\n", "\n", "In the next cell, you write the contents of the hyperparameter tuning script task.py. I won't go into detail, it's just there for you to browse. In summary:\n", "\n", "- Passes the hyperparameter values for a trial as a command line argument (`parser.add_argument('--lr',...)`)\n", "- Mimics a training loop, where on each loop (epoch) the variable `accuracy` is set to the loop iteration * the learning rate.\n", "- Reports back the objective metric `accuracy` back to the hyperparameter tuning service using `report_hyperparameter_tuning_metric()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "taskpy_contents:hpt,simple" }, "outputs": [], "source": [ "%%writefile custom/trainer/task.py\n", "# HP Tuning hello world example\n", "\n", "from __future__ import absolute_import, division, print_function, unicode_literals\n", "import tensorflow_datasets as tfds\n", "import tensorflow as tf\n", "from tensorflow.python.client import device_lib\n", "from hypertune import HyperTune\n", "import argparse\n", "import os\n", "import sys\n", "import time\n", "tfds.disable_progress_bar()\n", "\n", "parser = argparse.ArgumentParser()\n", "parser.add_argument('--lr', dest='lr',\n", " default=0.001, type=float,\n", " help='Learning rate.')\n", "parser.add_argument('--epochs', dest='epochs',\n", " default=10, type=int,\n", " help='Number of epochs.')\n", "parser.add_argument('--steps', dest='steps',\n", " default=200, type=int,\n", " help='Number of steps per epoch.')\n", "parser.add_argument('--model-dir',\n", " dest='model_dir',\n", " default='/tmp/saved_model',\n", " type=str,\n", " help='Model dir.')\n", "parser.add_argument('--distribute', dest='distribute', type=str, default='single',\n", " help='distributed training strategy')\n", "args = parser.parse_args()\n", "\n", "print('Python Version = {}'.format(sys.version))\n", "print('TensorFlow Version = {}'.format(tf.__version__))\n", "print('TF_CONFIG = {}'.format(os.environ.get('TF_CONFIG', 'Not found')))\n", "print(device_lib.list_local_devices())\n", "\n", "# Instantiate the HyperTune reporting object\n", "hpt = HyperTune()\n", "\n", "for epoch in range(1, args.epochs+1):\n", " # mimic metric result at the end of an epoch\n", " acc = args.lr * epoch\n", " # save the metric result to communicate back to the HPT service\n", " hpt.report_hyperparameter_tuning_metric(\n", " hyperparameter_metric_tag='val_accuracy',\n", " metric_value=acc,\n", " global_step=epoch)\n", " print('epoch: {}, accuracy: {}'.format(epoch, acc))\n", " time.sleep(1)" ] }, { "cell_type": "markdown", "metadata": { "id": "tarball_training_script" }, "source": [ "#### Store hyperparameter tuning script on your Cloud Storage bucket\n", "\n", "Next, you package the hyperparameter tuning folder into a compressed tar ball, and then store it in your Cloud Storage bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tarball_training_script" }, "outputs": [], "source": [ "! rm -f custom.tar custom.tar.gz\n", "! tar cvf custom.tar custom\n", "! gzip custom.tar\n", "! gsutil cp custom.tar.gz $BUCKET_NAME/trainer_cifar10.tar.gz" ] }, { "cell_type": "markdown", "metadata": { "id": "report_hypertune" }, "source": [ "#### Reporting back the result of the trial using hypertune\n", "\n", "For each trial, your Python script needs to report back to the hyperparameter tuning service the objective metric for which you specified as the criteria for evaluating the trial.\n", "\n", "For this example, you will specify in the study specification that the objective metric will be reported back as `val_accuracy`.\n", "\n", "You report back the value of the objective metric using `HyperTune`. This Python module is used to communicate key/value pairs to the hyperparameter tuning service. To setup this reporting in your Python package, you will add code for the following three steps:\n", "\n", "1. Import the HyperTune module: `from hypertune import HyperTune()`.\n", "2. At the end of every epoch, write the current value of the objective function to the log as a key/value pair using `hpt.report_hyperparameter_tuning_metric()`. In this example, the parameters are:\n", " - `hyperparameter_metric_tag`: The name of the objective metric to report back. The name must be identical to the name specified in the study specification.\n", " - `metric_value`: The value of the objective metric to report back to the hyperparameter service.\n", " - `global_step`: The epoch iteration, starting at 0." ] }, { "cell_type": "markdown", "metadata": { "id": "tune_custom_job" }, "source": [ "## Hyperparameter Tune the model\n", "\n", "Now start the hyperparameter tuning of your custom model on Vertex. Use this helper function `create_hyperparameter_tuning_job`, which takes the following parameter:\n", "\n", "-`hpt_job`: The specification for the hyperparameter tuning job.\n", "\n", "The helper function calls job client service's `create_hyperparameter_tuning_job` method, with the following parameters:\n", "\n", "-`parent`: The Vertex location path to `Dataset`, `Model` and `Endpoint` resources.\n", "-`hyperparameter_tuning_job`: The specification for the hyperparameter tuning job.\n", "\n", "You will display a handful of the fields returned in `response` object, with the two that are of most interest are:\n", "\n", "`response.name`: The Vertex fully qualified identifier assigned to this custom hyperparameter tuning job. You save this identifier for using in subsequent steps.\n", "\n", "`response.state`: The current state of the custom hyperparameter tuning job." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tune_custom_job" }, "outputs": [], "source": [ "def create_hyperparameter_tuning_job(hpt_job):\n", " response = clients[\"job\"].create_hyperparameter_tuning_job(\n", " parent=PARENT, hyperparameter_tuning_job=hpt_job\n", " )\n", " print(\"name:\", response.name)\n", " print(\"display_name:\", response.display_name)\n", " print(\"state:\", response.state)\n", " print(\"create_time:\", response.create_time)\n", " print(\"update_time:\", response.update_time)\n", " return response\n", "\n", "\n", "response = create_hyperparameter_tuning_job(hpt_job)" ] }, { "cell_type": "markdown", "metadata": { "id": "hpt_job_id:response" }, "source": [ "Now get the unique identifier for the hyperparameter tuning job you created." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hpt_job_id:response" }, "outputs": [], "source": [ "# The full unique ID for the hyperparameter tuning job\n", "hpt_job_id = response.name\n", "# The short numeric ID for the hyperparameter tuning job\n", "hpt_job_short_id = hpt_job_id.split(\"/\")[-1]\n", "\n", "print(hpt_job_id)" ] }, { "cell_type": "markdown", "metadata": { "id": "get_hpt_job" }, "source": [ "### Get information on a hyperparameter tuning job\n", "\n", "Next, use this helper function `get_hyperparameter_tuning_job`, which takes the following parameter:\n", "\n", "- `name`: The Vertex fully qualified identifier for the hyperparameter tuning job.\n", "\n", "The helper function calls the job client service's `get_hyperparameter_tuning_job` method, with the following parameter:\n", "\n", "- `name`: The Vertex fully qualified identifier for the hyperparameter tuning job.\n", "\n", "If you recall, you got the Vertex fully qualified identifier for the hyperparameter tuning job in the `response.name` field when you called the `create_hyperparameter_tuning_job` method, and saved the identifier in the variable `hpt_job_id`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "get_hpt_job" }, "outputs": [], "source": [ "def get_hyperparameter_tuning_job(name, silent=False):\n", " response = clients[\"job\"].get_hyperparameter_tuning_job(name=name)\n", " if silent:\n", " return response\n", "\n", " print(\"name:\", response.name)\n", " print(\"display_name:\", response.display_name)\n", " print(\"state:\", response.state)\n", " print(\"create_time:\", response.create_time)\n", " print(\"update_time:\", response.update_time)\n", " return response\n", "\n", "\n", "response = get_hyperparameter_tuning_job(hpt_job_id)" ] }, { "cell_type": "markdown", "metadata": { "id": "wait_tuning_complete" }, "source": [ "## Wait for tuning to complete\n", "\n", "Hyperparameter tuning the above model may take upwards of 20 minutes time.\n", "\n", "Once your model is done tuning, you can calculate the actual time it took to tune the model by subtracting `end_time` from `start_time`.\n", "\n", "For your model, we will need to know the location of the saved models for each trial, which the Python script saved in your local Cloud Storage bucket at `MODEL_DIR + '/<trial_number>/saved_model.pb'`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wait_tuning_complete" }, "outputs": [], "source": [ "while True:\n", " job_response = get_hyperparameter_tuning_job(hpt_job_id, True)\n", " if job_response.state != aip.JobState.JOB_STATE_SUCCEEDED:\n", " print(\"Study trials have not completed:\", job_response.state)\n", " if job_response.state == aip.JobState.JOB_STATE_FAILED:\n", " break\n", " else:\n", " if not DIRECT:\n", " MODEL_DIR = MODEL_DIR + \"/model\"\n", " print(\"Study trials have completed\")\n", " break\n", " time.sleep(60)" ] }, { "cell_type": "markdown", "metadata": { "id": "review_study_results" }, "source": [ "### Review the results of the study\n", "\n", "Now review the results of trials." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "review_study_results" }, "outputs": [], "source": [ "best = (None, None, None, 0.0)\n", "for trial in job_response.trials:\n", " print(trial)\n", " # Keep track of the best outcome\n", " if float(trial.final_measurement.metrics[0].value) > best[3]:\n", " try:\n", " best = (\n", " trial.id,\n", " float(trial.parameters[0].value),\n", " float(trial.parameters[1].value),\n", " float(trial.final_measurement.metrics[0].value),\n", " )\n", " except:\n", " best = (\n", " trial.id,\n", " float(trial.parameters[0].value),\n", " None,\n", " float(trial.final_measurement.metrics[0].value),\n", " )" ] }, { "cell_type": "markdown", "metadata": { "id": "best_trial" }, "source": [ "### Best trial\n", "\n", "Now look at which trial was the best:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "best_trial" }, "outputs": [], "source": [ "print(\"ID\", best[0])\n", "print(\"Learning Rate\", best[1])\n", "print(\"Decay\", best[2])\n", "print(\"Validation Accuracy\", best[3])" ] }, { "cell_type": "markdown", "metadata": { "id": "get_best_model" }, "source": [ "## Get the Best Model\n", "\n", "If you used the method of having the service tell the tuning script where to save the model artifacts (`DIRECT = False`), then the model artifacts for the best model are saved at:\n", "\n", " MODEL_DIR/<best_trial_id>/model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "get_best_model" }, "outputs": [], "source": [ "BEST_MODEL_DIR = MODEL_DIR + \"/\" + best[0] + \"/model\"" ] }, { "cell_type": "markdown", "metadata": { "id": "tune_custom_model:random" }, "source": [ "## Tuning a model - CIFAR10\n", "\n", "Now that you have seen the overall steps for hyperparameter tuning a custom training job using a Python package that mimics training a model, you will do a new hyperparameter tuning job for a custom training job for a CIFAR10 model.\n", "\n", "For this example, you will change two parts:\n", "\n", "1. Specify the CIFAR10 custom hyperparameter tuning Python package.\n", "2. Specify a study specification specific to the hyperparameters used in the CIFAR10 custom hyperparameter tuning Python package." ] }, { "cell_type": "markdown", "metadata": { "id": "create_study_spec:random" }, "source": [ "### Create a study specification\n", "\n", "In this study, you will tune for two hyperparameters using the random search algorithm:\n", "\n", "- **learning rate**: The search space is a set of discrete values.\n", "- **learning rate decay**: The search space is a continuous range between 1e-6 and 1e-2.\n", "\n", "The objective (goal) is to maximize the validation accuracy.\n", "\n", "You will run a maximum of six trials." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_study_spec:random" }, "outputs": [], "source": [ "study_spec = {\n", " \"metrics\": [\n", " {\n", " \"metric_id\": \"val_accuracy\",\n", " \"goal\": aip.StudySpec.MetricSpec.GoalType.MAXIMIZE,\n", " }\n", " ],\n", " \"parameters\": [\n", " {\n", " \"parameter_id\": \"lr\",\n", " \"discrete_value_spec\": {\"values\": [0.001, 0.01, 0.1]},\n", " \"scale_type\": aip.StudySpec.ParameterSpec.ScaleType.UNIT_LINEAR_SCALE,\n", " },\n", " {\n", " \"parameter_id\": \"decay\",\n", " \"double_value_spec\": {\"min_value\": 1e-6, \"max_value\": 1e-2},\n", " \"scale_type\": aip.StudySpec.ParameterSpec.ScaleType.UNIT_LINEAR_SCALE,\n", " },\n", " ],\n", " \"algorithm\": aip.StudySpec.Algorithm.RANDOM_SEARCH,\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "assemble_custom_hpt_job_specification" }, "source": [ "### Assemble a hyperparameter tuning job specification\n", "\n", "Now assemble the complete description for the custom hyperparameter tuning specification:\n", "\n", "- `display_name`: The human readable name you assign to this custom hyperparameter tuning job.\n", "- `trial_job_spec`: The specification for the custom hyperparameter tuning job.\n", "- `study_spec`: The specification for what to tune.\n", "- `max_trial_count`: The maximum number of tuning trials.\n", "- `parallel_trial_count`: How many trials to try in parallel; otherwise, they are done sequentially." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "assemble_custom_hpt_job_specification" }, "outputs": [], "source": [ "hpt_job = {\n", " \"display_name\": JOB_NAME,\n", " \"trial_job_spec\": {\"worker_pool_specs\": worker_pool_spec},\n", " \"study_spec\": study_spec,\n", " \"max_trial_count\": 6,\n", " \"parallel_trial_count\": 1,\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "taskpy_contents:hpt,cifar10" }, "source": [ "#### Task.py contents\n", "\n", "In the next cell, you write the contents of the hyperparameter tuning script task.py. I won't go into detail, it's just there for you to browse. In summary:\n", "\n", "- Parse the command line arguments for the hyperparameter settings for the current trial.\n", " - Get the directory where to save the model artifacts from the command line (`--model_dir`), and if not specified, then from the environment variable `AIP_MODEL_DIR`.\n", "- Download and preprocess the CIFAR10 dataset.\n", "- Build a CNN model.\n", "- The learning rate and decay hyperparameter values are used during the compile of the model.\n", "- A definition of a callback `HPTCallback` which obtains the validation accuracy at the end of each epoch (`on_epoch_end()`) and reports it to the hyperparameter tuning service using `hpt.report_hyperparameter_tuning_metric()`.\n", "- Train the model with the `fit()` method and specify a callback which will report the validation accuracy back to the hyperparameter tuning service." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "taskpy_contents:hpt,cifar10" }, "outputs": [], "source": [ "%%writefile custom/trainer/task.py\n", "# Custom Job for CIFAR10\n", "\n", "import tensorflow_datasets as tfds\n", "import tensorflow as tf\n", "from hypertune import HyperTune\n", "import argparse\n", "import os\n", "import sys\n", "\n", "# Command Line arguments\n", "parser = argparse.ArgumentParser()\n", "parser.add_argument('--model-dir', dest='model_dir',\n", " default=os.getenv('AIP_MODEL_DIR'), type=str, help='Model dir.')\n", "parser.add_argument('--lr', dest='lr',\n", " default=0.001, type=float,\n", " help='Learning rate.')\n", "parser.add_argument('--decay', dest='decay',\n", " default=0.98, type=float,\n", " help='Decay rate')\n", "parser.add_argument('--epochs', dest='epochs',\n", " default=10, type=int,\n", " help='Number of epochs.')\n", "parser.add_argument('--steps', dest='steps',\n", " default=200, type=int,\n", " help='Number of steps per epoch.')\n", "parser.add_argument('--distribute', dest='distribute', type=str, default='single',\n", " help='distributed training strategy')\n", "args = parser.parse_args()\n", "\n", "\n", "# Scaling CIFAR-10 data from (0, 255] to (0., 1.]\n", "def scale(image, label):\n", " image = tf.cast(image, tf.float32)\n", " image /= 255.0\n", " return image, label\n", "\n", "\n", "# Download the dataset\n", "datasets = tfds.load(name='cifar10', as_supervised=True)\n", "\n", "# Preparing dataset\n", "BUFFER_SIZE = 10000\n", "BATCH_SIZE = 64\n", "train_dataset = datasets['train'].map(scale).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)\n", "test_dataset = datasets['test'].map(scale).batch(BATCH_SIZE)\n", "\n", "# Build the Keras model\n", "def build_and_compile_cnn_model():\n", " model = tf.keras.Sequential([\n", " tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(32, 32, 3)),\n", " tf.keras.layers.MaxPooling2D(),\n", " tf.keras.layers.Conv2D(32, 3, activation='relu'),\n", " tf.keras.layers.MaxPooling2D(),\n", " tf.keras.layers.Flatten(),\n", " tf.keras.layers.Dense(10, activation='softmax')\n", " ])\n", " model.compile(\n", " loss=tf.keras.losses.sparse_categorical_crossentropy,\n", " optimizer=tf.keras.optimizers.SGD(learning_rate=args.lr, decay=args.decay),\n", " metrics=['accuracy'])\n", " return model\n", "\n", "\n", "model = build_and_compile_cnn_model()\n", "\n", "# Instantiate the HyperTune reporting object\n", "hpt = HyperTune()\n", "\n", "# Reporting callback\n", "class HPTCallback(tf.keras.callbacks.Callback):\n", "\n", " def on_epoch_end(self, epoch, logs=None):\n", " global hpt\n", " hpt.report_hyperparameter_tuning_metric(\n", " hyperparameter_metric_tag='val_accuracy',\n", " metric_value=logs['val_accuracy'],\n", " global_step=epoch)\n", "\n", "# Train the model\n", "model.fit(train_dataset, epochs=5, steps_per_epoch=10, validation_data=test_dataset.take(8),\n", " callbacks=[HPTCallback()])\n", "model.save(args.model_dir)" ] }, { "cell_type": "markdown", "metadata": { "id": "tarball_training_script" }, "source": [ "#### Store hyperparameter tuning script on your Cloud Storage bucket\n", "\n", "Next, you package the hyperparameter tuning folder into a compressed tar ball, and then store it in your Cloud Storage bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tarball_training_script" }, "outputs": [], "source": [ "! rm -f custom.tar custom.tar.gz\n", "! tar cvf custom.tar custom\n", "! gzip custom.tar\n", "! gsutil cp custom.tar.gz $BUCKET_NAME/trainer_cifar10.tar.gz" ] }, { "cell_type": "markdown", "metadata": { "id": "report_hypertune" }, "source": [ "#### Reporting back the result of the trial using hypertune\n", "\n", "For each trial, your Python script needs to report back to the hyperparameter tuning service the objective metric for which you specified as the criteria for evaluating the trial.\n", "\n", "For this example, you will specify in the study specification that the objective metric will be reported back as `val_accuracy`.\n", "\n", "You report back the value of the objective metric using `HyperTune`. This Python module is used to communicate key/value pairs to the hyperparameter tuning service. To setup this reporting in your Python package, you will add code for the following three steps:\n", "\n", "1. Import the HyperTune module: `from hypertune import HyperTune()`.\n", "2. At the end of every epoch, write the current value of the objective function to the log as a key/value pair using `hpt.report_hyperparameter_tuning_metric()`. In this example, the parameters are:\n", " - `hyperparameter_metric_tag`: The name of the objective metric to report back. The name must be identical to the name specified in the study specification.\n", " - `metric_value`: The value of the objective metric to report back to the hyperparameter service.\n", " - `global_step`: The epoch iteration, starting at 0." ] }, { "cell_type": "markdown", "metadata": { "id": "tune_custom_job" }, "source": [ "## Hyperparameter Tune the model\n", "\n", "Now start the hyperparameter tuning of your custom model on Vertex. Use this helper function `create_hyperparameter_tuning_job`, which takes the following parameter:\n", "\n", "-`hpt_job`: The specification for the hyperparameter tuning job.\n", "\n", "The helper function calls job client service's `create_hyperparameter_tuning_job` method, with the following parameters:\n", "\n", "-`parent`: The Vertex location path to `Dataset`, `Model` and `Endpoint` resources.\n", "-`hyperparameter_tuning_job`: The specification for the hyperparameter tuning job.\n", "\n", "You will display a handful of the fields returned in `response` object, with the two that are of most interest are:\n", "\n", "`response.name`: The Vertex fully qualified identifier assigned to this custom hyperparameter tuning job. You save this identifier for using in subsequent steps.\n", "\n", "`response.state`: The current state of the custom hyperparameter tuning job." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tune_custom_job" }, "outputs": [], "source": [ "def create_hyperparameter_tuning_job(hpt_job):\n", " response = clients[\"job\"].create_hyperparameter_tuning_job(\n", " parent=PARENT, hyperparameter_tuning_job=hpt_job\n", " )\n", " print(\"name:\", response.name)\n", " print(\"display_name:\", response.display_name)\n", " print(\"state:\", response.state)\n", " print(\"create_time:\", response.create_time)\n", " print(\"update_time:\", response.update_time)\n", " return response\n", "\n", "\n", "response = create_hyperparameter_tuning_job(hpt_job)" ] }, { "cell_type": "markdown", "metadata": { "id": "job_id:response" }, "source": [ "Now get the unique identifier for the custom job you created." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "job_id:response" }, "outputs": [], "source": [ "# The full unique ID for the custom job\n", "hpt_job_id = response.name\n", "# The short numeric ID for the custom job\n", "hpt_job_short_id = hpt_job_id.split(\"/\")[-1]\n", "\n", "print(hpt_job_id)" ] }, { "cell_type": "markdown", "metadata": { "id": "get_hpt_job" }, "source": [ "### Get information on a hyperparameter tuning job\n", "\n", "Next, use this helper function `get_hyperparameter_tuning_job`, which takes the following parameter:\n", "\n", "- `name`: The Vertex fully qualified identifier for the hyperparameter tuning job.\n", "\n", "The helper function calls the job client service's `get_hyperparameter_tuning_job` method, with the following parameter:\n", "\n", "- `name`: The Vertex fully qualified identifier for the hyperparameter tuning job.\n", "\n", "If you recall, you got the Vertex fully qualified identifier for the hyperparameter tuning job in the `response.name` field when you called the `create_hyperparameter_tuning_job` method, and saved the identifier in the variable `hpt_job_id`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "get_hpt_job" }, "outputs": [], "source": [ "def get_hyperparameter_tuning_job(name, silent=False):\n", " response = clients[\"job\"].get_hyperparameter_tuning_job(name=name)\n", " if silent:\n", " return response\n", "\n", " print(\"name:\", response.name)\n", " print(\"display_name:\", response.display_name)\n", " print(\"state:\", response.state)\n", " print(\"create_time:\", response.create_time)\n", " print(\"update_time:\", response.update_time)\n", " return response\n", "\n", "\n", "response = get_hyperparameter_tuning_job(hpt_job_id)" ] }, { "cell_type": "markdown", "metadata": { "id": "wait_tuning_complete" }, "source": [ "## Wait for tuning to complete\n", "\n", "Hyperparameter tuning the above model may take upwards of 20 minutes time.\n", "\n", "Once your model is done tuning, you can calculate the actual time it took to tune the model by subtracting `end_time` from `start_time`.\n", "\n", "For your model, we will need to know the location of the saved models for each trial, which the Python script saved in your local Cloud Storage bucket at `MODEL_DIR + '/<trial_number>/saved_model.pb'`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wait_tuning_complete" }, "outputs": [], "source": [ "while True:\n", " job_response = get_hyperparameter_tuning_job(hpt_job_id, True)\n", " if job_response.state != aip.JobState.JOB_STATE_SUCCEEDED:\n", " print(\"Study trials have not completed:\", job_response.state)\n", " if job_response.state == aip.JobState.JOB_STATE_FAILED:\n", " break\n", " else:\n", " if not DIRECT:\n", " MODEL_DIR = MODEL_DIR + \"/model\"\n", " print(\"Study trials have completed\")\n", " break\n", " time.sleep(60)" ] }, { "cell_type": "markdown", "metadata": { "id": "review_study_results" }, "source": [ "### Review the results of the study\n", "\n", "Now review the results of trials." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "review_study_results" }, "outputs": [], "source": [ "best = (None, None, None, 0.0)\n", "for trial in job_response.trials:\n", " print(trial)\n", " # Keep track of the best outcome\n", " if float(trial.final_measurement.metrics[0].value) > best[3]:\n", " try:\n", " best = (\n", " trial.id,\n", " float(trial.parameters[0].value),\n", " float(trial.parameters[1].value),\n", " float(trial.final_measurement.metrics[0].value),\n", " )\n", " except:\n", " best = (\n", " trial.id,\n", " float(trial.parameters[0].value),\n", " None,\n", " float(trial.final_measurement.metrics[0].value),\n", " )" ] }, { "cell_type": "markdown", "metadata": { "id": "best_trial" }, "source": [ "### Best trial\n", "\n", "Now look at which trial was the best:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "best_trial" }, "outputs": [], "source": [ "print(\"ID\", best[0])\n", "print(\"Learning Rate\", best[1])\n", "print(\"Decay\", best[2])\n", "print(\"Validation Accuracy\", best[3])" ] }, { "cell_type": "markdown", "metadata": { "id": "get_best_model" }, "source": [ "## Get the Best Model\n", "\n", "If you used the method of having the service tell the tuning script where to save the model artifacts (`DIRECT = False`), then the model artifacts for the best model are saved at:\n", "\n", " MODEL_DIR/<best_trial_id>/model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "get_best_model" }, "outputs": [], "source": [ "BEST_MODEL_DIR = MODEL_DIR + \"/\" + best[0] + \"/model\"" ] }, { "cell_type": "markdown", "metadata": { "id": "load_saved_model" }, "source": [ "## Load the saved model\n", "\n", "Your model is stored in a TensorFlow SavedModel format in a Cloud Storage bucket. Now load it from the Cloud Storage bucket, and then you can do some things, like evaluate the model, and do a prediction.\n", "\n", "To load, you use the TF.Keras `model.load_model()` method passing it the Cloud Storage path where the model is saved -- specified by `MODEL_DIR`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "load_saved_model" }, "outputs": [], "source": [ "import tensorflow as tf\n", "\n", "model = tf.keras.models.load_model(MODEL_DIR)" ] }, { "cell_type": "markdown", "metadata": { "id": "evaluate_custom_model:image" }, "source": [ "## Evaluate the model\n", "\n", "Now find out how good the model is.\n", "\n", "### Load evaluation data\n", "\n", "You will load the CIFAR10 test (holdout) data from `tf.keras.datasets`, using the method `load_data()`. This will return the dataset as a tuple of two elements. The first element is the training data and the second is the test data. Each element is also a tuple of two elements: the image data, and the corresponding labels.\n", "\n", "You don't need the training data, and hence why we loaded it as `(_, _)`.\n", "\n", "Before you can run the data through evaluation, you need to preprocess it:\n", "\n", "x_test:\n", "1. Normalize (rescaling) the pixel data by dividing each pixel by 255. This will replace each single byte integer pixel with a 32-bit floating point number between 0 and 1.\n", "\n", "y_test:<br/>\n", "2. The labels are currently scalar (sparse). If you look back at the `compile()` step in the `trainer/task.py` script, you will find that it was compiled for sparse labels. So we don't need to do anything more." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "evaluate_custom_model:image,cifar10" }, "outputs": [], "source": [ "import numpy as np\n", "from tensorflow.keras.datasets import cifar10\n", "\n", "(_, _), (x_test, y_test) = cifar10.load_data()\n", "x_test = (x_test / 255.0).astype(np.float32)\n", "\n", "print(x_test.shape, y_test.shape)" ] }, { "cell_type": "markdown", "metadata": { "id": "perform_evaluation_custom" }, "source": [ "### Perform the model evaluation\n", "\n", "Now evaluate how well the model in the custom job did." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "perform_evaluation_custom" }, "outputs": [], "source": [ "model.evaluate(x_test, y_test)" ] }, { "cell_type": "markdown", "metadata": { "id": "cleanup" }, "source": [ "# Cleaning up\n", "\n", "To clean up all GCP resources used in this project, you can [delete the GCP\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Otherwise, you can delete the individual resources you created in this tutorial:\n", "\n", "- Dataset\n", "- Pipeline\n", "- Model\n", "- Endpoint\n", "- Batch Job\n", "- Custom Job\n", "- Hyperparameter Tuning Job\n", "- Cloud Storage Bucket" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cleanup" }, "outputs": [], "source": [ "delete_dataset = True\n", "delete_pipeline = True\n", "delete_model = True\n", "delete_endpoint = True\n", "delete_batchjob = True\n", "delete_customjob = True\n", "delete_hptjob = True\n", "delete_bucket = True\n", "\n", "# Delete the dataset using the Vertex fully qualified identifier for the dataset\n", "try:\n", " if delete_dataset and \"dataset_id\" in globals():\n", " clients[\"dataset\"].delete_dataset(name=dataset_id)\n", "except Exception as e:\n", " print(e)\n", "\n", "# Delete the training pipeline using the Vertex fully qualified identifier for the pipeline\n", "try:\n", " if delete_pipeline and \"pipeline_id\" in globals():\n", " clients[\"pipeline\"].delete_training_pipeline(name=pipeline_id)\n", "except Exception as e:\n", " print(e)\n", "\n", "# Delete the model using the Vertex fully qualified identifier for the model\n", "try:\n", " if delete_model and \"model_to_deploy_id\" in globals():\n", " clients[\"model\"].delete_model(name=model_to_deploy_id)\n", "except Exception as e:\n", " print(e)\n", "\n", "# Delete the endpoint using the Vertex fully qualified identifier for the endpoint\n", "try:\n", " if delete_endpoint and \"endpoint_id\" in globals():\n", " clients[\"endpoint\"].delete_endpoint(name=endpoint_id)\n", "except Exception as e:\n", " print(e)\n", "\n", "# Delete the batch job using the Vertex fully qualified identifier for the batch job\n", "try:\n", " if delete_batchjob and \"batch_job_id\" in globals():\n", " clients[\"job\"].delete_batch_prediction_job(name=batch_job_id)\n", "except Exception as e:\n", " print(e)\n", "\n", "# Delete the custom job using the Vertex fully qualified identifier for the custom job\n", "try:\n", " if delete_customjob and \"job_id\" in globals():\n", " clients[\"job\"].delete_custom_job(name=job_id)\n", "except Exception as e:\n", " print(e)\n", "\n", "# Delete the hyperparameter tuning job using the Vertex fully qualified identifier for the hyperparameter tuning job\n", "try:\n", " if delete_hptjob and \"hpt_job_id\" in globals():\n", " clients[\"job\"].delete_hyperparameter_tuning_job(name=hpt_job_id)\n", "except Exception as e:\n", " print(e)\n", "\n", "if delete_bucket and \"BUCKET_NAME\" in globals():\n", " ! gsutil rm -r $BUCKET_NAME" ] } ], "metadata": { "colab": { "name": "showcase_hyperparmeter_tuning_image_classification.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }