notebooks/official/automl/get_started_automl

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "copyright" }, "outputs": [], "source": [ "# Copyright 2022 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "title:generic,gcp" }, "source": [ "# Get started with AutoML training\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/get_started_automl_training.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fautoml%2Fget_started_automl_training.ipynb\">\n", " <img width=\"32px\" src=\"https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n", " </a>\n", " </td> \n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/automl/get_started_automl_training.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/get_started_automl_training.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", "</table>" ] }, { "cell_type": "markdown", "metadata": { "id": "overview:mlops" }, "source": [ "## Overview\n", "\n", "\n", "This tutorial demonstrates how to use AutoML in production. This tutorial covers get started with AutoML training.\n", "\n", "Learn more about [AutoML training](https://cloud.google.com/vertex-ai/docs/training-overview)." ] }, { "cell_type": "markdown", "metadata": { "id": "objective:mlops,stage2,get_started_automl_training" }, "source": [ "### Objective\n", "\n", "In this tutorial, you learn how to use `AutoML` for training with `Vertex AI`.\n", "\n", "This tutorial uses the following Google Cloud ML services:\n", "\n", "- `AutoML training`\n", "- `Vertex AI Datasets`\n", "\n", "The steps performed include:\n", "\n", "- Train an image model\n", "- Export the image model as an edge model\n", "- Train a tabular model\n", "- Export the tabular model as a cloud model\n", "- Train a text model\n", "- Train a video model" ] }, { "cell_type": "markdown", "metadata": { "id": "recommendation:mlops,stage2,automl,training" }, "source": [ "### Recommendations\n", "\n", "When doing E2E MLOps on Google Cloud, the following are best practices for when to use AutoML:\n", "\n", "* **You have a limited amount of training data**\n", "\n", "* **You want to establish a baseline metric before experimenting with a custom model**" ] }, { "cell_type": "markdown", "metadata": { "id": "dataset:flowers,icn" }, "source": [ "### Datasets\n", "\n", "#### Image\n", "\n", "The image dataset used for this tutorial is the [Flowers dataset](https://www.tensorflow.org/datasets/catalog/tf_flowers) from [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/overview). The version of the dataset in this tutorial is stored in a public Cloud Storage bucket. The trained model predicts the type of flower in a given image from a class of five flowers: daisy, dandelion, rose, sunflower, or tulip.\n", "\n", "#### Tabular\n", "\n", "The tabular dataset used for this tutorial is the GSOD dataset from [BigQuery public datasets](https://cloud.google.com/bigquery/public-data). The version of the dataset you use only the fields year, month and day to predict the value of mean daily temperature (mean_temp).\n", "\n", "#### Text\n", "\n", "The text dataset used for this tutorial is the [Happy Moments dataset](https://www.kaggle.com/ritresearch/happydb) from [Kaggle Datasets](https://www.kaggle.com/ritresearch/happydb). The version of the dataset you use in this tutorial is stored in a public Cloud Storage bucket.\n", "\n", "#### Video\n", "\n", "The video dataset used for this tutorial is the golf swing recognition portion of the [Human Motion dataset](https://todo) from [MIT](http://cbcl.mit.edu/publications/ps/Kuehne_etal_iccv11.pdf). The version of the dataset you use in this tutorial is stored in a public Cloud Storage bucket. The trained model predicts the start frame where a golf swing begins." ] }, { "cell_type": "markdown", "metadata": { "id": "fb3451ce8e47" }, "source": [ "### Costs\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "- Vertex AI\n", "- Cloud Storage\n", "\n", "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage pricing](https://cloud.google.com/storage/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "2b9e4bcab250" }, "source": [ "## Get Started" ] }, { "cell_type": "markdown", "metadata": { "id": "install_mlops" }, "source": [ "### Install Vertex AI SDK for Python and other required packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "install_mlops" }, "outputs": [], "source": [ "import os\n", "\n", "# Install the packages\n", "\n", "! pip3 install --upgrade --quiet google-cloud-aiplatform \\\n", " google-cloud-storage " ] }, { "cell_type": "markdown", "metadata": { "id": "16220914acc5" }, "source": [ "### Restart runtime (Colab only)\n", "\n", "To use the newly installed packages, you must restart the runtime on Google Colab." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "157953ab28f0" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", "\n", " import IPython\n", "\n", " app = IPython.Application.instance()\n", " app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "b96b39fd4d7b" }, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "id": "ff666ce4051c" }, "source": [ "### Authenticate your notebook environment (Colab only)\n", "\n", "Authenticate your environment on Google Colab." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cc7251520a07" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", "\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "b02382a1fea6" }, "source": [ "### Set Google Cloud project information\n", "\n", "To get started using Vertex AI, you must have an existing Google Cloud project. Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_project_id" }, "outputs": [], "source": [ "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", "LOCATION = \"us-central1\" # @param {type: \"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "bucket:mbsdk" }, "source": [ "### Create a Cloud Storage bucket\n", "\n", "Create a storage bucket to store intermediate artifacts such as datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bucket" }, "outputs": [], "source": [ "BUCKET_URI = f\"gs://your-bucket-name-{PROJECT_ID}-unique\" # @param {type:\"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "create_bucket" }, "source": [ "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_bucket" }, "outputs": [], "source": [ "! gsutil mb -l $LOCATION $BUCKET_URI" ] }, { "cell_type": "markdown", "metadata": { "id": "setup_vars" }, "source": [ "### Set up variables\n", "\n", "Next, set up some variables used throughout the tutorial.\n", "### Import libraries and define constants" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "import_aip:mbsdk" }, "outputs": [], "source": [ "import sys\n", "\n", "import google.cloud.aiplatform as aiplatform" ] }, { "cell_type": "markdown", "metadata": { "id": "init_aip:mbsdk" }, "source": [ "### Initialize Vertex AI SDK for Python\n", "\n", "Initialize the Vertex AI SDK for Python for your project and corresponding bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "init_aip:mbsdk" }, "outputs": [], "source": [ "aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)" ] }, { "cell_type": "markdown", "metadata": { "id": "automl_training_intro" }, "source": [ "## AutoML training job\n", "\n", "AutoML can be used to automatically train a wide variety of image model types. AutoML automates the following:\n", "\n", "- Dataset preprocessing\n", "- Feature Engineering\n", "- Data feeding\n", "- Model Architecture selection\n", "- Hyperparameter tuning\n", "- Training the model\n", "\n", "Learn more about [Vertex AI for AutoML users](https://cloud.google.com/vertex-ai/docs/start/automl-users)" ] }, { "cell_type": "markdown", "metadata": { "id": "automl_image_intro" }, "source": [ "## AutoML image models\n", "\n", "AutoML can train the following types of image models:\n", "\n", "- classification\n", "- objection detection\n", "- segmentation\n", "\n", "A model can be trained for either deployment to the cloud or exported to the edge.\n", "\n", "Learn more about [AutoML Model Types](https://cloud.google.com/vertex-ai/docs/start/automl-model-types)" ] }, { "cell_type": "markdown", "metadata": { "id": "data_preparation:image,u_dataset" }, "source": [ "### Data preparation\n", "\n", "The Vertex `Dataset` resource for images has some requirements for your data:\n", "\n", "- Images must be stored in a Cloud Storage bucket.\n", "- Each image file must be in an image format (PNG, JPEG, BMP, ...).\n", "- There must be an index file stored in your Cloud Storage bucket that contains the path and label for each image.\n", "- The index file must be either CSV or JSONL.\n", "\n", "Learn more about [Preparing image data](https://cloud.google.com/vertex-ai/docs/datasets/prepare-image)" ] }, { "cell_type": "markdown", "metadata": { "id": "data_import_format:icn,u_dataset,csv" }, "source": [ "#### CSV\n", "\n", "For image classification, the CSV index file has the requirements:\n", "\n", "- No heading.\n", "- First column is the Cloud Storage path to the image.\n", "- Second column is the label.\n", "- Any remaining columns are additional labels for multi-label image classification.\n", "\n", "For image object detection, the CSV index file has the requirements:\n", "\n", "- No heading.\n", "- First column is the Cloud Storage path to the image.\n", "- Second column is the label.\n", "- Third/Fourth columns are the upper left corner of bounding box. Coordinates are normalized, between 0 and 1.\n", "- Fifth/Sixth/Seventh columns are not used and should be 0.\n", "- Eighth/Ninth columns are the lower right corner of the bounding box.\n", "\n", "##### ML_USE\n", "\n", "Each row may additionally specify which split to assign the data item to when the dataset is split for training; otherwise, the dataset will be randomly split: 80/10/10.\n", "\n", "The `ml_use` assignment is specified by prepending a column for specifying the assignment -- as the first column. The value may be one of: training, test, or validation." ] }, { "cell_type": "markdown", "metadata": { "id": "data_import_format:isg,u_dataset,jsonl" }, "source": [ "#### JSONL\n", "\n", "For image classification, the JSONL index file has the requirements:\n", "\n", "- Each data item is a separate JSON object, on a separate line.\n", "- The key/value pair `image_gcs_uri` is the Cloud Storage path to the image.\n", "- The key/value pair `display_name` is the label for the image.\n", "\n", " { 'image_gcs_uri': image, \n", " 'classification_annotations': \n", " { 'display_name': label\n", " }\n", " }\n", " \n", "For multi-label, the labels are specified as a list of `display_name` key/value pairs:\n", "\n", " { 'image_gcs_uri': image, \n", " 'classification_annotations': [\n", " { 'display_name': label1\n", " },\n", " { 'display_name': labelN\n", " },\n", " ]\n", " }\n", " \n", "For object detection, the JSONL index file has the requirements:\n", "\n", "- Each data item is a separate JSON object, on a separate line.\n", "- The key/value pair `image_gcs_uri` is the Cloud Storage path to the image.\n", "- The key/value pair `bounding_box_annotations` is a list of:\n", " - `display_name`: The label of the object\n", " - `x_min`, `y_min`, `x_max`, `y_max`: The coordinates for the bounding box\n", "\n", "{\n", " \"image_gcs_uri\": image,\n", " \"bounding_box_annotations\": [\n", " {\n", " \"display name\": label,\n", " \"x_min\": \"X_MIN\",\n", " \"y_min\": \"Y_MIN\",\n", " \"x_max\": \"X_MAX\",\n", " \"y_max\": \"Y_MAX\"\n", " }\n", " },\n", " {\n", " \"displayName\": \"OBJECT2_LABEL\",\n", " \"x_min\": \"X_MIN\",\n", " \"y_min\": \"Y_MIN\",\n", " \"x_max\": \"X_MAX\",\n", " \"y_max\": \"Y_MAX\"\n", " }\n", " ]\n", "}\n", "\n", "\n", "For image segmentation, the JSONL index file has the requirements:\n", "\n", "- Each data item is a separate JSON object, on a separate line.\n", "- The key/value pair `image_gcs_uri` is the Cloud Storage path to the image.\n", "- The key/value pair `category_mask_uri` is the Cloud Storage path to the mask image in PNG format.\n", "- The key/value pair `'annotation_spec_colors'` is a list mapping mask colors to a label.\n", " - The key/value pair pair `display_name` is the label for the pixel color mask.\n", " - The key/value pair pair `color` are the RGB normalized pixel values (between 0 and 1) of the mask for the corresponding label.\n", "\n", " { 'image_gcs_uri': image, \n", " 'segmentation_annotations': { 'category_mask_uri': mask_image, 'annotation_spec_colors' : [ \n", " { 'display_name': label, 'color': {\"red\": value, \"blue\", value, \"green\": value} }, ...\n", " ] \n", " }\n", " \n", "##### ML_USE\n", "\n", "Each JSONL object may additionally specify which split to assign the data item to when the dataset is split for training; otherwise, the dataset will be randomly split: 80/10/10.\n", "\n", "\"data_item_resource_labels\": {\n", " \"aiplatform.googleapis.com/ml_use\": \"training|test|validation\"\n", " }\n", "\n", "*Note*: The dictionary key fields may alternatively be in camelCase. For example, 'image_gcs_uri' can also be 'imageGcsUri'." ] }, { "cell_type": "markdown", "metadata": { "id": "import_file:u_dataset,csv" }, "source": [ "#### Location of Cloud Storage training data.\n", "\n", "Now set the variable `IMPORT_FILE` to the location of the CSV index file in Cloud Storage." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "import_file:flowers,csv,icn" }, "outputs": [], "source": [ "IMPORT_FILE = \"gs://cloud-samples-data/ai-platform/flowers/flowers.csv\"" ] }, { "cell_type": "markdown", "metadata": { "id": "quick_peek:csv" }, "source": [ "#### Quick peek at your data\n", "\n", "This tutorial uses a version of the Happy Moments dataset that is stored in a public Cloud Storage bucket, using a CSV index file.\n", "\n", "Start by doing a quick peek at the data. You count the number of examples by counting the number of rows in the CSV index file (`wc -l`) and then peek at the first few rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "quick_peek:csv" }, "outputs": [], "source": [ "FILE = IMPORT_FILE\n", "\n", "count = ! gsutil cat $FILE | wc -l\n", "print(\"Number of Examples\", int(count[0]))\n", "\n", "print(\"First 10 rows\")\n", "! gsutil cat $FILE | head" ] }, { "cell_type": "markdown", "metadata": { "id": "create_dataset:image,icn" }, "source": [ "### Create the Dataset\n", "\n", "Next, create the `Dataset` resource using the `create` method for the `ImageDataset` class, which takes the following parameters:\n", "\n", "- `display_name`: The human readable name for the `Dataset` resource.\n", "- `gcs_source`: A list of one or more dataset index files to import the data items into the `Dataset` resource.\n", "- `import_schema_uri`: The data labeling schema for the data items:\n", " - `single_label`: Binary and multi-class classification\n", " - `multi_label`: Multi-label multi-class classification\n", " - `bounding_box`: Object detection\n", " - `image_segmentation`: Segmentation\n", "\n", "Learn more about [ImageDataset](https://cloud.google.com/vertex-ai/docs/datasets/prepare-image)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_dataset:image,icn" }, "outputs": [], "source": [ "dataset = aiplatform.ImageDataset.create(\n", " display_name=\"flowers\",\n", " gcs_source=[IMPORT_FILE],\n", " import_schema_uri=aiplatform.schema.dataset.ioformat.image.single_label_classification,\n", ")\n", "\n", "print(dataset.resource_name)" ] }, { "cell_type": "markdown", "metadata": { "id": "create_automl_pipeline:image,edge,icn" }, "source": [ "### Create and run training pipeline\n", "\n", "To train an AutoML model, you perform two steps: 1) create a training pipeline, and 2) run the pipeline.\n", "\n", "#### Create training pipeline\n", "\n", "An AutoML training pipeline is created with the `AutoMLImageTrainingJob` class, with the following parameters:\n", "\n", "- `display_name`: The human readable name for the `TrainingJob` resource.\n", "- `prediction_type`: The type task to train the model for.\n", " - `classification`: An image classification model.\n", " - `object_detection`: An image object detection model.\n", "- `multi_label`: If a classification task, whether single (`False`) or multi-labeled (`True`).\n", "- `model_type`: The type of model for deployment.\n", " - `CLOUD`: Deployment on Google Cloud\n", " - `CLOUD_HIGH_ACCURACY_1`: Optimized for accuracy over latency for deployment on Google Cloud.\n", " - `CLOUD_LOW_LATENCY_`: Optimized for latency over accuracy for deployment on Google Cloud.\n", " - `MOBILE_TF_VERSATILE_1`: Deployment on an edge device.\n", " - `MOBILE_TF_HIGH_ACCURACY_1`:Optimized for accuracy over latency for deployment on an edge device.\n", " - `MOBILE_TF_LOW_LATENCY_1`: Optimized for latency over accuracy for deployment on an edge device.\n", "- `base_model`: (optional) Transfer learning from existing `Model` resource -- supported for image classification only.\n", "\n", "The instantiated object is the DAG (directed acyclic graph) for the training job." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_automl_pipeline:image,edge,icn" }, "outputs": [], "source": [ "dag = aiplatform.AutoMLImageTrainingJob(\n", " display_name=\"flowers\",\n", " prediction_type=\"classification\",\n", " multi_label=False,\n", " model_type=\"MOBILE_TF_LOW_LATENCY_1\",\n", " base_model=None,\n", ")\n", "\n", "print(dag)" ] }, { "cell_type": "markdown", "metadata": { "id": "run_automl_pipeline:image" }, "source": [ "#### Run the training pipeline\n", "\n", "Next, you run the created DAG to start the training job by invoking the method `run`, with the following parameters:\n", "\n", "- `dataset`: The `Dataset` resource to train the model.\n", "- `model_display_name`: The human readable name for the trained model.\n", "- `training_fraction_split`: The percentage of the dataset to use for training.\n", "- `test_fraction_split`: The percentage of the dataset to use for test (holdout data).\n", "- `validation_fraction_split`: The percentage of the dataset to use for validation.\n", "- `budget_milli_node_hours`: (optional) Maximum training time specified in unit of milli node-hours (1000 = node-hour).\n", "- `disable_early_stopping`: If `True`, training maybe completed before using the entire budget if the service believes it cannot further improve on the model objective measurements.\n", "\n", "The `run` method when completed returns the `Model` resource.\n", "\n", "The execution of the training pipeline will take upto > 30 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3eaba926cdfa" }, "outputs": [], "source": [ "import os\n", "\n", "if os.getenv(\"IS_TESTING\"):\n", " sys.exit(0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "run_automl_pipeline:image" }, "outputs": [], "source": [ "model = dag.run(\n", " dataset=dataset,\n", " model_display_name=\"flowers\",\n", " training_fraction_split=0.8,\n", " validation_fraction_split=0.1,\n", " test_fraction_split=0.1,\n", " budget_milli_node_hours=8000,\n", " disable_early_stopping=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "evaluate_the_model:mbsdk" }, "source": [ "## Review model evaluation scores\n", "\n", "After your model training has finished, you can review the evaluation scores for it using the `list_model_evaluations()` method. This method will return an iterator for each evaluation slice." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "evaluate_the_model:mbsdk" }, "outputs": [], "source": [ "model_evaluations = model.list_model_evaluations()\n", "\n", "for model_evaluation in model_evaluations:\n", " print(model_evaluation.to_dict())" ] }, { "cell_type": "markdown", "metadata": { "id": "deploy_model:mbsdk,automatic" }, "source": [ "## Deploy the model\n", "\n", "Next, deploy your model for online prediction. To deploy the model, you invoke the `deploy` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "deploy_model:mbsdk,automatic" }, "outputs": [], "source": [ "endpoint = model.deploy()" ] }, { "cell_type": "markdown", "metadata": { "id": "make_prediction" }, "source": [ "## Send a online prediction request\n", "\n", "Send a online prediction to your deployed model." ] }, { "cell_type": "markdown", "metadata": { "id": "get_test_item" }, "source": [ "### Get test item\n", "\n", "You use an arbitrary example out of the dataset as a test item. Don't be concerned that the example was likely used in training the model. You're just looking at how to make a prediction." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "get_test_item:automl,icn,csv" }, "outputs": [], "source": [ "test_item = !gsutil cat $IMPORT_FILE | head -n1\n", "if len(str(test_item[0]).split(\",\")) == 3:\n", " _, test_item, test_label = str(test_item[0]).split(\",\")\n", "else:\n", " test_item, test_label = str(test_item[0]).split(\",\")\n", "\n", "print(test_item, test_label)" ] }, { "cell_type": "markdown", "metadata": { "id": "predict_request:mbsdk,icn" }, "source": [ "### Make the prediction\n", "\n", "Now that your `Model` resource is deployed to an `Endpoint` resource, you can do online predictions by sending prediction requests to the Endpoint resource.\n", "\n", "#### Request\n", "\n", "Since your test item is in a public Cloud Storage bucket in this example, you copy it to your bucket and read the contents of the image using `Cloud Storage SDK`. To pass the test data to the prediction service, you encode the bytes into base64 which makes the content safe from modification while transmitting binary data over the network.\n", "\n", "The format of each instance is:\n", "\n", " { 'content': { 'b64': base64_encoded_bytes } }\n", "\n", "Since the `predict()` method can take multiple items (instances), send your single test item as a list of one test item.\n", "\n", "#### Response\n", "\n", "The response from the `predict()` call is a Python dictionary with the following entries:\n", "\n", "- `ids`: The internal assigned unique identifiers for each prediction request.\n", "- `displayNames`: The class names for each class label.\n", "- `confidences`: The predicted confidence, between 0 and 1, per class label.\n", "- `deployed_model_id`: The Vertex AI identifier for the deployed Model resource which did the predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1c1d53e89beb" }, "outputs": [], "source": [ "import base64\n", "\n", "from google.cloud import storage\n", "\n", "# Copy the test image to the Cloud storage bucket as \"test.jpg\"\n", "test_image_local = \"{}/test.jpg\".format(BUCKET_URI)\n", "! gsutil cp $test_item $test_image_local\n", "\n", "# Download the test image in bytes format\n", "storage_client = storage.Client(project=PROJECT_ID)\n", "bucket = storage_client.bucket(bucket_name=BUCKET_URI[5:])\n", "test_content = bucket.get_blob(\"test.jpg\").download_as_bytes()\n", "\n", "# The format of each instance should conform to the deployed model's prediction input schema.\n", "instances = [{\"content\": base64.b64encode(test_content).decode(\"utf-8\")}]\n", "\n", "prediction = endpoint.predict(instances=instances)\n", "\n", "print(prediction)" ] }, { "cell_type": "markdown", "metadata": { "id": "3b1b67898533" }, "source": [ "#### Alternate method using [GFile](https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile)\n", "\n", "Alternatively, [GFile](https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile) method from tensorflow-io library can be used to read the data from Cloud storage directly. The following code snippet does the same :\n", "\n", "```\n", "import base64\n", "import tensorflow as tf\n", "\n", "# Read the test file using GFile\n", "with tf.io.gfile.GFile(test_item, \"rb\") as f:\n", " content = f.read()\n", "\n", "# The format of each instance should conform to the deployed model's prediction input schema.\n", "instances = [{\"content\": base64.b64encode(content).decode(\"utf-8\")}]\n", "\n", "prediction = endpoint.predict(instances=instances)\n", "\n", "print(prediction)\n", "```\n", "Nevertheless, `tf.io.gfile.GFile` supports multiple file system implementations, including local files, Google Cloud Storage (using a gs:// prefix), and HDFS (using an hdfs:// prefix)." ] }, { "cell_type": "markdown", "metadata": { "id": "undeploy_model:mbsdk" }, "source": [ "#### Undeploy the model\n", "\n", "When you're done doing predictions, undeploy the model from the `Endpoint` resource. This deprovisions all compute resources and ends billing for the deployed model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "undeploy_model:mbsdk" }, "outputs": [], "source": [ "endpoint.undeploy_all()" ] }, { "cell_type": "markdown", "metadata": { "id": "export_model:mbsdk,image" }, "source": [ "## Export as Edge model\n", "\n", "You can export an AutoML cloud model as a `Edge` model which you can then custom deploy to an edge device or download locally. Use the method `export_model()` to export the model to Cloud Storage, which takes the following parameters:\n", "\n", "- `artifact_destination`: The Cloud Storage location to store the SavedFormat model artifacts to.\n", "- `export_format_id`: The format to save the model format as. For AutoML cloud there is just one option:\n", " - `tf-saved-model`: TensorFlow SavedFormat for deployment to a container.\n", " - `tflite`: TensorFlow Lite for deployment to an edge or mobile device.\n", " - `edgetpu-tflite`: TensorFlow Lite for TPU\n", " - `tf-js`: TensorFlow for web client\n", " - `coral-ml`: for Coral devices\n", "\n", "- `sync`: Whether to perform operational sychronously or asynchronously." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "export_model:mbsdk,image" }, "outputs": [], "source": [ "response = model.export_model(\n", " artifact_destination=BUCKET_URI, export_format_id=\"tflite\", sync=True\n", ")\n", "\n", "model_package = response[\"artifactOutputUri\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "model_delete:mbsdk" }, "source": [ "#### Delete the model\n", "\n", "The method 'delete()' will delete the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "model_delete:mbsdk" }, "outputs": [], "source": [ "model.delete()" ] }, { "cell_type": "markdown", "metadata": { "id": "dataset_delete:mbsdk" }, "source": [ "#### Delete the dataset\n", "\n", "The method 'delete()' will delete the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dataset_delete:mbsdk" }, "outputs": [], "source": [ "dataset.delete()" ] }, { "cell_type": "markdown", "metadata": { "id": "endpoint_delete:mbsdk" }, "source": [ "#### Delete the endpoint\n", "\n", "The method 'delete()' will delete the endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "endpoint_delete:mbsdk" }, "outputs": [], "source": [ "endpoint.delete()" ] }, { "cell_type": "markdown", "metadata": { "id": "automl_tabular_intro" }, "source": [ "## AutoML tabular models\n", "\n", "AutoML can train the following types of tabular models:\n", "\n", "- classification\n", "- regression\n", "- forecasting\n", "\n", "A model can be trained for either automatic deployment to the cloud or exported for manual deployment to the cloud.\n", "\n", "Learn more about [AutoML Model Types](https://cloud.google.com/vertex-ai/docs/start/automl-model-types)" ] }, { "cell_type": "markdown", "metadata": { "id": "data_preparation:tabular,u_dataset" }, "source": [ "### Data preparation\n", "\n", "The Vertex AI `Dataset` resource for tabular has a couple of requirements for your tabular data.\n", "\n", "- Must be in a CSV file or a BigQuery table.\n", "\n", "Learn more about [Preparing tabular data](https://cloud.google.com/vertex-ai/docs/datasets/prepare-tabular)" ] }, { "cell_type": "markdown", "metadata": { "id": "data_import_format:lbn,u_dataset,csv" }, "source": [ "#### CSV\n", "\n", "For tabular models, the CSV file has a few requirements:\n", "\n", "- The first row must be the heading -- note how this is different from Image, Text and Video where the requirement is no heading.\n", "- All but one column are features.\n", "- One column is the label, which you will specify when you subsequently create the training pipeline.\n", "\n", "##### ML_USE\n", "\n", "Each row may additionally specify which split to assign the data item to when the dataset is split for training; otherwise, the dataset will be randomly split: 80/10/10.\n", "\n", "The `ml_use` assignment is specified by prepending a column for specifying the assignment -- as the first column. The value may be one of: training, test, or validation." ] }, { "cell_type": "markdown", "metadata": { "id": "import_file:u_dataset,bq" }, "source": [ "#### Location of BigQuery training data.\n", "\n", "Now set the variable `IMPORT_FILE` to the location of the data table in BigQuery." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "import_file:gsod,bq,lrg" }, "outputs": [], "source": [ "IMPORT_FILE = \"bq://bigquery-public-data.samples.gsod\"\n", "BQ_TABLE = \"bigquery-public-data.samples.gsod\"" ] }, { "cell_type": "markdown", "metadata": { "id": "create_dataset:tabular,bq,lrg" }, "source": [ "### Create the Dataset\n", "\n", "#### BigQuery input data\n", "\n", "Next, create the `Dataset` resource using the `create` method for the `TabularDataset` class, which takes the following parameters:\n", "\n", "- `display_name`: The human readable name for the `Dataset` resource.\n", "- `bq_source`: Import data items from a BigQuery table into the `Dataset` resource.\n", "- `labels`: User defined metadata. In this example, you store the location of the Cloud Storage bucket containing the user defined data.\n", "\n", "Learn more about [TabularDataset from BigQuery table](https://cloud.google.com/vertex-ai/docs/datasets/create-dataset-api#aiplatform_create_dataset_tabular_bigquery_sample-python)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_dataset:tabular,bq,lrg" }, "outputs": [], "source": [ "dataset = aiplatform.TabularDataset.create(\n", " display_name=\"gsod\",\n", " bq_source=[IMPORT_FILE],\n", " labels={\"user_metadata\": BUCKET_URI[5:]},\n", ")\n", "\n", "label_column = \"mean_temp\"\n", "\n", "print(dataset.resource_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_transformations:gsod" }, "outputs": [], "source": [ "TRANSFORMATIONS = [\n", " {\"auto\": {\"column_name\": \"year\"}},\n", " {\"auto\": {\"column_name\": \"month\"}},\n", " {\"auto\": {\"column_name\": \"day\"}},\n", "]\n", "\n", "label_column = \"mean_temp\"" ] }, { "cell_type": "markdown", "metadata": { "id": "create_automl_pipeline:tabular,lrg,transformations" }, "source": [ "### Create and run training pipeline\n", "\n", "To train an AutoML model, you perform two steps: 1) create a training pipeline, and 2) run the pipeline.\n", "\n", "#### Create training pipeline\n", "\n", "An AutoML training pipeline is created with the `AutoMLTabularTrainingJob` class, with the following parameters:\n", "\n", "- `display_name`: The human readable name for the `TrainingJob` resource.\n", "- `optimization_prediction_type`: The type task to train the model for.\n", " - `classification`: A tabuar classification model.\n", " - `regression`: A tabular regression model.\n", "- `column_transformations`: (Optional): Transformations to apply to the input columns\n", "- `optimization_objective`: The optimization objective to minimize or maximize.\n", " - binary classification:\n", " - `minimize-log-loss`\n", " - `maximize-au-roc`\n", " - `maximize-au-prc`\n", " - `maximize-precision-at-recall`\n", " - `maximize-recall-at-precision`\n", " - multi-class classification:\n", " - `minimize-log-loss`\n", " - regression:\n", " - `minimize-rmse`\n", " - `minimize-mae`\n", " - `minimize-rmsle`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_automl_pipeline:tabular,lrg,transformations" }, "outputs": [], "source": [ "dag = aiplatform.AutoMLTabularTrainingJob(\n", " display_name=\"gsod\",\n", " optimization_prediction_type=\"regression\",\n", " optimization_objective=\"minimize-rmse\",\n", " column_transformations=TRANSFORMATIONS,\n", ")\n", "\n", "print(dag)" ] }, { "cell_type": "markdown", "metadata": { "id": "run_automl_pipeline:tabular" }, "source": [ "#### Run the training pipeline\n", "\n", "Next, you run the created DAG to start the training job by invoking the method `run`, with the following parameters:\n", "\n", "- `dataset`: The `Dataset` resource to train the model.\n", "- `model_display_name`: The human readable name for the trained model.\n", "- `training_fraction_split`: The percentage of the dataset to use for training.\n", "- `test_fraction_split`: The percentage of the dataset to use for test (holdout data).\n", "- `validation_fraction_split`: The percentage of the dataset to use for validation.\n", "- `target_column`: The name of the column to train as the label.\n", "- `budget_milli_node_hours`: (optional) Maximum training time specified in unit of millihours (1000 = hour).\n", "- `disable_early_stopping`: If `True`, training maybe completed before using the entire budget if the service believes it cannot further improve on the model objective measurements.\n", "\n", "The `run` method when completed returns the `Model` resource.\n", "\n", "The execution of the training pipeline will take upto > 30 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "run_automl_pipeline:tabular" }, "outputs": [], "source": [ "model = dag.run(\n", " dataset=dataset,\n", " model_display_name=\"gsod\",\n", " training_fraction_split=0.8,\n", " validation_fraction_split=0.1,\n", " test_fraction_split=0.1,\n", " budget_milli_node_hours=8000,\n", " disable_early_stopping=False,\n", " target_column=\"mean_temp\",\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "evaluate_the_model:mbsdk" }, "source": [ "## Review model evaluation scores\n", "\n", "After your model training has finished, you can review the evaluation scores for it using the `list_model_evaluations()` method. This method will return an iterator for each evaluation slice." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "evaluate_the_model:mbsdk" }, "outputs": [], "source": [ "model_evaluations = model.list_model_evaluations()\n", "\n", "for model_evaluation in model_evaluations:\n", " print(model_evaluation.to_dict())" ] }, { "cell_type": "markdown", "metadata": { "id": "deploy_model:mbsdk,dedicated" }, "source": [ "## Deploy the model\n", "\n", "Next, deploy your model for online prediction. To deploy the model, you invoke the `deploy` method, with the following parameters:\n", "\n", "- `machine_type`: The type of compute machine." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "deploy_model:mbsdk,dedicated" }, "outputs": [], "source": [ "endpoint = model.deploy(machine_type=\"n1-standard-4\")" ] }, { "cell_type": "markdown", "metadata": { "id": "undeploy_model:mbsdk" }, "source": [ "#### Undeploy the model\n", "\n", "When you're done doing predictions, undeploy the model from the `Endpoint` resource. This deprovisions all compute resources and ends billing for the deployed model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "undeploy_model:mbsdk" }, "outputs": [], "source": [ "endpoint.undeploy_all()" ] }, { "cell_type": "markdown", "metadata": { "id": "export_model:mbsdk,tabular" }, "source": [ "## Export as cloud model\n", "\n", "You can export an AutoML cloud model as a TensorFlow SavedFormat model which you can then custom deploy to Cloud Storage or download locally. Use the method `export_model()` to export the model to Cloud Storage, which takes the following parameters:\n", "\n", "- `artifact_destination`: The Cloud Storage location to store the SavedFormat model artifacts to.\n", "- `export_format_id`: The format to save the model format as. For AutoML cloud there is just one option:\n", " - `tf-saved-model`: TensorFlow SavedFormat\n", "- `sync`: Whether to perform operational sychronously or asynchronously." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "export_model:mbsdk,tabular" }, "outputs": [], "source": [ "response = model.export_model(\n", " artifact_destination=BUCKET_URI, export_format_id=\"tf-saved-model\", sync=True\n", ")\n", "\n", "model_package = response[\"artifactOutputUri\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "model_delete:mbsdk" }, "source": [ "#### Delete the model\n", "\n", "The method 'delete()' will delete the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "model_delete:mbsdk" }, "outputs": [], "source": [ "model.delete()" ] }, { "cell_type": "markdown", "metadata": { "id": "dataset_delete:mbsdk" }, "source": [ "#### Delete the dataset\n", "\n", "The method 'delete()' will delete the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dataset_delete:mbsdk" }, "outputs": [], "source": [ "dataset.delete()" ] }, { "cell_type": "markdown", "metadata": { "id": "endpoint_delete:mbsdk" }, "source": [ "#### Delete the endpoint\n", "\n", "The method 'delete()' will delete the endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "endpoint_delete:mbsdk" }, "outputs": [], "source": [ "endpoint.delete()" ] }, { "cell_type": "markdown", "metadata": { "id": "automl_text_intro" }, "source": [ "## AutoML text models\n", "\n", "AutoML can train the following types of text models:\n", "\n", "- classification\n", "- sentiment analysis\n", "- entity extraction\n", "\n", "Learn more about [AutoML Model Types](https://cloud.google.com/vertex-ai/docs/start/automl-model-types)" ] }, { "cell_type": "markdown", "metadata": { "id": "data_preparation:text,u_dataset" }, "source": [ "### Data preparation\n", "\n", "The Vertex AI `Dataset` resource for text has a couple of requirements for your text data.\n", "\n", "- Text examples must be stored in a CSV or JSONL file.\n", "\n", "Learn more about [Preparing text data](https://cloud.google.com/vertex-ai/docs/datasets/prepare-text)" ] }, { "cell_type": "markdown", "metadata": { "id": "data_import_format:tcn,u_dataset,csv" }, "source": [ "#### CSV\n", "\n", "For text classification, the CSV file has a few requirements:\n", "\n", "- No heading.\n", "- First column is the text example or Cloud Storage path to text file (.txt suffix).\n", "- Second column the label.\n", "- Any remaining columns are additional labels for multi-label text classification.\n", "\n", "For text sentiment analysis, the CSV file has a few requirements:\n", "\n", "- No heading.\n", "- First column is the text example or Cloud Storage path to text file (.txt suffix).\n", "- Second column is the sentiment value.\n", "- Third column is the maximum possible sentiment value.\n", "\n", "##### ML_USE\n", "\n", "Each row may additionally specify which split to assign the data item to when the dataset is split for training; otherwise, the dataset will be randomly split: 80/10/10.\n", "\n", "The `ml_use` assignment is specified by prepending a column for specifying the assignment -- as the first column. The value may be one of: training, test, or validation." ] }, { "cell_type": "markdown", "metadata": { "id": "766c838de8a0" }, "source": [ "#### JSONL \n", "\n", "For text classification, the JSONL file has a few requirements:\n", "\n", "- Each data item is a separate JSON object, on a separate line.\n", "- The key/value pair `text_gcs_uri` is the Cloud Storage path to the text file.\n", "- The key/value pair `text_content` is the alternate way of specifying the text as inlined.\n", "- The key/value pair `display_name` is the label for the text.\n", "\n", "{\n", " \"classification_annotation\": {\n", " \"display_name\": label\n", " },\n", " \"text_content\": text\n", "}\n", "{\n", " \"classification_annotation\": {\n", " \"display_name\": label\n", " },\n", " \"text_gcs_uri\": \"gcs_uri_to_file\"\n", "}\n", "\n", " \n", "For multi-label, the labels are specified as a list of `display_name` key/value pairs:\n", "\n", " 'classification_annotations': [\n", " { 'display_name': label1\n", " },\n", " { 'display_name': labelN\n", " },\n", " ]\n", "\n", "For text sentiment analysis, the JSONL file has a few requirements:\n", "\n", "- Each data item is a separate JSON object, on a separate line.\n", "- The key/value pair `text_gcs_uri` is the Cloud Storage path to the text file.\n", "- The key/value pair `text_content` is the alternate way of specifying the text as inlined.\n", "- The key/value pair `sentiment` is the sentiment value as an integer value greater than 0.\n", "- The key/value pair `sentiment_max`is the maximum possible value for the sentiment.\n", "\n", "{\n", " \"sentiment_annotation\": {\n", " \"sentiment\": number,\n", " \"sentiment_max\": number\n", " },\n", " \"text_content\": text,\n", "}\n", "{\n", " \"sentiment_annotation\": {\n", " \"sentiment\": number,\n", " \"sentiment_max\": number\n", " },\n", " \"text_gcs_uri\": \"gcs_uri_to_file\"\n", "}\n", "\n", "\n", "For text entity extraction, the JSONL file has a few requirements:\n", "\n", "- Each data item is a separate JSON object, on a separate line.\n", "- The key/value pair `text_gcs_uri` is the Cloud Storage path to the text file.\n", "- The key/value pair `text_content` is the alternate way of specifying the text as inlined.\n", "- The key/value pair `start_offset` is the character offset of the start of the text.\n", "- The key/value pair `end_offset` is the character offset of the end of the text.\n", "- The key/value pair `display_name` is the label for the text.\n", "\n", "{\n", " \"text_segment_annotations\": [\n", " {\n", " \"start_offset\":number,\n", " \"end_offset\":number,\n", " \"display_name\": label\n", " },\n", " ...\n", " ],\n", " \"textContent\": \"inline_text\"\n", "}\n", "{\n", " \"textSegmentAnnotations\": [\n", " {\n", " \"start_offset\": number,\n", " \"end_offset\": number,\n", " \"displayName\": label\n", " },\n", " ...\n", " ],\n", " \"text_gcs_uri\": \"gcs_uri_to_file\"\n", "}\n", "\n", "##### ML_USE\n", "\n", "Each JSONL object may additionally specify which split to assign the data item to when the dataset is split for training; otherwise, the dataset will be randomly split: 80/10/10.\n", "\n", "\"data_item_resource_labels\": {\n", " \"aiplatform.googleapis.com/ml_use\": \"training|test|validation\"\n", " }\n", "\n", "*Note*: The dictionary key fields may alternatively be in camelCase. For example, 'text_gcs_uri' can also be 'textGcsUri'." ] }, { "cell_type": "markdown", "metadata": { "id": "import_file:u_dataset,csv" }, "source": [ "#### Location of Cloud Storage training data.\n", "\n", "Now set the variable `IMPORT_FILE` to the location of the CSV index file in Cloud Storage." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "import_file:happydb,csv,tcn" }, "outputs": [], "source": [ "IMPORT_FILE = \"gs://cloud-ml-data/NL-classification/happiness.csv\"" ] }, { "cell_type": "markdown", "metadata": { "id": "quick_peek:csv" }, "source": [ "#### Quick peek at your data\n", "\n", "This tutorial uses a version of the Happy Moments dataset that is stored in a public Cloud Storage bucket, using a CSV index file.\n", "\n", "Start by doing a quick peek at the data. You count the number of examples by counting the number of rows in the CSV index file (`wc -l`) and then peek at the first few rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "quick_peek:csv" }, "outputs": [], "source": [ "FILE = IMPORT_FILE\n", "\n", "count = ! gsutil cat $FILE | wc -l\n", "print(\"Number of Examples\", int(count[0]))\n", "\n", "print(\"First 10 rows\")\n", "! gsutil cat $FILE | head" ] }, { "cell_type": "markdown", "metadata": { "id": "create_dataset:text,tcn" }, "source": [ "### Create the Dataset\n", "\n", "Next, create the `Dataset` resource using the `create` method for the `TextDataset` class, which takes the following parameters:\n", "\n", "- `display_name`: The human readable name for the `Dataset` resource.\n", "- `gcs_source`: A list of one or more dataset index files to import the data items into the `Dataset` resource.\n", "- `import_schema_uri`: The data labeling schema for the data items.\n", " - `single_label`: Binary and multi-class classification\n", " - `multi_label`: Multi-label multi-class classification\n", " - `sentiment`: Sentiment analysis\n", " - `extraction`: Entity extraction\n", "\n", "Learn more about [TextDataset](https://cloud.google.com/vertex-ai/docs/datasets/prepare-text)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_dataset:text,tcn" }, "outputs": [], "source": [ "dataset = aiplatform.TextDataset.create(\n", " display_name=\"happydb\",\n", " gcs_source=[IMPORT_FILE],\n", " import_schema_uri=aiplatform.schema.dataset.ioformat.text.single_label_classification,\n", ")\n", "\n", "print(dataset.resource_name)" ] }, { "cell_type": "markdown", "metadata": { "id": "create_automl_pipeline:text,tcn" }, "source": [ "### Create and run training pipeline\n", "\n", "To train an AutoML model, you perform two steps: 1) create a training pipeline, and 2) run the pipeline.\n", "\n", "#### Create training pipeline\n", "\n", "An AutoML training pipeline is created with the `AutoMLTextTrainingJob` class, with the following parameters:\n", "\n", "- `display_name`: The human readable name for the `TrainingJob` resource.\n", "- `prediction_type`: The type task to train the model for.\n", " - `classification`: A text classification model.\n", " - `sentiment`: A text sentiment analysis model.\n", " - `extraction`: A text entity extraction model.\n", "- `multi_label`: If a classification task, whether single (False) or multi-labeled (True).\n", "- `sentiment_max`: If a sentiment analysis task, the maximum sentiment value.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_automl_pipeline:text,tcn" }, "outputs": [], "source": [ "dag = aiplatform.AutoMLTextTrainingJob(\n", " display_name=\"happydb\",\n", " prediction_type=\"classification\",\n", " multi_label=False,\n", ")\n", "\n", "print(dag)" ] }, { "cell_type": "markdown", "metadata": { "id": "run_automl_pipeline:text" }, "source": [ "#### Run the training pipeline\n", "\n", "Next, you run the created DAG to start the training job by invoking the method `run`, with the following parameters:\n", "\n", "- `dataset`: The `Dataset` resource to train the model.\n", "- `model_display_name`: The human readable name for the trained model.\n", "- `training_fraction_split`: The percentage of the dataset to use for training.\n", "- `test_fraction_split`: The percentage of the dataset to use for test (holdout data).\n", "- `validation_fraction_split`: The percentage of the dataset to use for validation.\n", "\n", "The `run` method when completed returns the `Model` resource.\n", "\n", "The execution of the training pipeline will take upto > 30 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "run_automl_pipeline:text" }, "outputs": [], "source": [ "model = dag.run(\n", " dataset=dataset,\n", " model_display_name=\"happydb\",\n", " training_fraction_split=0.8,\n", " validation_fraction_split=0.1,\n", " test_fraction_split=0.1,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "evaluate_the_model:mbsdk" }, "source": [ "## Review model evaluation scores\n", "\n", "After your model training has finished, you can review the evaluation scores for it using the `list_model_evaluations()` method. This method will return an iterator for each evaluation slice." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "evaluate_the_model:mbsdk" }, "outputs": [], "source": [ "model_evaluations = model.list_model_evaluations()\n", "\n", "for model_evaluation in model_evaluations:\n", " print(model_evaluation.to_dict())" ] }, { "cell_type": "markdown", "metadata": { "id": "deploy_model:mbsdk,automatic" }, "source": [ "## Deploy the model\n", "\n", "Next, deploy your model for online prediction. To deploy the model, you invoke the `deploy` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "deploy_model:mbsdk,automatic" }, "outputs": [], "source": [ "endpoint = model.deploy()" ] }, { "cell_type": "markdown", "metadata": { "id": "undeploy_model:mbsdk" }, "source": [ "#### Undeploy the model\n", "\n", "When you're done doing predictions, undeploy the model from the `Endpoint` resource. This deprovisions all compute resources and ends billing for the deployed model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "undeploy_model:mbsdk" }, "outputs": [], "source": [ "endpoint.undeploy_all()" ] }, { "cell_type": "markdown", "metadata": { "id": "model_delete:mbsdk" }, "source": [ "#### Delete the model\n", "\n", "The method 'delete()' will delete the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "model_delete:mbsdk" }, "outputs": [], "source": [ "model.delete()" ] }, { "cell_type": "markdown", "metadata": { "id": "dataset_delete:mbsdk" }, "source": [ "#### Delete the dataset\n", "\n", "The method 'delete()' will delete the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dataset_delete:mbsdk" }, "outputs": [], "source": [ "dataset.delete()" ] }, { "cell_type": "markdown", "metadata": { "id": "endpoint_delete:mbsdk" }, "source": [ "#### Delete the endpoint\n", "\n", "The method 'delete()' will delete the endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "endpoint_delete:mbsdk" }, "outputs": [], "source": [ "endpoint.delete()" ] }, { "cell_type": "markdown", "metadata": { "id": "automl_video_intro" }, "source": [ "## AutoML video models\n", "\n", "AutoML can train the following types of video models:\n", "\n", "- classification\n", "- object tracking\n", "- action recognition\n", "\n", "A model can be trained for either deployment to the cloud or exported to the edge.\n", "\n", "Learn more about [AutoML Model Types](https://cloud.google.com/vertex-ai/docs/start/automl-model-types)" ] }, { "cell_type": "markdown", "metadata": { "id": "data_preparation:text,u_dataset" }, "source": [ "### Data preparation\n", "\n", "The Vertex AI `Dataset` resource for text has a couple of requirements for your text data.\n", "\n", "- Text examples must be stored in a CSV or JSONL file.\n", "\n", "Learn more about [Preparing video data](https://cloud.google.com/vertex-ai/docs/datasets/prepare-video)" ] }, { "cell_type": "markdown", "metadata": { "id": "427212b48840" }, "source": [ "#### CSV\n", "\n", "For video classification, the CSV file has a few requirements:\n", "\n", "- No heading.\n", "- First column is the Cloud Storage path to video file.\n", "- Second column the label.\n", "- Third column is the start time (seconds) in the video to classify.\n", "- Fourth column is the end time (seconds) in the video to classify.\n", "\n", "For multi-label classification, each label is a separate row entry.\n", "\n", "For video object tracking, the CSV file has a few requirements:\n", "\n", "- No heading.\n", "- First column is the Cloud Storage path to video file.\n", "- Second column the label.\n", "- Third column is unused (blank).\n", "- Fourth column is the start time (seconds) in the video to start tracking the object.\n", "- The fifth through eighth columns are the vertices of the object to track.\n", " - x_min\n", " - y_min\n", " - x_max\n", " - y_max\n", " \n", "For action recognition, the CSV file has a few requirements:\n", "\n", "- No heading.\n", "- Each row can be one of the following four formats:\n", "\n", "VIDEO_URI, TIME_SEGMENT_START, TIME_SEGMENT_END, LABEL, ANNOTATION_FRAME_TIMESTAMP\n", "\n", "VIDEO_URI, , , LABEL, ANNOTATION_FRAME_TIMESTAMP\n", "\n", "VIDEO_URI, TIME_SEGMENT_START, TIME_SEGMENT_END, LABEL, ANNOTATION_SEGMENT_START, ANNOTATION_SEGMENT_END\n", "\n", "VIDEO_URI, , , LABEL, ANNOTATION_SEGMENT_START, ANNOTATION_SEGMENT_END\n", "\n", "\n", "##### ML_USE\n", "\n", "Each row may additionally specify which split to assign the data item to when the dataset is split for training; otherwise, the dataset will be randomly split: 80/10/10.\n", "\n", "The `ml_use` assignment is specified by prepending a column for specifying the assignment -- as the first column. The value may be one of: training, or test." ] }, { "cell_type": "markdown", "metadata": { "id": "461301339727" }, "source": [ "#### JSONL\n", "\n", "For video classification, the CSV file has a few requirements:\n", "\n", "- Each data item is a separate JSON object, on a separate line.\n", "- The key/value pair `video_gcs_uri` is the Cloud Storage path to the text file.\n", "- The key/value pair `display_name` is the label for the text.\n", "- The key/value pair `start_time` is the start time (seconds) for classifying.\n", "- The key/value pair `end_time` is the end time (seconds) for classifying.\n", "\n", "\n", " {\n", " \"video_gcs_uri\": video,\n", " \"time_segment_annotations\": [{\n", " \"display_name\": label,\n", " \"start_time\": \"start_time_of_segment\",\n", " \"end_time\": \"end_time_of_segment\"\n", " }]\n", " }\n", "\n", "For video object tracking, the CSV file has a few requirements:\n", "\n", "- Each data item is a separate JSON object, on a separate line.\n", "- The key/value pair `video_gcs_uri` is the Cloud Storage path to the text file.\n", "\n", " {\n", " \"video_gcs_uri\": video,\n", " \"temporal_bounding_box_annotations\": [{\n", " \"display_name\": label,\n", " \"x_min\": \"leftmost_coordinate_of_the_bounding box\",\n", " \"x_max\": \"rightmost_coordinate_of_the_bounding box\",\n", " \"y_min\": \"topmost_coordinate_of_the_bounding box\",\n", " \"y_max\": \"bottommost_coordinate_of_the_bounding box\",\n", " \"time_offset\": \"timeframe_object-detected\"\n", " }]\n", " }\n", "\n", "For video action recognition, the CSV file has a few requirements:\n", "\n", "- Each data item is a separate JSON object, on a separate line.\n", "- The key/value pair `video_gcs_uri` is the Cloud Storage path to the text file.\n", "\n", " {\n", " \"video_gcs_uri': video,\n", " \"time_segments\": [{\n", " \"start_time\": \"start_time_of_fully_annotated_segment\",\n", " \"end_time\": \"end_time_of_segment\"}],\n", " \"time_segment_annotations\": [{\n", " \"display_name\": label,\n", " \"start_time\": \"start_time_of_segment\",\n", " \"end_time\": \"end_time_of_segment\"\n", " }]\n", " }\n", "\n", "##### ML_USE\n", "\n", "Each JSONL object may additionally specify which split to assign the data item to when the dataset is split for training; otherwise, the dataset will be randomly split: 80/20.\n", "\n", "\"data_item_resource_labels\": {\n", " \"aiplatform.googleapis.com/ml_use\": \"training|test\"\n", " }\n", "\n", "*Note*: The dictionary key fields may alternatively be in camelCase. For example, 'video_gcs_uri' can also be 'videoGcsUri'." ] }, { "cell_type": "markdown", "metadata": { "id": "import_file:u_dataset,csv" }, "source": [ "#### Location of Cloud Storage training data.\n", "\n", "Now set the variable `IMPORT_FILE` to the location of the CSV index file in Cloud Storage." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "import_file:hmdb,csv,vcn" }, "outputs": [], "source": [ "IMPORT_FILE = \"gs://automl-video-demo-data/hmdb_split1_5classes_train_inf.csv\"" ] }, { "cell_type": "markdown", "metadata": { "id": "quick_peek:csv" }, "source": [ "#### Quick peek at your data\n", "\n", "This tutorial uses a version of the Happy Moments dataset that is stored in a public Cloud Storage bucket, using a CSV index file.\n", "\n", "Start by doing a quick peek at the data. You count the number of examples by counting the number of rows in the CSV index file (`wc -l`) and then peek at the first few rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "quick_peek:csv" }, "outputs": [], "source": [ "FILE = IMPORT_FILE\n", "\n", "count = ! gsutil cat $FILE | wc -l\n", "print(\"Number of Examples\", int(count[0]))\n", "\n", "print(\"First 10 rows\")\n", "! gsutil cat $FILE | head" ] }, { "cell_type": "markdown", "metadata": { "id": "create_dataset:video,vcn" }, "source": [ "### Create the Dataset\n", "\n", "Next, create the `Dataset` resource using the `create` method for the `VideoDataset` class, which takes the following parameters:\n", "\n", "- `display_name`: The human readable name for the `Dataset` resource.\n", "- `gcs_source`: A list of one or more dataset index files to import the data items into the `Dataset` resource.\n", "- `import_schema_uri`: The data labeling schema for the data items.\n", " - `classification`: Binary and multi-class classification\n", " - `object_tracking`: Object tracking\n", " - `action_recognition`: Action recognition\n", "\n", "Learn more about [VideoDataset](https://cloud.google.com/vertex-ai/docs/datasets/prepare-video)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_dataset:video,vcn" }, "outputs": [], "source": [ "dataset = aiplatform.VideoDataset.create(\n", " display_name=\"human_motion\",\n", " gcs_source=[IMPORT_FILE],\n", " import_schema_uri=aiplatform.schema.dataset.ioformat.video.classification,\n", ")\n", "\n", "print(dataset.resource_name)" ] }, { "cell_type": "markdown", "metadata": { "id": "create_automl_pipeline:video,vcn" }, "source": [ "### Create and run training pipeline\n", "\n", "To train an AutoML model, you perform two steps: 1) create a training pipeline, and 2) run the pipeline.\n", "\n", "#### Create training pipeline\n", "\n", "An AutoML training pipeline is created with the `AutoMLVideoTrainingJob` class, with the following parameters:\n", "\n", "- `display_name`: The human readable name for the `TrainingJob` resource.\n", "- `prediction_type`: The type task to train the model for.\n", " - `classification`: A video classification model.\n", " - `object_tracking`: A video object tracking model.\n", " - `action_recognition`: A video action recognition model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_automl_pipeline:video,vcn" }, "outputs": [], "source": [ "dag = aiplatform.AutoMLVideoTrainingJob(\n", " display_name=\"human_motion\",\n", " prediction_type=\"classification\",\n", ")\n", "\n", "print(dag)" ] }, { "cell_type": "markdown", "metadata": { "id": "run_automl_pipeline:video" }, "source": [ "#### Run the training pipeline\n", "\n", "Next, you run the created DAG to start the training job by invoking the method `run`, with the following parameters:\n", "\n", "- `dataset`: The `Dataset` resource to train the model.\n", "- `model_display_name`: The human readable name for the trained model.\n", "- `training_fraction_split`: The percentage of the dataset to use for training.\n", "- `test_fraction_split`: The percentage of the dataset to use for test (holdout data).\n", "\n", "The `run` method when completed returns the `Model` resource.\n", "\n", "The execution of the training pipeline will take upto > 30 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "run_automl_pipeline:video" }, "outputs": [], "source": [ "model = dag.run(\n", " dataset=dataset,\n", " model_display_name=\"human_motion\",\n", " training_fraction_split=0.8,\n", " test_fraction_split=0.2,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "evaluate_the_model:mbsdk" }, "source": [ "## Review model evaluation scores\n", "\n", "After your model training has finished, you can review the evaluation scores for it using the `list_model_evaluations()` method. This method will return an iterator for each evaluation slice." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "evaluate_the_model:mbsdk" }, "outputs": [], "source": [ "model_evaluations = model.list_model_evaluations()\n", "\n", "for model_evaluation in model_evaluations:\n", " print(model_evaluation.to_dict())" ] }, { "cell_type": "markdown", "metadata": { "id": "model_delete:mbsdk" }, "source": [ "#### Delete the model\n", "\n", "The method 'delete()' will delete the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "model_delete:mbsdk" }, "outputs": [], "source": [ "model.delete()" ] }, { "cell_type": "markdown", "metadata": { "id": "dataset_delete:mbsdk" }, "source": [ "#### Delete the dataset\n", "\n", "The method 'delete()' will delete the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dataset_delete:mbsdk" }, "outputs": [], "source": [ "dataset.delete()" ] }, { "cell_type": "markdown", "metadata": { "id": "cleanup" }, "source": [ "# Cleaning up\n", "\n", "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Otherwise, you can delete the individual resources you created in this tutorial.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cleanup" }, "outputs": [], "source": [ "# Delete the Cloud Storage bucket\n", "delete_bucket = False # Set True for deletion\n", "if delete_bucket:\n", " ! gsutil rm -r $BUCKET_URI" ] } ], "metadata": { "colab": { "name": "get_started_automl_training.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }

notebooks/official/automl/get_started_automl_training.ipynb (2,329 lines of code) (raw):