notebooks/official/pipelines/automl_tabular_classification

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "copyright" }, "outputs": [], "source": [ "# Copyright 2021 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "title:generic" }, "source": [ "# Vertex AI Pipelines: AutoML Tabular pipelines using google-cloud-pipeline-components\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/automl_tabular_classification_beans.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fpipelines%2Fautoml_tabular_classification_beans.ipynb\">\n", " <img width=\"32px\" src=\"https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n", " </a>\n", " </td> \n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/pipelines/automl_tabular_classification_beans.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/automl_tabular_classification_beans.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", "</table>" ] }, { "cell_type": "markdown", "metadata": { "id": "overview:pipelines,automl,beans" }, "source": [ "## Overview\n", "\n", "This notebook shows how to use the components from [*google_cloud_pipeline_components*](https://github.com/kubeflow/pipelines/tree/master/components/google-cloud) SDK to build an AutoML tabular classification workflow in [Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines).\n", "\n", "You build a pipeline in this notebook that looks as below:\n", "\n", "<a href=\"https://storage.googleapis.com/amy-jo/images/mp/beans.png\" target=\"_blank\"><img src=\"https://storage.googleapis.com/amy-jo/images/mp/beans.png\" width=\"95%\"/></a>\n", "\n", "Learn more about [Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) and [AutoML components](https://cloud.google.com/vertex-ai/docs/pipelines/vertex-automl-component). Learn more about [Classification for tabular data](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/overview)." ] }, { "cell_type": "markdown", "metadata": { "id": "objective:pipelines,automl" }, "source": [ "### Objective\n", "\n", "In this tutorial, you learn to use Vertex AI Pipelines and Google Cloud Pipeline Components to build an AutoML tabular classification model.\n", "\n", "\n", "This tutorial uses the following Vertex AI services:\n", "\n", "- Vertex AI Pipelines\n", "- Google Cloud Pipeline Components\n", "- Vertex AutoML\n", "- Vertex AI Model\n", "- Vertex AI Endpoint\n", "\n", "The steps performed include:\n", "\n", "- Create a KFP pipeline that creates a Vertex AI Dataset.\n", "- Add a component to the pipeline that trains an AutoML tabular classification model resource.\n", "- Add a component that creates a Vertex AI endpoint resource.\n", "- Add a component that deploys the model resource to the endpoint resource.\n", "- Compile the KFP pipeline.\n", "- Execute the KFP pipeline using Vertex AI Pipelines.\n", "\n", "Learn more about the [Vertex AI components in Google Cloud Pipeline Components SDK](https://google-cloud-pipeline-components.readthedocs.io/en/latest/google_cloud_pipeline_components.aiplatform.html#module-google_cloud_pipeline_components.aiplatform)." ] }, { "cell_type": "markdown", "metadata": { "id": "dataset:beans,lcn" }, "source": [ "### Dataset\n", "\n", "This tutorial uses the UCI Machine Learning ['Dry beans dataset'](https://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset), from: KOKLU, M. and OZKAN, I.A., (2020), \"Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques\". In Computers and Electronics in Agriculture, 174, 105507. [DOI](https://doi.org/10.1016/j.compag.2020.105507)." ] }, { "cell_type": "markdown", "metadata": { "id": "costs" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* Vertex AI\n", "* Cloud Storage\n", "\n", "Learn about [Vertex AI\n", "pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage\n", "pricing](https://cloud.google.com/storage/pricing), and use the [Pricing\n", "Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "61RBz8LLbxCR" }, "source": [ "## Get started" ] }, { "cell_type": "markdown", "metadata": { "id": "No17Cw5hgx12" }, "source": [ "### Install Vertex AI SDK for Python and other required packages\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "install_aip:mbsdk" }, "outputs": [], "source": [ "! pip3 install --upgrade --quiet google-cloud-aiplatform \\\n", " google-cloud-storage \\\n", " kfp \\\n", " google-cloud-pipeline-components" ] }, { "cell_type": "markdown", "metadata": { "id": "R5Xep4W9lq-Z" }, "source": [ "### Restart runtime (Colab only)\n", "\n", "To use the newly installed packages, you must restart the runtime on Google Colab." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "D-ZBOjErv5mM" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", "\n", " import IPython\n", "\n", " app = IPython.Application.instance()\n", " app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "SbmM4z7FOBpM" }, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>\n", "</div>\n" ] }, { "cell_type": "markdown", "metadata": { "id": "check_versions" }, "source": [ "#### Check the package versions\n", "\n", "Check the versions of the packages you installed. The KFP SDK version should be >=1.8." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "check_versions:kfp,gcpc" }, "outputs": [], "source": [ "! python3 -c \"import kfp; print('KFP SDK version: {}'.format(kfp.__version__))\"\n", "! python3 -c \"import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))\"" ] }, { "cell_type": "markdown", "metadata": { "id": "dmWOrTJ3gx13" }, "source": [ "### Authenticate your notebook environment (Colab only)\n", "\n", "Authenticate your environment on Google Colab.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NyKGtVQjgx13" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", "\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "DF4l8DTdWgPY" }, "source": [ "### Set Google Cloud project information \n", "\n", "To get started using Vertex AI, you must have an existing Google Cloud project. Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_project_id" }, "outputs": [], "source": [ "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", "LOCATION = \"us-central1\" # @param {type:\"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "zgPO1eR3CYjk" }, "source": [ "### Create a Cloud Storage bucket\n", "\n", "Create a storage bucket to store intermediate artifacts such as datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MzGDU7TWdts_" }, "outputs": [], "source": [ "BUCKET_URI = f\"gs://your-bucket-name-{PROJECT_ID}-unique\" # @param {type:\"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "create_bucket" }, "source": [ "**If your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NIq7R4HZCfIc" }, "outputs": [], "source": [ "! gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}" ] }, { "cell_type": "markdown", "metadata": { "id": "set_service_account" }, "source": [ "### Service Account\n", "\n", "**If you don't know your service account**, try to get your service account using `gcloud` command by executing the second cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_service_account" }, "outputs": [], "source": [ "SERVICE_ACCOUNT = \"\" # @param {type:\"string\"}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "autoset_service_account" }, "outputs": [], "source": [ "import sys\n", "\n", "IS_COLAB = \"google.colab\" in sys.modules\n", "if (\n", " SERVICE_ACCOUNT == \"\"\n", " or SERVICE_ACCOUNT is None\n", " or SERVICE_ACCOUNT == \"[your-service-account]\"\n", "):\n", " # Get your service account from gcloud\n", " if not IS_COLAB:\n", " shell_output = !gcloud auth list 2>/dev/null\n", " SERVICE_ACCOUNT = shell_output[2].replace(\"*\", \"\").strip()\n", "\n", " if IS_COLAB:\n", " shell_output = ! gcloud projects describe $PROJECT_ID\n", " project_number = shell_output[-1].split(\":\")[1].strip().replace(\"'\", \"\")\n", " SERVICE_ACCOUNT = f\"{project_number}-compute@developer.gserviceaccount.com\"\n", "\n", " print(\"Service Account:\", SERVICE_ACCOUNT)" ] }, { "cell_type": "markdown", "metadata": { "id": "set_service_account:pipelines" }, "source": [ "#### Set service account access for Vertex AI Pipelines\n", "\n", "Run the following commands to grant your service account access to read and write pipeline artifacts in the bucket that you created in the previous step. You only need to run these once per service account." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_service_account:pipelines" }, "outputs": [], "source": [ "! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI\n", "\n", "! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI" ] }, { "cell_type": "markdown", "metadata": { "id": "setup_vars" }, "source": [ "### Import libraries and define constants" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "import_aip:mbsdk" }, "outputs": [], "source": [ "from typing import NamedTuple\n", "\n", "import google.cloud.aiplatform as aiplatform\n", "import kfp\n", "from google.cloud import bigquery\n", "from kfp import compiler, dsl\n", "from kfp.dsl import (Artifact, ClassificationMetrics, Input, Metrics, Output,\n", " component)" ] }, { "cell_type": "markdown", "metadata": { "id": "aip_constants:endpoint" }, "source": [ "#### Vertex AI constants\n", "\n", "Setup up the following constants for Vertex AI pipeline:\n", "- `PIPELINE_NAME`: Set name for the pipeline.\n", "- `PIPELINE_ROOT`: Cloud Storage bucket path to store pipeline artifacts." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "aip_constants:endpoint" }, "outputs": [], "source": [ "# set path for storing the pipeline artifacts\n", "PIPELINE_NAME = \"automl-tabular-beans-training\"\n", "PIPELINE_ROOT = \"{}/pipeline_root/beans\".format(BUCKET_URI)" ] }, { "cell_type": "markdown", "metadata": { "id": "init_aip:mbsdk" }, "source": [ "### Initialize Vertex AI SDK for Python\n", "\n", "To get started using Vertex AI, you must [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n", "\n", "Initialize the Vertex AI SDK for Python for your project and corresponding bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "init_aip:mbsdk" }, "outputs": [], "source": [ "aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)" ] }, { "cell_type": "markdown", "metadata": { "id": "define_component:eval" }, "source": [ "## Define a custom component for metrics evaluation \n", "\n", "In this tutorial, you define one custom pipeline component. The remaining components are pre-built\n", "components for Vertex AI services.\n", "\n", "The custom pipeline component you define is a Python function-based component.\n", "Python function-based components make it easier to iterate quickly by letting you build your component code as a Python function and generating the component specification for you.\n", "\n", "Note the `@component` decorator. When you evaluate the `classification_model_eval` function, the component is compiled to what is essentially a task factory function, that can be used in the the pipeline definition.\n", "\n", "In addition, a **tabular_eval_component.yaml** component definition file is generated. The component **yaml** file can be shared & placed under version control, and used later to define a pipeline step.\n", "\n", "The component definition specifies a base image for the component to use, and specifies that the `google-cloud-aiplatform` package should be installed. When not specified, the base image defaults to Python 3.7.\n", "\n", "The custom pipeline component you create in this step retrieves the classification model evaluation metrics generated by the AutoML tabular training process. Then, it parses the evaluation data, and renders the ROC curve and confusion matrix for the model. Further, the set thresholds are checked against the generated metrics to determine whether the model is accurate enough for deployment.\n", "\n", "**Note**: This custom component is specific to an AutoML tabular classification task." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "define_component:eval" }, "outputs": [], "source": [ "@component(\n", " base_image=\"gcr.io/deeplearning-platform-release/tf2-cpu.2-6:latest\",\n", " packages_to_install=[\"google-cloud-aiplatform\"],\n", ")\n", "def classification_model_eval_metrics(\n", " project: str,\n", " location: str,\n", " thresholds_dict_str: str,\n", " model: Input[Artifact],\n", " metrics: Output[Metrics],\n", " metricsc: Output[ClassificationMetrics],\n", ") -> NamedTuple(\"Outputs\", [(\"dep_decision\", str)]): # Return parameter.\n", "\n", " import json\n", " import logging\n", "\n", " from google.cloud import aiplatform\n", "\n", " aiplatform.init(project=project)\n", "\n", " # Fetch model eval info\n", " def get_eval_info(model):\n", " response = model.list_model_evaluations()\n", " metrics_list = []\n", " metrics_string_list = []\n", " for evaluation in response:\n", " evaluation = evaluation.to_dict()\n", " print(\"model_evaluation\")\n", " print(\" name:\", evaluation[\"name\"])\n", " print(\" metrics_schema_uri:\", evaluation[\"metricsSchemaUri\"])\n", " metrics = evaluation[\"metrics\"]\n", " for metric in metrics.keys():\n", " logging.info(\"metric: %s, value: %s\", metric, metrics[metric])\n", " metrics_str = json.dumps(metrics)\n", " metrics_list.append(metrics)\n", " metrics_string_list.append(metrics_str)\n", "\n", " return (\n", " evaluation[\"name\"],\n", " metrics_list,\n", " metrics_string_list,\n", " )\n", "\n", " # Use the given metrics threshold(s) to determine whether the model is\n", " # accurate enough to deploy.\n", " def classification_thresholds_check(metrics_dict, thresholds_dict):\n", " for k, v in thresholds_dict.items():\n", " logging.info(\"k {}, v {}\".format(k, v))\n", " if k in [\"auRoc\", \"auPrc\"]: # higher is better\n", " if metrics_dict[k] < v: # if under threshold, don't deploy\n", " logging.info(\"{} < {}; returning False\".format(metrics_dict[k], v))\n", " return False\n", " logging.info(\"threshold checks passed.\")\n", " return True\n", "\n", " def log_metrics(metrics_list, metricsc):\n", " test_confusion_matrix = metrics_list[0][\"confusionMatrix\"]\n", " logging.info(\"rows: %s\", test_confusion_matrix[\"rows\"])\n", "\n", " # log the ROC curve\n", " fpr = []\n", " tpr = []\n", " thresholds = []\n", " for item in metrics_list[0][\"confidenceMetrics\"]:\n", " fpr.append(item.get(\"falsePositiveRate\", 0.0))\n", " tpr.append(item.get(\"recall\", 0.0))\n", " thresholds.append(item.get(\"confidenceThreshold\", 0.0))\n", " print(f\"fpr: {fpr}\")\n", " print(f\"tpr: {tpr}\")\n", " print(f\"thresholds: {thresholds}\")\n", " metricsc.log_roc_curve(fpr, tpr, thresholds)\n", "\n", " # log the confusion matrix\n", " annotations = []\n", " for item in test_confusion_matrix[\"annotationSpecs\"]:\n", " annotations.append(item[\"displayName\"])\n", " logging.info(\"confusion matrix annotations: %s\", annotations)\n", " metricsc.log_confusion_matrix(\n", " annotations,\n", " test_confusion_matrix[\"rows\"],\n", " )\n", "\n", " # log textual metrics info as well\n", " for metric in metrics_list[0].keys():\n", " if metric != \"confidenceMetrics\":\n", " val_string = json.dumps(metrics_list[0][metric])\n", " metrics.log_metric(metric, val_string)\n", "\n", " logging.getLogger().setLevel(logging.INFO)\n", "\n", " # extract the model resource name from the input Model Artifact\n", " model_resource_path = model.metadata[\"resourceName\"]\n", " logging.info(\"model path: %s\", model_resource_path)\n", "\n", " # Get the trained model resource\n", " model = aiplatform.Model(model_resource_path)\n", "\n", " # Get model evaluation metrics from the the trained model\n", " eval_name, metrics_list, metrics_str_list = get_eval_info(model)\n", " logging.info(\"got evaluation name: %s\", eval_name)\n", " logging.info(\"got metrics list: %s\", metrics_list)\n", " log_metrics(metrics_list, metricsc)\n", "\n", " thresholds_dict = json.loads(thresholds_dict_str)\n", " deploy = classification_thresholds_check(metrics_list[0], thresholds_dict)\n", " if deploy:\n", " dep_decision = \"true\"\n", " else:\n", " dep_decision = \"false\"\n", " logging.info(\"deployment decision is %s\", dep_decision)\n", "\n", " return (dep_decision,)\n", "\n", "\n", "compiler.Compiler().compile(\n", " classification_model_eval_metrics, \"tabular_eval_component.yaml\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "define_pipeline:gcpc,beans,lcn" }, "source": [ "## Define pipeline \n", "\n", "Define the pipeline for AutoML tabular classification using the components from `google_cloud_pipeline_components` package." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "define_pipeline:gcpc,beans,lcn" }, "outputs": [], "source": [ "@kfp.dsl.pipeline(name=PIPELINE_NAME, pipeline_root=PIPELINE_ROOT)\n", "def pipeline(\n", " bq_source: str,\n", " DATASET_DISPLAY_NAME: str,\n", " TRAINING_DISPLAY_NAME: str,\n", " MODEL_DISPLAY_NAME: str,\n", " ENDPOINT_DISPLAY_NAME: str,\n", " MACHINE_TYPE: str,\n", " project: str,\n", " gcp_region: str,\n", " thresholds_dict_str: str,\n", "):\n", " from google_cloud_pipeline_components.v1.automl.training_job import \\\n", " AutoMLTabularTrainingJobRunOp\n", " from google_cloud_pipeline_components.v1.dataset.create_tabular_dataset.component import \\\n", " tabular_dataset_create as TabularDatasetCreateOp\n", " from google_cloud_pipeline_components.v1.endpoint.create_endpoint.component import \\\n", " endpoint_create as EndpointCreateOp\n", " from google_cloud_pipeline_components.v1.endpoint.deploy_model.component import \\\n", " model_deploy as ModelDeployOp\n", "\n", " dataset_create_op = TabularDatasetCreateOp(\n", " project=project,\n", " location=gcp_region,\n", " display_name=DATASET_DISPLAY_NAME,\n", " bq_source=bq_source,\n", " )\n", "\n", " training_op = AutoMLTabularTrainingJobRunOp(\n", " project=project,\n", " location=gcp_region,\n", " display_name=TRAINING_DISPLAY_NAME,\n", " optimization_prediction_type=\"classification\",\n", " optimization_objective=\"minimize-log-loss\",\n", " budget_milli_node_hours=1000,\n", " model_display_name=MODEL_DISPLAY_NAME,\n", " column_specs={\n", " \"Area\": \"numeric\",\n", " \"Perimeter\": \"numeric\",\n", " \"MajorAxisLength\": \"numeric\",\n", " \"MinorAxisLength\": \"numeric\",\n", " \"AspectRation\": \"numeric\",\n", " \"Eccentricity\": \"numeric\",\n", " \"ConvexArea\": \"numeric\",\n", " \"EquivDiameter\": \"numeric\",\n", " \"Extent\": \"numeric\",\n", " \"Solidity\": \"numeric\",\n", " \"roundness\": \"numeric\",\n", " \"Compactness\": \"numeric\",\n", " \"ShapeFactor1\": \"numeric\",\n", " \"ShapeFactor2\": \"numeric\",\n", " \"ShapeFactor3\": \"numeric\",\n", " \"ShapeFactor4\": \"numeric\",\n", " \"Class\": \"categorical\",\n", " },\n", " dataset=dataset_create_op.outputs[\"dataset\"],\n", " target_column=\"Class\",\n", " )\n", "\n", " model_eval_task = classification_model_eval_metrics(\n", " project=project,\n", " location=gcp_region,\n", " thresholds_dict_str=thresholds_dict_str,\n", " model=training_op.outputs[\"model\"],\n", " )\n", "\n", " with dsl.If(\n", " model_eval_task.outputs[\"dep_decision\"] == \"true\",\n", " name=\"deploy_decision\",\n", " ):\n", "\n", " endpoint_op = EndpointCreateOp(\n", " project=project,\n", " location=gcp_region,\n", " display_name=ENDPOINT_DISPLAY_NAME,\n", " )\n", "\n", " ModelDeployOp(\n", " model=training_op.outputs[\"model\"],\n", " endpoint=endpoint_op.outputs[\"endpoint\"],\n", " dedicated_resources_min_replica_count=1,\n", " dedicated_resources_max_replica_count=1,\n", " dedicated_resources_machine_type=MACHINE_TYPE,\n", " )" ] }, { "cell_type": "markdown", "metadata": { "id": "compile_pipeline" }, "source": [ "## Compile the pipeline\n", "\n", "Next, compile the pipeline to a yaml file." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "compile_pipeline" }, "outputs": [], "source": [ "compiler.Compiler().compile(\n", " pipeline_func=pipeline,\n", " package_path=\"tabular_classification_pipeline.yaml\",\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "run_pipeline:model" }, "source": [ "## Run the pipeline\n", "\n", "Pass the input parameters required for the pipeline and run it. The defined pipeline takes the following parameters:\n", "\n", "- `bq_source`: BigQuery source for the tabular dataset.\n", "- `DATASET_DISPLAY_NAME`: Display name for the Vertex AI managed dataset.\n", "- `TRAINIG_DISPLAY_NAME`: Display name for the AutoML training job.\n", "- `MODEL_DISPLAY_NAME`: Display name for the Vertex AI model generated as a result of the training job.\n", "- `ENDPOINT_DISPLAY_NAME`: Display name for the Vertex AI endpoint where the model is deployed.\n", "- `MACHINE_TYPE`: Machine type for the serving container.\n", "- `project`: Id of the project where the pipeline is run.\n", "- `gcp_region`: Region for setting the pipeline location.\n", "- `thresholds_dict_str`: Dictionary of thresholds based on which the model deployment is conditioned.\n", "- `pipeline_root`: To override the pipeline root path specified in the pipeline job's definition, specify a path that your pipeline job can access, such as a Cloud Storage bucket URI." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1adf9b056954" }, "outputs": [], "source": [ "# Set the display-names for Vertex AI resources\n", "PIPELINE_DISPLAY_NAME = \"[your-pipeline-display-name]\" # @param {type:\"string\"}\n", "DATASET_DISPLAY_NAME = \"[your-dataset-display-name]\" # @param {type:\"string\"}\n", "MODEL_DISPLAY_NAME = \"[your-model-display-name]\" # @param {type:\"string\"}\n", "TRAINING_DISPLAY_NAME = \"[your-training-job-display-name]\" # @param {type:\"string\"}\n", "ENDPOINT_DISPLAY_NAME = \"[your-endpoint-display-name]\" # @param {type:\"string\"}\n", "\n", "# Otherwise, use the default display-names\n", "if PIPELINE_DISPLAY_NAME == \"[your-pipeline-display-name]\":\n", " PIPELINE_DISPLAY_NAME = \"pipeline_beans-unique\"\n", "\n", "if DATASET_DISPLAY_NAME == \"[your-dataset-display-name]\":\n", " DATASET_DISPLAY_NAME = \"dataset_beans-unique\"\n", "\n", "if MODEL_DISPLAY_NAME == \"[your-model-display-name]\":\n", " MODEL_DISPLAY_NAME = \"model_beans-unique\"\n", "\n", "if TRAINING_DISPLAY_NAME == \"[your-training-job-display-name]\":\n", " TRAINING_DISPLAY_NAME = \"automl_training_beans-unique\"\n", "\n", "if ENDPOINT_DISPLAY_NAME == \"[your-endpoint-display-name]\":\n", " ENDPOINT_DISPLAY_NAME = \"endpoint_beans-unique\"\n", "\n", "# Set machine type\n", "MACHINE_TYPE = \"n1-standard-4\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "run_pipeline:model" }, "outputs": [], "source": [ "# Validate region of the given source (BigQuery) against region of the pipeline\n", "bq_source = \"aju-dev-demos.beans.beans1\"\n", "\n", "client = bigquery.Client()\n", "bq_region = client.get_table(bq_source).location.lower()\n", "try:\n", " assert bq_region in LOCATION\n", " print(f\"Region validated: {LOCATION}\")\n", "except AssertionError:\n", " print(\n", " \"Please make sure the region of BigQuery (source) and that of the pipeline are the same.\"\n", " )\n", "\n", "# Configure the pipeline\n", "job = aiplatform.PipelineJob(\n", " display_name=PIPELINE_DISPLAY_NAME,\n", " template_path=\"tabular_classification_pipeline.yaml\",\n", " pipeline_root=PIPELINE_ROOT,\n", " parameter_values={\n", " \"project\": PROJECT_ID,\n", " \"gcp_region\": LOCATION,\n", " \"bq_source\": f\"bq://{bq_source}\",\n", " \"thresholds_dict_str\": '{\"auRoc\": 0.95}',\n", " \"DATASET_DISPLAY_NAME\": DATASET_DISPLAY_NAME,\n", " \"TRAINING_DISPLAY_NAME\": TRAINING_DISPLAY_NAME,\n", " \"MODEL_DISPLAY_NAME\": MODEL_DISPLAY_NAME,\n", " \"ENDPOINT_DISPLAY_NAME\": ENDPOINT_DISPLAY_NAME,\n", " \"MACHINE_TYPE\": MACHINE_TYPE,\n", " },\n", " enable_caching=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "view_pipeline_run:model" }, "source": [ "Run the pipeline job. Click on the generated link to see your run in the Cloud Console." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "114ab8ff24ac" }, "outputs": [], "source": [ "# Run the job\n", "job.run()" ] }, { "cell_type": "markdown", "metadata": { "id": "compare_pipeline_runs" }, "source": [ "## Check the parameters and metrics of the pipeline run from their tracked metadata\n", "\n", "Next, you use the Vertex AI SDK for Python to check the parameters and metrics of the pipeline run. Wait until the pipeline run has finished to run the next cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "compare_pipeline_runs" }, "outputs": [], "source": [ "pipeline_df = aiplatform.get_pipeline_df(pipeline=PIPELINE_NAME)\n", "print(pipeline_df.head(2))" ] }, { "cell_type": "markdown", "metadata": { "id": "cleanup:pipelines" }, "source": [ "## Cleaning up\n", "\n", "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Otherwise, you can delete the individual resources you created in this tutorial." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cleanup:pipelines" }, "outputs": [], "source": [ "# Delete the Vertex AI Pipeline Job\n", "job.delete()\n", "\n", "# List and filter the Vertex AI Endpoint\n", "endpoints = aiplatform.Endpoint.list(\n", " filter=f\"display_name={ENDPOINT_DISPLAY_NAME}\", order_by=\"create_time\"\n", ")\n", "# Delete the Vertex AI Endpoint\n", "if len(endpoints) > 0:\n", " endpoint = endpoints[0]\n", " endpoint.delete(force=True)\n", "\n", "# List and filter the Vertex AI model\n", "models = aiplatform.Model.list(\n", " filter=f\"display_name={MODEL_DISPLAY_NAME}\", order_by=\"create_time\"\n", ")\n", "# Delete the Vertex AI model\n", "if len(models) > 0:\n", " model = models[0]\n", " model.delete()\n", "\n", "# List and filter the Vertex AI Dataset\n", "datasets = aiplatform.TabularDataset.list(\n", " filter=f\"display_name={DATASET_DISPLAY_NAME}\", order_by=\"create_time\"\n", ")\n", "# Delete the Vertex AI Dataset\n", "if len(datasets) > 0:\n", " dataset = datasets[0]\n", " dataset.delete()\n", "\n", "# Delete the Cloud Storage bucket\n", "delete_bucket = False # Set True for deletion\n", "if delete_bucket:\n", " ! gsutil rm -r $BUCKET_URI" ] } ], "metadata": { "colab": { "name": "automl_tabular_classification_beans.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }

notebooks/official/pipelines/automl_tabular_classification_beans.ipynb (972 lines of code) (raw):