notebooks/official/automl/automl_tabular_on_vertex

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "18ebbd838e32" }, "outputs": [], "source": [ "# Copyright 2022 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "mThXALJl9Yue" }, "source": [ "# AutoML Tabular Workflow pipelines\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/automl_tabular_on_vertex_pipelines.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fautoml%2Fautoml_tabular_on_vertex_pipelines.ipynb\">\n", " <img width=\"32px\" src=\"https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n", " </a>\n", " </td> \n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/automl/automl_tabular_on_vertex_pipelines.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/automl_tabular_on_vertex_pipelines.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", "</table>" ] }, { "cell_type": "markdown", "metadata": { "id": "fcc745968395" }, "source": [ "## Overview\n", "\n", "In this tutorial, you will use two Vertex AI Tabular Workflows pipelines to train AutoML models using different configurations. You will see how `get_automl_tabular_pipeline_and_parameters` gives you the ability to customize the default AutoML Tabular pipeline, and how `get_skip_architecture_search_pipeline_and_parameters` allows you to reduce the training time and cost for an AutoML model by using the tuning results from a previous pipeline run.\n", "\n", "Learn more about [Tabular Workflow for E2E AutoML](https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/e2e-automl)." ] }, { "cell_type": "markdown", "metadata": { "id": "f887ec5c06c5" }, "source": [ "### Objective\n", "\n", "In this tutorial, you learn how to create two regression models using [Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) downloaded from [Google Cloud Pipeline Components](https://cloud.google.com/vertex-ai/docs/pipelines/components-introduction) (GCPC). These pipelines will be Vertex AI Tabular Workflow pipelines which are maintained by Google. These pipelines will showcase different ways to customize the Vertex Tabular training process.\n", "\n", "This tutorial uses the following Google Cloud ML services:\n", "\n", "- `AutoML Training`\n", "- `Vertex AI Datasets`\n", "\n", "The steps performed are:\n", "\n", "- Create a training pipeline that reduces the search space from the default to save time.\n", "- Create a training pipeline that reuses the architecture search results from the previous pipeline to save time." ] }, { "cell_type": "markdown", "metadata": { "id": "eac26958afe8" }, "source": [ "### Dataset\n", "\n", "The dataset you will be using is [Bank Marketing](https://archive.ics.uci.edu/ml/datasets/bank+marketing).\n", "The data is for direct marketing campaigns (phone calls) of a Portuguese banking institution. The binary classification goal is to predict if a client will subscribe a term deposit. For this notebook, we randomly selected 90% of the rows in the original dataset and saved them in a train.csv file hosted on Cloud Storage. To download the file, click [here](https://storage.googleapis.com/cloud-samples-data/vertex-ai/tabular-workflows/datasets/bank-marketing/train.csv)." ] }, { "cell_type": "markdown", "metadata": { "id": "181d4dfbf917" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* Vertex AI\n", "* Cloud Storage\n", "\n", "Learn about [Vertex AI\n", "pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage\n", "pricing](https://cloud.google.com/storage/pricing), and use the [Pricing\n", "Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "f0316df526f8" }, "source": [ "## Get started" ] }, { "cell_type": "markdown", "metadata": { "id": "install_aip:mbsdk" }, "source": [ "### Install Vertex AI SDK for Python and other required packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "E7-SzYTR9bo2" }, "outputs": [], "source": [ "!pip3 install --upgrade --quiet google-cloud-pipeline-components==1.0.45 \\\n", " google-cloud-aiplatform" ] }, { "cell_type": "markdown", "metadata": { "id": "ff555b32bab8" }, "source": [ "### Restart runtime (Colab only)\n", "\n", "To use the newly installed packages, you must restart the runtime on Google Colab." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f09b4dff629a" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", "\n", " import IPython\n", "\n", " app = IPython.Application.instance()\n", " app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "4a2b7b59bbf7" }, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "id": "f82e28c631cc" }, "source": [ "### Authenticate your notebook environment (Colab only)\n", "\n", "Authenticate your environment on Google Colab." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "46604f70e831" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", "\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "91842ef41bbd" }, "source": [ "### Set Google Cloud project information\n", "\n", "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_project_id" }, "outputs": [], "source": [ "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", "LOCATION = \"us-central1\" # @param {type: \"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "bucket:mbsdk" }, "source": [ "### Create a Cloud Storage bucket\n", "\n", "Create a storage bucket to store intermediate artifacts such as datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bucket" }, "outputs": [], "source": [ "BUCKET_URI = f\"gs://your-bucket-name-{PROJECT_ID}-unique\" # @param {type:\"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "create_bucket" }, "source": [ "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_bucket" }, "outputs": [], "source": [ "! gsutil mb -l $LOCATION $BUCKET_URI" ] }, { "cell_type": "markdown", "metadata": { "id": "44accda192d5" }, "source": [ "#### Service Account\n", "\n", "You use a service account to create Vertex AI Pipeline jobs. If you don't want to use your project's Compute Engine service account, set `SERVICE_ACCOUNT` to another service account ID." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "e0c9c4f84849" }, "outputs": [], "source": [ "SERVICE_ACCOUNT = \"[your-service-account]\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "604ae09ab6d3" }, "outputs": [], "source": [ "import sys\n", "\n", "IS_COLAB = \"google.colab\" in sys.modules\n", "if (\n", " SERVICE_ACCOUNT == \"\"\n", " or SERVICE_ACCOUNT is None\n", " or SERVICE_ACCOUNT == \"[your-service-account]\"\n", "):\n", " # Get your service account from gcloud\n", " if not IS_COLAB:\n", " shell_output = !gcloud auth list 2>/dev/null\n", " SERVICE_ACCOUNT = shell_output[2].replace(\"*\", \"\").strip()\n", "\n", " else: # IS_COLAB:\n", " shell_output = ! gcloud projects describe $PROJECT_ID\n", " project_number = shell_output[-1].split(\":\")[1].strip().replace(\"'\", \"\")\n", " SERVICE_ACCOUNT = f\"{project_number}-compute@developer.gserviceaccount.com\"\n", "\n", " print(\"Service Account:\", SERVICE_ACCOUNT)" ] }, { "cell_type": "markdown", "metadata": { "id": "d1ecb60964d5" }, "source": [ "#### Set service account access for Vertex AI Pipelines\n", "Run the following commands to grant your service account access to read and write pipeline artifacts in the bucket that you created in the previous step. You only need to run this step once per service account." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "a592f0a380c2" }, "outputs": [], "source": [ "! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI\n", "\n", "! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI" ] }, { "cell_type": "markdown", "metadata": { "id": "fbbc3479a1da" }, "source": [ "## Import libraries and define constants" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8G6YmJT1yqkV" }, "outputs": [], "source": [ "import json\n", "# Import required modules\n", "import os\n", "from typing import Any, Dict, List\n", "\n", "from google.cloud import aiplatform, storage\n", "from google_cloud_pipeline_components.experimental.automl.tabular import \\\n", " utils as automl_tabular_utils" ] }, { "cell_type": "markdown", "metadata": { "id": "c0423f260423" }, "source": [ "## Initialize Vertex AI SDK for Python\n", "\n", "Initialize the Vertex AI SDK for Python for your project." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ad69f2590268" }, "outputs": [], "source": [ "aiplatform.init(project=PROJECT_ID, location=LOCATION)" ] }, { "cell_type": "markdown", "metadata": { "id": "3LWH3PRF5o2v" }, "source": [ "### Define helper functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "g9FPFT8c5oC0" }, "outputs": [], "source": [ "def get_bucket_name_and_path(uri):\n", " no_prefix_uri = uri[len(\"gs://\") :]\n", " splits = no_prefix_uri.split(\"/\")\n", " return splits[0], \"/\".join(splits[1:])\n", "\n", "\n", "def download_from_gcs(uri):\n", " bucket_name, path = get_bucket_name_and_path(uri)\n", " storage_client = storage.Client(project=PROJECT_ID)\n", " bucket = storage_client.get_bucket(bucket_name)\n", " blob = bucket.blob(path)\n", " return blob.download_as_string()\n", "\n", "\n", "def write_to_gcs(uri: str, content: str):\n", " bucket_name, path = get_bucket_name_and_path(uri)\n", " storage_client = storage.Client()\n", " bucket = storage_client.get_bucket(bucket_name)\n", " blob = bucket.blob(path)\n", " blob.upload_from_string(content)\n", "\n", "\n", "def generate_auto_transformation(column_names: List[str]) -> List[Dict[str, Any]]:\n", " transformations = []\n", " for column_name in column_names:\n", " transformations.append({\"auto\": {\"column_name\": column_name}})\n", " return transformations\n", "\n", "\n", "def write_auto_transformations(uri: str, column_names: List[str]):\n", " transformations = generate_auto_transformation(column_names)\n", " write_to_gcs(uri, json.dumps(transformations))\n", "\n", "\n", "def get_task_detail(\n", " task_details: List[Dict[str, Any]], task_name: str\n", ") -> List[Dict[str, Any]]:\n", " for task_detail in task_details:\n", " if task_detail.task_name == task_name:\n", " return task_detail\n", "\n", "\n", "def get_deployed_model_uri(\n", " task_details,\n", "):\n", " ensemble_task = get_task_detail(task_details, \"model-upload\")\n", " return ensemble_task.outputs[\"model\"].artifacts[0].uri\n", "\n", "\n", "def get_no_custom_ops_model_uri(task_details):\n", " ensemble_task = get_task_detail(task_details, \"automl-tabular-ensemble\")\n", " return download_from_gcs(\n", " ensemble_task.outputs[\"model_without_custom_ops\"].artifacts[0].uri\n", " )\n", "\n", "\n", "def get_feature_attributions(\n", " task_details,\n", "):\n", " ensemble_task = get_task_detail(task_details, \"feature-attribution-2\")\n", " return download_from_gcs(\n", " ensemble_task.outputs[\"feature_attributions\"].artifacts[0].uri\n", " )\n", "\n", "\n", "def get_evaluation_metrics(\n", " task_details,\n", "):\n", " ensemble_task = get_task_detail(task_details, \"model-evaluation-2\")\n", " return download_from_gcs(\n", " ensemble_task.outputs[\"evaluation_metrics\"].artifacts[0].uri\n", " )\n", "\n", "\n", "def load_and_print_json(s):\n", " parsed = json.loads(s)\n", " print(json.dumps(parsed, indent=2, sort_keys=True))" ] }, { "cell_type": "markdown", "metadata": { "id": "gvNFMRmBegZq" }, "source": [ "### Define training specification" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eV4JrwB8wAkg" }, "outputs": [], "source": [ "run_evaluation = True # @param {type:\"boolean\"}\n", "run_distillation = False # @param {type:\"boolean\"}\n", "root_dir = os.path.join(BUCKET_URI, \"automl_tabular_pipeline\")\n", "prediction_type = \"classification\"\n", "optimization_objective = \"minimize-log-loss\"\n", "target_column = \"deposit\"\n", "data_source_csv_filenames = \"gs://cloud-samples-data/vertex-ai/tabular-workflows/datasets/bank-marketing/train.csv\"\n", "data_source_bigquery_table_path = None # format: bq://bq_project.bq_dataset.bq_table\n", "\n", "timestamp_split_key = None # timestamp column name when using timestamp split\n", "stratified_split_key = None # target column name when using stratified split\n", "training_fraction = 0.8\n", "validation_fraction = 0.1\n", "test_fraction = 0.1\n", "\n", "predefined_split_key = None\n", "if predefined_split_key:\n", " training_fraction = None\n", " validation_fraction = None\n", " test_fraction = None\n", "\n", "weight_column = None\n", "\n", "features = [\n", " \"age\",\n", " \"job\",\n", " \"marital\",\n", " \"education\",\n", " \"default\",\n", " \"balance\",\n", " \"housing\",\n", " \"loan\",\n", " \"contact\",\n", " \"day\",\n", " \"month\",\n", " \"duration\",\n", " \"campaign\",\n", " \"pdays\",\n", " \"previous\",\n", " \"poutcome\",\n", "]\n", "transformations = generate_auto_transformation(features)\n", "transform_config_path = os.path.join(root_dir, \"transform_config_unique.json\")\n", "write_to_gcs(transform_config_path, json.dumps(transformations))" ] }, { "cell_type": "markdown", "metadata": { "id": "zyWGg2s09xOk" }, "source": [ "## VPC related config\n", "\n", "If you need to use a custom Dataflow subnetwork, you can set it through the `dataflow_subnetwork` parameter. The requirements are:\n", "1. `dataflow_subnetwork` must be fully qualified subnetwork name.\n", " [[reference](https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications)]\n", "1. The following service accounts must have [Compute Network User role](https://cloud.google.com/compute/docs/access/iam#compute.networkUser) assigned on the specified dataflow subnetwork [[reference](https://cloud.google.com/dataflow/docs/guides/specifying-networks#shared)]:\n", " 1. Compute Engine default service account: PROJECT_NUMBER-compute@developer.gserviceaccount.com\n", " 1. Dataflow service account: service-PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com\n", "\n", "If your project has VPC-SC enabled, please make sure:\n", "\n", "1. The dataflow subnetwork used in VPC-SC is configured properly for Dataflow.\n", " [[reference](https://cloud.google.com/dataflow/docs/guides/routes-firewall)]\n", "1. `dataflow_use_public_ips` is set to False.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_TePNlLl9v1q" }, "outputs": [], "source": [ "# Dataflow's fully qualified subnetwork name, when empty the default subnetwork will be used.\n", "# Fully qualified subnetwork name is in the form of\n", "# https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME\n", "# reference: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications\n", "dataflow_subnetwork = None # @param {type:\"string\"}\n", "# Specifies whether Dataflow workers use public IP addresses.\n", "dataflow_use_public_ips = True # @param {type:\"boolean\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "N-iXXE14voyR" }, "source": [ "## Customize search space and change training configuration\n", "\n", "You create a skip evaluation AutoML Tables pipeline with the following customizations:\n", "- Limit the hyperparameter search space\n", "- Change machine type and tuning / training parallelism" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sG46cXVueb66" }, "outputs": [], "source": [ "study_spec_parameters_override = [\n", " {\n", " \"parameter_id\": \"model_type\",\n", " \"categorical_value_spec\": {\n", " \"values\": [\n", " \"nn\"\n", " ] # The default value is [\"nn\", \"boosted_trees\"], this reduces the search space\n", " },\n", " }\n", "]\n", "\n", "worker_pool_specs_override = [\n", " {\"machine_spec\": {\"machine_type\": \"n1-standard-8\"}}, # override for TF chief node\n", " {}, # override for TF worker node, since it's not used, leave it empty\n", " {}, # override for TF ps node, since it's not used, leave it empty\n", " {\n", " \"machine_spec\": {\n", " \"machine_type\": \"n1-standard-4\" # override for TF evaluator node\n", " }\n", " },\n", "]\n", "\n", "# Number of weak models in the final ensemble model is\n", "# stage_2_num_selected_trials * 5. If unspecified, 5 is the default value for\n", "# stage_2_num_selected_trials.\n", "stage_2_num_selected_trials = 5\n", "\n", "# The pipeline output a TF saved model contains the following TF custom op:\n", "# - https://github.com/google/struct2tensor\n", "#\n", "# There are a few ways to run the model:\n", "# - Official prediction server docker image\n", "# Please follow the \"Run the model server\" section in\n", "# https://cloud.google.com/vertex-ai/docs/export/export-model-tabular#run-server\n", "# - Python or cpp runtimes like TF serving\n", "# Please set export_additional_model_without_custom_ops so the pipeline\n", "# outputs an additional model does does not depend on struct2tensor.\n", "# - `get_no_custom_ops_model_uri` shows how to get the model artifact URI.\n", "# - The input to the model is a dictionary of feature name to tensor. Use\n", "# `saved_model_cli show --dir {saved_model.pb's path} --signature_def serving_default --tag serve`\n", "# to find out more details.\n", "export_additional_model_without_custom_ops = False\n", "\n", "train_budget_milli_node_hours = 1000 # 1 hour\n", "\n", "(\n", " template_path,\n", " parameter_values,\n", ") = automl_tabular_utils.get_automl_tabular_pipeline_and_parameters(\n", " PROJECT_ID,\n", " LOCATION,\n", " root_dir,\n", " target_column,\n", " prediction_type,\n", " optimization_objective,\n", " transform_config_path,\n", " train_budget_milli_node_hours,\n", " data_source_csv_filenames=data_source_csv_filenames,\n", " data_source_bigquery_table_path=data_source_bigquery_table_path,\n", " weight_column=weight_column,\n", " predefined_split_key=predefined_split_key,\n", " timestamp_split_key=timestamp_split_key,\n", " stratified_split_key=stratified_split_key,\n", " training_fraction=training_fraction,\n", " validation_fraction=validation_fraction,\n", " test_fraction=test_fraction,\n", " study_spec_parameters_override=study_spec_parameters_override,\n", " stage_1_tuner_worker_pool_specs_override=worker_pool_specs_override,\n", " cv_trainer_worker_pool_specs_override=worker_pool_specs_override,\n", " run_evaluation=run_evaluation,\n", " run_distillation=run_distillation,\n", " dataflow_subnetwork=dataflow_subnetwork,\n", " dataflow_use_public_ips=dataflow_use_public_ips,\n", " export_additional_model_without_custom_ops=export_additional_model_without_custom_ops,\n", ")\n", "\n", "job_id = \"automl-tabular-unique\"\n", "job = aiplatform.PipelineJob(\n", " display_name=job_id,\n", " location=LOCATION, # launches the pipeline job in the specified location\n", " template_path=template_path,\n", " job_id=job_id,\n", " pipeline_root=root_dir,\n", " parameter_values=parameter_values,\n", " enable_caching=False,\n", ")\n", "\n", "job.run()\n", "\n", "\n", "pipeline_task_details = job.gca_resource.job_detail.task_details\n", "\n", "if export_additional_model_without_custom_ops:\n", " print(\n", " \"trained model without custom TF ops:\",\n", " get_no_custom_ops_model_uri(pipeline_task_details),\n", " )\n", "\n", "if run_evaluation:\n", " print(\"evaluation metrics:\")\n", " load_and_print_json(get_evaluation_metrics(pipeline_task_details))\n", "\n", " print(\"feature attributions:\")\n", " load_and_print_json(get_feature_attributions(pipeline_task_details))\n", "\n", "automl_tabular_pipeline_job_name = job_id" ] }, { "cell_type": "markdown", "metadata": { "id": "_sF8a2RKtRhg" }, "source": [ "## Skip architecture search\n", "Instead of doing architecture search everytime, you can reuse the existing architecture search result. This could help:\n", "1. reducing the variation of the output model\n", "2. reducing training cost\n", "\n", "The existing architecture search result is stored in the `tuning_result_output` output of the `automl-tabular-stage-1-tuner` component. You can manually input it or get it programmatically." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8hlPps2Rtpq-" }, "outputs": [], "source": [ "stage_1_tuner_task = get_task_detail(\n", " pipeline_task_details, \"automl-tabular-stage-1-tuner\"\n", ")\n", "\n", "stage_1_tuning_result_artifact_uri = (\n", " stage_1_tuner_task.outputs[\"tuning_result_output\"].artifacts[0].uri\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "c5F12ZL_uZZ3" }, "source": [ "### Run the skip architecture search pipeline\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4x4GA5sMuewX" }, "outputs": [], "source": [ "(\n", " template_path,\n", " parameter_values,\n", ") = automl_tabular_utils.get_skip_architecture_search_pipeline_and_parameters(\n", " PROJECT_ID,\n", " LOCATION,\n", " root_dir,\n", " target_column,\n", " prediction_type,\n", " optimization_objective,\n", " transform_config_path,\n", " train_budget_milli_node_hours,\n", " data_source_csv_filenames=data_source_csv_filenames,\n", " data_source_bigquery_table_path=data_source_bigquery_table_path,\n", " weight_column=weight_column,\n", " predefined_split_key=predefined_split_key,\n", " timestamp_split_key=timestamp_split_key,\n", " stratified_split_key=stratified_split_key,\n", " training_fraction=training_fraction,\n", " validation_fraction=validation_fraction,\n", " test_fraction=test_fraction,\n", " stage_1_tuning_result_artifact_uri=stage_1_tuning_result_artifact_uri,\n", " run_evaluation=run_evaluation,\n", " dataflow_subnetwork=dataflow_subnetwork,\n", " dataflow_use_public_ips=dataflow_use_public_ips,\n", ")\n", "\n", "job_id = \"automl-tabular-skip-architecture-search-unique\"\n", "job = aiplatform.PipelineJob(\n", " display_name=job_id,\n", " location=LOCATION, # launches the pipeline job in the specified location\n", " template_path=template_path,\n", " job_id=job_id,\n", " pipeline_root=root_dir,\n", " parameter_values=parameter_values,\n", " enable_caching=False,\n", ")\n", "\n", "job.run()\n", "\n", "# Get model URI\n", "skip_architecture_search_pipeline_task_details = (\n", " job.gca_resource.job_detail.task_details\n", ")\n", "\n", "if export_additional_model_without_custom_ops:\n", " print(\n", " \"trained model without custom TF ops:\",\n", " get_no_custom_ops_model_uri(pipeline_task_details),\n", " )\n", "\n", "automl_tabular_skip_architecture_search_pipeline_job_name = job_id" ] }, { "cell_type": "markdown", "metadata": { "id": "43342a43176e" }, "source": [ "## Clean up Vertex and BigQuery resources\n", "\n", "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Otherwise, you can delete the individual resources you created in this tutorial:\n", "\n", "- Cloud Storage Bucket" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "acd787ad23d6" }, "outputs": [], "source": [ "def get_task_detail(\n", " task_details: List[Dict[str, Any]], task_name: str\n", ") -> List[Dict[str, Any]]:\n", " for task_detail in task_details:\n", " if task_detail.task_name == task_name:\n", " return task_detail" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5354389ff0dc" }, "outputs": [], "source": [ "# Get the automl tabular training pipeline object\n", "automl_tabular_pipeline_job = aiplatform.PipelineJob.get(\n", " f\"projects/{PROJECT_ID}/locations/{LOCATION}/pipelineJobs/{automl_tabular_pipeline_job_name}\"\n", ")\n", "\n", "# fetch automl tabular training pipeline task details\n", "pipeline_task_details = automl_tabular_pipeline_job.gca_resource.job_detail.task_details\n", "\n", "# fetch model from automl tabular training pipeline and delete the model\n", "model_task = get_task_detail(pipeline_task_details, \"model-upload-2\")\n", "model_resourceName = model_task.outputs[\"model\"].artifacts[0].metadata[\"resourceName\"]\n", "model = aiplatform.Model(model_resourceName)\n", "model.delete()\n", "\n", "# Delete the automl tabular pipeline\n", "automl_tabular_pipeline_job.delete()\n", "\n", "# Get the automl tabular skip architecture search pipeline object\n", "automl_tabular_skip_architecture_search_pipeline_job = aiplatform.PipelineJob.get(\n", " f\"projects/{PROJECT_ID}/locations/{LOCATION}/pipelineJobs/{automl_tabular_skip_architecture_search_pipeline_job_name}\"\n", ")\n", "\n", "# fetch automl tabular skip architecture search pipeline task details\n", "pipeline_task_details = (\n", " automl_tabular_skip_architecture_search_pipeline_job.gca_resource.job_detail.task_details\n", ")\n", "\n", "# fetch model from automl tabular skip architecture search pipeline and delete the model\n", "model_task = get_task_detail(pipeline_task_details, \"model-upload\")\n", "model_resourceName = model_task.outputs[\"model\"].artifacts[0].metadata[\"resourceName\"]\n", "model = aiplatform.Model(model_resourceName)\n", "model.delete()\n", "\n", "# Delete the automl tabular skip architecture search pipeline\n", "automl_tabular_skip_architecture_search_pipeline_job.delete()\n", "\n", "# Delete Cloud Storage objects that were created\n", "delete_bucket = False # Set True for deletion\n", "if delete_bucket:\n", " ! gsutil -m rm -r $BUCKET_URI" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "automl_tabular_on_vertex_pipelines.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }

notebooks/official/automl/automl_tabular_on_vertex_pipelines.ipynb (935 lines of code) (raw):