notebooks/official/tabular_workflows/tabnet_on_vertex

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "18ebbd838e32" }, "outputs": [], "source": [ "# Copyright 2022 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "mThXALJl9Yue" }, "source": [ "# Tabular Workflows: TabNet Pipeline\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/tabular_workflows/tabnet_on_vertex_pipelines.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Ftabular_workflows%2Ftabnet_on_vertex_pipelines.ipynb\">\n", " <img width=\"32px\" src=\"https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n", " </a>\n", " </td> \n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/tabular_workflows/tabnet_on_vertex_pipelines.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/tabular_workflows/tabnet_on_vertex_pipelines.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", "</table>" ] }, { "cell_type": "markdown", "metadata": { "id": "962e636b5cee" }, "source": [ "**_NOTE_**: This notebook has been tested in the following environment:\n", "\n", "* Python version = 3.9" ] }, { "cell_type": "markdown", "metadata": { "id": "fcc745968395" }, "source": [ "## Overview\n", "\n", "This notebook showcases how to run the TabNet algorithm using Vertex AI Tabular Workflows.\n", "\n", "Learn more about [Tabular Workflow for TabNet](https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/tabnet)." ] }, { "cell_type": "markdown", "metadata": { "id": "f887ec5c06c5" }, "source": [ "### Objective\n", "\n", "In this tutorial, you learn how to create classification models on tabular data using two of the Vertex AI TabNet Tabular Workflows. Each workflow is a managed instance of [Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction).\n", "\n", "This tutorial uses the following Google Cloud ML services and resources:\n", "\n", "- Vertex AI Training\n", "- Vertex AI Pipelines\n", "- Cloud Storage\n", "\n", "The steps performed include:\n", "\n", "- Create a TabNet CustomJob. This is the best option if you know which hyperparameters to use for training.\n", "- Create a TabNet HyperparameterTuningJob. This allows you to get the best set of hyperparameters for your dataset.\n", "\n", "After training, each pipeline returns a link to the Vertex Model UI. You can use the UI to deploy the model, get online predictions, or run batch prediction." ] }, { "cell_type": "markdown", "metadata": { "id": "eac26958afe8" }, "source": [ "### Dataset\n", "\n", "The dataset you use in this notebook is the [Bank Marketing](https://archive.ics.uci.edu/ml/datasets/bank+marketing) dataset.\n", "It consists of data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The objective of the binary classification task in this notebook is to predict if a client subscribes to a term deposit or not. \n", "\n", "For this notebook, a subset of randomly selected rows that makes 90% of the original dataset was saved to `train.csv` file and hosted on Cloud Storage. To download the file, click [here](https://storage.googleapis.com/cloud-samples-data-us-central1/vertex-ai/tabular-workflows/datasets/bank-marketing/train.csv)." ] }, { "cell_type": "markdown", "metadata": { "id": "181d4dfbf917" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* Vertex AI\n", "* Cloud Storage\n", "\n", "Learn about [Vertex AI\n", "pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage\n", "pricing](https://cloud.google.com/storage/pricing), and use the [Pricing\n", "Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "2b9e4bcab250" }, "source": [ "## Get Started" ] }, { "cell_type": "markdown", "metadata": { "id": "be898f74332d" }, "source": [ "### Install Vertex AI SDK for Python and other required packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2b4ef9b72d43" }, "outputs": [], "source": [ "! pip3 install --upgrade --quiet google-cloud-aiplatform \\\n", " google-cloud-pipeline-components" ] }, { "cell_type": "markdown", "metadata": { "id": "16220914acc5" }, "source": [ "### Restart runtime (Colab only)\n", "\n", "To use the newly installed packages, you must restart the runtime on Google Colab." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "157953ab28f0" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", "\n", " import IPython\n", "\n", " app = IPython.Application.instance()\n", " app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "b96b39fd4d7b" }, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "id": "ff666ce4051c" }, "source": [ "### Authenticate your notebook environment (Colab only)\n", "\n", "Authenticate your environment on Google Colab." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cc7251520a07" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", "\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "b02382a1fea6" }, "source": [ "### Set Google Cloud project information\n", "\n", "To get started using Vertex AI, you must have an existing Google Cloud project. Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wsePm9c4jmpT" }, "outputs": [], "source": [ "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", "LOCATION = \"us-central1\" # @param {type: \"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "bucket:custom" }, "source": [ "### Create a Cloud Storage bucket\n", "\n", "Create a storage bucket to store intermediate artifacts such as datasets.\n", "\n", "When you submit a training job using the Cloud SDK, you upload a Python package\n", "containing your training code to a Cloud Storage bucket. Vertex AI runs\n", "the code from this package. In this tutorial, Vertex AI also saves the\n", "trained model that results from your job in the same bucket. Using this model artifact, you can then\n", "create Vertex AI Model resource and use for prediction." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bucket" }, "outputs": [], "source": [ "BUCKET_URI = f\"gs://your-bucket-name-{PROJECT_ID}-unique\" # @param {type:\"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "create_bucket" }, "source": [ "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Oz8J0vmSlugt" }, "outputs": [], "source": [ "! gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}" ] }, { "cell_type": "markdown", "metadata": { "id": "zebLBGXOky2A" }, "source": [ "### Notes about service account and permission\n", "\n", "**By default no configuration is required**, if you run into any permission related issue, please make sure the service accounts have the required roles listed in the [Service accounts for Tabular Workflow for TabNet, and Tabular Workflow for Wide & Deep, and Prophet documentation](https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/service-accounts#fte-workflow)." ] }, { "cell_type": "markdown", "metadata": { "id": "44accda192d5" }, "source": [ "#### Service Account\n", "\n", "You use a service account to create Vertex AI Pipeline jobs. If you don't want to use your project's Compute Engine service account, set `SERVICE_ACCOUNT` to another service account ID." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c65d12a97f45" }, "outputs": [], "source": [ "SERVICE_ACCOUNT = \"[your-service-account]\" # @param {type:\"string\"}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "604ae09ab6d3" }, "outputs": [], "source": [ "import sys\n", "\n", "IS_COLAB = \"google.colab\" in sys.modules\n", "if (\n", " SERVICE_ACCOUNT == \"\"\n", " or SERVICE_ACCOUNT is None\n", " or SERVICE_ACCOUNT == \"[your-service-account]\"\n", "):\n", " # Get your service account from gcloud\n", " if not IS_COLAB:\n", " shell_output = !gcloud auth list 2>/dev/null\n", " SERVICE_ACCOUNT = shell_output[2].replace(\"*\", \"\").strip()\n", "\n", " else: # IS_COLAB:\n", " shell_output = ! gcloud projects describe $PROJECT_ID\n", " project_number = shell_output[-1].split(\":\")[1].strip().replace(\"'\", \"\")\n", " SERVICE_ACCOUNT = f\"{project_number}-compute@developer.gserviceaccount.com\"\n", "\n", " print(\"Service Account:\", SERVICE_ACCOUNT)" ] }, { "cell_type": "markdown", "metadata": { "id": "d1ecb60964d5" }, "source": [ "#### Set service account access for Vertex AI Pipelines\n", "Run the following commands to grant your service account access to read and write pipeline artifacts in the bucket that you created in the previous step. You only need to run this step once per service account." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "a592f0a380c2" }, "outputs": [], "source": [ "! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI\n", "\n", "! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI" ] }, { "cell_type": "markdown", "metadata": { "id": "fbbc3479a1da" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8G6YmJT1yqkV" }, "outputs": [], "source": [ "# Import required modules\n", "import os\n", "from typing import Any, Dict, List\n", "\n", "from google.cloud import aiplatform, storage\n", "from google_cloud_pipeline_components.preview.automl.tabular import \\\n", " utils as automl_tabular_utils" ] }, { "cell_type": "markdown", "metadata": { "id": "c0423f260423" }, "source": [ "### Initialize Vertex AI SDK for Python\n", "\n", "Initialize the Vertex AI SDK for Python for your project." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ad69f2590268" }, "outputs": [], "source": [ "aiplatform.init(project=PROJECT_ID, location=LOCATION)" ] }, { "cell_type": "markdown", "metadata": { "id": "3LWH3PRF5o2v" }, "source": [ "## Define helper functions\n", "Define the following helper functions:\n", "\n", "- `get_model_artifacts_path`: Gets the model artifacts path from task details.\n", "- `get_model_uri`: Gets the model uri from the task details.\n", "- `get_bucket_name_and_path`: Gets the bucket name and path.\n", "- `download_from_gcs`: Downloads the content from the bucket.\n", "- `write_to_gcs`: Uploads content into the bucket.\n", "- `get_task_detail`: Gets the task details by using task name.\n", "- `get_model_name`: Gets the model name from pipeline job ID.\n", "- `get_evaluation_metrics`: Gets the evaluation metrics from pipeline task details.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "g9FPFT8c5oC0" }, "outputs": [], "source": [ "# Get the model artifacts path from task details.\n", "\n", "\n", "def get_model_artifacts_path(task_details: List[Dict[str, Any]], task_name: str) -> str:\n", " task = get_task_detail(task_details, task_name)\n", " return task.outputs[\"unmanaged_container_model\"].artifacts[0].uri\n", "\n", "\n", "# Get the model uri from the task details.\n", "def get_model_uri(task_details: List[Dict[str, Any]]) -> str:\n", " task = get_task_detail(task_details, \"model-upload\")\n", " # in format https://<location>-aiplatform.googleapis.com/v1/projects/<project_number>/locations/<location>/models/<model_id>\n", " model_id = task.outputs[\"model\"].artifacts[0].uri.split(\"/\")[-1]\n", " return f\"https://console.cloud.google.com/vertex-ai/locations/{LOCATION}/models/{model_id}?project={PROJECT_ID}\"\n", "\n", "\n", "# Get the bucket name and path.\n", "def get_bucket_name_and_path(uri: str) -> str:\n", " no_prefix_uri = uri[len(\"gs://\") :]\n", " splits = no_prefix_uri.split(\"/\")\n", " return splits[0], \"/\".join(splits[1:])\n", "\n", "\n", "# Get the content from the bucket.\n", "def download_from_gcs(uri: str) -> str:\n", " bucket_name, path = get_bucket_name_and_path(uri)\n", " storage_client = storage.Client(project=PROJECT_ID)\n", " bucket = storage_client.get_bucket(bucket_name)\n", " blob = bucket.blob(path)\n", " return blob.download_as_string()\n", "\n", "\n", "# Upload content into the bucket.\n", "def write_to_gcs(uri: str, content: str):\n", " bucket_name, path = get_bucket_name_and_path(uri)\n", " storage_client = storage.Client()\n", " bucket = storage_client.get_bucket(bucket_name)\n", " blob = bucket.blob(path)\n", " blob.upload_from_string(content)\n", "\n", "\n", "# Get the task details by using task name.\n", "def get_task_detail(\n", " task_details: List[Dict[str, Any]], task_name: str\n", ") -> List[Dict[str, Any]]:\n", " for task_detail in task_details:\n", " if task_detail.task_name == task_name:\n", " return task_detail\n", "\n", "\n", "# Get the model name from pipeline job ID.\n", "def get_model_name(job_id: str) -> str:\n", " pipeline_task_details = aiplatform.PipelineJob.get(\n", " job_id\n", " ).gca_resource.job_detail.task_details\n", " upload_task_details = get_task_detail(pipeline_task_details, \"model-upload\")\n", " return upload_task_details.outputs[\"model\"].artifacts[0].metadata[\"resourceName\"]\n", "\n", "\n", "# Get the evaluation metrics.\n", "def get_evaluation_metrics(\n", " task_details: List[Dict[str, Any]],\n", ") -> str:\n", " ensemble_task = get_task_detail(task_details, \"model-evaluation\")\n", " return download_from_gcs(\n", " ensemble_task.outputs[\"evaluation_metrics\"].artifacts[0].uri\n", " )" ] }, { "cell_type": "markdown", "metadata": { "id": "7a7332a3f8e2" }, "source": [ "## Define training specifications\n", "\n", "Before creating the training job, you create the below steps in this section:\n", "\n", "1. Configure the source dataset.\n", "2. Configure the feature transformation process.\n", "3. Configure the feature selection process.\n", "4. Set up the parameters needed for running the training process.\n", "\n", "### Configure the dataset\n", "\n", "You define either of the following parameters:\n", "\n", "- `data_source_csv_filenames`: The CSV data source. You specify the Cloud Storage path to the `train.csv` file described in the dataset section.\n", "- `data_source_bigquery_table_path`: The BigQuery data source. As you use the Cloud Storage source, this is kept as none.\n", "\n", "***Notes***: Please note that the dataset's location has to be the same as the same as the service location (i.e., `REGION`) set for launching the training pipeline.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b6dd1af1d336" }, "outputs": [], "source": [ "data_source_csv_filenames = \"gs://cloud-samples-data-us-central1/vertex-ai/tabular-workflows/datasets/bank-marketing/train.csv\"\n", "data_source_bigquery_table_path = (\n", " None # @param {type:\"string\"}, format: bq://bq_project.bq_dataset.bq_table\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "bf417de96807" }, "source": [ "### Configure feature transformation\n", "\n", "Transformations can be specified using Feature Transform Engine (FTE) specific configurations. FTE supports both TensorFlow-based row-level and BigQuery-based dataset-level transformations.\n", "\n", "* **TensorFlow-based row-level transformations**:\n", " * Full automatic transformations: FTE automatically configures a set of built-in transformations for each input column based on its data statistics. This can be set via `tf_auto_transform_features` in the training pipeline.\n", " * Fully specified transformations: All transformations on input columns are explicitly specified with FTE's built-in transformations. Chaining of multiple transformations on a single column is also supported. These transformations can be saved to JSON configuration file and specified via `tf_transformations_path` argument of the training pipeline.\n", " * Custom transformations: Custom, bring-your-own transform function, where you can define and import your own transform function and use it with other FTE's built-in transformations. You can specify custom transformations as an array of JSON object and pass through the `tf_custom_transformation_definitions` argument of the training pipeline.\n", " \n", "\n", "* **BigQuery-based dataset-level transformations**:\n", " * Fully specified transformations: All transformations on input columns are explicitly specified with FTE's built-in transformations. These transformations can be specified as an array of JSON objects via `dataset_level_transformations` argument of the training pipeline.\n", " * Custom transformations: Custom, bring-your-own transform function, where you can define and import your own transform function and use it with other FTE's built-in transformations. You can specify custom transformations as an array of JSON object and pass through the `dataset_level_custom_transformation_definitions` argument of the training pipeline.\n", "\n", "Below, you configure full automatic transformations by specifying a list of input features to pass to the `tf_auto_transform_features` argument of the training pipeline.\n", "\n", "Learn more about [feature transformation configurations](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-1.0.31/google_cloud_pipeline_components.experimental.automl.tabular.html#google_cloud_pipeline_components.experimental.automl.tabular.FeatureTransformEngineOp)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fce334e09df6" }, "outputs": [], "source": [ "auto_transform_features = [\n", " \"age\",\n", " \"job\",\n", " \"marital\",\n", " \"education\",\n", " \"default\",\n", " \"balance\",\n", " \"housing\",\n", " \"loan\",\n", " \"contact\",\n", " \"day\",\n", " \"month\",\n", " \"duration\",\n", " \"campaign\",\n", " \"pdays\",\n", " \"previous\",\n", " \"poutcome\",\n", "]" ] }, { "cell_type": "markdown", "metadata": { "id": "-t1fAaCFs8Os" }, "source": [ "### Configure feature selection\n", "\n", "In addition to transformations, you can also apply feature selection via Feature Transform Engine to use only highly ranked features, evaluated by supported algorithms. If enabled, it will be applied right after dataset level transformations, and exclude any feature that's not selected.\n", "\n", "To enable it, you need to set `run_feature_selection` to True.\n", "\n", "To configure the algorihtm to use, and number of features to be selected, you need to configure both `feature_selection_algorithm` and `max_selected_features` parameters.\n", "\n", "Learn more about [feature selection algorithms and configurations](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-1.0.31/google_cloud_pipeline_components.experimental.automl.tabular.html#google_cloud_pipeline_components.experimental.automl.tabular.FeatureTransformEngineOp)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YroYjOTJwytk" }, "outputs": [], "source": [ "RUN_FEATURE_SELECTION = True # @param {type:\"boolean\"}\n", "\n", "FEATURE_SELECTION_ALGORITHM = \"AMI\" # @param {type:\"string\"}\n", "\n", "MAX_SELECTED_FEATURES = 10 # @param {type:\"integer\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "4b28b609b259" }, "source": [ "### Setup training configuration\n", "\n", "Now, you define the following parameters for training:\n", "\n", "- `target_column`: The target column name.\n", "- `prediction_type`: The type of prediction the model is to produce.\n", " 'classification' or 'regression'.\n", "- `predefined_split_key`: The predefined_split column name.\n", "- `timestamp_split_key`: The timestamp_split column name.\n", "- `stratified_split_key`: The stratified_split column name.\n", "- `training_fraction`: The training fraction.\n", "- `validation_fraction`: The validation fraction.\n", "- `test_fraction`: The test fraction.\n", "- `weight_column`: The weight column name.\n", "- `run_evaluation`: Whether to run evaluation steps during training." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eV4JrwB8wAkg" }, "outputs": [], "source": [ "run_evaluation = True # @param {type:\"boolean\"}\n", "prediction_type = \"classification\"\n", "target_column = \"deposit\"\n", "\n", "# Fraction split\n", "training_fraction = 0.8\n", "validation_fraction = 0.1\n", "test_fraction = 0.1\n", "\n", "timestamp_split_key = None # timestamp column name when using timestamp split\n", "stratified_split_key = None # target column name when using stratified split\n", "\n", "predefined_split_key = None\n", "if predefined_split_key:\n", " training_fraction = None\n", " validation_fraction = None\n", " test_fraction = None\n", "\n", "weight_column = None" ] }, { "cell_type": "markdown", "metadata": { "id": "zyWGg2s09xOk" }, "source": [ "## Setup VPC configuration for Dataflow\n", "\n", "In this section, you define the following parameters:\n", "\n", "- `dataflow_subnetwork`: Dataflow's fully qualified subnetwork name, when empty the default subnetwork is used. See an [example](\n", "https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications).\n", "- `dataflow_use_public_ips`: Specifies whether Dataflow workers use public IP\n", " addresses.\n", "\n", "If you need to use a custom Dataflow subnetwork, you can set it through the `dataflow_subnetwork` parameter. The requirements are:\n", "1. `dataflow_subnetwork` must be a fully qualified subnetwork name.\n", " [[reference](https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications)]\n", "1. The following service accounts must have [Compute Network User role](https://cloud.google.com/compute/docs/access/iam#compute.networkUser) assigned on the specified dataflow subnetwork [[reference](https://cloud.google.com/dataflow/docs/guides/specifying-networks#shared)]:\n", " 1. Compute Engine default service account: PROJECT_NUMBER-compute@developer.gserviceaccount.com\n", " 1. Dataflow service account: service-PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com\n", "\n", "If your project has VPC-SC enabled, please make sure of the following:\n", "\n", "1. The dataflow subnetwork used in VPC-SC is configured properly for Dataflow.\n", " See [reference](https://cloud.google.com/dataflow/docs/guides/routes-firewall).\n", "1. `dataflow_use_public_ips` is set to False.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_TePNlLl9v1q" }, "outputs": [], "source": [ "dataflow_subnetwork = \"\" # @param {type:\"string\"}\n", "dataflow_use_public_ips = True # @param {type:\"boolean\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "N-iXXE14voyR" }, "source": [ "## Customize TabNet CustomJob configuration and create pipeline\n", "\n", "Creating a TabNet CustomJob is the best choice if you know exactly which hyperparameter values to use for model training. It uses fewer training resources than a HyperparameterTuningJob.\n", "\n", "In the example below, you configure the following key parameters:\n", "\n", "- `root_dir`: The root GCS directory for the pipeline components.\n", "- `worker_pool_specs_override`: The dictionary for overriding training and evaluation worker pool specs. The dictionary should follow a [particular format]( https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172). TabNet supports training using both CPUs and GPUs.\n", "- `learning_rate`: The learning rate used by the linear optimizer.\n", "- `max_steps`: Number of steps to run the trainer for.\n", "- `max_train_secs`: Amount of time in seconds to run the trainer for.\n", "\n", "Learn more about [pipeline inputs and model hyperparameters](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-1.0.23/google_cloud_pipeline_components.experimental.automl.tabular.html#google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_trainer_pipeline_and_parameters).\n", "\n", "Learn more about the parameters needed for [creating a pipeline job](https://cloud.google.com/vertex-ai/docs/pipelines/run-pipeline#create_a_pipeline_run)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sG46cXVueb66" }, "outputs": [], "source": [ "# set a unique display name for your pipeline\n", "pipeline_job_id = \"tabnet-unique\" # @param {type: \"string\"}\n", "# set the root dir\n", "pipeline_job_root_dir = os.path.join(BUCKET_URI, \"tabnet_custom_job\")\n", "# set the worker pool specs\n", "worker_pool_specs_override = [\n", " {\"machine_spec\": {\"machine_type\": \"c2-standard-16\"}} # Override for TF chief node\n", "]\n", "# set the learning rate\n", "learning_rate = 0.01\n", "# max_steps and/or max_train_secs must be set. If both are\n", "# specified, training stop after either condition is met.\n", "# By default, max_train_secs is set to -1.\n", "max_steps = 20\n", "\n", "max_train_secs = -1\n", "\n", "# To test GPU training, the worker_pool_specs_override can be specified like this.\n", "# worker_pool_specs_override = [\n", "# {\"machine_spec\": {\n", "# 'machine_type': \"n1-highmem-32\",\n", "# \"accelerator_type\": \"NVIDIA_TESLA_V100\",\n", "# \"accelerator_count\": 2\n", "# }\n", "# }\n", "# ]\n", "\n", "# define the pipeline\n", "# If your system does not use Python, you can save the JSON file (`template_path`),\n", "# and use another programming language to submit the pipeline.\n", "(\n", " template_path,\n", " parameter_values,\n", ") = automl_tabular_utils.get_tabnet_trainer_pipeline_and_parameters(\n", " project=PROJECT_ID,\n", " location=LOCATION,\n", " root_dir=pipeline_job_root_dir,\n", " max_steps=max_steps,\n", " max_train_secs=max_train_secs,\n", " learning_rate=learning_rate,\n", " target_column=target_column,\n", " prediction_type=prediction_type,\n", " tf_auto_transform_features=auto_transform_features,\n", " run_feature_selection=RUN_FEATURE_SELECTION,\n", " feature_selection_algorithm=FEATURE_SELECTION_ALGORITHM,\n", " max_selected_features=MAX_SELECTED_FEATURES,\n", " training_fraction=training_fraction,\n", " validation_fraction=validation_fraction,\n", " test_fraction=test_fraction,\n", " data_source_csv_filenames=data_source_csv_filenames,\n", " data_source_bigquery_table_path=data_source_bigquery_table_path,\n", " worker_pool_specs_override=worker_pool_specs_override,\n", " dataflow_use_public_ips=dataflow_use_public_ips,\n", " dataflow_subnetwork=dataflow_subnetwork,\n", " run_evaluation=run_evaluation,\n", ")\n", "\n", "# create the pipeline job\n", "training_pipeline_job = aiplatform.PipelineJob(\n", " display_name=pipeline_job_id,\n", " template_path=template_path,\n", " job_id=pipeline_job_id,\n", " pipeline_root=pipeline_job_root_dir,\n", " parameter_values=parameter_values,\n", " enable_caching=False,\n", ")\n", "\n", "# run the pipeline\n", "training_pipeline_job.run(service_account=SERVICE_ACCOUNT)" ] }, { "cell_type": "markdown", "metadata": { "id": "cbf2cf98842b" }, "source": [ "### Go to the Vertex Model UI\n", "\n", "Through the link generated from the below cell, you can deploy the model and run online prediction or batch prediction." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "719a784573ce" }, "outputs": [], "source": [ "tabnet_trainer_pipeline_task_details = aiplatform.PipelineJob.get(\n", " pipeline_job_id\n", ").gca_resource.job_detail.task_details\n", "CUSTOM_JOB_MODEL = get_model_name(pipeline_job_id)\n", "print(\"model uri:\", get_model_uri(tabnet_trainer_pipeline_task_details))\n", "print(\n", " \"model artifacts:\",\n", " get_model_artifacts_path(tabnet_trainer_pipeline_task_details, \"tabnet-trainer\"),\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "_sF8a2RKtRhg" }, "source": [ "## Customize TabNet HyperparameterTuningJob configuration and create pipeline\n", "\n", "To get the best set of hyperparameters on your dataset, it's recommended that you run a HyperparameterTuningJob.\n", "\n", "Hyperparameters that can be tuned are set with the optional `study_spec_parameters_override` parameter. You provide a helper function named `get_tabnet_study_spec_parameters_override` to get these hyperparameters. To this helper function, you provide:\n", "\n", "- `dataset_size_bucket`: one of 'small' (< 1M rows), 'medium' (1M - 100M rows), or 'large' (> 100M rows)).\n", "- `training_budget_bucket`: one of 'small' (< \\\\$600), 'medium' (\\\\$600 - \\\\$2400), or 'large' (> \\\\$2400)).\n", "- `prediction_type`: The type of prediction the model is to produce. “classification” or “regression”.\n", "\n", "Then, you get the list of hyperparameters and ranges. `study_spec_parameters_override` can be empty or one or more of the above hyperparameters can be specified. For hyperparameters not specified, you can set their ranges in the pipeline. Learn more about the [hyperparameters available for tuning](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-1.0.23/google_cloud_pipeline_components.experimental.automl.tabular.html#google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_trainer_pipeline_and_parameters).\n", "\n", "In addition to hyperparameters, HyperparameterTuningJob takes the following values:\n", "\n", "- `root_dir`: The root GCS directory for the pipeline components.\n", "- `worker_pool_specs_override`: The dictionary for overriding training and evaluation worker pool specs. The dictionary should follow a [particular format]( https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172). TabNet supports training using both CPUs and GPUs.\n", "- `study_spec_metric_id`: Metric to optimize, possible values: ['loss', 'average_loss', 'rmse', 'mae', 'mql', 'accuracy', 'auc', 'precision', 'recall'].\n", "- `study_spec_metric_goal`: Optimization goal of the metric, possible values: \"MAXIMIZE\", \"MINIMIZE\".\n", "- `max_trial_count`: The desired total number of trials.\n", "- `parallel_trial_count`: The desired number of trials to run in parallel.\n", "- `max_failed_trial_count`: The number of failed trials that need to be seen before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.\n", "- `study_spec_algorithm`: The search algorithm specified for the study. One of\n", "'ALGORITHM_UNSPECIFIED', 'GRID_SEARCH', or 'RANDOM_SEARCH'.\n", "\n", "Learno more about the [HyperparameterTuningJob parameters](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-1.0.23/google_cloud_pipeline_components.experimental.automl.tabular.html#google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_hyperparameter_tuning_job_pipeline_and_parameters).\n", "\n", "Multiple trials can be configured. The pipeline returns the best trial based on the metric specified in `study_spec_metrics`. In the example below, you return the trial with the lowest loss value." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8hlPps2Rtpq-" }, "outputs": [], "source": [ "# set a unique display name for pipeline\n", "pipeline_job_id = \"tabnet-hpt-unique\" # @param {type: \"string\"}\n", "# set the root dir\n", "pipeline_job_root_dir = os.path.join(BUCKET_URI, \"tabnet_hyperparameter_tuning_job\")\n", "# set the worker pool specs\n", "worker_pool_specs_override = [\n", " {\"machine_spec\": {\"machine_type\": \"c2-standard-16\"}} # Override for TF chief node\n", "]\n", "# set the metric\n", "study_spec_metric_id = \"loss\"\n", "# set the objective for metric\n", "study_spec_metric_goal = \"MINIMIZE\"\n", "\n", "# To test GPU training, the worker_pool_specs_override can be specified like this.\n", "# worker_pool_specs_override = [\n", "# {\n", "# \"machine_spec\":{\n", "# \"machine_type\":\"n1-highmem-32\",\n", "# \"accelerator_type\":\"NVIDIA_TESLA_V100\",\n", "# \"accelerator_count\":2\n", "# }\n", "# }\n", "# ]\n", "\n", "\n", "# define the component to get the hyperparameters\n", "# max_steps and/or max_train_secs must be set. If both are\n", "# specified, training stop after either condition is met.\n", "# By default, max_train_secs is set to -1 and max_steps is set to\n", "# an appropriate range given dataset_size and training budget.\n", "study_spec_parameters_override = (\n", " automl_tabular_utils.get_tabnet_study_spec_parameters_override(\n", " dataset_size_bucket=\"small\",\n", " prediction_type=prediction_type,\n", " training_budget_bucket=\"small\",\n", " )\n", ")\n", "\n", "# define the hyperparameter tuning pipeline\n", "# If your system does not use Python, you can save the JSON file (`template_path`),\n", "# and use another programming language to submit the pipeline.\n", "(\n", " template_path,\n", " parameter_values,\n", ") = automl_tabular_utils.get_tabnet_hyperparameter_tuning_job_pipeline_and_parameters(\n", " project=PROJECT_ID,\n", " location=LOCATION,\n", " root_dir=pipeline_job_root_dir,\n", " target_column=target_column,\n", " prediction_type=prediction_type,\n", " tf_auto_transform_features=auto_transform_features,\n", " run_feature_selection=RUN_FEATURE_SELECTION,\n", " feature_selection_algorithm=FEATURE_SELECTION_ALGORITHM,\n", " max_selected_features=MAX_SELECTED_FEATURES,\n", " training_fraction=training_fraction,\n", " validation_fraction=validation_fraction,\n", " test_fraction=test_fraction,\n", " data_source_csv_filenames=data_source_csv_filenames,\n", " data_source_bigquery_table_path=data_source_bigquery_table_path,\n", " study_spec_metric_id=study_spec_metric_id,\n", " study_spec_metric_goal=study_spec_metric_goal,\n", " study_spec_parameters_override=study_spec_parameters_override,\n", " max_trial_count=1,\n", " parallel_trial_count=1,\n", " max_failed_trial_count=0,\n", " worker_pool_specs_override=worker_pool_specs_override,\n", " dataflow_use_public_ips=dataflow_use_public_ips,\n", " dataflow_subnetwork=dataflow_subnetwork,\n", " run_evaluation=True,\n", ")\n", "\n", "# create the pipeline job\n", "tuning_pipeline_job = aiplatform.PipelineJob(\n", " display_name=pipeline_job_id,\n", " template_path=template_path,\n", " job_id=pipeline_job_id,\n", " pipeline_root=pipeline_job_root_dir,\n", " parameter_values=parameter_values,\n", " enable_caching=False,\n", ")\n", "\n", "# run the pipeline job\n", "tuning_pipeline_job.run(service_account=SERVICE_ACCOUNT)" ] }, { "cell_type": "markdown", "metadata": { "id": "2749d8cec287" }, "source": [ "### Go to the Vertex Model UI\n", "\n", "Through the link generated from the below cell, you can deploy the model and run online prediction or batch prediction." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "730af2836871" }, "outputs": [], "source": [ "tabnet_hpt_pipeline_task_details = aiplatform.PipelineJob.get(\n", " pipeline_job_id\n", ").gca_resource.job_detail.task_details\n", "HPT_JOB_MODEL = get_model_name(pipeline_job_id)\n", "\n", "print(\"model uri:\", get_model_uri(tabnet_hpt_pipeline_task_details))\n", "print(\n", " \"model artifacts:\",\n", " get_model_artifacts_path(\n", " tabnet_hpt_pipeline_task_details, \"get-best-hyperparameter-tuning-job-trial\"\n", " ),\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "43342a43176e" }, "source": [ "## Clean up Vertex and BigQuery resources\n", "\n", "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Otherwise, you can delete the individual resources you created in this tutorial:\n", "\n", "- Pipeline from CustomJob pipeline\n", "- Pipeline from HyperparameterTuningJob pipeline\n", "- Model from CustomJob pipeline\n", "- Model from HyperparameterTuningJob pipeline\n", "- Cloud Storage Bucket (set `delete_bucket` to True to delete the bucket)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ad8d12061a65" }, "outputs": [], "source": [ "# Delete the training pipeline job\n", "training_pipeline_job.delete()\n", "\n", "# Delete the tuning pipeline job\n", "tuning_pipeline_job.delete()\n", "\n", "# Delete model resources\n", "custom_job_model = aiplatform.Model(CUSTOM_JOB_MODEL)\n", "hpt_job_model = aiplatform.Model(HPT_JOB_MODEL)\n", "custom_job_model.delete()\n", "hpt_job_model.delete()\n", "\n", "# Delete Cloud Storage objects that were created\n", "delete_bucket = False # Set True for deletion\n", "if delete_bucket:\n", " ! gsutil -m rm -r $BUCKET_URI" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "tabnet_on_vertex_pipelines.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }

notebooks/official/tabular_workflows/tabnet_on_vertex_pipelines.ipynb (1,108 lines of code) (raw):