notebooks/official/automl/sdk_automl_tabular_regression_batch_bq.ipynb (771 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "copyright" }, "outputs": [], "source": [ "# Copyright 2022 Google LLC.\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "title" }, "source": [ "# Vertex AI SDK for Python: AutoML training tabular regression model for batch prediction using BigQuery\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/sdk_automl_tabular_regression_batch_bq.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fautoml%2Fsdk_automl_tabular_regression_batch_bq.ipynb\">\n", " <img width=\"32px\" src=\"https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n", " </a>\n", " </td> \n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/automl/sdk_automl_tabular_regression_batch_bq.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/sdk_automl_tabular_regression_batch_bq.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", "</table>\n" ] }, { "cell_type": "markdown", "metadata": { "id": "overview:automl" }, "source": [ "## Overview\n", "\n", "\n", "This tutorial demonstrates how to use the Vertex AI SDK for Python to create tabular regression models and generate batch prediction using a Google Cloud [AutoML](https://cloud.google.com/vertex-ai/docs/start/automl-users) model.\n", "\n", "Learn more about [Regression for tabular data](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/overview)." ] }, { "cell_type": "markdown", "metadata": { "id": "objective:automl,training,online_prediction" }, "source": [ "### Objective\n", "\n", "In this tutorial, you learn how to create an AutoML tabular regression model and deploy it for batch prediction using the Vertex AI SDK for Python. You can alternatively create and deploy models using the `gcloud` command-line tool or batch using the Cloud Console.\n", "\n", "This tutorial uses the following Google Cloud ML services:\n", "\n", "- Vertex AI Datasets (Tabular)\n", "- Vertex AI Training (AutoML Tabular Training)\n", "- Vertex AI Model Registry\n", "- Vertex AI batch prediction\n", "\n", "The steps performed include:\n", "\n", "- Create a Vertex AI dataset resource.\n", "- Train an AutoML tabular regression model resource.\n", "- Obtain the evaluation metrics for the model resource.\n", "- Make a batch prediction." ] }, { "cell_type": "markdown", "metadata": { "id": "dataset:gsod,lrg" }, "source": [ "### Dataset\n", "\n", "The dataset used for this tutorial is the [GSOD dataset](https://console.cloud.google.com/marketplace/product/noaa-public/gsod) from [BigQuery public datasets](https://cloud.google.com/bigquery/public-data). In this version of the dataset, you use the year, month, and day fields to predict the mean daily temperature (`mean_temp`)." ] }, { "cell_type": "markdown", "metadata": { "id": "costs" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* Vertex AI\n", "* BigQuery / BigQuery ML\n", "\n", "Learn about [Vertex AI\n", "pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage\n", "pricing](https://cloud.google.com/storage/pricing) and [BigQuery pricing](https://cloud.google.com/bigquery/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "61RBz8LLbxCR" }, "source": [ "## Get started" ] }, { "cell_type": "markdown", "metadata": { "id": "No17Cw5hgx12" }, "source": [ "### Install Vertex AI SDK for Python and other required packages\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "870f1b093d9c" }, "outputs": [], "source": [ "# Install the packages\n", "! pip3 install --upgrade --quiet google-cloud-aiplatform \\\n", " 'google-cloud-bigquery[bqstorage,pandas]' \\\n", " google-cloud-storage \n", " " ] }, { "cell_type": "markdown", "metadata": { "id": "R5Xep4W9lq-Z" }, "source": [ "### Restart runtime (Colab only)\n", "\n", "To use the newly installed packages, you must restart the runtime on Google Colab." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XRvKdaPDTznN" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", "\n", " import IPython\n", "\n", " app = IPython.Application.instance()\n", " app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "SbmM4z7FOBpM" }, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>\n", "</div>\n" ] }, { "cell_type": "markdown", "metadata": { "id": "dmWOrTJ3gx13" }, "source": [ "### Authenticate your notebook environment (Colab only)\n", "\n", "Authenticate your environment on Google Colab.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NyKGtVQjgx13" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", "\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "DF4l8DTdWgPY" }, "source": [ "### Set Google Cloud project information\n", "\n", "To get started using Vertex AI, you must have an existing Google Cloud project. Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_project_id" }, "outputs": [], "source": [ "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", "LOCATION = \"us-central1\" # @param {type:\"string\"}\n", "\n", "# Set the project id\n", "! gcloud config set project {PROJECT_ID}" ] }, { "cell_type": "markdown", "metadata": { "id": "setup_vars" }, "source": [ "### Set up variables\n", "\n", "Next, set up some variables used throughout the tutorial.\n", "### Import libraries and define constants" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "import_aip:mbsdk" }, "outputs": [], "source": [ "from google.cloud import aiplatform, bigquery" ] }, { "cell_type": "markdown", "metadata": { "id": "init_aip:mbsdk" }, "source": [ "## Initialize Vertex AI SDK for Python\n", "\n", "Initialize the Vertex AI SDK for Python for your project and the corresponding bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "init_aip:mbsdk" }, "outputs": [], "source": [ "aiplatform.init(project=PROJECT_ID, location=LOCATION)" ] }, { "cell_type": "markdown", "metadata": { "id": "tutorial_start:automl" }, "source": [ "## Tutorial\n", "\n", "Now you're ready to start creating your own AutoML tabular regression model." ] }, { "cell_type": "markdown", "metadata": { "id": "import_file:u_dataset,bq" }, "source": [ "### Location of BigQuery training data.\n", "\n", "Set the `IMPORT_File` variable to the location of the data table in BigQuery." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "import_file:gsod,bq,lrg" }, "outputs": [], "source": [ "IMPORT_FILE = \"bigquery-public-data.samples.gsod\"" ] }, { "cell_type": "markdown", "metadata": { "id": "07d8e5973590" }, "source": [ "#### Prepare the batch prediction data\n", "\n", "Create two datasets from the original data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "63a447565c9c" }, "outputs": [], "source": [ "# Create client in default location\n", "bq_client = bigquery.Client(\n", " project=PROJECT_ID,\n", " credentials=aiplatform.initializer.global_config.credentials,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "56ade963c098" }, "outputs": [], "source": [ "# Create training dataset in default location\n", "TRAINING_INPUT_DATASET_ID = \"gsod_training_unique\"\n", "bq_dataset = bigquery.Dataset(f\"{PROJECT_ID}.{TRAINING_INPUT_DATASET_ID}\")\n", "bq_dataset = bq_client.create_dataset(bq_dataset)\n", "print(f\"Created dataset {bq_client.project}.{bq_dataset.dataset_id}\")\n", "\n", "# Create test dataset in default location\n", "PREDICTION_INPUT_DATASET_ID = \"gsod_prediction_unique\"\n", "bq_dataset = bigquery.Dataset(f\"{PROJECT_ID}.{PREDICTION_INPUT_DATASET_ID}\")\n", "bq_dataset = bq_client.create_dataset(bq_dataset)\n", "print(f\"Created dataset {bq_client.project}.{bq_dataset.dataset_id}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5b790c91182f" }, "outputs": [], "source": [ "# Select top 3000 rows of dataset\n", "TRAINING_SIZE = 3000\n", "query = f\"\"\"\n", " SELECT *\n", " FROM {IMPORT_FILE}\n", " LIMIT {TRAINING_SIZE}\n", " \"\"\"\n", "\n", "TRAINING_INPUT_TABLE_ID = f\"{PROJECT_ID}.{TRAINING_INPUT_DATASET_ID}.test\"\n", "job_config = bigquery.QueryJobConfig(destination=TRAINING_INPUT_TABLE_ID)\n", "\n", "query_job = bq_client.query(query, job_config=job_config) # API request\n", "query_job.result() # Waits for query to finish" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8f841da4574b" }, "outputs": [], "source": [ "# Select a subset of the original dataset for testing\n", "PREDICTION_SIZE = 100\n", "query = f\"\"\"\n", " SELECT *\n", " FROM {IMPORT_FILE}\n", " LIMIT {PREDICTION_SIZE}\n", " OFFSET {TRAINING_SIZE} \n", " \"\"\"\n", "\n", "PREDICTION_INPUT_TABLE_ID = f\"{PROJECT_ID}.{PREDICTION_INPUT_DATASET_ID}.prediction\"\n", "job_config = bigquery.QueryJobConfig(destination=PREDICTION_INPUT_TABLE_ID)\n", "\n", "query_job = bq_client.query(query, job_config=job_config) # API request\n", "query_job.result() # Waits for query to finish" ] }, { "cell_type": "markdown", "metadata": { "id": "create_dataset:tabular,bq,lrg" }, "source": [ "### Create the Dataset\n", "\n", "Use `TabularDataset.create()` method to create a tabular dataset resource, which takes the following parameters:\n", "\n", "- `display_name`: The human readable name for the dataset resource.\n", "- `gcs_source`: A list of one or more dataset index files to import the data items into the dataset resource.\n", "- `bq_source`: Alternatively, import data items from a BigQuery table into the dataset resource.\n", "\n", "This operation may take several minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_dataset:tabular,bq,lrg" }, "outputs": [], "source": [ "dataset = aiplatform.TabularDataset.create(\n", " display_name=\"NOAA historical weather data_unique\",\n", " bq_source=[f\"bq://{TRAINING_INPUT_TABLE_ID}\"],\n", ")\n", "\n", "label_column = \"mean_temp\"\n", "\n", "print(dataset.resource_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_transformations:gsod" }, "outputs": [], "source": [ "COLUMN_SPECS = {\n", " \"year\": \"auto\",\n", " \"month\": \"auto\",\n", " \"day\": \"auto\",\n", "}\n", "\n", "label_column = \"mean_temp\"" ] }, { "cell_type": "markdown", "metadata": { "id": "create_automl_pipeline:tabular,lrg,transformations" }, "source": [ "### Create and run training pipeline\n", "\n", "To train an AutoML model, create and run a training pipeline.\n", "\n", "#### Create training pipeline\n", "\n", "Create an AutoML training pipeline using the `AutoMLTabularTrainingJob` class, with the following parameters:\n", "\n", "- `display_name`: The human readable name for the training job resource.\n", "- `optimization_prediction_type`: The type task to train the model for.\n", " - `classification`: A tabular classification model.\n", " - `regression`: A tabular regression model.\n", "- `column_transformations`: (Optional): Transformations to apply to the input columns.\n", "- `optimization_objective`: The optimization objective (minimize or maximize).\n", " - binary classification:\n", " - `minimize-log-loss`\n", " - `maximize-au-roc`\n", " - `maximize-au-prc`\n", " - `maximize-precision-at-recall`\n", " - `maximize-recall-at-precision`\n", " - multi-class classification:\n", " - `minimize-log-loss`\n", " - regression:\n", " - `minimize-rmse`\n", " - `minimize-mae`\n", " - `minimize-rmsle`\n", "\n", "The instantiated object is the DAG (directed acyclic graph) for the training pipeline." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "create_automl_pipeline:tabular,lrg,transformations" }, "outputs": [], "source": [ "training_job = aiplatform.AutoMLTabularTrainingJob(\n", " display_name=\"job_unique\",\n", " optimization_prediction_type=\"regression\",\n", " optimization_objective=\"minimize-rmse\",\n", " column_specs=COLUMN_SPECS,\n", ")\n", "\n", "print(training_job)" ] }, { "cell_type": "markdown", "metadata": { "id": "run_automl_pipeline:tabular" }, "source": [ "#### Run the training pipeline\n", "\n", "Run the training job by invoking the `run` method with the following parameters:\n", "\n", "- `dataset`: The dataset resource to train the model.\n", "- `model_display_name`: The human readable name for the trained model.\n", "- `training_fraction_split`: The percentage of the dataset to use for training.\n", "- `test_fraction_split`: The percentage of the dataset to use for test (holdout data).\n", "- `validation_fraction_split`: The percentage of the dataset to use for validation.\n", "- `target_column`: The name of the column to train as the label.\n", "- `budget_milli_node_hours`: (optional) Maximum training time specified in unit of millihours (1000 = hour).\n", "- `disable_early_stopping`: By default, the model training stops early if the model performance doesn't improve. Setting `disable_early_stopping` = `True` overrides this behavior, allowing the model to train for the entire specified duration.\n", "\n", "The `run` method, upon completion, returns the model resource.\n", "\n", "The execution of the training pipeline may take upto 3 hours." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "run_automl_pipeline:tabular" }, "outputs": [], "source": [ "model = training_job.run(\n", " dataset=dataset,\n", " model_display_name=\"model_unique\",\n", " training_fraction_split=0.6,\n", " validation_fraction_split=0.2,\n", " test_fraction_split=0.2,\n", " budget_milli_node_hours=1000,\n", " disable_early_stopping=False,\n", " target_column=label_column,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "evaluate_the_model:mbsdk" }, "source": [ "## Review model evaluation scores\n", "After model training is complete, you can review its evaluation scores." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "evaluate_the_model:mbsdk" }, "outputs": [], "source": [ "# Get evaluations\n", "model_evaluations = model.list_model_evaluations()\n", "\n", "model_evaluation = list(model_evaluations)[0]\n", "print(model_evaluation)" ] }, { "cell_type": "markdown", "metadata": { "id": "337625a4c33c" }, "source": [ "## Send a batch prediction request\n", "\n", "Now you can make a batch prediction." ] }, { "cell_type": "markdown", "metadata": { "id": "93bc53cc960d" }, "source": [ "### Create a results dataset\n", "\n", "Create a dataset to store the prediction results." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f49e90431fcd" }, "outputs": [], "source": [ "# Create results dataset in default location\n", "RESULTS_DATASET_ID = \"gsod_results_unique\"\n", "bq_dataset = bigquery.Dataset(f\"{PROJECT_ID}.{RESULTS_DATASET_ID}\")\n", "bq_dataset = bq_client.create_dataset(bq_dataset)\n", "print(f\"Created dataset {bq_client.project}.{bq_dataset.dataset_id}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "36c070503d2f" }, "source": [ "### Make the batch prediction request\n", "\n", "You can make a batch prediction by invoking the `batch_predict()` method, with the following parameters:\n", "\n", "- `job_display_name`: The human readable name for the batch prediction job.\n", "- `gcs_source`: A list of one or more batch request input files.\n", "- `gcs_destination_prefix`: The Cloud Storage location for storing the batch prediction resuls.\n", "- `instances_format`: The format for the input instances, either 'bigquery', 'csv' or 'jsonl'. Defaults to 'jsonl'.\n", "- `predictions_format`: The format for the output predictions, either 'csv' or 'jsonl'. Defaults to 'jsonl'.\n", "- `machine_type`: The type of machine to use for training.\n", "- `accelerator_type`: The hardware accelerator type.\n", "- `accelerator_count`: The number of accelerators to attach to a worker replica.\n", "- `sync`: Set `True` to wait until the completion of the job.\n", "\n", "Batch prediction job takes roughly 1 hour to finish." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bbd40d78ea46" }, "outputs": [], "source": [ "# Note: The bigquery_source and bigquery_destination_prefix must be in the same location\n", "PREDICTION_RESULTS_DATASET_ID = f\"{PROJECT_ID}.{RESULTS_DATASET_ID}\"\n", "\n", "batch_predict_job = model.batch_predict(\n", " job_display_name=\"tabular_regression_batch_predict_job\",\n", " bigquery_source=f\"bq://{PREDICTION_INPUT_TABLE_ID}\",\n", " instances_format=\"bigquery\",\n", " predictions_format=\"bigquery\",\n", " bigquery_destination_prefix=f\"bq://{PREDICTION_RESULTS_DATASET_ID}\",\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "4fc95aae00b0" }, "source": [ "### View the batch prediction results\n", "\n", "Use the BigQuery Python client to query the destination table and return results as a Pandas dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fafe1b0f654b" }, "outputs": [], "source": [ "dataframe = (\n", " bq_client.query(f\"SELECT * FROM `{PREDICTION_RESULTS_DATASET_ID}.*`\")\n", " .result()\n", " .to_dataframe()\n", ")\n", "\n", "print(dataframe.head())" ] }, { "cell_type": "markdown", "metadata": { "id": "cleanup:mbsdk" }, "source": [ "# Cleaning up\n", "\n", "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Otherwise, you can delete the individual resources you created in this tutorial:\n", "\n", "- Model\n", "- AutoML Training Job\n", "- Batch Job" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3be2c2bf9146" }, "outputs": [], "source": [ "# Delete BigQuery datasets\n", "bq_client.delete_dataset(\n", " f\"{PROJECT_ID}.{TRAINING_INPUT_DATASET_ID}\",\n", " delete_contents=True,\n", " not_found_ok=True,\n", ")\n", "\n", "bq_client.delete_dataset(\n", " f\"{PROJECT_ID}.{PREDICTION_INPUT_DATASET_ID}\",\n", " delete_contents=True,\n", " not_found_ok=True,\n", ")\n", "\n", "bq_client.delete_dataset(\n", " f\"{PROJECT_ID}.{RESULTS_DATASET_ID}\", delete_contents=True, not_found_ok=True\n", ")\n", "\n", "# Delete Vertex AI resources\n", "dataset.delete()\n", "model.delete()\n", "training_job.delete()\n", "batch_predict_job.delete()" ] } ], "metadata": { "colab": { "name": "sdk_automl_tabular_regression_batch_bq.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }