notebooks/official/model_evaluation/automl_text_classification_model

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "copyright" }, "outputs": [], "source": [ "# Copyright 2022 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "148c71404373" }, "source": [ "Starting on September 15, 2024, you can only customize classification, entity extraction, and sentiment analysis models by moving to Vertex AI Gemini prompts and tuning. Training or updating models for Vertex AI AutoML for Text classification, entity extraction, and sentiment analysis objectives will no longer be available. You can continue using existing Vertex AI AutoML Text objectives until June 15, 2025. For more information about how Gemini offers enhanced user experience through improved prompting capabilities, see \n", "[Introduction to tuning](https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-gemini-overview)" ] }, { "cell_type": "markdown", "metadata": { "id": "title:generic" }, "source": [ "# Vertex AI Pipelines: AutoML text classification pipelines using google-cloud-pipeline-components\n", "\n", "<table align=\"left\">\n", " <td>\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/model_evaluation/automl_text_classification_model_evaluation.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/model_evaluation/automl_text_classification_model_evaluation.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n", " View on GitHub\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/model_evaluation/automl_text_classification_model_evaluation.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\">\n", " Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", "</table>\n", "<br/><br/><br/>" ] }, { "cell_type": "markdown", "metadata": { "id": "962e636b5cee" }, "source": [ "**_NOTE_**: This notebook has been tested in the following environment:\n", "\n", "* Python version = 3.9" ] }, { "cell_type": "markdown", "metadata": { "id": "overview:pipelines,automl" }, "source": [ "## Overview\n", "\n", "This notebook demonstrates how to use the Vertex AI classification model evaluation component to evaluate an AutoML text classification model. Model evaluation helps you determine your model performance based on the evaluation metrics and improve the model if necessary. \n", "\n", "Learn more about [Vertex AI Model Evaluation](https://cloud.google.com/vertex-ai/docs/evaluation/introduction) and [Classification on text data](https://cloud.google.com/vertex-ai/docs/training-overview#classification_for_text)." ] }, { "cell_type": "markdown", "metadata": { "id": "objective:pipelines,automl" }, "source": [ "### Objective\n", "\n", "In this tutorial, you learn how to use `Vertex AI Pipelines` and `Google Cloud Pipeline Components` to build and evaluate an `AutoML` text classification model.\n", "\n", "\n", "This tutorial uses the following Google Cloud ML services and resources:\n", "\n", "- Vertex AI `Datasets`\n", "- Vertex AI `Training`(AutoML Text Classification) \n", "- Vertex AI `Model Registry`\n", "- Vertex AI `Pipelines`\n", "- Vertex AI `Batch Predictions`\n", "\n", "The steps performed include:\n", "\n", "- Create a Vertex AI `Dataset`.\n", "- Train an Automl Text Classification model on the `Dataset` resource.\n", "- Import the trained `AutoML model resource` into the pipeline.\n", "- Run a `Batch Prediction` job.\n", "- Evaluate the AutoML model using the `Classification Evaluation Component`.\n", "- Import the evaluation metrics to the AutoML model resource." ] }, { "cell_type": "markdown", "metadata": { "id": "dd81fd5c3454" }, "source": [ "### Dataset\n", "\n", "The dataset used for this tutorial is the [Happy Moments dataset](https://www.kaggle.com/ritresearch/happydb) from [Kaggle Datasets](https://www.kaggle.com/ritresearch/happydb). The version of the dataset you use in this tutorial is stored in a public Cloud Storage bucket." ] }, { "cell_type": "markdown", "metadata": { "id": "costs" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* Vertex AI\n", "* Cloud Storage\n", "\n", "Learn about [Vertex AI\n", "pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage\n", "pricing](https://cloud.google.com/storage/pricing), and use the [Pricing\n", "Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "install_aip:mbsdk" }, "source": [ "## Installation\n", "\n", "Install the packages required for executing this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "install_aip:mbsdk" }, "outputs": [], "source": [ "! pip3 install --upgrade google-cloud-aiplatform \\\n", " google-cloud-storage \\\n", " kfp google-cloud-pipeline-components==1.0.25 \\\n", " ndjson --quiet" ] }, { "cell_type": "markdown", "metadata": { "id": "restart" }, "source": [ "### Colab only: Uncomment the following cell to restart the kernel" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "D-ZBOjErv5mM" }, "outputs": [], "source": [ "# Automatically restart kernel after installs so that your environment can access the new packages\n", "# import IPython\n", "\n", "# app = IPython.Application.instance()\n", "# app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "a19566219b28" }, "source": [ "## Before you begin\n", "\n", "### Set up your Google Cloud project\n", "\n", "**The following steps are required, regardless of your notebook environment.**\n", "\n", "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", "\n", "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", "\n", "3. [Enable the Vertex AI and Dataflow APIs](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,dataflow.googleapis.com).\n", "4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk)." ] }, { "cell_type": "markdown", "metadata": { "id": "before_you_begin:nogpu" }, "source": [ "#### Set your project ID\n", "\n", "**If you don't know your project ID**, try the following:\n", "* Run `gcloud config list`.\n", "* Run `gcloud projects list`.\n", "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_project_id" }, "outputs": [], "source": [ "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", "\n", "# Set the project id\n", "! gcloud config set project {PROJECT_ID}" ] }, { "cell_type": "markdown", "metadata": { "id": "region" }, "source": [ "#### Region\n", "\n", "You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2dw8q9fdQEH5" }, "outputs": [], "source": [ "REGION = \"us-central1\" # @param {type: \"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "gcp_authenticate" }, "source": [ "### Authenticate your Google Cloud account\n", "\n", "Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.\n", "\n", "**1. Vertex AI Workbench**\n", "* Do nothing as you are already authenticated.\n", "\n", "**2. Local JupyterLab instance, uncomment and run:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ce6043da7b33" }, "outputs": [], "source": [ "# ! gcloud auth login" ] }, { "cell_type": "markdown", "metadata": { "id": "0367eac06a10" }, "source": [ "**3. Colab, uncomment and run:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "21ad4dbb4a61" }, "outputs": [], "source": [ "# from google.colab import auth\n", "# auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "c13224697bfb" }, "source": [ "**4. Service account or other**\n", "* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples." ] }, { "cell_type": "markdown", "metadata": { "id": "bucket:mbsdk" }, "source": [ "### Create a Cloud Storage bucket\n", "\n", "Create a storage bucket to store intermediate artifacts such as datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bucket" }, "outputs": [], "source": [ "BUCKET_URI = f\"gs://your-bucket-name-{PROJECT_ID}-unique\" # @param {type:\"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "autoset_bucket" }, "source": [ "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "91c46850b49b" }, "outputs": [], "source": [ "! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}" ] }, { "cell_type": "markdown", "metadata": { "id": "set_service_account" }, "source": [ "#### Service Account\n", "\n", "**If you don't know your service account**, try to get your service account using `gcloud` command by executing the second cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_service_account" }, "outputs": [], "source": [ "SERVICE_ACCOUNT = \"[your-service-account]\" # @param {type:\"string\"}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "autoset_service_account" }, "outputs": [], "source": [ "import sys\n", "\n", "IS_COLAB = \"google.colab\" in sys.modules\n", "\n", "if (\n", " SERVICE_ACCOUNT == \"\"\n", " or SERVICE_ACCOUNT is None\n", " or SERVICE_ACCOUNT == \"[your-service-account]\"\n", "):\n", " # Get your service account from gcloud\n", " if not IS_COLAB:\n", " shell_output = !gcloud auth list 2>/dev/null\n", " SERVICE_ACCOUNT = shell_output[2].replace(\"*\", \"\").strip()\n", "\n", " if IS_COLAB:\n", " shell_output = ! gcloud projects describe $PROJECT_ID\n", " project_number = shell_output[-1].split(\":\")[1].strip().replace(\"'\", \"\")\n", " SERVICE_ACCOUNT = f\"{project_number}-compute@developer.gserviceaccount.com\"\n", "\n", " print(\"Service Account:\", SERVICE_ACCOUNT)" ] }, { "cell_type": "markdown", "metadata": { "id": "set_service_account:pipelines" }, "source": [ "#### Set service account access for Vertex AI Pipelines\n", "\n", "Run the following commands to grant your service account the access to read and write pipeline artifacts in the bucket that you created in the previous step. You only need to run this step once per service account." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "set_service_account:pipelines" }, "outputs": [], "source": [ "! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI\n", "\n", "! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI" ] }, { "cell_type": "markdown", "metadata": { "id": "setup_vars" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "import_aip:mbsdk" }, "outputs": [], "source": [ "import json\n", "\n", "import kfp\n", "import matplotlib.pyplot as plt\n", "import ndjson\n", "from google.cloud import aiplatform, aiplatform_v1, storage\n", "from kfp.v2 import compiler # noqa: F811" ] }, { "cell_type": "markdown", "metadata": { "id": "init_aip:mbsdk" }, "source": [ "### Initialize Vertex AI SDK for Python\n", "\n", "Initialize the Vertex AI SDK for Python for your project and corresponding bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "init_aip:mbsdk" }, "outputs": [], "source": [ "aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)" ] }, { "cell_type": "markdown", "metadata": { "id": "define_pipeline:gcpc,automl,happydb,tcn" }, "source": [ "## Train and deploy AutoML Text Classification model \n", "\n", "In this notebook, you execute all the steps from dataset building to model deployment and evaluation using Vertex AI pipelines. \n", "\n", "As the first step, you build the training and deployment pipeline. The pipeline includes the following tasks:\n", "1. Create a Vertex AI Text Dataset.\n", "2. Trains an Automl Text Classification model.\n", "3. Creates a Vertex AI Endpoint.\n", "4. Deploys the AutoML model to the Vertex AI Endpoint.\n", "\n", "The pipeline uses pre-built components for each of the tasks from the `Google Cloud Pipeline Components` package.\n", "\n", "Learn more about the [Google Cloud Pipeline Components](https://cloud.google.com/vertex-ai/docs/pipelines/components-introduction).\n", "\n", "Set the parameters required for the training and deployment pipeline." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "90e2a99bdcf8" }, "outputs": [], "source": [ "# Specify the GCS path for the text dataset\n", "IMPORT_FILE = \"gs://cloud-ml-data/NL-classification/happiness.csv\"\n", "\n", "# provide dataset display name\n", "DATASET_DISPLAY_NAME = \"happydb-dataset-unique\"\n", "\n", "# provide training job display name\n", "TRAINING_JOB_DISPLAY_NAME = \"happydb-automl-job-unique\"\n", "\n", "# provide model display name\n", "MODEL_DISPLAY_NAME = \"happydb-automl-model-unique\"\n", "\n", "# provide endpoint display name\n", "ENDPOINT_DISPLAY_NAME = \"happydb-classification-endpoint-unique\"\n", "\n", "# provide pipeline job display name\n", "TRAINING_PIPELINE_DISPLAY_NAME = \"happydb-training-pipeline-unique\"\n", "\n", "# provide Cloud Storage root folder path for saving the artifacts\n", "PIPELINE_ROOT = f\"{BUCKET_URI}/pipeline_root/happydb\"\n", "\n", "# provide path to store the compiled pipeline package\n", "TRAINING_PIPELINE_PATH = \"automl_text_classification_pipeline.json\"" ] }, { "cell_type": "markdown", "metadata": { "id": "02efd309b96b" }, "source": [ "Define the Vertex AI pipeline. \n", "\n", "Learn more about building [Vertex AI pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/build-pipeline)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "define_pipeline:gcpc,automl,happydb,tcn" }, "outputs": [], "source": [ "@kfp.dsl.pipeline(name=TRAINING_PIPELINE_DISPLAY_NAME)\n", "def pipeline(\n", " import_file: str,\n", " dataset_display_name: str,\n", " training_job_display_name: str,\n", " model_display_name: str,\n", " endpoint_display_name: str,\n", " project: str = PROJECT_ID,\n", " region: str = REGION,\n", " training_split: float = 0.4,\n", " validation_split: float = 0.3,\n", " test_split: float = 0.3,\n", "):\n", " from google_cloud_pipeline_components import aiplatform as gcc_aip\n", " from google_cloud_pipeline_components.v1.endpoint import (EndpointCreateOp,\n", " ModelDeployOp)\n", "\n", " # component to create the dataset\n", " dataset_create_task = gcc_aip.TextDatasetCreateOp(\n", " display_name=dataset_display_name,\n", " gcs_source=import_file,\n", " import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification,\n", " project=project,\n", " )\n", " # component to run AutoML training job\n", " training_run_task = gcc_aip.AutoMLTextTrainingJobRunOp(\n", " dataset=dataset_create_task.outputs[\"dataset\"],\n", " display_name=training_job_display_name,\n", " prediction_type=\"classification\",\n", " multi_label=True,\n", " training_fraction_split=training_split,\n", " validation_fraction_split=validation_split,\n", " test_fraction_split=test_split,\n", " model_display_name=model_display_name,\n", " project=project,\n", " )\n", " # component to create an endpoint\n", " endpoint_op = EndpointCreateOp(\n", " project=project,\n", " location=region,\n", " display_name=endpoint_display_name,\n", " )\n", " # component to deploy the model the endpoint\n", " _ = ModelDeployOp(\n", " model=training_run_task.outputs[\"model\"],\n", " endpoint=endpoint_op.outputs[\"endpoint\"],\n", " automatic_resources_min_replica_count=1,\n", " automatic_resources_max_replica_count=1,\n", " )" ] }, { "cell_type": "markdown", "metadata": { "id": "compile_pipeline" }, "source": [ "### Compile the pipeline\n", "\n", "Next, compile the pipeline to a json package." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "compile_pipeline" }, "outputs": [], "source": [ "compiler.Compiler().compile(\n", " pipeline_func=pipeline,\n", " package_path=TRAINING_PIPELINE_PATH,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "run_pipeline:automl,text" }, "source": [ "### Run the training and deployment pipeline\n", "\n", "Now, create a Vertex AI pipeline job to run the pipeline. Note that during the pipeline definition, training, validation and test split are by default specified as 0.4, 0.3 and 0.3 respectively. Change it as needed.\n", "\n", "For creating the pipeline job, you specify the following parameters:\n", "\n", "- `display_name`: The name of the pipeline, this shows up in the Google Cloud console.\n", "- `template_path`: The path of PipelineJob or PipelineSpec JSON or YAML file. It can be a local path, a Google Cloud Storage URI or an Artifact Registry URI.\n", "- `parameter_values`: The mapping from runtime parameter names to its values that control the pipeline run.\n", "- `enable_caching`: Set as True to turn on caching for the run.\n", "\n", "Learn more about [PipelineJob](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.PipelineJob)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "run_pipeline:automl,text" }, "outputs": [], "source": [ "# set the values to be passed as input parameters to the pipeline\n", "training_parameters = {\n", " \"import_file\": IMPORT_FILE,\n", " \"dataset_display_name\": DATASET_DISPLAY_NAME,\n", " \"training_job_display_name\": TRAINING_JOB_DISPLAY_NAME,\n", " \"model_display_name\": MODEL_DISPLAY_NAME,\n", " \"endpoint_display_name\": ENDPOINT_DISPLAY_NAME,\n", "}\n", "\n", "# create a pipeline job\n", "training_job = aiplatform.PipelineJob(\n", " display_name=TRAINING_PIPELINE_DISPLAY_NAME,\n", " template_path=TRAINING_PIPELINE_PATH,\n", " pipeline_root=PIPELINE_ROOT,\n", " parameter_values=training_parameters,\n", " enable_caching=False,\n", ")\n", "\n", "# run the job\n", "training_job.run(sync=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "view_pipeline_run:automl,text" }, "source": [ "Click on the generated link to see your run in the Cloud Console.\n", "\n", "In the UI, many of the pipeline DAG nodes will expand or collapse when you click on them. Here is a partially-expanded view of the DAG (click image to see larger version).\n", "\n", "<a href=\"https://storage.googleapis.com/amy-jo/images/mp/automl_text_classif.png\" target=\"_blank\"><img src=\"https://storage.googleapis.com/amy-jo/images/mp/automl_text_classif.png\" width=\"40%\"/></a>" ] }, { "cell_type": "markdown", "metadata": { "id": "b875cffc34c4" }, "source": [ "Fetch the created model by filtering the display name." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9af2a70fc560" }, "outputs": [], "source": [ "models = aiplatform.Model.list(\n", " filter=f\"display_name={MODEL_DISPLAY_NAME}\", order_by=\"create_time\"\n", ")\n", "if models:\n", " model = models[0]\n", "print(model)" ] }, { "cell_type": "markdown", "metadata": { "id": "6efda0ed8c17" }, "source": [ "Fetch the availble evaluation metrics for the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "679ded8b6097" }, "outputs": [], "source": [ "# Get evaluations\n", "model_evaluations = model.list_model_evaluations()\n", "\n", "model_evaluation = list(model_evaluations)[0]\n", "\n", "# Print the evaluation metrics\n", "for evaluation in model_evaluations:\n", " evaluation = evaluation.to_dict()\n", " print(\"Model's evaluation metrics from Training:\\n\")\n", " metrics = evaluation[\"metrics\"]\n", " for metric in metrics.keys():\n", " print(f\"metric: {metric}, value: {metrics[metric]}\\n\")" ] }, { "cell_type": "markdown", "metadata": { "id": "f54ab89a020c" }, "source": [ "### Run batch predictions on the model\n", "\n", "For evaluating the model, a batch of test data along with ground truth is required. Before evaluating the model, you generate a batch prediction job for the model to see if the model is able to generate the predictions in batches. In Vertex AI, you need not deploy a model in order to run batch prediction jobs on it. \n", "\n", "To create a batch prediction job, you must first format your input instances(in JSONL format) and store them in a Google Cloud Storage bucket. You also need to provide a Google Cloud Storage bucket to save the results.\n", "\n", "#### Format input instances\n", "In this step, the instances are formatted in JSONL. Each line in the JSONL document needs to be formatted as below.\n", "\n", "```\n", "{ \"content\": \"gs://sourcebucket/datasets/texts/source_text.txt\", \"mimeType\": \"text/plain\"}\n", "```\n", "\n", "The `content` field in the JSON structure must be a Google Cloud Storage URI to a document that contains the text input for the model.\n", "\n", "Learn more about [batch predictions](https://cloud.google.com/ai-platform-unified/docs/predictions/batch-predictions#text)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f23865fbab9a" }, "outputs": [], "source": [ "# define a set of test samples\n", "instances = [\n", " {\n", " \"Text\": \"I went on a successful date with someone I felt sympathy and connection with.\",\n", " \"Labels\": \"affection\",\n", " },\n", " {\n", " \"Text\": \"I was happy when my son got 90% marks in his examination\",\n", " \"Labels\": \"affection\",\n", " },\n", " {\"Text\": \"I went to the gym this morning and did yoga.\", \"Labels\": \"exercise\"},\n", " {\n", " \"Text\": \"We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.\",\n", " \"Labels\": \"bonding\",\n", " },\n", " {\n", " \"Text\": \"I went with grandchildren to butterfly display at Crohn Conservatory\",\n", " \"Labels\": \"affection\",\n", " },\n", " {\"Text\": \"I meditated last night.\", \"Labels\": \"leisure\"},\n", " {\n", " \"Text\": \"I made a new recipe for peasant bread, and it came out spectacular!\",\n", " \"Labels\": \"achievement\",\n", " },\n", " {\n", " \"Text\": \"I got gift from my elder brother which was really surprising me\",\n", " \"Labels\": \"affection\",\n", " },\n", " {\"Text\": \"YESTERDAY MY MOMS BIRTHDAY SO I ENJOYED\", \"Labels\": \"enjoy_the_moment\"},\n", " {\n", " \"Text\": \"Watching cupcake wars with my three teen children\",\n", " \"Labels\": \"affection\",\n", " },\n", " {\"Text\": \"I came in 3rd place in my Call of Duty video game.\", \"Labels\": \"leisure\"},\n", " {\n", " \"Text\": \"I completed my 5 miles run without break. It makes me feel strong.\",\n", " \"Labels\": \"exercise\",\n", " },\n", " {\"Text\": \"went to movies with my friends it was fun\", \"Labels\": \"bonding\"},\n", " {\n", " \"Text\": \"I was shorting Gold and made $200 from the trade.\",\n", " \"Labels\": \"achievement\",\n", " },\n", " {\n", " \"Text\": \"Hearing Songs It can be nearly impossible to go from angry to happy, so you're just looking for the thought that eases you out of your angry feeling and moves you in the direction of happiness. It may take a while, but as long as you're headed in a more positive direction youall be doing yourself a world of good.\",\n", " \"Labels\": \"enjoy_the_moment\",\n", " },\n", " {\n", " \"Text\": \"My son performed very well for a test preparation.\",\n", " \"Labels\": \"affection\",\n", " },\n", " {\"Text\": \"I helped my neighbour to fix their car damages.\", \"Labels\": \"bonding\"},\n", " {\n", " \"Text\": \"Managed to get the final trophy in a game I was playing.\",\n", " \"Labels\": \"achievement\",\n", " },\n", " {\n", " \"Text\": \"A hot kiss with my girl friend last night made my day\",\n", " \"Labels\": \"bonding\",\n", " },\n", " {\n", " \"Text\": \"My new BCAAs came in the mail. Yay! Strawberry Lemonade flavored aminos make my heart happy.\",\n", " \"Labels\": \"affection\",\n", " },\n", " {\"Text\": \"Got A in class.\", \"Labels\": \"achievement\"},\n", " {\n", " \"Text\": \"My sister called me from abroad this morning after some long years. Such a happy occassion for all family members.\",\n", " \"Labels\": \"affection\",\n", " },\n", " {\n", " \"Text\": \"The cake I made today came out amazing. It tasted amazing as well.\",\n", " \"Labels\": \"achievement\",\n", " },\n", " {\n", " \"Text\": \"There are two types of people in the world: those who choose to be happy, and those who choose to be unhappy. Contrary to popular belief, happiness doesn't come from fame, fortune, other people, or material possessions\",\n", " \"Labels\": \"enjoy_the_moment\",\n", " },\n", " {\n", " \"Text\": \"My grandmother start to walk from the bed after a long time.\",\n", " \"Labels\": \"affection\",\n", " },\n", " {\"Text\": \"i was able to hit a top spin serve in tennis\", \"Labels\": \"achievement\"},\n", " {\n", " \"Text\": \"I napped with my husband on the bed this afternoon and it was sweet to cuddle so close to him.\",\n", " \"Labels\": \"affection\",\n", " },\n", " {\n", " \"Text\": \"My co-woker started playing a Carley Rae Jepsen song from her phone while ringing out customers.\",\n", " \"Labels\": \"leisure\",\n", " },\n", " {\n", " \"Text\": \"My son woke me up to a fantastic breakfast of eggs, his special hamburger patty and pancakes.\",\n", " \"Labels\": \"affection\",\n", " },\n", " {\n", " \"Text\": \"After a long time my brother gave a suprise visit to my house yesterday.\",\n", " \"Labels\": \"affection\",\n", " },\n", "]\n", "\n", "# define the input file name\n", "BATCH_JOB_INPUT_FILE = \"happiness-batch-prediction-input.jsonl\"" ] }, { "cell_type": "markdown", "metadata": { "id": "6d0156222582" }, "source": [ "#### Save the data to Cloud Storage bucket\n", "\n", "Create a new Cloud Storage blob, upload individual instances as text files to the bucket, and then create the JSONL file with URIs for the instances." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "aa27e20fc2ec" }, "outputs": [], "source": [ "# Instantiate the Storage client and create the new bucket\n", "storage_client = storage.Client()\n", "bucket = storage_client.bucket(BUCKET_URI[5:])\n", "# Iterate over the prediction instances and create a new text file\n", "input_file_data = []\n", "for count, instance in enumerate(instances):\n", " instance_name = f\"input_{count}.txt\"\n", " instance_file_uri = f\"{BUCKET_URI}/batch-prediction-input/{instance_name}\"\n", " # Add the data to store in the JSONL input file.\n", " tmp_data = {\"content\": instance_file_uri, \"mimeType\": \"text/plain\"}\n", " input_file_data.append(tmp_data)\n", "\n", " # Create the new instance file\n", " blob = bucket.blob(\"batch-prediction-input/\" + instance_name)\n", " blob.upload_from_string(instance[\"Text\"])\n", "\n", "\n", "input_str = \"\\n\".join([str(d) for d in input_file_data])\n", "file_blob = bucket.blob(f\"{BATCH_JOB_INPUT_FILE}\")\n", "file_blob.upload_from_string(input_str)" ] }, { "cell_type": "markdown", "metadata": { "id": "90fe8584fe74" }, "source": [ "#### Create and run the batch prediction job" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "d330cc0a3582" }, "outputs": [], "source": [ "# provide display name for the batch prediction job\n", "BATCH_JOB_DISPLAY_NAME = \"happydb-batch-prediction-job-unique\"\n", "\n", "# create the batch prediction job\n", "batch_prediction_job = model.batch_predict(\n", " job_display_name=BATCH_JOB_DISPLAY_NAME,\n", " gcs_source=f\"{BUCKET_URI}/{BATCH_JOB_INPUT_FILE}\",\n", " gcs_destination_prefix=f\"{BUCKET_URI}/output\",\n", " sync=True,\n", ")\n", "batch_prediction_job_name = batch_prediction_job.resource_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b3ca5b6c88ef" }, "outputs": [], "source": [ "# fetch the job details\n", "batch_job = aiplatform.jobs.BatchPredictionJob(batch_prediction_job_name)\n", "print(f\"Batch prediction job state: {str(batch_job.state)}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "8ec70df091f1" }, "source": [ "#### Get predictions from the batch prediction job\n", "\n", "Load the batch predictions that are saved to the specified output Cloud Storage path." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "82c88319987e" }, "outputs": [], "source": [ "bp_iter_outputs = batch_job.iter_outputs()\n", "\n", "prediction_results = list()\n", "for blob in bp_iter_outputs:\n", " if blob.name.split(\"/\")[-1].startswith(\"prediction\"):\n", " prediction_results.append(blob.name)\n", "\n", "for prediction_result in prediction_results:\n", " gfile_name = f\"gs://{bp_iter_outputs.bucket.name}/{prediction_result}\".replace(\n", " BUCKET_URI + \"/\", \"\"\n", " )\n", " data = bucket.get_blob(gfile_name).download_as_string()\n", " data = ndjson.loads(data)\n", " print(data)" ] }, { "cell_type": "markdown", "metadata": { "id": "10aefe12628a" }, "source": [ "## Create input file with ground truth for evaluation \n", "\n", "Evaluation component needs ground truth to be part of the input file against which the predicted results can be compared and evaluated." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8f51edb0d581" }, "outputs": [], "source": [ "# set the file name for saving the input with ground truth data\n", "BATCH_JOB_INPUT_EVAL_FILE = \"happydb-input-with-groundtruth.jsonl\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c6b057194bb4" }, "outputs": [], "source": [ "# Instantiate the Storage client and create the new bucket\n", "storage_client = storage.Client()\n", "bucket = storage_client.bucket(BUCKET_URI[5:])\n", "# Iterate over the prediction instances, creating a new TXT file\n", "# for each.\n", "input_file_data = []\n", "for count, instance in enumerate(instances):\n", " instance_name = f\"input_{count}.txt\"\n", " instance_file_uri = (\n", " f\"{BUCKET_URI}/evaluation-batch-prediction-input/{instance_name}\"\n", " )\n", " # Add the data to store in the JSONL input file.\n", " # ground_truth variable in each json instance is needed to act as ground_truth for the evaluation task\n", " tmp_data = {\n", " \"content\": instance_file_uri,\n", " \"mimeType\": \"text/plain\",\n", " \"ground_truth\": instance[\"Labels\"],\n", " }\n", " input_file_data.append(tmp_data)\n", "\n", " # Create the new instance file\n", " blob = bucket.blob(\"evaluation-batch-prediction-input/\" + instance_name)\n", " blob.upload_from_string(instance[\"Text\"])\n", "\n", "input_str = json.dumps(input_file_data[0])\n", "for i in input_file_data[1:]:\n", " input_str = input_str + \"\\n\" + json.dumps(i)\n", "file_blob = bucket.blob(f\"{BATCH_JOB_INPUT_EVAL_FILE}\")\n", "file_blob.upload_from_string(input_str)" ] }, { "cell_type": "markdown", "metadata": { "id": "da26980e4892" }, "source": [ "## Create a pipeline for model evaluation\n", "\n", "In this section, you run a batch prediction job and evaluate the results from a Vertex AI pipeline by calling `evaluate` function. Learn more about [evaluate function](https://github.com/googleapis/python-aiplatform/blob/main/google/cloud/aiplatform/models.py#L5127)..\n", "\n", "Set the parameters for the evaluation pipeline." ] }, { "cell_type": "markdown", "metadata": { "id": "ba99dc63345d" }, "source": [ "### Define parameters to run the evaluate function\n", "\n", "Specify the required parameters to run `evaluate` function. \n", "\n", "The following is the instruction of `evaluate` function paramters:\n", "\n", "- `prediction_type`: The problem type being addressed by this evaluation run. 'classification' and 'regression' are the currently supported problem types.\n", "- `target_field_name`: Name of the column to be used as the target for classification.\n", "- `gcs_source_uris`: List of the Cloud Storage bucket uris of input instances for batch prediction.\n", "- `class_labels`: The list of all class names for the target field in the dataset.\n", "- `generate_feature_attributions`: Optional. Whether the model evaluation job should generate feature attributions. Defaults to False if not specified." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8db9156f3050" }, "outputs": [], "source": [ "DATA_SOURCE = f\"{BUCKET_URI}/{BATCH_JOB_INPUT_EVAL_FILE}\"\n", "CLASS_LABELS = [\n", " \"affection\",\n", " \"exercise\",\n", " \"bonding\",\n", " \"leisure\",\n", " \"achievement\",\n", " \"enjoy_the_moment\",\n", " \"nature\",\n", "]\n", "\n", "evaluation_job = model.evaluate(\n", " prediction_type=\"classification\",\n", " target_field_name=\"ground_truth\",\n", " gcs_source_uris=[DATA_SOURCE],\n", " class_labels=CLASS_LABELS,\n", " generate_feature_attributions=False,\n", ")\n", "\n", "print(\"Waiting model evaluation is in process\")\n", "evaluation_job.wait()" ] }, { "cell_type": "markdown", "metadata": { "id": "40944be9d8e7" }, "source": [ "## Check the evaluation result\n", "\n", "To see if the pipeline ran successfully, click on the generated link above to see the pipeline graph in the Cloud Console.\n", "\n", "In the displayed pipeline, the nodes expand or collapse when you click on them. An example of a partially-expanded view of the pipeline can be seen below (click image to see larger version).\n", "\n", "<img src=\"images/automl-text-classification-evaluation-image.PNG\">" ] }, { "cell_type": "markdown", "metadata": { "id": "c9b563bdc3d5" }, "source": [ "### Get the model evaluation result\n", "\n", "After the evalution pipeline is finished, run the below cell to print the evaluation metrics." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "81f66e4bd162" }, "outputs": [], "source": [ "model_evaluation = evaluation_job.get_model_evaluation()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f05de1ad1e67" }, "outputs": [], "source": [ "# Iterate over the pipeline tasks\n", "for (\n", " task\n", ") in model_evaluation._backing_pipeline_job._gca_resource.job_detail.task_details:\n", " # Obtain the artifacts from the evaluation task\n", " if (\n", " (\"model-evaluation\" in task.task_name)\n", " and (\"model-evaluation-import\" not in task.task_name)\n", " and (\n", " task.state == aiplatform_v1.types.PipelineTaskDetail.State.SUCCEEDED\n", " or task.state == aiplatform_v1.types.PipelineTaskDetail.State.SKIPPED\n", " )\n", " ):\n", " evaluation_metrics = task.outputs.get(\"evaluation_metrics\").artifacts[\n", " 0\n", " ] # ['artifacts']\n", " evaluation_metrics_gcs_uri = evaluation_metrics.uri\n", "\n", "print(evaluation_metrics)\n", "print(evaluation_metrics_gcs_uri)" ] }, { "cell_type": "markdown", "metadata": { "id": "34c1905f54e3" }, "source": [ "### Visualize the metrics\n", "\n", "Visualize the available metrics like `auRoc` and `logLoss` using a bar-chart." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "274de9ff8dc5" }, "outputs": [], "source": [ "metrics = []\n", "values = []\n", "for i in evaluation_metrics.metadata.items():\n", " metrics.append(i[0])\n", " values.append(i[1])\n", "plt.figure(figsize=(15, 5))\n", "plt.bar(x=metrics, height=values)\n", "plt.title(\"Evaluation Metrics\")\n", "plt.ylabel(\"Value\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "91082815d1b3" }, "source": [ "### Check model evaluations in model registry\n", "\n", "To ensure that the model evaluations are successfully imported into the model resource, list the evaluations and print them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8a35f4f09cc2" }, "outputs": [], "source": [ "# get the model evaluation configuration from the pipeline job\n", "for (\n", " task\n", ") in model_evaluation._backing_pipeline_job._gca_resource.job_detail.task_details:\n", " if \"model-evaluation-import\" in task.task_name:\n", " val = json.loads(task.execution.metadata.get(\"output:gcp_resources\"))\n", " model_evaluation = val[\"resources\"][0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "70aaaca32f36" }, "outputs": [], "source": [ "# Print the evaluation metrics\n", "model_evaluation_id = model_evaluation[\"resourceUri\"].split(\"/\")[-1]\n", "print(model_evaluation_id)\n", "\n", "# get evaluations from the model\n", "evaluation = model.get_model_evaluation()\n", "evaluation = evaluation.to_dict()\n", "print(\"Model's evaluation metrics:\\n\")\n", "metrics = evaluation[\"metrics\"]\n", "for metric in metrics.keys():\n", " print(f\"metric: {metric}, value: {metrics[metric]}\\n\")" ] }, { "cell_type": "markdown", "metadata": { "id": "cleanup:pipelines" }, "source": [ "## Cleaning up\n", "\n", "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Otherwise, you can delete the individual resources you created in this tutorial:\n", "\n", "- Evaluation job\n", "- Batch prediction job\n", "- Training and deployment job\n", "- Endpoint\n", "- Model\n", "- Dataset\n", "- Cloud Storage bucket (Set `delete_bucket` to True for deletion)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "92e77f0a7931" }, "outputs": [], "source": [ "delete_bucket = False\n", "\n", "# # delete the evaluation job\n", "evaluation_job.delete()\n", "\n", "# # delete the batch prediction job\n", "batch_prediction_job.delete()\n", "\n", "# delete the training job\n", "training_job.delete()\n", "\n", "# list the endpoints filtering the display name\n", "endpoints = aiplatform.Endpoint.list(\n", " filter=f\"display_name={ENDPOINT_DISPLAY_NAME}\", order_by=\"create_time\"\n", ")\n", "\n", "# delete the endpoint\n", "if endpoints:\n", " endpoint = endpoints[0]\n", " endpoint.undeploy_all()\n", " endpoint.delete()\n", " print(\"Deleted endpoint:\", endpoint)\n", "\n", "# list the models filtering the display name\n", "models = aiplatform.Model.list(\n", " filter=f\"display_name={MODEL_DISPLAY_NAME}\", order_by=\"create_time\"\n", ")\n", "# delete the model\n", "if models:\n", " model = models[0]\n", " model.delete()\n", " print(\"Deleted model:\", model)\n", "\n", "# list the datasets filtering the display name\n", "datasets = aiplatform.TextDataset.list(\n", " filter=f\"display_name={DATASET_DISPLAY_NAME}\", order_by=\"create_time\"\n", ")\n", "# delete the dataset\n", "if datasets:\n", " dataset = datasets[0]\n", " dataset.delete()\n", " print(\"Deleted dataset:\", dataset)\n", "\n", "# delete the Cloud Storage bucket\n", "if delete_bucket and os.getenv(\"IS_TESTING\"):\n", " ! gsutil rm -r $BUCKET_URI" ] } ], "metadata": { "colab": { "name": "automl_text_classification_model_evaluation.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }

notebooks/official/model_evaluation/automl_text_classification_model_evaluation.ipynb (1,381 lines of code) (raw):