notebooks/official/model_monitoring/model_monitoring.ipynb (1,249 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "ur8xi4C7S06n" }, "outputs": [], "source": [ "# @title Copyright & License (click to expand)\n", "# Copyright 2021 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "fsv4jGuU89rX" }, "source": [ "# Vertex AI Model Monitoring with Explainable AI Feature Attributions\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/model_monitoring/model_monitoring.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"><br> Open in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fmodel_monitoring%2Fmodel_monitoring.ipynb\">\n", " <img width=\"32px\" src=\"https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n", " </a>\n", " </td> \n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/model_monitoring/model_monitoring.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/model_monitoring/model_monitoring.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br>\n", " View on GitHub\n", " </a>\n", " </td>\n", "</table>" ] }, { "cell_type": "markdown", "metadata": { "id": "lA32H1oKGgpf" }, "source": [ "## Overview\n", "\n", "### What is Vertex AI Model Monitoring?\n", "\n", "Modern applications rely on a well established set of capabilities to monitor the health of their services. Examples include:\n", "\n", "* software versioning\n", "* rigorous deployment processes\n", "* event logging\n", "* alerting/notication of situations requiring intervention\n", "* on-demand and automated diagnostic tracing\n", "* automated performance and functional testing\n", "\n", "You should be able to manage your ML services with the same degree of power and flexibility with which you can manage your applications. That's what MLOps is all about - managing ML services with the best practices Google and the broader computing industry have learned from generations of experience deploying well engineered, reliable, and scalable services.\n", "\n", "Model monitoring is only one piece of the ML Ops puzzle - it helps answer the following questions:\n", "\n", "* How well do recent service requests match the training data used to build your model? This is called **training-serving skew**.\n", "* How significantly are service requests evolving over time? This is called **drift detection**.\n", "\n", "[Vertex Explainable AI](https://cloud.google.com/vertex-ai/docs/explainable-ai/overview) adds another facet to model monitoring, called feature attribution monitoring. Explainable AI enables you to understand the relative contribution of each feature to a resulting prediction. In essence, it assesses the magnitude of each feature's influence.\n", "\n", "If production traffic differs from training data, or varies substantially over time, **either in terms of model predictions or feature attributions**, that's likely to impact the quality of the answers your model produces. When that happens, you'd like to be alerted automatically and responsively, so that **you can anticipate problems before they affect your customer experiences or your revenue streams**.\n", "\n", "Learn more about [Vertex AI Model Monitoring](https://cloud.google.com/vertex-ai/docs/model-monitoring)." ] }, { "cell_type": "markdown", "metadata": { "id": "t6Cd51FkG09E" }, "source": [ "### Objective\n", "\n", "In this notebook, you learn to use the Vertex AI Model Monitoring service to detect drift and anomalies in prediction requests from a deployed Vertex AI model resource. \n", "\n", "This tutorial uses the following Google Cloud ML services:\n", "\n", "- Vertex AI Model Monitoring\n", "- Vertex AI Prediction\n", "- Vertex AI model resource\n", "- Vertex AI endpoint resource\n", "\n", "The steps performed include:\n", "\n", "- Upload a pre-trained model as a Vertex AI model resource.\n", "- Create an Vertex AI endpoint resource.\n", "- Deploy the model resource to the endpoint resource.\n", "- Configure the endpoint resource for model monitoring.\n", "- Initialize the baseline distribution for model monitoring.\n", "- Generate synthetic prediction requests.\n", "- Understand how to interpret the statistics, visualizations, other data reported by the model monitoring feature." ] }, { "cell_type": "markdown", "metadata": { "id": "edba71dc9840" }, "source": [ "### Model\n", "\n", "This tutorial uses a pre-trained model, where the model artifacts are stored in a public Cloud Storage bucket. The model predicts for an online gaming site, the probability that a player may churn, i.e. stop being an active player." ] }, { "cell_type": "markdown", "metadata": { "id": "5abcd585354f" }, "source": [ "### Costs \n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "* Vertex AI\n", "* BigQuery\n", "* Cloud Storage\n", "\n", "Learn about [Vertext AI pricing](https://cloud.google.com/vertex-ai/pricing), \n", "[Cloud Storage pricing](https://cloud.google.com/storage/pricing), \n", "and [BigQuery pricing](https://cloud.google.com/bigquery/pricing)\n", "and use the [Pricing\n", "Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "id": "f3848df1e5b0" }, "source": [ "## Get started" ] }, { "cell_type": "markdown", "metadata": { "id": "a2c2cb2109a0" }, "source": [ "### Install Vertex AI SDK for Python and other required packages\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c562886219d0" }, "outputs": [], "source": [ "# Don't bother installing tensorflow or explainable_ai_sdk on Colab\n", "\n", "# Install required packages.\n", "! pip3 install --upgrade --quiet \\\n", " google-cloud-aiplatform[full] \\\n", " google-cloud-bigquery \\\n", " explainable_ai_sdk \\\n", " tensorflow" ] }, { "cell_type": "markdown", "metadata": { "id": "restart" }, "source": [ "### Restart runtime (Colab only)\n", "\n", "To use the newly installed packages, you must restart the runtime on Google Colab." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "D-ZBOjErv5mM" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", "\n", " import IPython\n", "\n", " app = IPython.Application.instance()\n", " app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "ee775571c2b5" }, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>\n", "</div>\n" ] }, { "cell_type": "markdown", "metadata": { "id": "92e68cfc3a90" }, "source": [ "### Authenticate your notebook environment (Colab only)\n", "\n", "Authenticate your environment on Google Colab.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "46604f70e831" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", "\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "4f872cd812d0" }, "source": [ "### Set Google Cloud project information and initialize Vertex AI SDK for Python\n", "\n", "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "acbc2e1a15fe" }, "outputs": [], "source": [ "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", "LOCATION = \"us-central1\" # @param {type:\"string\"}\n", "\n", "\n", "from google.cloud import aiplatform\n", "\n", "aiplatform.init(project=PROJECT_ID, location=LOCATION)" ] }, { "cell_type": "markdown", "metadata": { "id": "42c8a7c56abd" }, "source": [ "#### User Email\n", "\n", "Set your user email address to receive monitoring alerts." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ce2589511bb6" }, "outputs": [], "source": [ "import os\n", "\n", "USER_EMAIL = \"[your-email-address]\" # @param {type:\"string\"}\n", "\n", "if os.getenv(\"IS_TESTING\"):\n", " USER_EMAIL = \"noreply@google.com\"" ] }, { "cell_type": "markdown", "metadata": { "id": "bucket:mbsdk" }, "source": [ "### Create a Cloud Storage bucket\n", "\n", "Create a storage bucket to store intermediate artifacts such as datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bucket" }, "outputs": [], "source": [ "BUCKET_URI = f\"gs://your-bucket-name-{PROJECT_ID}-unique\" # @param {type:\"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "autoset_bucket" }, "source": [ "**If your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "91c46850b49b" }, "outputs": [], "source": [ "! gsutil mb -l $LOCATION -p $PROJECT_ID $BUCKET_URI" ] }, { "cell_type": "markdown", "metadata": { "id": "8RJ3_20etd31" }, "source": [ "### Notes about service account and permission\n", "\n", "**By default no configuration is required**, if you run into any permission related issue, please make sure the service accounts above have the required roles:\n", "\n", "|Service account email|Description|Roles|\n", "|---|---|---|\n", "|PROJECT_NUMBER-compute@developer.gserviceaccount.com|Compute Engine default service account|Dataflow Admin, Dataflow Worker, Storage Admin, BigQuery Admin, Vertex AI User|\n", "|service-PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com|AI Platform Service Agent|Vertex AI Service Agent|\n", "\n", "\n", "1. Goto https://console.cloud.google.com/iam-admin/iam.\n", "2. Check the \"Include Google-provided role grants\" checkbox.\n", "3. Find the above emails.\n", "4. Grant the corresponding roles.\n", "\n", "### Using data source from a different project\n", "- For the BQ data source, grant both service accounts the \"BigQuery Data Viewer\" role.\n", "- For the CSV data source, grant both service accounts the \"Storage Object Viewer\" role." ] }, { "cell_type": "markdown", "metadata": { "id": "XoEqT2Y4DJmf" }, "source": [ "### Import libraries and define constants" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8353fb6d69a0" }, "outputs": [], "source": [ "# Import required packages.\n", "import os\n", "import pprint as pp\n", "\n", "import matplotlib.pyplot as plt\n", "from google.cloud import bigquery\n", "from google.cloud.aiplatform import model_monitoring\n", "from google.cloud.aiplatform.explain.metadata.tf.v2 import \\\n", " saved_model_metadata_builder" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "51010fc06d8c" }, "outputs": [], "source": [ "if os.getenv(\"IS_TESTING\"):\n", " ! gcloud --quiet components install beta\n", " ! gcloud --quiet components update\n", "\n", "! gcloud config set ai/region $LOCATION\n", "os.environ[\"GOOGLE_CLOUD_PROJECT\"] = PROJECT_ID" ] }, { "cell_type": "markdown", "metadata": { "id": "init_bq" }, "source": [ "### Create BigQuery client\n", "\n", "In this tutorial, you use data from the same public BigQuery table that was used to train the pre-trained model. You create a client interface, which you subsequently use to access the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "init_bq" }, "outputs": [], "source": [ "bqclient = bigquery.Client(project=PROJECT_ID)" ] }, { "cell_type": "markdown", "metadata": { "id": "tvgnzT1CKxrO" }, "source": [ "### The example model\n", "\n", "The model you use in this notebook is based on [this blog post](https://cloud.google.com/blog/topics/developers-practitioners/churn-prediction-game-developers-using-google-analytics-4-ga4-and-bigquery-ml). The idea behind this model is that your company has extensive log data describing how your game users have interacted with the site. The raw data contains the following categories of information:\n", "\n", "- identity - unique player identitity numbers\n", "- demographic features - information about the player, such as the geographic region in which a player is located\n", "- behavioral features - counts of the number of times a player has triggered certain game events, such as reaching a new level\n", "- churn propensity - this is the label or target feature, it provides an estimated probability that this player churns, i.e. stop being an active player.\n", "\n", "The blog article referenced above explains how to use BigQuery to store the raw data, pre-process the data for machine learning, and train the corresponding model. Because this notebook focuses on model monitoring, rather than training models, you're going to reuse a pre-trained version of this model, which has been exported to Cloud Storage. In the next section, you setup your environment and import this model into your own project." ] }, { "cell_type": "markdown", "metadata": { "id": "btZeLzqQ7pXc" }, "source": [ "### Define some helper data structures\n", "\n", "Run the following cell to define some data structures used throughout this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "0zBhvh642JdH" }, "outputs": [], "source": [ "# @title Utility data structures\n", "\n", "# Sampling distributions for categorical features...\n", "DAYOFWEEK = {1: 1040, 2: 1223, 3: 1352, 4: 1217, 5: 1078, 6: 1011, 7: 1110}\n", "\n", "LANGUAGE = {\n", " \"en-us\": 4807,\n", " \"en-gb\": 678,\n", " \"ja-jp\": 419,\n", " \"en-au\": 310,\n", " \"en-ca\": 299,\n", " \"de-de\": 147,\n", " \"en-in\": 130,\n", " \"en\": 127,\n", " \"fr-fr\": 94,\n", " \"pt-br\": 81,\n", " \"es-us\": 65,\n", " \"zh-tw\": 64,\n", " \"zh-hans-cn\": 55,\n", " \"es-mx\": 53,\n", " \"nl-nl\": 37,\n", " \"fr-ca\": 34,\n", " \"en-za\": 29,\n", " \"vi-vn\": 29,\n", " \"en-nz\": 29,\n", " \"es-es\": 25,\n", "}\n", "\n", "OS = {\"IOS\": 3980, \"ANDROID\": 3798, \"null\": 253}\n", "\n", "MONTH = {6: 3125, 7: 1838, 8: 1276, 9: 1718, 10: 74}\n", "\n", "COUNTRY = {\n", " \"United States\": 4395,\n", " \"India\": 486,\n", " \"Japan\": 450,\n", " \"Canada\": 354,\n", " \"Australia\": 327,\n", " \"United Kingdom\": 303,\n", " \"Germany\": 144,\n", " \"Mexico\": 102,\n", " \"France\": 97,\n", " \"Brazil\": 93,\n", " \"Taiwan\": 72,\n", " \"China\": 65,\n", " \"Saudi Arabia\": 49,\n", " \"Pakistan\": 48,\n", " \"Egypt\": 46,\n", " \"Netherlands\": 45,\n", " \"Vietnam\": 42,\n", " \"Philippines\": 39,\n", " \"South Africa\": 38,\n", "}\n", "\n", "# Means and standard deviations for numerical features...\n", "MEAN_SD = {\n", " \"julianday\": (204.6, 34.7),\n", " \"cnt_user_engagement\": (30.8, 53.2),\n", " \"cnt_level_start_quickplay\": (7.8, 28.9),\n", " \"cnt_level_end_quickplay\": (5.0, 16.4),\n", " \"cnt_level_complete_quickplay\": (2.1, 9.9),\n", " \"cnt_level_reset_quickplay\": (2.0, 19.6),\n", " \"cnt_post_score\": (4.9, 13.8),\n", " \"cnt_spend_virtual_currency\": (0.4, 1.8),\n", " \"cnt_ad_reward\": (0.1, 0.6),\n", " \"cnt_challenge_a_friend\": (0.0, 0.3),\n", " \"cnt_completed_5_levels\": (0.1, 0.4),\n", " \"cnt_use_extra_steps\": (0.4, 1.7),\n", "}\n", "\n", "DEFAULT_INPUT = {\n", " \"cnt_ad_reward\": 0,\n", " \"cnt_challenge_a_friend\": 0,\n", " \"cnt_completed_5_levels\": 1,\n", " \"cnt_level_complete_quickplay\": 3,\n", " \"cnt_level_end_quickplay\": 5,\n", " \"cnt_level_reset_quickplay\": 2,\n", " \"cnt_level_start_quickplay\": 6,\n", " \"cnt_post_score\": 34,\n", " \"cnt_spend_virtual_currency\": 0,\n", " \"cnt_use_extra_steps\": 0,\n", " \"cnt_user_engagement\": 120,\n", " \"country\": \"Denmark\",\n", " \"dayofweek\": 3,\n", " \"julianday\": 254,\n", " \"language\": \"da-dk\",\n", " \"month\": 9,\n", " \"operating_system\": \"IOS\",\n", " \"user_pseudo_id\": \"104B0770BAE16E8B53DF330C95881893\",\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "1mhT_d_Bi-Kf" }, "source": [ "### Generate model metadata for Vertex Explainable AI\n", "\n", "Run the following cell to extract metadata from the exported model, which is needed for generating the explanations for a prediction request." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "d1a984c57356" }, "outputs": [], "source": [ "MODEL_PATH = \"gs://mco-mm/churn\"\n", "\n", "params = {\"sampled_shapley_attribution\": {\"path_count\": 10}}\n", "EXPLAIN_PARAMS = aiplatform.explain.ExplanationParameters(params)\n", "\n", "builder = saved_model_metadata_builder.SavedModelMetadataBuilder(\n", " model_path=MODEL_PATH, outputs_to_explain=[\"churned_probs\"]\n", ")\n", "EXPLAIN_META = builder.get_metadata_protobuf()" ] }, { "cell_type": "markdown", "metadata": { "id": "lAOk8UqvCL0S" }, "source": [ "## Upload your model\n", "\n", "The churn propensity model you use in this notebook has been trained in BigQuery ML and exported to a Cloud Storage bucket. This illustrates how you can easily export a trained model and move a model from one cloud service to another. \n", "\n", "Run the next cell to import this model into your project. **If you've already imported your model, you can skip this step.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qJKpxpexu5ZS" }, "outputs": [], "source": [ "MODEL_NAME = \"churn\"\n", "IMAGE = \"us-docker.pkg.dev/cloud-aiplatform/prediction/tf2-cpu.2-5:latest\"\n", "\n", "model = aiplatform.Model.upload(\n", " display_name=MODEL_NAME,\n", " artifact_uri=MODEL_PATH,\n", " serving_container_image_uri=IMAGE,\n", " explanation_parameters=EXPLAIN_PARAMS,\n", " explanation_metadata=EXPLAIN_META,\n", " sync=True,\n", ")\n", "\n", "MODEL_ID = model.resource_name.split(\"/\")[-1]" ] }, { "cell_type": "markdown", "metadata": { "id": "e2030b028cef" }, "source": [ "Once the above cell completes, you should see a new model on the Vertex AI Model Registry page on the Cloud Console." ] }, { "cell_type": "markdown", "metadata": { "id": "d7cbb0fb73cc" }, "source": [ "## Deploy your Model resource to an Endpoint resource\n", "\n", "Now that you've imported your model into your project, you need to create an endpoint to serve your model. An endpoint can be thought of as a channel through which your model provides prediction services. Once established, you can make online prediction requests on your model via the public internet. Your endpoint is also serverless, in the sense that Google Cloud ensures high availability by reducing single points of failure, and scalability by dynamically allocating resources to meet the demand for your service. In this way, you are able to focus on your model quality, and freed from adminstrative and infrastructure concerns.\n", "\n", "Run the next cell to deploy your model to an endpoint. **This takes about ten minutes to complete.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AuZQ5D1_CfR7" }, "outputs": [], "source": [ "endpoint = model.deploy(machine_type=\"n1-standard-4\")\n", "print(f\"endpoint display name: {endpoint.display_name}\")\n", "print(f\"endpoint resource name: {endpoint.resource_name}\")\n", "ENDPOINT = endpoint.resource_name\n", "ENDPOINT_ID = ENDPOINT.split(\"/\")[-1]" ] }, { "cell_type": "markdown", "metadata": { "id": "HmhoNm4saoCQ" }, "source": [ "Once the above cell completes, you should see a new endoint \n", "on the Vertex AI Endpoints page on the Cloud Console." ] }, { "cell_type": "markdown", "metadata": { "id": "NKsA_lfl9Ryw" }, "source": [ "## Run a prediction test\n", "\n", "Now that you have imported a model and deployed that model to an endpoint, you are ready to verify that it's working. Run the next cell to send a test prediction request. If everything works as expected, you should receive a response encoded in a text representation called JSON, along with a pie chart summarizing the results.\n", "\n", "**Try this now by running the next cell.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QNEb7fDJ9NXc" }, "outputs": [], "source": [ "try:\n", " resp = endpoint.predict([DEFAULT_INPUT])\n", " for i in resp.predictions:\n", " vals = i[\"churned_values\"]\n", " probs = i[\"churned_probs\"]\n", " for i in range(len(vals)):\n", " print(vals[i], probs[i])\n", " plt.pie(probs, labels=vals)\n", " pp.pprint(resp)\n", "except Exception as ex:\n", " print(\"prediction request failed\", ex)" ] }, { "cell_type": "markdown", "metadata": { "id": "a1eb4131bb5e" }, "source": [ "### Test results\n", "\n", "Taking a look at the results, you see the following elements:\n", "\n", "- **churned_values** - a set of possible values (0 and 1) for the target field\n", "- **churned_probs** - a corresponding set of probabilities for each possible target field value (5x10^-40 and 1.0, respectively)\n", "- **predicted_churn** - based on the probabilities, the predicted value of the target field (1)\n", "\n", "This response encodes the model's prediction in a format that is readily digestible by software, which makes this service ideal for automated use by an application." ] }, { "cell_type": "markdown", "metadata": { "id": "be7830c55cdc" }, "source": [ "## Run an explanation test\n", "\n", "You can run a test of Explainable AI on this endpoint. Run the next cell to send a test explanation request. The response you receive encodes the feature importance of this prediction in a text representation called JSON, along with a bar chart summarizing the results.\n", "\n", "**Try this now by running the next cell.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dcbd4e755931" }, "outputs": [], "source": [ "try:\n", " features = []\n", " scores = []\n", " resp = endpoint.explain([DEFAULT_INPUT])\n", " for i in resp.explanations:\n", " for j in i.attributions:\n", " for k in j.feature_attributions:\n", " features.append(k)\n", " scores.append(j.feature_attributions[k])\n", " features = [x for _, x in sorted(zip(scores, features))]\n", " scores = sorted(scores)\n", " fig, ax = plt.subplots()\n", " fig.set_size_inches(9, 9)\n", " ax.barh(features, scores)\n", " fig.show()\n", "except Exception as ex:\n", " print(\"explanation request failed\", ex)" ] }, { "cell_type": "markdown", "metadata": { "id": "8rF5iLuXCT7i" }, "source": [ "## Start your monitoring job\n", "\n", "Now that you've created an endpoint to serve prediction requests on your model, you're ready to start a monitoring job to keep an eye on model quality and to alert you if and when input begins to deviate in way that may impact your model's prediction quality.\n", "\n", "In this section, you configure and create a model monitoring job based on the churn propensity model you imported from BigQuery ML." ] }, { "cell_type": "markdown", "metadata": { "id": "wW2gLBQ3Zkhq" }, "source": [ "### Configure the following fields:\n", "\n", "1. Log sample rate - Your prediction requests and responses are logged to BigQuery tables, which are automatically created when you create a monitoring job. This parameter specifies the desired logging frequency for those tables.\n", "1. Monitor interval - time window over which to analyze your data and report anomalies. The minimum window is one hour (1)\n", "1. Target field - prediction target column name in training dataset\n", "1. Skew detection threshold - skew threshold for each feature you want to monitor\n", "1. Prediction drift threshold - drift threshold for each feature you want to monitor\n", "1. Attribution Skew detection threshold - feature importance skew threshold\n", "1. Attribution Prediction drift threshold - feature importance drift threshold" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "plpASmM2YIVO" }, "outputs": [], "source": [ "JOB_NAME = \"churn\"\n", "\n", "# Sampling rate (optional, default=.8)\n", "LOG_SAMPLE_RATE = 0.8 # @param {type:\"number\"}\n", "\n", "# Monitoring Interval in hours (optional, default=1).\n", "MONITOR_INTERVAL = 1 # @param {type:\"number\"}\n", "\n", "# URI to training dataset.\n", "DATASET_BQ_URI = \"bq://mco-mm.bqmlga4.train\" # @param {type:\"string\"}\n", "# Prediction target column name in training dataset.\n", "TARGET = \"churned\"\n", "\n", "# # Skew and drift thresholds.\n", "\n", "DEFAULT_THRESHOLD_VALUE = 0.001\n", "\n", "SKEW_THRESHOLDS = {\n", " \"country\": DEFAULT_THRESHOLD_VALUE,\n", " \"cnt_user_engagement\": DEFAULT_THRESHOLD_VALUE,\n", "}\n", "DRIFT_THRESHOLDS = {\n", " \"country\": DEFAULT_THRESHOLD_VALUE,\n", " \"cnt_user_engagement\": DEFAULT_THRESHOLD_VALUE,\n", "}\n", "ATTRIB_SKEW_THRESHOLDS = {\n", " \"country\": DEFAULT_THRESHOLD_VALUE,\n", " \"cnt_user_engagement\": DEFAULT_THRESHOLD_VALUE,\n", "}\n", "ATTRIB_DRIFT_THRESHOLDS = {\n", " \"country\": DEFAULT_THRESHOLD_VALUE,\n", " \"cnt_user_engagement\": DEFAULT_THRESHOLD_VALUE,\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "e10f3d0fa538" }, "source": [ "You can change the threshold values and the configuration settings, so that you can monitor other features in the model as well." ] }, { "cell_type": "markdown", "metadata": { "id": "mjVSViZR-dP2" }, "source": [ "### Create your monitoring job\n", "\n", "The following code uses the Google Python client library to translate your configuration settings into a programmatic request to start a model monitoring job. Instantiating a monitoring job can take some time. If everything looks good with your request, you'll get a successful API response. Then, you'll need to check your email to receive a notification that the job is running." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-62TYm2iYv3K" }, "outputs": [], "source": [ "skew_config = model_monitoring.SkewDetectionConfig(\n", " data_source=DATASET_BQ_URI,\n", " skew_thresholds=SKEW_THRESHOLDS,\n", " attribute_skew_thresholds=ATTRIB_SKEW_THRESHOLDS,\n", " target_field=TARGET,\n", ")\n", "\n", "drift_config = model_monitoring.DriftDetectionConfig(\n", " drift_thresholds=DRIFT_THRESHOLDS,\n", " attribute_drift_thresholds=ATTRIB_DRIFT_THRESHOLDS,\n", ")\n", "\n", "explanation_config = model_monitoring.ExplanationConfig()\n", "objective_config = model_monitoring.ObjectiveConfig(\n", " skew_config, drift_config, explanation_config\n", ")\n", "\n", "# Create sampling configuration\n", "random_sampling = model_monitoring.RandomSampleConfig(sample_rate=LOG_SAMPLE_RATE)\n", "\n", "# Create schedule configuration\n", "schedule_config = model_monitoring.ScheduleConfig(monitor_interval=MONITOR_INTERVAL)\n", "\n", "# Create alerting configuration.\n", "emails = [USER_EMAIL]\n", "alerting_config = model_monitoring.EmailAlertConfig(\n", " user_emails=emails, enable_logging=True\n", ")\n", "\n", "# Create the monitoring job.\n", "job = aiplatform.ModelDeploymentMonitoringJob.create(\n", " display_name=JOB_NAME,\n", " logging_sampling_strategy=random_sampling,\n", " schedule_config=schedule_config,\n", " alert_config=alerting_config,\n", " objective_configs=objective_config,\n", " project=PROJECT_ID,\n", " location=LOCATION,\n", " endpoint=endpoint,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "SaXYVFFslRru" }, "source": [ "### Receiving email alert\n", "After a minute or two, you should receive email at the address you configured above for USER_EMAIL. This email confirms successful deployment of your monitoring job. Here's a sample of what this email might look like:\n", "<br>\n", "<br>\n", "<img src=\"https://storage.googleapis.com/mco-general/img/mm6.png\" />\n", "<br>\n", "As your monitoring job collects data, measurements are stored in Cloud Storage and you are free to examine your data at any time. The \"Statistics and Anomalies Root Path\" specifies the location of your measurements in Cloud Storage. Run the following cell to see an example of the layout of these measurements in Cloud Storage. If you substitute the Cloud Storage URL in your job creation email, you can view the structure and content of the data files for your own monitoring job." ] }, { "cell_type": "markdown", "metadata": { "id": "6f38e8423bce" }, "source": [ "### Create the sampling distribution\n", "\n", "Next, you send a first test prediction request. The model monitoring service analyzes the distribution of features and automatically create a baseline to monitor deviations from the baseline.\n", "\n", "*Note:* You need to wait for the email notification before making the first prediction request." ] }, { "cell_type": "markdown", "metadata": { "id": "3960076190ab" }, "source": [ "## Initialize the parsing for automatically generating the input schema\n", "\n", "After your `Endpoint` receives a 1000 prediction requests, the modeling service automatically parses and creates the `input schema`.\n", "\n", "### Create the 1000 instance data\n", "\n", "In this example, the first 1000 entries in the BigQuery training data are used as the first 1000 prediction requests. \n", "\n", "*Note:* In this context, each instance is a prediction request. In otherwords, sending 1000 prediction requests of a single instance is the same as sending a single prediction request with 1000 instances." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cb26c3dea306" }, "outputs": [], "source": [ "# Download the table.\n", "table = bigquery.TableReference.from_string(DATASET_BQ_URI[5:])\n", "\n", "rows = bqclient.list_rows(table, max_results=1000)\n", "\n", "instances = []\n", "for row in rows:\n", " instance = {}\n", " for key, value in row.items():\n", " if key == TARGET:\n", " continue\n", " if value is None:\n", " value = \"\"\n", " instance[key] = value\n", " instances.append(instance)\n", "\n", "print(len(instances))" ] }, { "cell_type": "markdown", "metadata": { "id": "6d002569dadc" }, "source": [ "### Make the initial prediction request\n", "\n", "Next, you send the the 1000 prediction request to your `Vertex AI Endpoint` resource using the `predict()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b2d69d89cd66" }, "outputs": [], "source": [ "response = endpoint.predict(instances=instances)\n", "\n", "prediction = response[0]\n", "\n", "# print the prediction for the first instance\n", "print(prediction[0])" ] }, { "cell_type": "markdown", "metadata": { "id": "e990a8821178" }, "source": [ "### Automatic generation of the input schema\n", "\n", "After the model monitoring service receives 1000 instances of prediction requests, the monitoring starts analyzing the prediction requests to automatically generate an `input schema` for the feature inputs.\n", "\n", "### Automatic generation of the baseline distribution\n", "\n", "After the `input schema` is generated, the monitoring service creates a batch job to analyze the training data to determine the baseline distribution. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "670b5bc98c2a" }, "outputs": [], "source": [ "# Pause a bit for the baseline distribution to be calculated\n", "if os.getenv(\"IS_TESTING\"):\n", " import time\n", "\n", " time.sleep(120)" ] }, { "cell_type": "markdown", "metadata": { "id": "8f9455fc894d" }, "source": [ "### Example of monitoring data stored in Cloud Storage" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XV-vru2Pm1oX" }, "outputs": [], "source": [ "!gsutil ls gs://cloud-ai-platform-fdfb4810-148b-4c86-903c-dbdff879f6e1/*/*" ] }, { "cell_type": "markdown", "metadata": { "id": "XgUwU0sDpUUD" }, "source": [ "### Cloud storage layout\n", "Notice the following components in these Cloud Storage paths:\n", "\n", "- **cloud-ai-platform-..** - This is a bucket created for you and assigned to capture your service's prediction data. Each monitoring job you create triggers the creation of a new folder in this bucket.\n", "- **[model_monitoring|instance_schemas]/job-..** - This is your unique monitoring job number, which you can see above in both the response to your job creation requesst and the email notification. \n", "- **instance_schemas/job-../analysis** - This is the monitoring jobs understanding and encoding of your training data's schema (field names, types, etc.).\n", "- **instance_schemas/job-../predict** - This is the first prediction made to your model after the current monitoring job was enabled.\n", "- **model_monitoring/job-../serving** - This folder is used to record data relevant to drift calculations. It contains measurement summaries for every hour your model serves traffic.\n", "- **model_monitoring/job-../training** - This folder is used to record data relevant to training-serving skew calculations. It contains an ongoing summary of prediction data relative to training data.\n", "- **model_monitoring/job-../feature_attribution_score** - This folder is used to record data relevant to feature attribution calculations. It contains an ongoing summary of feature attribution scores relative to training data." ] }, { "cell_type": "markdown", "metadata": { "id": "8V2zo7-MMd7G" }, "source": [ "### You can create monitoring jobs with other user interfaces\n", "\n", "In the previous cells, you created a monitoring job using the Python client library. Alternatively, you can use the *gcloud* command line tool or the Cloud Console to create a model monitoring job. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "bQohDTJgLQlW" }, "source": [ "## Interpret your results\n", "\n", "Vertex AI Model Monitoring detects an anomaly when the threshold set for a feature is exceeded. The following cells give you a sense of the alerting and reporting experience after model monitoring anomalies have been detected.\n", "\n", "Vertex AI Model Monitoring automatically notifies you of detected anomalies through email, but you can also [set up alerts through Cloud Logging](https://cloud.google.com/vertex-ai/docs/model-monitoring/using-model-monitoring#monitor-job)." ] }, { "cell_type": "markdown", "metadata": { "id": "uGPI92qbOFUR" }, "source": [ "### Here's what a sample email alert looks like...\n", "\n", "<img src=\"https://storage.googleapis.com/mco-general/img/mm7.png\" />\n" ] }, { "cell_type": "markdown", "metadata": { "id": "HoaqsxpaRs1m" }, "source": [ "This email is warning you that the *cnt_level_start_quickplay*, *cnt_user_engagement*, and *country* feature values seen in production have skewed above your threshold between training and serving your model. It's also telling you that the *cnt_user_engagement* and *country* feature attribution values are skewed relative to your training data, again, as per your threshold specification." ] }, { "cell_type": "markdown", "metadata": { "id": "w4jVIVq4VzB_" }, "source": [ "### Monitoring results in the Cloud Console\n", "\n", "You can examine your model monitoring data from the Cloud Console. Below is a screenshot of those capabilities." ] }, { "cell_type": "markdown", "metadata": { "id": "2OdIMVBAPZi_" }, "source": [ "#### Monitoring Status\n", "\n", "You can verify that a given endpoint has an active model monitoring job via the Endpoint summary page:\n", "\n", "<img src=\"https://storage.googleapis.com/mco-general/img/mm1.png\" />\n", "\n", "#### Monitoring Alerts\n", "\n", "You can examine the alert details by clicking into the endpoint of interest, and selecting the alerts panel:\n", "\n", "<img src=\"https://storage.googleapis.com/mco-general/img/mm2.png\" />\n", "\n", "#### Feature Value Distributions\n", "\n", "You can also examine the recorded training and production feature distributions by drilling down into a given feature, like this:\n", "\n", "<img src=\"https://storage.googleapis.com/mco-general/img/mm9.png\" />\n", "\n", "which yields graphical representations of the feature distrubution during both training and production, like this:\n", "\n", "<img src=\"https://storage.googleapis.com/mco-general/img/mm8.png\" />" ] }, { "cell_type": "markdown", "metadata": { "id": "TpV-iwP9qw9c" }, "source": [ "## Clean up\n", "\n", "When you are finished with this notebook, you can clean up all Google Cloud resources used in this project, by [deleting the Google Cloud\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Alternatively, you can preserve the project and delete the individual resources you created in this tutorial by executing the following cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "d6cc924aa1fb" }, "outputs": [], "source": [ "# Undeploy the model and delete the endpoint\n", "endpoint.undeploy_all()\n", "endpoint.delete()\n", "\n", "model.delete()\n", "\n", "# Delete BQ table and dataset\n", "rmtable = f\"bq rm -f model_deployment_monitoring_{ENDPOINT_ID}.serving_predict\"\n", "! $rmtable\n", "rmdataset = f\"bq rm -f model_deployment_monitoring_{ENDPOINT_ID}\"\n", "! $rmdataset\n", "\n", "# Delete Cloud Storage bucket\n", "delete_bucket = False\n", "\n", "if delete_bucket:\n", " ! gsutil rm -rf {BUCKET_URI}" ] }, { "cell_type": "markdown", "metadata": { "id": "j3Dh15h3-NoO" }, "source": [ "## Learn more about model monitoring\n", "\n", "**Congratulations!** You've now learned what model monitoring is, how to configure and enable it, and how to find and interpret the results. Check out the following resources to learn more about model monitoring and ML Ops.\n", "\n", "- [TensorFlow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv)\n", "- [Data Understanding, Validation, and Monitoring At Scale](https://blog.tensorflow.org/2018/09/introducing-tensorflow-data-validation.html)\n", "- [Vertex Product Documentation](https://cloud.google.com/vertex-ai)\n", "- [Vertex AI Model Monitoring Reference Docs](https://cloud.google.com/vertex-ai/docs/reference)\n", "- [Vertex AI Model Monitoring blog article](https://cloud.google.com/blog/topics/developers-practitioners/monitor-models-training-serving-skew-vertex-ai)\n", "- [Explainable AI Whitepaper](https://storage.googleapis.com/cloud-ai-whitepapers/AI%20Explainability%20Whitepaper.pdf)" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "model_monitoring.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }