tutorials/explanations/ai-explanations-tabular.ipynb (1,452 lines of code) (raw):

{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "ai-explanations-tabular.ipynb", "provenance": [], "collapsed_sections": [] }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "cells": [ { "cell_type": "code", "metadata": { "cellView": "both", "colab_type": "code", "deletable": true, "editable": true, "id": "qnMpW5Y9nv2l", "colab": {} }, "source": [ "# Copyright 2020 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "mHF9VCProKJN" }, "source": [ "# AI Explanations: Explaining a tabular data model\n", "\n", "<table align=\"left\">\n", " <td>\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/ml-on-gcp/blob/master/tutorials/explanations/ai-explanations-tabular.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://github.com/GoogleCloudPlatform/ml-on-gcp/tree/master/tutorials/explanations/ai-explanations-tabular.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n", " View on GitHub\n", " </a>\n", " </td>\n", "</table>" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "hZzRVxNtH-zG" }, "source": [ "## Overview\n", "\n", "This tutorial shows how to train a Keras model on tabular data and deploy it to the AI Explanations service to get feature attributions on your deployed model.\n", "\n", "If you've already got a trained model and want to deploy it to AI Explanations, skip to the **Export the model as a TF 1 SavedModel** section." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "iN69d4D9Flrh" }, "source": [ "### Dataset\n", "\n", "The dataset used for this tutorial was created by combining two BigQuery Public Datasets: [London Bikeshare data](https://console.cloud.google.com/marketplace/details/greater-london-authority/london-bicycles?filter=solution-type%3Adataset&q=london%20bicycle%20hires&id=95374cac-2834-4fa2-a71f-fc033ccb5ce4) and [NOAA weather data](https://console.cloud.google.com/marketplace/details/noaa-public/gsod?filter=solution-type:dataset&q=noaa&id=c6c1b652-3958-4a47-9e58-552a546df47f). " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "Su2qu-4CW-YH" }, "source": [ "### Objective\n", "\n", "The goal is to train a model using the Keras Sequential API that predicts how long a bike trip took based on the trip start time, distance, day of week, and various weather data during that day. \n", "\n", "This tutorial focuses more on deploying the model to AI Explanations than on the design of the model itself. " ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "912RD_3fxGeH" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud Platform (GCP):\n", "\n", "* AI Platform for:\n", " * Prediction\n", " * Explanation: AI Explanations comes at no extra charge to prediction prices. However, explanation requests take longer to process than normal predictions, so heavy usage of AI Explanations along with auto-scaling may result in more nodes being started and thus more charges\n", "* Cloud Storage for:\n", " * Storing model files for deploying to Cloud AI Platform\n", "\n", "Learn about [AI Platform\n", "pricing](https://cloud.google.com/ml-engine/docs/pricing) and [Cloud Storage\n", "pricing](https://cloud.google.com/storage/pricing), and use the [Pricing\n", "Calculator](https://cloud.google.com/products/calculator/)\n", "to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "rgLXkyHEvTVD" }, "source": [ "## Before you begin\n", "\n", "Make sure you're running this notebook in a **GPU runtime** if you have that option. In Colab, select **Runtime** --> **Change runtime type**\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "avDUUQEGTnUo" }, "source": [ "This tutorial assumes you are running the notebook either in **Colab** or **Cloud AI Platform Notebooks**." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "i2qsxysTVc-l" }, "source": [ "### Set up your GCP project\n", "\n", "**The following steps are required, regardless of your notebook environment.**\n", "\n", "1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager)\n", "\n", "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n", "\n", "3. [Enable the AI Platform Training & Prediction and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n", "\n", "4. Enter your project ID in the cell below. Then run the cell to make sure the\n", "Cloud SDK uses the right project for all the commands in this notebook.\n", "\n", "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands." ] }, { "cell_type": "code", "metadata": { "cellView": "both", "colab_type": "code", "deletable": true, "editable": true, "id": "4qxwBA4RM9Lu", "colab": {} }, "source": [ "PROJECT_ID = \"<your-project-id>\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "TSy-f05IO4LB" }, "source": [ "### Authenticate your GCP account\n", "\n", "**If you are using AI Platform Notebooks**, your environment is already\n", "authenticated. Skip this step." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "fZQUrHdXNJnk" }, "source": [ "**If you are using Colab**, run the cell below and follow the instructions\n", "when prompted to authenticate your account via oAuth." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "deletable": true, "editable": true, "id": "W9i6oektpgld", "colab": {} }, "source": [ "import sys, os\n", "import warnings\n", "import googleapiclient\n", "\n", "warnings.filterwarnings('ignore')\n", "os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' \n", "# If you are running this notebook in Colab, follow the\n", "# instructions to authenticate your GCP account. This provides access to your\n", "# Cloud Storage bucket and lets you submit training jobs and prediction\n", "# requests.\n", "\n", "def install_dlvm_packages():\n", " !pip install tabulate\n", "\n", "if 'google.colab' in sys.modules:\n", " from google.colab import auth as google_auth\n", " google_auth.authenticate_user()\n", " !pip install witwidget --quiet\n", " !pip install tensorflow==1.15.0 --quiet\n", " !gcloud config set project $PROJECT_ID\n", "\n", "elif \"DL_PATH\" in os.environ:\n", " install_dlvm_packages()\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "tT061irlJwkg" }, "source": [ "### Create a Cloud Storage bucket\n", "\n", "**The following steps are required, regardless of your notebook environment.**\n", "\n", "When you submit a training job using the Cloud SDK, you upload a Python package\n", "containing your training code to a Cloud Storage bucket. AI Platform runs\n", "the code from this package. In this tutorial, AI Platform also saves the\n", "trained model that results from your job in the same bucket. You can then\n", "create an AI Platform model version based on this output in order to serve\n", "online predictions.\n", "\n", "Set the name of your Cloud Storage bucket below. It must be unique across all\n", "Cloud Storage buckets. \n", "\n", "You may also change the `REGION` variable, which is used for operations\n", "throughout the rest of this notebook. Make sure to [choose a region where Cloud\n", "AI Platform services are\n", "available](https://cloud.google.com/ml-engine/docs/tensorflow/regions). You may\n", "not use a Multi-Regional Storage bucket for training with AI Platform." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "deletable": true, "editable": true, "id": "bTxmbDg1I0x1", "colab": {} }, "source": [ "BUCKET_NAME = \"<your-bucket-name>\"\n", "REGION = \"us-central1\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "fsmCk2dwJnLZ" }, "source": [ "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "deletable": true, "editable": true, "id": "160PRO3aJqLD", "colab": {} }, "source": [ "!gsutil mb -l $REGION gs://$BUCKET_NAME" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "PyxoF-iqqD1t" }, "source": [ "### Import libraries\n", "\n", "Import the libraries we'll be using in this tutorial. This tutorial has been tested with **TensorFlow versions 1.14 and 1.15**." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "MEDlLSWK15UL", "colab": {} }, "source": [ "import tensorflow as tf \n", "import pandas as pd\n", "import numpy as np \n", "import json\n", "import time\n", "\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.preprocessing import MinMaxScaler\n", "from tabulate import tabulate\n", "\n", "# Should be 1.15.0\n", "print(tf.__version__)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "aRVMEU2Qshm4" }, "source": [ "## Downloading and preprocessing data\n", "\n", "In this section you'll download the data to train your model from a public GCS bucket. The original data is from the BigQuery datasets linked above. For your convenience, we've joined the London bike and NOAA weather tables, done some preprocessing, and provided a subset of that dataset here.\n" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "v7HLNsvekxvz", "colab": {} }, "source": [ "# Copy the data to your notebook instance\n", "!gsutil cp 'gs://explanations_sample_data/bike-data.csv' ./" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "8zr6lj66UlMn" }, "source": [ "### Read the data with Pandas\n", "\n", "We'll use Pandas to read the data into a `DataFrame` and then do some additional pre-processing." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "deletable": true, "editable": true, "id": "Icz22E69smnD", "colab": {} }, "source": [ "data = pd.read_csv('bike-data.csv')\n", "\n", "# Shuffle the data\n", "data = data.sample(frac=1, random_state=2)\n", "\n", "# Drop rows with null values\n", "data = data[data['wdsp'] != 999.9]\n", "data = data[data['dewp'] != 9999.9]\n", "\n", "# Rename some columns for readability\n", "data=data.rename(columns = {'day_of_week':'weekday'})\n", "data=data.rename(columns = {'max':'max_temp'})\n", "data=data.rename(columns = {'dewp': 'dew_point'})\n", "\n", "# Drop columns we won't use to train this model\n", "data = data.drop(columns=['start_station_name', 'end_station_name', 'bike_id', 'snow_ice_pellets'])\n", "\n", "# Convert trip duration from seconds to minutes so it's easier to understand\n", "data['duration'] = data['duration'].apply(lambda x:float(x / 60))" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "vxZryg4xmdy0", "colab": {} }, "source": [ "# Preview the first 5 rows\n", "data.head()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "YXLNDcfUvlr8", "colab": {} }, "source": [ "# Save duration to its own DataFrame and remove it from the original DataFrame\n", "labels = data['duration']\n", "data = data.drop(columns=['duration'])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "iSrzwuchvcgv" }, "source": [ "### Split data into train and test sets\n", "\n", "We'll split our data into train and test sets using an 80 / 20 train / test split." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "deletable": true, "editable": true, "id": "D5PIljnYveDN", "colab": {} }, "source": [ "# Use 80/20 train/test split\n", "train_size = int(len(data) * .8)\n", "print (\"Train size: %d\" % train_size)\n", "print (\"Test size: %d\" % (len(data) - train_size))\n", "\n", "# Split our data into train and test sets\n", "train_data = data[:train_size]\n", "train_labels = labels[:train_size]\n", "\n", "test_data = data[train_size:]\n", "test_labels = labels[train_size:]" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "kV_NEAQwwH0e" }, "source": [ "## Build, train, and evaluate our model with Keras\n", "\n", "We'll use tf.keras to build a simple Sequential model that takes our 10 features as input and predicts trip duration in minutes (numerical value)." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "3kQz8Q0DsBM7", "colab": {} }, "source": [ "# Build our model\n", "model = tf.keras.Sequential(name=\"bike_predict\")\n", "model.add(tf.keras.layers.Dense(64, input_dim=len(train_data.iloc[0]), activation='relu'))\n", "model.add(tf.keras.layers.Dense(32, activation='relu'))\n", "model.add(tf.keras.layers.Dense(1))" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "UvAcjSUcs_l7", "colab": {} }, "source": [ "# Compile the model and see a summary\n", "optimizer = tf.keras.optimizers.Adam(0.001)\n", "model.compile(loss='mean_squared_logarithmic_error', optimizer=optimizer)\n", "model.summary()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "GcOkuHPVwjiM" }, "source": [ "### Create an input data pipeline with tf.data" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "ZUu9wFklwmm6", "colab": {} }, "source": [ "batch_size = 256\n", "epochs = 3\n", "\n", "input_train = tf.data.Dataset.from_tensor_slices(train_data)\n", "output_train = tf.data.Dataset.from_tensor_slices(train_labels)\n", "input_train = input_train.batch(batch_size).repeat()\n", "output_train = output_train.batch(batch_size).repeat()\n", "train_dataset = tf.data.Dataset.zip((input_train, output_train))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "l98aRzfPwo5e" }, "source": [ "### Train the model" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "h1x_8CR0wtRs", "colab": {} }, "source": [ "# This will take about a minute to run\n", "# To keep training time short, we're not using the full dataset\n", "model.fit(train_dataset, steps_per_epoch=train_size // batch_size, epochs=epochs)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "QPr0A8bjw0wm" }, "source": [ "### Evaluate the trained model locally" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "3Elbvna4vU30", "colab": {} }, "source": [ "# Run evaluation\n", "results = model.evaluate(test_data, test_labels)\n", "print(results)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "bIh6uds2x2tr", "colab": {} }, "source": [ "# Send test instances to model for prediction\n", "predict = model.predict(test_data[:5])" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "aFjBh4DVx7QL", "colab": {} }, "source": [ "# Preview predictions on the first 5 examples from our test dataset\n", "for i, val in enumerate(predict):\n", " print('Predicted duration: {}'.format(round(val[0])))\n", " print('Actual duration: {} \\n'.format(test_labels.iloc[i]))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "gAO6-zv6osJ8" }, "source": [ "## Export the model as a TF 1 SavedModel\n", "\n", "AI Explanations currently supports TensorFlow 1.x. In order to deploy our model in a format compatible with AI Explanations, we'll follow the steps below to convert our Keras model to a TF Estimator, and then use the `export_saved_model` method to generate the SavedModel and save it in GCS." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "fbvzBm1lji7b", "colab": {} }, "source": [ "## Convert our Keras model to an estimator\n", "keras_estimator = tf.keras.estimator.model_to_estimator(keras_model=model, model_dir='savedmodel_export')" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "KLM43L2FjmFu", "colab": {} }, "source": [ "# We need this serving input function to export our model in the next cell\n", "serving_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(\n", " {'dense_input': model.input}\n", ")" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "JBA8ejrJjnLB", "colab": {} }, "source": [ "export_path = keras_estimator.export_saved_model(\n", " 'gs://' + BUCKET_NAME + '/explanations',\n", " serving_input_receiver_fn=serving_fn\n", ").decode('utf-8')\n", "print(export_path)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "-f8elyM8KMNX" }, "source": [ "Use TensorFlow's `saved_model_cli` to inspect the model's SignatureDef. We'll use this information when we deploy our model to AI Explanations in the next section." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "yFg5r-7s1BKr", "colab": {} }, "source": [ "!saved_model_cli show --dir $export_path --all" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "y270ZNinycoy" }, "source": [ "## Deploy the model to AI Explanations\n", "\n", "In order to deploy the model to Explanations, we need to generate an `explanations_metadata.json` file and upload this to the Cloud Storage bucket with our SavedModel. Then we'll deploy the model using `gcloud`." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "cUdUVjjGbvQy" }, "source": [ "### Prepare explanation metadata\n", "\n", "We need to tell AI Explanations the names of the input and output tensors our model is expecting, which we print below. \n", "\n", "The value for `input_baselines` tells the explanations service what the baseline input should be for our model. Here we're using the median for all of our input features. That means the baseline prediction for this model will be the trip duration our model predicts for the median of each feature in our dataset. \n", "\n", "Since this model accepts a single numpy array with all numerical feature, we can optionally pass an `index_feature_mapping` list to AI Explanations to make the API response easier to parse. When we provide a list of feature names via this parameter, the service will return a key / value mapping of each feature with its corresponding attribution value." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "UolAW3lcVTGl", "colab": {} }, "source": [ "# Print the names of our tensors\n", "print('Model input tensor: ', model.input.name)\n", "print('Model output tensor: ', model.output.name)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "qpZiW9Cq6IY4", "colab": {} }, "source": [ "explanation_metadata = {\n", " \"inputs\": {\n", " \"data\": {\n", " \"input_tensor_name\": model.input.name,\n", " \"input_baselines\": [train_data.median().values.tolist()],\n", " \"encoding\": \"bag_of_features\", \n", " \"index_feature_mapping\": train_data.columns.tolist()\n", " }\n", " },\n", " \"outputs\": {\n", " \"duration\": {\n", " \"output_tensor_name\": model.output.name\n", " }\n", " },\n", " \"framework\": \"tensorflow\"\n", " }" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "rT3iG5pDdrHi" }, "source": [ "Since this is a regression model (predicting a numerical value), the baseline prediction will be the same for every example we send to the model. If this were instead a classification model, each class would have a different baseline prediction." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "b6dyTQ1e9Tan", "colab": {} }, "source": [ "# Write the json to a local file\n", "with open('explanation_metadata.json', 'w') as output_file:\n", " json.dump(explanation_metadata, output_file)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "zmVJKgch6PYJ", "colab": {} }, "source": [ "!gsutil cp explanation_metadata.json $export_path" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "J6MKKy6Xb2MT" }, "source": [ "### Create the model" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "S2OaOycmb4o0", "colab": {} }, "source": [ "MODEL = 'bike'" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "0bwCxEr5b8BP", "colab": {} }, "source": [ "# Create the model if it doesn't exist yet (you only need to run this once)\n", "!gcloud ai-platform models create $MODEL --enable-logging --regions=us-central1" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "qp4qfnZib-zQ" }, "source": [ "### Create the model version \n", "\n", "Creating the version will take ~5-10 minutes. Note that your first deploy may take longer." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "LQlcQFG_AB4o", "colab": {} }, "source": [ "# Each time you create a version the name should be unique\n", "VERSION = 'v1'" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "3l5t2o1t7dal", "colab": {} }, "source": [ "# Create the version with gcloud\n", "explain_method = 'integrated-gradients'\n", "!gcloud beta ai-platform versions create $VERSION \\\n", "--model $MODEL \\\n", "--origin $export_path \\\n", "--runtime-version 1.15 \\\n", "--framework TENSORFLOW \\\n", "--python-version 3.7 \\\n", "--machine-type n1-standard-4 \\\n", "--explanation-method $explain_method \\\n", "--num-integral-steps 25" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "eWkkRFhEMbFa", "colab": {} }, "source": [ "# Make sure the model deployed correctly. State should be `READY` in the following log\n", "!gcloud ai-platform versions describe $VERSION --model $MODEL" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "JzevJps9IOcU" }, "source": [ "## Getting predictions and explanations on deployed model\n", "\n", "Now that your model is deployed, you can use the AI Platform Prediction API to get feature attributions. We'll pass it a single test example here and see which features were most important in the model's prediction. Here we'll use the AI Platform Prediction API to get our prediction and explanation. You can also use `gcloud`." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "CJ-2ErWJDvcg" }, "source": [ "### Format our explanation request\n", "\n", "To make our AI Explanations request, we need to create a JSON object with our test data for prediction." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "D_PR2BcHD40-", "colab": {} }, "source": [ "# Format data for prediction to our model\n", "prediction_json = {'dense_input': test_data.iloc[0].values.tolist()}" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "kw7_f9QVD8Y_" }, "source": [ "### Making the explain request\n", "\n", "The following `predict_json` function will make an `explain` request to the AI Platform Prediction API." ] }, { "cell_type": "code", "metadata": { "id": "jHFobgrbYSL9", "colab_type": "code", "colab": {} }, "source": [ "# This is adapted from a sample in the docs\n", "# Find it here: https://cloud.google.com/ai-platform/prediction/docs/online-predict#python\n", "\n", "def predict_json(project, model, instances, version=None):\n", " \"\"\"Send json data to a deployed model for prediction.\n", "\n", " Args:\n", " project (str): project where the AI Platform Model is deployed.\n", " model (str): model name.\n", " instances ([Mapping[str: Any]]): Keys should be the names of Tensors\n", " your deployed model expects as inputs. Values should be datatypes\n", " convertible to Tensors, or (potentially nested) lists of datatypes\n", " convertible to tensors.\n", " version: str, version of the model to target.\n", " Returns:\n", " Mapping[str: any]: dictionary of prediction results defined by the\n", " model.\n", " \"\"\"\n", "\n", " service = googleapiclient.discovery.build('ml', 'v1')\n", " name = 'projects/{}/models/{}'.format(project, model)\n", "\n", " if version is not None:\n", " name += '/versions/{}'.format(version)\n", "\n", " response = service.projects().explain(\n", " name=name,\n", " body={'instances': instances}\n", " ).execute()\n", "\n", " if 'error' in response:\n", " raise RuntimeError(response['error'])\n", "\n", " return response" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "3H8bfxJMYsc9", "colab_type": "code", "colab": {} }, "source": [ "response = predict_json(PROJECT_ID, MODEL, prediction_json, VERSION)\n", "print(response)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "0nKR8RelNnkK" }, "source": [ "### Understanding the explanations response\n", "\n", "First, let's look at the trip duration our model predicted and compare it to the actual value" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "825KoNgHR-tv", "colab": {} }, "source": [ "explanations = response['explanations'][0]['attributions_by_label'][0]\n", "\n", "predicted = round(explanations['example_score'], 2)\n", "print('Predicted duration: ' + str(predicted) + ' minutes')\n", "print('Actual duration: ' + str(test_labels.iloc[0]) + ' minutes')" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "QmObtmXIONDp" }, "source": [ "Next let's look at the feature attributions for this particular example. Positive attribution values mean a particular feature pushed our model prediction up by that amount, and vice versa for negative attribution values." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "6HKvAImeM_qi", "colab": {} }, "source": [ "feature_names = test_data.columns.tolist()\n", "attributions = explanations['attributions']\n", "rows = []\n", "for i,val in enumerate(feature_names):\n", " rows.append([val, test_data.iloc[1].tolist()[i], attributions[val][0]])\n", "print(tabulate(rows,headers=['Feature name', 'Feature value', 'Attribution value']))" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "BZiM7kywQy6j" }, "source": [ "## Sanity check our explanations\n", "\n", "To better make sense of the feature attributions we're getting, we should compare them with our model's baseline. In most cases, the sum of your attribution values + the baseline should be very close to your model's predicted value for each input. Also note that for regression models, the `baseline_score` returned from AI Explanations will be the same for each example sent to your model. For classification models, each class will have its own baseline.\n", "\n", "In this section we'll send 10 test examples to our model for prediction in order to compare the feature attributions with the baseline. Then we'll run each test example's attributions through two sanity checks in the `sanity_check_explanations` method." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "CSf6psVDSDrN", "colab": {} }, "source": [ "# Prepare 10 test examples to our model for prediction\n", "pred_batch = []\n", "for i in range(10):\n", " pred_batch.append({'dense_input': test_data.iloc[i].values.tolist()})" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "sUUjKw-fDFYe", "colab": {} }, "source": [ "# Make the request using the method we defined above\n", "batch_explain = predict_json(PROJECT_ID, MODEL, pred_batch, VERSION)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "vEevMCrMNjxm" }, "source": [ "In the function below we perform two sanity checks for models using Integrated Gradient (IG) explanations and one sanity check for models using Sampled Shapley." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "B_WQXkE6RLe4", "colab": {} }, "source": [ "def sanity_check_explanations(example, mean_tgt_value=None, variance_tgt_value=None):\n", " passed_test = 0\n", " total_test = 1\n", " # `attributions` is a dict where keys are the feature names\n", " # and values are the feature attributions for each feature\n", " attribution_vals = [x[0] for x in example['attributions_by_label'][0]['attributions'].values()]\n", " baseline_score = example['attributions_by_label'][0]['baseline_score']\n", " sum_with_baseline = np.sum(attribution_vals) + baseline_score\n", " predicted_val = example['attributions_by_label'][0]['example_score']\n", " # Sanity check 1 \n", " # The prediction at the input is equal to that at the baseline.\n", " # Please use a different baseline. Some suggestions are: random input, training\n", " # set mean.\n", " if abs(predicted_val - baseline_score) <= 0.05:\n", " print('Warning: example score and baseline score are too close.')\n", " print('You might not get attributions.')\n", " else:\n", " passed_test += 1\n", " \n", " # Sanity check 2 (only for models using Integrated Gradient explanations)\n", " # Ideally, the sum of the integrated gradients must be equal to the difference\n", " # in the prediction probability at the input and baseline. Any discrepency in\n", " # these two values is due to the errors in approximating the integral.\n", " if explain_method == 'integrated-gradients':\n", " total_test += 1\n", " want_integral = predicted_val - baseline_score\n", " got_integral = sum(attribution_vals)\n", " if abs(want_integral-got_integral)/abs(want_integral) > 0.05: \n", " print('Warning: Integral approximation error exceeds 5%.') \n", " print('Please try increasing the number of integrated gradient steps.')\n", " else:\n", " passed_test += 1\n", " \n", " print(passed_test, ' out of ', total_test, ' sanity checks passed.')" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "dkpK830AtRkJ", "colab": { "base_uri": "https://localhost:8080/", "height": 187 }, "outputId": "8fe50733-2e9a-40f6-ad96-5a472d09d5ea" }, "source": [ "for i in batch_explain['explanations']:\n", " sanity_check_explanations(i)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "2 out of 2 sanity checks passed.\n", "2 out of 2 sanity checks passed.\n", "2 out of 2 sanity checks passed.\n", "2 out of 2 sanity checks passed.\n", "2 out of 2 sanity checks passed.\n", "2 out of 2 sanity checks passed.\n", "2 out of 2 sanity checks passed.\n", "2 out of 2 sanity checks passed.\n", "2 out of 2 sanity checks passed.\n", "2 out of 2 sanity checks passed.\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "C5ur65baOmnn" }, "source": [ "## Understanding AI Explanations with the What-If Tool\n", "\n", "In this section we'll use the [What-If Tool](https://pair-code.github.io/what-if-tool/) to better understand how our model is making predictions. See the cell below the What-if Tool for visualization ideas." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "OFO6s8ZsvKT_" }, "source": [ "The What-If-Tool expects data with keys for each feature name, but our model expects a flat list. The functions below convert data to the format required by the What-If Tool." ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "-yajpXi4Oyc3", "colab": {} }, "source": [ "# This is the number of data points we'll send to the What-if Tool\n", "WHAT_IF_TOOL_SIZE = 500\n", "\n", "from witwidget.notebook.visualization import WitWidget, WitConfigBuilder\n", "\n", "def create_list(ex_dict):\n", " new_list = []\n", " for i in feature_names:\n", " new_list.append(ex_dict[i])\n", " return new_list\n", "\n", "def example_dict_to_input(example_dict):\n", " return { 'dense_input': create_list(example_dict) }\n", "\n", "from collections import OrderedDict\n", "wit_data = test_data.iloc[:WHAT_IF_TOOL_SIZE].copy()\n", "wit_data['duration'] = test_labels[:WHAT_IF_TOOL_SIZE]\n", "wit_data_dict = wit_data.to_dict(orient='records', into=OrderedDict)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab_type": "code", "id": "3I2bEUe7Pr2Y", "colab": {} }, "source": [ "config_builder = WitConfigBuilder(\n", " wit_data_dict\n", " ).set_ai_platform_model(\n", " PROJECT_ID,\n", " MODEL,\n", " VERSION,\n", " adjust_example=example_dict_to_input\n", " ).set_target_feature('duration').set_model_type('regression')\n", "WitWidget(config_builder)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "hWmNoC6FxOAt" }, "source": [ "### What-If Tool visualization ideas\n", "\n", "On the x-axis, you'll see the predicted trip duration for the test inputs you passed to the What-If Tool. Each circle represents one of your test examples. If you click on a circle, you'll be able to see the feature values for that example along with the attribution values for each feature. \n", "\n", "* You can edit individual feature values and re-run prediction directly within the What-If Tool. Try changing `distance`, click **Run inference** and see how that affects the model's prediction\n", "* You can sort features for an individual example by their attribution value, try changing the sort from the attributions dropdown\n", "* The What-If Tool also lets you create custom visualizations. You can do this by changing the values in the dropdown menus above the scatter plot visualization. For example, you can sort data points by inference error, or by their similarity to a single datapoint." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "x27DXeUGzb-M" }, "source": [ "## Cleaning up\n", "\n", "To clean up all GCP resources used in this project, you can [delete the GCP\n", "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", "Alternatively, you can clean up individual resources by running the following\n", "commands:" ] }, { "cell_type": "code", "metadata": { "colab_type": "code", "deletable": true, "editable": true, "id": "no210oWF68Uk", "colab": {} }, "source": [ "# Delete model version resource\n", "!gcloud ai-platform versions delete $VERSION --quiet --model $MODEL\n", "\n", "# Delete model resource\n", "!gcloud ai-platform models delete $MODEL --quiet\n", "\n", "# Delete Cloud Storage objects that were created\n", "!gsutil -m rm -r $BUCKET_NAME" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "3F2g4OjbJ3gZ" }, "source": [ "If your Cloud Storage bucket doesn't contain any other objects and you would like to delete it, run `gsutil rm -r gs://$BUCKET_NAME`." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "K0UXLWaBJnrY" }, "source": [ "## What's next?\n", "\n", "To learn more about AI Explanations or the What-if Tool, check out the resources here.\n", "\n", "* [AI Explanations documentation](cloud.google.com/ml-engine/docs/ai-explanations)\n", "* [Documentation for using the What-if Tool with Cloud AI Platform models ](https://cloud.google.com/ml-engine/docs/using-what-if-tool) \n", "* [What-If Tool documentation and demos](https://pair-code.github.io/what-if-tool/)\n", "* [Integrated gradients paper](https://arxiv.org/abs/1703.01365)" ] } ] }