sdk/python/responsible-ai/tabular/responsibleaidashboard-programmer-regression-model-debugging/responsibleaidashboard-programmer-regression-model-debugging.ipynb (1,134 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "98605bcd",
"metadata": {},
"source": [
"# Analysis of Synthetic Data\n",
"\n",
"This notebook demonstrates a hypothetical scenario of how likely a programmer should be given access to a GPT2 model for inferencing, based on information such as their favorite programming language, preference for tabs vs spaces, OS, location and so forth. Each programmer will be given a score between [0,10] where a score between [7,10] indicates access given to the programmer and [0,7) indicates access denied. The data were synthetically generated via the [PyPI package, Fibber.io](https://pypi.org/project/fibber/).\n",
"\n",
"First, we need to specify the version of the RAI components which are available in the workspace. This was specified when the components were uploaded, and will have defaulted to '1':"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53b4eeac",
"metadata": {},
"outputs": [],
"source": [
"version_string = \"1\""
]
},
{
"cell_type": "markdown",
"id": "06008690",
"metadata": {},
"source": [
"We also need to give the name of the compute cluster we want to use in AzureML. Later in this notebook, we will create it if it does not already exist:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f1ad79f9",
"metadata": {},
"outputs": [],
"source": [
"compute_name = \"rai-cluster\""
]
},
{
"cell_type": "markdown",
"id": "9fc65dc7",
"metadata": {},
"source": [
"Finally, we need to specify a version for the data and components we will create while running this notebook. This should be unique for the workspace, but the specific value doesn't matter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "78053935",
"metadata": {},
"outputs": [],
"source": [
"rai_programmer_example_version_string = \"1\""
]
},
{
"cell_type": "markdown",
"id": "35c15429",
"metadata": {},
"source": [
"## Configure workspace details and get a handle to the workspace\n",
"\n",
"To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the MLClient from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10627d86",
"metadata": {},
"outputs": [],
"source": [
"# Enter details of your AML workspace\n",
"subscription_id = \"<SUBSCRIPTION_ID>\"\n",
"resource_group = \"<RESOURCE_GROUP>\"\n",
"workspace = \"<AML_WORKSPACE_NAME>\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "83d33e34",
"metadata": {},
"outputs": [],
"source": [
"# Handle to the workspace\n",
"from azure.ai.ml import MLClient\n",
"from azure.identity import DefaultAzureCredential\n",
"\n",
"credential = DefaultAzureCredential()\n",
"ml_client = MLClient(\n",
" credential=credential,\n",
" subscription_id=subscription_id,\n",
" resource_group_name=resource_group,\n",
" workspace_name=workspace,\n",
")\n",
"print(ml_client)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "54738095",
"metadata": {},
"outputs": [],
"source": [
"# Get handle to azureml registry for the RAI built in components\n",
"registry_name = \"azureml\"\n",
"ml_client_registry = MLClient(\n",
" credential=credential,\n",
" subscription_id=subscription_id,\n",
" resource_group_name=resource_group,\n",
" registry_name=registry_name,\n",
")\n",
"print(ml_client_registry)"
]
},
{
"cell_type": "markdown",
"id": "73be2b63",
"metadata": {},
"source": [
"## Accessing the Data\n",
"\n",
"We supply the synthetic data as a pair of parquet files and accompanying `MLTable` file. We can read them in and take a brief look:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f875f18",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "17d53df4",
"metadata": {},
"source": [
"Now define the paths to the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c7bbe58",
"metadata": {},
"outputs": [],
"source": [
"train_data_path = \"data-programmer-regression/train/\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8c4eb082",
"metadata": {},
"outputs": [],
"source": [
"test_data_path = \"data-programmer-regression/test/\""
]
},
{
"cell_type": "markdown",
"id": "a2c4ebb4",
"metadata": {},
"source": [
"Load some data for a quick view:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1027fa92",
"metadata": {},
"outputs": [],
"source": [
"import mltable\n",
"\n",
"tbl = mltable.load(train_data_path)\n",
"train_df: pd.DataFrame = tbl.to_pandas_dataframe()\n",
"\n",
"# test dataset should have less than 5000 rows\n",
"test_df = mltable.load(test_data_path).to_pandas_dataframe()\n",
"assert len(test_df.index) <= 5000\n",
"\n",
"display(train_df)"
]
},
{
"cell_type": "markdown",
"id": "1115ac59",
"metadata": {},
"source": [
"The (synthetic) data are about a collection of programmers, with a 'score' column which we wish to predict:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5b42df3d",
"metadata": {},
"outputs": [],
"source": [
"target_column_name = \"score\""
]
},
{
"cell_type": "markdown",
"id": "52e79b04",
"metadata": {},
"source": [
"First, we need to upload the datasets to our workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "62eb02a2",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"input_train_data = \"Programmers_Train_MLTable\"\n",
"input_test_data = \"Programmers_Test_MLTable\"\n",
"\n",
"try:\n",
" # Try getting data already registered in workspace\n",
" train_data = ml_client.data.get(\n",
" name=input_train_data, version=rai_programmer_example_version_string\n",
" )\n",
" test_data = ml_client.data.get(\n",
" name=input_test_data, version=rai_programmer_example_version_string\n",
" )\n",
"except Exception as e:\n",
" # If no data of specified version exist, create new one\n",
" train_data = Data(\n",
" path=train_data_path,\n",
" type=AssetTypes.MLTABLE,\n",
" description=\"RAI programmers training data\",\n",
" name=input_train_data,\n",
" version=rai_programmer_example_version_string,\n",
" )\n",
" ml_client.data.create_or_update(train_data)\n",
"\n",
" test_data = Data(\n",
" path=test_data_path,\n",
" type=AssetTypes.MLTABLE,\n",
" description=\"RAI programmers test data\",\n",
" name=input_test_data,\n",
" version=rai_programmer_example_version_string,\n",
" )\n",
" ml_client.data.create_or_update(test_data)"
]
},
{
"cell_type": "markdown",
"id": "6815ba75",
"metadata": {},
"source": [
"# Creating the Model\n",
"\n",
"To simplify the model creation process, we're going to use a pipeline.\n",
"\n",
"We create a directory for the training script:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e78d869b",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.makedirs(\"register_model_src\", exist_ok=True)\n",
"os.makedirs(\"programmer_component_src\", exist_ok=True)"
]
},
{
"cell_type": "markdown",
"id": "ea86e55d",
"metadata": {},
"source": [
"Next, we write out our training script:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a523f144",
"metadata": {},
"outputs": [],
"source": [
"%%writefile programmer_component_src/training_script_reg.py\n",
"\n",
"import argparse\n",
"import os\n",
"import shutil\n",
"import tempfile\n",
"\n",
"\n",
"from azureml.core import Run\n",
"\n",
"import mlflow\n",
"import mlflow.sklearn\n",
"\n",
"import mltable\n",
"\n",
"import pandas as pd\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.impute import SimpleImputer\n",
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
"from sklearn.compose import ColumnTransformer\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"def parse_args():\n",
" # setup arg parser\n",
" parser = argparse.ArgumentParser()\n",
"\n",
" # add arguments\n",
" parser.add_argument(\"--training_data\", type=str, help=\"Path to training data\")\n",
" parser.add_argument(\"--target_column_name\", type=str, help=\"Name of target column\")\n",
" parser.add_argument(\"--model_output\", type=str, help=\"Path of output model\")\n",
"\n",
" # parse args\n",
" args = parser.parse_args()\n",
"\n",
" # return args\n",
" return args\n",
"\n",
"def create_regression_pipeline(X, y):\n",
" pipe_cfg = {\n",
" 'num_cols': X.dtypes[X.dtypes == 'int64'].index.values.tolist(),\n",
" 'cat_cols': X.dtypes[X.dtypes == 'object'].index.values.tolist(),\n",
" }\n",
" num_pipe = Pipeline([\n",
" ('num_imputer', SimpleImputer(strategy='median')),\n",
" ('num_scaler', StandardScaler())\n",
" ])\n",
" cat_pipe = Pipeline([\n",
" ('cat_imputer', SimpleImputer(strategy='constant', fill_value='?')),\n",
" ('cat_encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))\n",
" ])\n",
" feat_pipe = ColumnTransformer([\n",
" ('num_pipe', num_pipe, pipe_cfg['num_cols']),\n",
" ('cat_pipe', cat_pipe, pipe_cfg['cat_cols'])\n",
" ])\n",
"\n",
" # Append classifier to preprocessing pipeline.\n",
" # Now we have a full prediction pipeline.\n",
" pipeline = Pipeline(steps=[('preprocessor', feat_pipe),\n",
" ('model', LinearRegression())])\n",
" return pipeline.fit(X, y)\n",
"\n",
"def main(args):\n",
" current_experiment = Run.get_context().experiment\n",
" tracking_uri = current_experiment.workspace.get_mlflow_tracking_uri()\n",
" print(\"tracking_uri: {0}\".format(tracking_uri))\n",
" mlflow.set_tracking_uri(tracking_uri)\n",
" mlflow.set_experiment(current_experiment.name)\n",
" \n",
" # Read in data\n",
" print(\"Reading data\")\n",
" tbl = mltable.load(args.training_data)\n",
" all_data = tbl.to_pandas_dataframe()\n",
"\n",
" print(\"Extracting X_train, y_train\")\n",
" print(\"all_data cols: {0}\".format(all_data.columns))\n",
" y_train = all_data[args.target_column_name]\n",
" X_train = all_data.drop(labels=args.target_column_name, axis=\"columns\")\n",
" print(\"X_train cols: {0}\".format(X_train.columns))\n",
"\n",
" print(\"Training model\")\n",
" # The estimator can be changed to suit\n",
" model = create_regression_pipeline(X_train, y_train)\n",
"\n",
" # Saving model with mlflow - leave this section unchanged\n",
" with tempfile.TemporaryDirectory() as td:\n",
" print(\"Saving model with MLFlow to temporary directory\")\n",
" tmp_output_dir = os.path.join(td, \"my_model_dir\")\n",
" mlflow.sklearn.save_model(sk_model=model, path=tmp_output_dir)\n",
"\n",
" print(\"Copying MLFlow model to output path\")\n",
" for file_name in os.listdir(tmp_output_dir):\n",
" print(\" Copying: \", file_name)\n",
" # As of Python 3.8, copytree will acquire dirs_exist_ok as\n",
" # an option, removing the need for listdir\n",
" shutil.copy2(src=os.path.join(tmp_output_dir, file_name), dst=os.path.join(args.model_output, file_name))\n",
"\n",
"\n",
"# run script\n",
"if __name__ == \"__main__\":\n",
" # add space in logs\n",
" print(\"*\" * 60)\n",
" print(\"\\n\\n\")\n",
"\n",
" # parse args\n",
" args = parse_args()\n",
"\n",
" # run main function\n",
" main(args)\n",
"\n",
" # add space in logs\n",
" print(\"*\" * 60)\n",
" print(\"\\n\\n\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "65af35fc",
"metadata": {},
"outputs": [],
"source": [
"%%writefile register_model_src/register.py\n",
"\n",
"# ---------------------------------------------------------\n",
"# Copyright (c) Microsoft Corporation. All rights reserved.\n",
"# ---------------------------------------------------------\n",
"\n",
"import argparse\n",
"import json\n",
"import os\n",
"import time\n",
"\n",
"\n",
"from azureml.core import Run\n",
"\n",
"import mlflow\n",
"import mlflow.sklearn\n",
"\n",
"# Based on example:\n",
"# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-cli\n",
"# which references\n",
"# https://github.com/Azure/azureml-examples/tree/main/cli/jobs/train/lightgbm/iris\n",
"\n",
"\n",
"def parse_args():\n",
" # setup arg parser\n",
" parser = argparse.ArgumentParser()\n",
"\n",
" # add arguments\n",
" parser.add_argument(\"--model_input_path\", type=str, help=\"Path to input model\")\n",
" parser.add_argument(\n",
" \"--model_info_output_path\", type=str, help=\"Path to write model info JSON\"\n",
" )\n",
" parser.add_argument(\n",
" \"--model_base_name\", type=str, help=\"Name of the registered model\"\n",
" )\n",
" parser.add_argument(\n",
" \"--model_name_suffix\", type=int, help=\"Set negative to use epoch_secs\"\n",
" )\n",
"\n",
" # parse args\n",
" args = parser.parse_args()\n",
"\n",
" # return args\n",
" return args\n",
"\n",
"\n",
"def main(args):\n",
" current_experiment = Run.get_context().experiment\n",
" tracking_uri = current_experiment.workspace.get_mlflow_tracking_uri()\n",
" print(\"tracking_uri: {0}\".format(tracking_uri))\n",
" mlflow.set_tracking_uri(tracking_uri)\n",
" mlflow.set_experiment(current_experiment.name)\n",
"\n",
" print(\"Loading model\")\n",
" mlflow_model = mlflow.sklearn.load_model(args.model_input_path)\n",
"\n",
" if args.model_name_suffix < 0:\n",
" suffix = int(time.time())\n",
" else:\n",
" suffix = args.model_name_suffix\n",
" registered_name = \"{0}_{1}\".format(args.model_base_name, suffix)\n",
" print(f\"Registering model as {registered_name}\")\n",
"\n",
" print(\"Registering via MLFlow\")\n",
" mlflow.sklearn.log_model(\n",
" sk_model=mlflow_model,\n",
" registered_model_name=registered_name,\n",
" artifact_path=registered_name,\n",
" )\n",
"\n",
" print(\"Writing JSON\")\n",
" dict = {\"id\": \"{0}:1\".format(registered_name)}\n",
" output_path = os.path.join(args.model_info_output_path, \"model_info.json\")\n",
" with open(output_path, \"w\") as of:\n",
" json.dump(dict, fp=of)\n",
"\n",
"\n",
"# run script\n",
"if __name__ == \"__main__\":\n",
" # add space in logs\n",
" print(\"*\" * 60)\n",
" print(\"\\n\\n\")\n",
"\n",
" # parse args\n",
" args = parse_args()\n",
"\n",
" # run main function\n",
" main(args)\n",
"\n",
" # add space in logs\n",
" print(\"*\" * 60)\n",
" print(\"\\n\\n\")\n"
]
},
{
"cell_type": "markdown",
"id": "e115dd6e",
"metadata": {},
"source": [
"Now, we can build this into an AzureML component:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d54e43f",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import load_component\n",
"\n",
"yaml_contents = f\"\"\"\n",
"$schema: http://azureml/sdk-2-0/CommandComponent.json\n",
"name: rai_programmers_training_component\n",
"display_name: Programmers training component for RAI example\n",
"version: {rai_programmer_example_version_string}\n",
"type: command\n",
"inputs:\n",
" training_data:\n",
" type: path\n",
" target_column_name:\n",
" type: string\n",
"outputs:\n",
" model_output:\n",
" type: path\n",
"code: ./programmer_component_src/\n",
"environment: azureml://registries/azureml/environments/responsibleai-tabular/versions/14\n",
"command: >-\n",
" python training_script_reg.py\n",
" --training_data ${{{{inputs.training_data}}}}\n",
" --target_column_name ${{{{inputs.target_column_name}}}}\n",
" --model_output ${{{{outputs.model_output}}}}\n",
"\"\"\"\n",
"\n",
"yaml_filename = \"ProgrammersRegTrainingComp.yaml\"\n",
"\n",
"with open(yaml_filename, \"w\") as f:\n",
" f.write(yaml_contents)\n",
"\n",
"train_model_component = load_component(source=yaml_filename)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f9cde33c",
"metadata": {},
"outputs": [],
"source": [
"yaml_contents = f\"\"\"\n",
"$schema: http://azureml/sdk-2-0/CommandComponent.json\n",
"name: register_model\n",
"display_name: Register Model\n",
"version: {rai_programmer_example_version_string}\n",
"type: command\n",
"is_deterministic: False\n",
"inputs:\n",
" model_input_path:\n",
" type: path\n",
" model_base_name:\n",
" type: string\n",
" model_name_suffix: # Set negative to use epoch_secs\n",
" type: integer\n",
" default: -1\n",
"outputs:\n",
" model_info_output_path:\n",
" type: path\n",
"code: ./register_model_src/\n",
"environment: azureml://registries/azureml/environments/responsibleai-tabular/versions/14\n",
"command: >-\n",
" python register.py\n",
" --model_input_path ${{{{inputs.model_input_path}}}}\n",
" --model_base_name ${{{{inputs.model_base_name}}}}\n",
" --model_name_suffix ${{{{inputs.model_name_suffix}}}}\n",
" --model_info_output_path ${{{{outputs.model_info_output_path}}}}\n",
"\n",
"\"\"\"\n",
"\n",
"yaml_filename = \"register.yaml\"\n",
"\n",
"with open(yaml_filename, \"w\") as f:\n",
" f.write(yaml_contents)\n",
"\n",
"register_component = load_component(source=yaml_filename)"
]
},
{
"cell_type": "markdown",
"id": "6d165e2b",
"metadata": {},
"source": [
"We need a compute target on which to run our jobs. The following checks whether the compute specified above is present; if not, then the compute target is created."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1e40fc38",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import AmlCompute\n",
"\n",
"all_compute_names = [x.name for x in ml_client.compute.list()]\n",
"\n",
"if compute_name in all_compute_names:\n",
" print(f\"Found existing compute: {compute_name}\")\n",
"else:\n",
" my_compute = AmlCompute(\n",
" name=compute_name,\n",
" size=\"Standard_D2_v2\",\n",
" min_instances=0,\n",
" max_instances=4,\n",
" idle_time_before_scale_down=3600,\n",
" )\n",
" ml_client.compute.begin_create_or_update(my_compute).result()\n",
" print(\"Initiated compute creation\")"
]
},
{
"cell_type": "markdown",
"id": "9d8eb868",
"metadata": {},
"source": [
"## Running a training pipeline\n",
"\n",
"Now that we have our training component, we can run it. We begin by generating a unique name for the mode;"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad76242b",
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"model_name_suffix = int(time.time())\n",
"model_name = \"rai_programmer_example_reg\""
]
},
{
"cell_type": "markdown",
"id": "d49615a7",
"metadata": {},
"source": [
"Next, we define our training pipeline. This has two components. The first is the training component which we defined above. The second is a component to register the model in AzureML:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cb6c6cec",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import dsl, Input\n",
"\n",
"programmers_train_mltable = Input(\n",
" type=\"mltable\",\n",
" path=f\"azureml:{input_train_data}:{rai_programmer_example_version_string}\",\n",
" mode=\"download\",\n",
")\n",
"programmers_test_mltable = Input(\n",
" type=\"mltable\",\n",
" path=f\"azureml:{input_test_data}:{rai_programmer_example_version_string}\",\n",
" mode=\"download\",\n",
")\n",
"\n",
"\n",
"@dsl.pipeline(\n",
" compute=compute_name,\n",
" description=\"Register Model for RAI Programmers example\",\n",
" experiment_name=f\"RAI_Programmers_Example_Model_Training_{model_name_suffix}\",\n",
")\n",
"def my_training_pipeline(target_column_name, training_data):\n",
" trained_model = train_model_component(\n",
" target_column_name=target_column_name, training_data=training_data\n",
" )\n",
" trained_model.set_limits(timeout=3600)\n",
"\n",
" _ = register_component(\n",
" model_input_path=trained_model.outputs.model_output,\n",
" model_base_name=model_name,\n",
" model_name_suffix=model_name_suffix,\n",
" )\n",
"\n",
" return {}\n",
"\n",
"\n",
"model_registration_pipeline_job = my_training_pipeline(\n",
" target_column_name, programmers_train_mltable\n",
")"
]
},
{
"cell_type": "markdown",
"id": "2fa66ea6",
"metadata": {},
"source": [
"With the training pipeline defined, we can submit it for execution in AzureML. We define a helper function to wait for the job to complete:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f854eef5",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import PipelineJob\n",
"from IPython.core.display import HTML\n",
"from IPython.display import display\n",
"\n",
"\n",
"def submit_and_wait(ml_client, pipeline_job) -> PipelineJob:\n",
" created_job = ml_client.jobs.create_or_update(pipeline_job)\n",
" assert created_job is not None\n",
"\n",
" print(\"Pipeline job can be accessed in the following URL:\")\n",
" display(HTML('<a href=\"{0}\">{0}</a>'.format(created_job.studio_url)))\n",
"\n",
" while created_job.status not in [\n",
" \"Completed\",\n",
" \"Failed\",\n",
" \"Canceled\",\n",
" \"NotResponding\",\n",
" ]:\n",
" time.sleep(30)\n",
" created_job = ml_client.jobs.get(created_job.name)\n",
" print(\"Latest status : {0}\".format(created_job.status))\n",
" assert created_job.status == \"Completed\"\n",
" return created_job\n",
"\n",
"\n",
"# This is the actual submission\n",
"training_job = submit_and_wait(ml_client, model_registration_pipeline_job)"
]
},
{
"cell_type": "markdown",
"id": "0722395e",
"metadata": {},
"source": [
"## Creating the RAI Insights\n",
"\n",
"Now that we have our model, we can generate RAI insights for it. We will need the `id` of the registered model, which will be as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7d3e6e6e",
"metadata": {},
"outputs": [],
"source": [
"expected_model_id = f\"{model_name}_{model_name_suffix}:1\"\n",
"azureml_model_id = f\"azureml:{expected_model_id}\""
]
},
{
"cell_type": "markdown",
"id": "310aa659",
"metadata": {},
"source": [
"Next, we load the RAI components, so that we can construct a pipeline:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d67b942e",
"metadata": {},
"outputs": [],
"source": [
"label = \"latest\"\n",
"\n",
"rai_constructor_component = ml_client_registry.components.get(\n",
" name=\"rai_tabular_insight_constructor\", label=label\n",
")\n",
"\n",
"# We get latest version and use the same version for all components\n",
"version = rai_constructor_component.version\n",
"print(\"The current version of RAI built-in components is: \" + version)\n",
"\n",
"rai_explanation_component = ml_client_registry.components.get(\n",
" name=\"rai_tabular_explanation\", version=version\n",
")\n",
"\n",
"rai_causal_component = ml_client_registry.components.get(\n",
" name=\"rai_tabular_causal\", version=version\n",
")\n",
"\n",
"rai_counterfactual_component = ml_client_registry.components.get(\n",
" name=\"rai_tabular_counterfactual\", version=version\n",
")\n",
"\n",
"rai_erroranalysis_component = ml_client_registry.components.get(\n",
" name=\"rai_tabular_erroranalysis\", version=version\n",
")\n",
"\n",
"rai_gather_component = ml_client_registry.components.get(\n",
" name=\"rai_tabular_insight_gather\", version=version\n",
")\n",
"\n",
"rai_scorecard_component = ml_client_registry.components.get(\n",
" name=\"rai_tabular_score_card\", version=version\n",
")"
]
},
{
"cell_type": "markdown",
"id": "933b22cd",
"metadata": {},
"source": [
"## Score card generation config\n",
"For score card generation, we need some additional configuration in a separate json file. Here we configure the following model performance metrics for reporting:\n",
"- mean absolute error\n",
"- mean squared error"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "872e1fac",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"score_card_config_dict = {\n",
" \"Model\": {\n",
" \"ModelName\": \"GPT2 Access\",\n",
" \"ModelType\": \"Regression\",\n",
" \"ModelSummary\": \"This is a regression model to analyzer how likely a programmer is given access to gpt 2\",\n",
" },\n",
" \"Metrics\": {\"mean_absolute_error\": {\"threshold\": \"<=20\"}, \"mean_squared_error\": {}},\n",
" \"FeatureImportance\": {\"top_n\": 6},\n",
" \"DataExplorer\": {\"features\": [\"YOE\", \"age\"]},\n",
" \"Fairness\": {\n",
" \"metric\": [\"mean_squared_error\", \"mean_absolute_error\"],\n",
" \"sensitive_features\": [\"IDE\", \"style\"],\n",
" \"fairness_evaluation_kind\": \"difference\",\n",
" },\n",
"}\n",
"\n",
"score_card_config_filename = \"rai_programmer_regression_score_card_config.json\"\n",
"\n",
"with open(score_card_config_filename, \"w\") as f:\n",
" json.dump(score_card_config_dict, f)"
]
},
{
"cell_type": "markdown",
"id": "c98cd2d9",
"metadata": {},
"source": [
"We can now specify our pipeline. Complex objects (such as lists of column names) have to be converted to JSON strings before being passed to the components. Note that the timeout for the counterfactual job is noticeably longer, since generating counterfactual points is a comparatively slow process:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a62105a7",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from azure.ai.ml import Input\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"score_card_config_path = Input(\n",
" type=\"uri_file\", path=score_card_config_filename, mode=\"download\"\n",
")\n",
"\n",
"categorical_columns = json.dumps(\n",
" [\"location\", \"style\", \"job title\", \"OS\", \"Employer\", \"IDE\", \"Programming language\"]\n",
")\n",
"treatment_features = json.dumps([\"Number of github repos contributed to\", \"YOE\"])\n",
"desired_range = json.dumps([5, 10])\n",
"filter_columns = json.dumps([\"style\", \"Employer\"])\n",
"\n",
"\n",
"@dsl.pipeline(\n",
" compute=compute_name,\n",
" description=\"Example RAI computation on programmers data\",\n",
" experiment_name=f\"RAI_Programmers_Example_RAIInsights_Computation_{model_name_suffix}\",\n",
")\n",
"def rai_programmer_regression_pipeline(\n",
" target_column_name,\n",
" train_data,\n",
" test_data,\n",
" score_card_config_path,\n",
"):\n",
" # Initiate the RAIInsights\n",
" create_rai_job = rai_constructor_component(\n",
" title=\"RAI Dashboard Example\",\n",
" task_type=\"regression\",\n",
" model_info=expected_model_id,\n",
" model_input=Input(type=AssetTypes.MLFLOW_MODEL, path=azureml_model_id),\n",
" train_dataset=train_data,\n",
" test_dataset=test_data,\n",
" target_column_name=target_column_name,\n",
" categorical_column_names=categorical_columns,\n",
" # If your model has extra dependencies, and your Responsible AI job failed to\n",
" # load mlflow model with ValueError, try set use_model_dependency to True.\n",
" # If you have further questions, contact askamlrai@microsoft.com\n",
" use_model_dependency=True,\n",
" )\n",
" create_rai_job.set_limits(timeout=7200)\n",
"\n",
" # Add an explanation\n",
" explain_job = rai_explanation_component(\n",
" comment=\"Explanation for the programmers dataset\",\n",
" rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,\n",
" )\n",
" explain_job.set_limits(timeout=7200)\n",
"\n",
" # Add causal analysis\n",
" causal_job = rai_causal_component(\n",
" rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,\n",
" treatment_features=treatment_features,\n",
" )\n",
" causal_job.set_limits(timeout=7200)\n",
"\n",
" # Add counterfactual analysis\n",
" counterfactual_job = rai_counterfactual_component(\n",
" rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,\n",
" total_cfs=10,\n",
" desired_range=desired_range,\n",
" )\n",
" counterfactual_job.set_limits(timeout=7200)\n",
"\n",
" # Add error analysis\n",
" erroranalysis_job = rai_erroranalysis_component(\n",
" rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,\n",
" filter_features=filter_columns,\n",
" )\n",
" erroranalysis_job.set_limits(timeout=7200)\n",
"\n",
" # Combine everything\n",
" rai_gather_job = rai_gather_component(\n",
" constructor=create_rai_job.outputs.rai_insights_dashboard,\n",
" insight_1=explain_job.outputs.explanation,\n",
" insight_2=causal_job.outputs.causal,\n",
" insight_3=counterfactual_job.outputs.counterfactual,\n",
" insight_4=erroranalysis_job.outputs.error_analysis,\n",
" )\n",
" rai_gather_job.set_limits(timeout=7200)\n",
"\n",
" rai_gather_job.outputs.dashboard.mode = \"upload\"\n",
" rai_gather_job.outputs.ux_json.mode = \"upload\"\n",
"\n",
" # Generate score card in pdf format for a summary report on model performance,\n",
" # and observe distrbution of error between prediction vs ground truth.\n",
" rai_scorecard_job = rai_scorecard_component(\n",
" dashboard=rai_gather_job.outputs.dashboard,\n",
" pdf_generation_config=score_card_config_path,\n",
" )\n",
"\n",
" return {\n",
" \"dashboard\": rai_gather_job.outputs.dashboard,\n",
" \"ux_json\": rai_gather_job.outputs.ux_json,\n",
" \"scorecard\": rai_scorecard_job.outputs.scorecard,\n",
" }"
]
},
{
"cell_type": "markdown",
"id": "6b5b14a9",
"metadata": {},
"source": [
"Next, we define the pipeline object itself, and ensure that the outputs will be available for download:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4d86ec2",
"metadata": {},
"outputs": [],
"source": [
"import uuid\n",
"from azure.ai.ml import Output\n",
"\n",
"insights_pipeline_job = rai_programmer_regression_pipeline(\n",
" target_column_name=target_column_name,\n",
" train_data=programmers_train_mltable,\n",
" test_data=programmers_test_mltable,\n",
" score_card_config_path=score_card_config_path,\n",
")\n",
"\n",
"rand_path = str(uuid.uuid4())\n",
"insights_pipeline_job.outputs.dashboard = Output(\n",
" path=f\"azureml://datastores/workspaceblobstore/paths/{rand_path}/dashboard/\",\n",
" mode=\"upload\",\n",
" type=\"uri_folder\",\n",
")\n",
"insights_pipeline_job.outputs.ux_json = Output(\n",
" path=f\"azureml://datastores/workspaceblobstore/paths/{rand_path}/ux_json/\",\n",
" mode=\"upload\",\n",
" type=\"uri_folder\",\n",
")\n",
"insights_pipeline_job.outputs.scorecard = Output(\n",
" path=f\"azureml://datastores/workspaceblobstore/paths/{rand_path}/scorecard/\",\n",
" mode=\"upload\",\n",
" type=\"uri_folder\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "25f34573",
"metadata": {},
"source": [
"And submit the pipeline to AzureML for execution:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ca757f7",
"metadata": {},
"outputs": [],
"source": [
"insights_job = submit_and_wait(ml_client, insights_pipeline_job)"
]
},
{
"cell_type": "markdown",
"id": "1381768a",
"metadata": {},
"source": [
"The dashboard should appear in the AzureML portal in the registered model view. The following cell computes the expected URI:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e86ab611",
"metadata": {},
"outputs": [],
"source": [
"sub_id = ml_client._operation_scope.subscription_id\n",
"rg_name = ml_client._operation_scope.resource_group_name\n",
"ws_name = ml_client.workspace_name\n",
"\n",
"expected_uri = f\"https://ml.azure.com/model/{expected_model_id}/model_analysis?wsid=/subscriptions/{sub_id}/resourcegroups/{rg_name}/workspaces/{ws_name}\"\n",
"\n",
"print(f\"Please visit {expected_uri} to see your analysis\")"
]
},
{
"cell_type": "markdown",
"id": "3513d61c",
"metadata": {},
"source": [
"## Downloading the Scorecard PDF\n",
"\n",
"We can download the scorecard PDF from our pipeline as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55e1350c",
"metadata": {},
"outputs": [],
"source": [
"target_directory = \".\"\n",
"\n",
"ml_client.jobs.download(\n",
" insights_job.name, download_path=target_directory, output_name=\"scorecard\"\n",
")"
]
}
],
"metadata": {
"celltoolbar": "Raw Cell Format",
"kernelspec": {
"display_name": "Python 3.10 - SDK V2",
"language": "python",
"name": "python310-sdkv2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "8fd340b5477ca1a0b454d48a3973beff39fee032ada47a04f6f3725b469a8988"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}