sdk/python/responsible-ai/tabular/responsibleaidashboard-housing-classification-model-debugging/responsibleaidashboard-housing-classification-model-debugging.ipynb

{ "cells": [ { "cell_type": "markdown", "id": "587f4596", "metadata": {}, "source": [ "# Debug housing price predictions" ] }, { "cell_type": "markdown", "id": "df6550b0", "metadata": {}, "source": [ "This notebook demonstrates the use of the AzureML RAI components to assess a classification model trained on Kaggle's apartments dataset (https://www.kaggle.com/alphaepsilon/housing-prices-dataset). The model predicts if the house sells for more than median price or not. It is a reimplementation of the [notebook of the same name](https://github.com/microsoft/responsible-ai-toolbox/blob/main/notebooks/responsibleaidashboard/responsibleaidashboard-housing-classification-model-debugging.ipynb) in the [Responsible AI toolbox repo](https://github.com/microsoft/responsible-ai-toolbox).\n", "\n", "First, we need to specify the version of the RAI components which are available in the workspace. This was specified when the components were uploaded, and will have defaulted to '1':" ] }, { "cell_type": "code", "execution_count": null, "id": "30f41ad0", "metadata": {}, "outputs": [], "source": [ "version_string = \"1\"" ] }, { "cell_type": "markdown", "id": "f6f81e5d", "metadata": {}, "source": [ "We also need to give the name of the compute cluster we want to use in AzureML. Later in this notebook, we will create it if it does not already exist:" ] }, { "cell_type": "code", "execution_count": null, "id": "ec86d5f4", "metadata": {}, "outputs": [], "source": [ "compute_name = \"rai-cluster\"" ] }, { "cell_type": "markdown", "id": "23b96bc7", "metadata": {}, "source": [ "Finally, we need to specify a version for the data and components we will create while running this notebook. This should be unique for the workspace, but the specific value doesn't matter:" ] }, { "cell_type": "code", "execution_count": null, "id": "dd1ea3ee", "metadata": {}, "outputs": [], "source": [ "rai_housing_example_version_string = \"1\"" ] }, { "cell_type": "markdown", "id": "0a4a4bb7", "metadata": {}, "source": [ "## Configure workspace details and get a handle to the workspace\n", "\n", "To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the MLClient from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace." ] }, { "cell_type": "code", "execution_count": null, "id": "80ecc1c1", "metadata": {}, "outputs": [], "source": [ "# Enter details of your AML workspace\n", "subscription_id = \"<SUBSCRIPTION_ID>\"\n", "resource_group = \"<RESOURCE_GROUP>\"\n", "workspace = \"<AML_WORKSPACE_NAME>\"" ] }, { "cell_type": "code", "execution_count": null, "id": "2b3b145d", "metadata": {}, "outputs": [], "source": [ "# Handle to the workspace\n", "from azure.ai.ml import MLClient\n", "from azure.identity import DefaultAzureCredential\n", "\n", "credential = DefaultAzureCredential()\n", "ml_client = MLClient(\n", " credential=credential,\n", " subscription_id=subscription_id,\n", " resource_group_name=resource_group,\n", " workspace_name=workspace,\n", ")\n", "print(ml_client)" ] }, { "cell_type": "code", "execution_count": null, "id": "f105e9a1", "metadata": {}, "outputs": [], "source": [ "# Get handle to azureml registry for the RAI built in components\n", "registry_name = \"azureml\"\n", "ml_client_registry = MLClient(\n", " credential=credential,\n", " subscription_id=subscription_id,\n", " resource_group_name=resource_group,\n", " registry_name=registry_name,\n", ")\n", "print(ml_client_registry)" ] }, { "cell_type": "markdown", "id": "b20f25c3", "metadata": {}, "source": [ "## Accessing the Data\n", "\n", "The following section examines the code necessary to create datasets and a model using components in AzureML." ] }, { "cell_type": "markdown", "id": "4736295c", "metadata": {}, "source": [ "### Fetching the data" ] }, { "cell_type": "code", "execution_count": null, "id": "8f8daaf5", "metadata": {}, "outputs": [], "source": [ "train_data_path = \"data-housing-classification/train/\"" ] }, { "cell_type": "code", "execution_count": null, "id": "1bfcb5c5", "metadata": {}, "outputs": [], "source": [ "test_data_path = \"data-housing-classification/test/\"" ] }, { "cell_type": "markdown", "id": "40495cbd", "metadata": {}, "source": [ "Load some data for a quick view:" ] }, { "cell_type": "code", "execution_count": null, "id": "273f7614", "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import mltable\n", "\n", "tbl = mltable.load(train_data_path)\n", "train_df: pd.DataFrame = tbl.to_pandas_dataframe()\n", "\n", "# test dataset should have less than 5000 rows\n", "test_df = mltable.load(test_data_path).to_pandas_dataframe()\n", "assert len(test_df.index) <= 5000\n", "\n", "display(train_df)" ] }, { "cell_type": "markdown", "id": "8a3cc435", "metadata": {}, "source": [ "We are going to create two Datasets in AzureML, one for the train and one for the test datasets. We can then define the Datasets, and create them in AzureML. This will also upload the Parquet files." ] }, { "cell_type": "code", "execution_count": null, "id": "3194459a", "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import Data\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "input_train_data = \"housing_train_pq\"\n", "input_test_data = \"housing_test_pq\"\n", "\n", "\n", "try:\n", " # Try getting data already registered in workspace\n", " train_data = ml_client.data.get(\n", " name=input_train_data, version=rai_housing_example_version_string\n", " )\n", " test_data = ml_client.data.get(\n", " name=input_test_data, version=rai_housing_example_version_string\n", " )\n", "except Exception as e:\n", " train_data = Data(\n", " path=train_data_path,\n", " type=AssetTypes.MLTABLE,\n", " description=\"RAI housing example training data\",\n", " name=input_train_data,\n", " version=rai_housing_example_version_string,\n", " )\n", " ml_client.data.create_or_update(train_data)\n", "\n", " test_data = Data(\n", " path=test_data_path,\n", " type=AssetTypes.MLTABLE,\n", " description=\"RAI housing example test data\",\n", " name=input_test_data,\n", " version=rai_housing_example_version_string,\n", " )\n", " ml_client.data.create_or_update(test_data)" ] }, { "cell_type": "markdown", "id": "afa1edb8", "metadata": {}, "source": [ "## A model training pipeline\n", "\n", "To simplify the model creation process, we're going to use a pipeline. This will have two stages:\n", "\n", "1. The actual training component\n", "1. A model registration component\n", "\n", "We have to register the model in AzureML in order for our RAI insights components to use it.\n", "\n", "### The Training Component\n", "\n", "The training component is for this particular model. In this case, we are going to train an `LGBMClassifier` on the input data and save it using MLFlow. We need command line arguments to specify the location of the input data, the location where MLFlow should write the output model, and the name of the target column in the dataset.\n", "\n", "We start by creating a directory to hold the component source:" ] }, { "cell_type": "code", "execution_count": null, "id": "dfcce044", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.makedirs(\"component_src\", exist_ok=True)\n", "os.makedirs(\"register_model_src\", exist_ok=True)" ] }, { "cell_type": "markdown", "id": "b549b192", "metadata": {}, "source": [ "Now, the component source code itself:" ] }, { "cell_type": "code", "execution_count": null, "id": "d8e07198", "metadata": {}, "outputs": [], "source": [ "%%writefile component_src/housing_training_script.py\n", "\n", "import argparse\n", "import os\n", "import shutil\n", "import tempfile\n", "\n", "\n", "from azureml.core import Run\n", "\n", "import mlflow\n", "import mlflow.sklearn\n", "\n", "import mltable\n", "\n", "import pandas as pd\n", "from lightgbm import LGBMClassifier\n", "\n", "def parse_args():\n", " # setup arg parser\n", " parser = argparse.ArgumentParser()\n", "\n", " # add arguments\n", " parser.add_argument(\"--training_data\", type=str, help=\"Path to training data\")\n", " parser.add_argument(\"--target_column_name\", type=str, help=\"Name of target column\")\n", " parser.add_argument(\"--model_output\", type=str, help=\"Path of output model\")\n", "\n", " # parse args\n", " args = parser.parse_args()\n", "\n", " # return args\n", " return args\n", "\n", "\n", "def main(args):\n", " current_experiment = Run.get_context().experiment\n", " tracking_uri = current_experiment.workspace.get_mlflow_tracking_uri()\n", " print(\"tracking_uri: {0}\".format(tracking_uri))\n", " mlflow.set_tracking_uri(tracking_uri)\n", " mlflow.set_experiment(current_experiment.name)\n", "\n", " # Read in data\n", " print(\"Reading data\")\n", " tbl = mltable.load(args.training_data)\n", " all_data = tbl.to_pandas_dataframe()\n", "\n", " print(\"Extracting X_train, y_train\")\n", " print(\"all_data cols: {0}\".format(all_data.columns))\n", " y_train = all_data[args.target_column_name]\n", " X_train = all_data.drop(labels=args.target_column_name, axis=\"columns\")\n", " print(\"X_train cols: {0}\".format(X_train.columns))\n", "\n", " print(\"Training model\")\n", " # The estimator can be changed to suit\n", " model = LGBMClassifier(n_estimators=5)\n", " model.fit(X_train, y_train)\n", "\n", " # Saving model with mlflow - leave this section unchanged\n", " with tempfile.TemporaryDirectory() as td:\n", " print(\"Saving model with MLFlow to temporary directory\")\n", " tmp_output_dir = os.path.join(td, \"my_model_dir\")\n", " mlflow.sklearn.save_model(sk_model=model, path=tmp_output_dir)\n", "\n", " print(\"Copying MLFlow model to output path\")\n", " for file_name in os.listdir(tmp_output_dir):\n", " print(\" Copying: \", file_name)\n", " # As of Python 3.8, copytree will acquire dirs_exist_ok as\n", " # an option, removing the need for listdir\n", " shutil.copy2(src=os.path.join(tmp_output_dir, file_name), dst=os.path.join(args.model_output, file_name))\n", "\n", "\n", "# run script\n", "if __name__ == \"__main__\":\n", " # add space in logs\n", " print(\"*\" * 60)\n", " print(\"\\n\\n\")\n", "\n", " # parse args\n", " args = parse_args()\n", "\n", " # run main function\n", " main(args)\n", "\n", " # add space in logs\n", " print(\"*\" * 60)\n", " print(\"\\n\\n\")" ] }, { "cell_type": "code", "execution_count": null, "id": "4570e574", "metadata": {}, "outputs": [], "source": [ "%%writefile register_model_src/register.py\n", "\n", "# ---------------------------------------------------------\n", "# Copyright (c) Microsoft Corporation. All rights reserved.\n", "# ---------------------------------------------------------\n", "\n", "import argparse\n", "import json\n", "import os\n", "import time\n", "\n", "\n", "from azureml.core import Run\n", "\n", "import mlflow\n", "import mlflow.sklearn\n", "\n", "# Based on example:\n", "# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-cli\n", "# which references\n", "# https://github.com/Azure/azureml-examples/tree/main/cli/jobs/train/lightgbm/iris\n", "\n", "\n", "def parse_args():\n", " # setup arg parser\n", " parser = argparse.ArgumentParser()\n", "\n", " # add arguments\n", " parser.add_argument(\"--model_input_path\", type=str, help=\"Path to input model\")\n", " parser.add_argument(\n", " \"--model_info_output_path\", type=str, help=\"Path to write model info JSON\"\n", " )\n", " parser.add_argument(\n", " \"--model_base_name\", type=str, help=\"Name of the registered model\"\n", " )\n", " parser.add_argument(\n", " \"--model_name_suffix\", type=int, help=\"Set negative to use epoch_secs\"\n", " )\n", "\n", " # parse args\n", " args = parser.parse_args()\n", "\n", " # return args\n", " return args\n", "\n", "\n", "def main(args):\n", " current_experiment = Run.get_context().experiment\n", " tracking_uri = current_experiment.workspace.get_mlflow_tracking_uri()\n", " print(\"tracking_uri: {0}\".format(tracking_uri))\n", " mlflow.set_tracking_uri(tracking_uri)\n", " mlflow.set_experiment(current_experiment.name)\n", "\n", " print(\"Loading model\")\n", " mlflow_model = mlflow.sklearn.load_model(args.model_input_path)\n", "\n", " if args.model_name_suffix < 0:\n", " suffix = int(time.time())\n", " else:\n", " suffix = args.model_name_suffix\n", " registered_name = \"{0}_{1}\".format(args.model_base_name, suffix)\n", " print(f\"Registering model as {registered_name}\")\n", "\n", " print(\"Registering via MLFlow\")\n", " mlflow.sklearn.log_model(\n", " sk_model=mlflow_model,\n", " registered_model_name=registered_name,\n", " artifact_path=registered_name,\n", " )\n", "\n", " print(\"Writing JSON\")\n", " dict = {\"id\": \"{0}:1\".format(registered_name)}\n", " output_path = os.path.join(args.model_info_output_path, \"model_info.json\")\n", " with open(output_path, \"w\") as of:\n", " json.dump(dict, fp=of)\n", "\n", "\n", "# run script\n", "if __name__ == \"__main__\":\n", " # add space in logs\n", " print(\"*\" * 60)\n", " print(\"\\n\\n\")\n", "\n", " # parse args\n", " args = parse_args()\n", "\n", " # run main function\n", " main(args)\n", "\n", " # add space in logs\n", " print(\"*\" * 60)\n", " print(\"\\n\\n\")\n" ] }, { "cell_type": "markdown", "id": "cbad789b", "metadata": {}, "source": [ "Now that the training script is saved on our local drive, we create a YAML file to describe it as a component to AzureML. This involves defining the inputs and outputs, specifing the AzureML environment which can run the script, and telling AzureML how to invoke the training script:" ] }, { "cell_type": "code", "execution_count": null, "id": "9feb7548", "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import load_component\n", "\n", "yaml_contents = (\n", " f\"\"\"\n", "$schema: http://azureml/sdk-2-0/CommandComponent.json\n", "name: rai_housing_training_component\n", "display_name: Housing training component for RAI example\n", "version: {rai_housing_example_version_string}\n", "type: command\n", "inputs:\n", " training_data:\n", " type: path\n", " target_column_name:\n", " type: string\n", "outputs:\n", " model_output:\n", " type: path\n", "code: ./component_src/\n", "environment: azureml://registries/azureml/environments/responsibleai-tabular/versions/14\n", "\"\"\"\n", " + r\"\"\"\n", "command: >-\n", " python housing_training_script.py\n", " --training_data ${{{{inputs.training_data}}}}\n", " --target_column_name ${{{{inputs.target_column_name}}}}\n", " --model_output ${{{{outputs.model_output}}}}\n", "\"\"\"\n", ")\n", "\n", "yaml_filename = \"RAIHousingTrainingComponent.yaml\"\n", "\n", "with open(yaml_filename, \"w\") as f:\n", " f.write(yaml_contents.format(yaml_contents))\n", "\n", "train_model_component = load_component(source=yaml_filename)" ] }, { "cell_type": "code", "execution_count": null, "id": "0d46d6cf", "metadata": {}, "outputs": [], "source": [ "yaml_contents = f\"\"\"\n", "$schema: http://azureml/sdk-2-0/CommandComponent.json\n", "name: register_model\n", "display_name: Register Model\n", "version: {rai_housing_example_version_string}\n", "type: command\n", "is_deterministic: False\n", "inputs:\n", " model_input_path:\n", " type: path\n", " model_base_name:\n", " type: string\n", " model_name_suffix: # Set negative to use epoch_secs\n", " type: integer\n", " default: -1\n", "outputs:\n", " model_info_output_path:\n", " type: path\n", "code: ./register_model_src/\n", "environment: azureml://registries/azureml/environments/responsibleai-tabular/versions/14\n", "command: >-\n", " python register.py\n", " --model_input_path ${{{{inputs.model_input_path}}}}\n", " --model_base_name ${{{{inputs.model_base_name}}}}\n", " --model_name_suffix ${{{{inputs.model_name_suffix}}}}\n", " --model_info_output_path ${{{{outputs.model_info_output_path}}}}\n", "\n", "\"\"\"\n", "\n", "yaml_filename = \"register.yaml\"\n", "\n", "with open(yaml_filename, \"w\") as f:\n", " f.write(yaml_contents)\n", "\n", "register_component = load_component(source=yaml_filename)" ] }, { "cell_type": "markdown", "id": "59d994bd", "metadata": {}, "source": [ "We need a compute target on which to run our jobs. The following checks whether the compute specified above is present; if not, then the compute target is created." ] }, { "cell_type": "code", "execution_count": null, "id": "577a5d68", "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import AmlCompute\n", "\n", "all_compute_names = [x.name for x in ml_client.compute.list()]\n", "\n", "if compute_name in all_compute_names:\n", " print(f\"Found existing compute: {compute_name}\")\n", "else:\n", " my_compute = AmlCompute(\n", " name=compute_name,\n", " size=\"Standard_D2_v2\",\n", " min_instances=0,\n", " max_instances=4,\n", " idle_time_before_scale_down=3600,\n", " )\n", " ml_client.compute.begin_create_or_update(my_compute).result()\n", " print(\"Initiated compute creation\")" ] }, { "cell_type": "markdown", "id": "c19ec415", "metadata": {}, "source": [ "### Running a training pipeline\n", "\n", "The component to register the model is part of the suite of RAI components, so we do not have to define it here. As such, we are now ready to run the training pipeline itself.\n", "\n", "We start by defining the name under which we want to register the model:" ] }, { "cell_type": "code", "execution_count": null, "id": "2ad4a77a", "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "model_name_suffix = int(time.time())\n", "model_name = \"rai_housing_classifier\"" ] }, { "cell_type": "markdown", "id": "df5f3361", "metadata": {}, "source": [ "Next, we define the pipeline using objects from the AzureML SDKv2. As mentioned above, there are two component jobs: one to train the model, and one to register it:" ] }, { "cell_type": "code", "execution_count": null, "id": "449dcda8", "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import dsl, Input\n", "\n", "target_feature = \"Sold_HigherThan_Median\"\n", "categorical_features = []\n", "\n", "housing_train_pq = Input(\n", " type=\"mltable\",\n", " path=f\"azureml:{input_train_data}:{rai_housing_example_version_string}\",\n", " mode=\"download\",\n", ")\n", "housing_test_pq = Input(\n", " type=\"mltable\",\n", " path=f\"azureml:{input_test_data}:{rai_housing_example_version_string}\",\n", " mode=\"download\",\n", ")\n", "\n", "\n", "@dsl.pipeline(\n", " compute=compute_name,\n", " description=\"Register Model for RAI Housing example\",\n", " experiment_name=f\"RAI_Housing_Example_Model_Training_{model_name_suffix}\",\n", ")\n", "def my_training_pipeline(target_column_name, training_data):\n", " trained_model = train_model_component(\n", " target_column_name=target_column_name, training_data=training_data\n", " )\n", " trained_model.set_limits(timeout=3600)\n", "\n", " _ = register_component(\n", " model_input_path=trained_model.outputs.model_output,\n", " model_base_name=model_name,\n", " model_name_suffix=model_name_suffix,\n", " )\n", "\n", " return {}\n", "\n", "\n", "model_registration_pipeline_job = my_training_pipeline(target_feature, housing_train_pq)" ] }, { "cell_type": "markdown", "id": "6397292c", "metadata": {}, "source": [ "With the pipeline definition created, we can submit it to AzureML. We define a helper function to do the submission, which waits for the submitted job to complete:" ] }, { "cell_type": "code", "execution_count": null, "id": "0c33154c", "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import PipelineJob\n", "from IPython.core.display import HTML\n", "from IPython.display import display\n", "\n", "\n", "def submit_and_wait(ml_client, pipeline_job) -> PipelineJob:\n", " created_job = ml_client.jobs.create_or_update(pipeline_job)\n", " assert created_job is not None\n", "\n", " print(\"Pipeline job can be accessed in the following URL:\")\n", " display(HTML('<a href=\"{0}\">{0}</a>'.format(created_job.studio_url)))\n", "\n", " while created_job.status not in [\n", " \"Completed\",\n", " \"Failed\",\n", " \"Canceled\",\n", " \"NotResponding\",\n", " ]:\n", " time.sleep(30)\n", " created_job = ml_client.jobs.get(created_job.name)\n", " print(\"Latest status : {0}\".format(created_job.status))\n", " assert created_job.status == \"Completed\"\n", " return created_job\n", "\n", "\n", "# This is the actual submission\n", "training_job = submit_and_wait(ml_client, model_registration_pipeline_job)" ] }, { "cell_type": "markdown", "id": "9b29ec13", "metadata": {}, "source": [ "## Creating the RAI Insights\n", "\n", "We have a registered model, and can now run a pipeline to create the RAI insights. First off, compute the name of the model we registered:" ] }, { "cell_type": "code", "execution_count": null, "id": "3bae879b", "metadata": {}, "outputs": [], "source": [ "expected_model_id = f\"{model_name}_{model_name_suffix}:1\"\n", "azureml_model_id = f\"azureml:{expected_model_id}\"" ] }, { "cell_type": "markdown", "id": "5147d43f", "metadata": {}, "source": [ "Now, we create the RAI pipeline itself. There are four 'component stages' in this pipeline:\n", "\n", "1. Construct an empty `RAIInsights` object\n", "1. Run the RAI tool components\n", "1. Gather the tool outputs into a single `RAIInsights` object\n", "\n", "We start by loading the RAI component definitions for use in our pipeline:" ] }, { "cell_type": "code", "execution_count": null, "id": "b1efc849", "metadata": {}, "outputs": [], "source": [ "label = \"latest\"\n", "\n", "rai_constructor_component = ml_client_registry.components.get(\n", " name=\"rai_tabular_insight_constructor\", label=label\n", ")\n", "\n", "# We get latest version and use the same version for all components\n", "version = rai_constructor_component.version\n", "print(\"The current version of RAI built-in components is: \" + version)\n", "\n", "rai_explanation_component = ml_client_registry.components.get(\n", " name=\"rai_tabular_explanation\", version=version\n", ")\n", "\n", "rai_causal_component = ml_client_registry.components.get(\n", " name=\"rai_tabular_causal\", version=version\n", ")\n", "\n", "rai_counterfactual_component = ml_client_registry.components.get(\n", " name=\"rai_tabular_counterfactual\", version=version\n", ")\n", "\n", "rai_erroranalysis_component = ml_client_registry.components.get(\n", " name=\"rai_tabular_erroranalysis\", version=version\n", ")\n", "\n", "rai_gather_component = ml_client_registry.components.get(\n", " name=\"rai_tabular_insight_gather\", version=version\n", ")\n", "\n", "rai_scorecard_component = ml_client_registry.components.get(\n", " name=\"rai_tabular_score_card\", version=version\n", ")" ] }, { "cell_type": "markdown", "id": "6358cae7", "metadata": {}, "source": [ "## Score card generation config\n", "For score card generation, we need some additional configuration in a separate json file. Here we configure the following model performance metrics for reporting:\n", "- accuracy\n", "- precision" ] }, { "cell_type": "code", "execution_count": null, "id": "40af5cce", "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "score_card_config_dict = {\n", " \"Model\": {\n", " \"ModelName\": \"Housing classification\",\n", " \"ModelType\": \"Classification\",\n", " \"ModelSummary\": \"<model summary>\",\n", " },\n", " \"Metrics\": {\"accuracy_score\": {\"threshold\": \">=0.7\"}, \"precision_score\": {}},\n", "}\n", "\n", "score_card_config_filename = \"rai_housing_classification_score_card_config.json\"\n", "\n", "with open(score_card_config_filename, \"w\") as f:\n", " json.dump(score_card_config_dict, f)\n", "\n", "score_card_config_path = Input(\n", " type=\"uri_file\", path=score_card_config_filename, mode=\"download\"\n", ")" ] }, { "cell_type": "markdown", "id": "f2aeb94d", "metadata": {}, "source": [ "Now the pipeline itself. This creates an empty `RAIInsights` object, adds the analyses, and then gathers everything into the final `RAIInsights` output. More complex objects (such as lists of feature names) have to be converted to JSON strings before being passed to the components. Note that the counterfactual generation has a longer timeout set, since it is a relatively slow process:" ] }, { "cell_type": "code", "execution_count": null, "id": "88eae294", "metadata": {}, "outputs": [], "source": [ "import json\n", "from azure.ai.ml import Input\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "classes_in_target = json.dumps([\"Less than median\", \"More than median\"])\n", "treatment_features = json.dumps(\n", " [\"OverallCond\", \"OverallQual\", \"Fireplaces\", \"GarageCars\", \"ScreenPorch\"]\n", ")\n", "\n", "\n", "@dsl.pipeline(\n", " compute=compute_name,\n", " description=\"Example RAI computation on housing data\",\n", " experiment_name=f\"RAI_Housing_Example_RAIInsights_Computation_{model_name_suffix}\",\n", ")\n", "def rai_classification_pipeline(\n", " target_column_name,\n", " train_data,\n", " test_data,\n", " score_card_config_path,\n", "):\n", " # Initiate the RAIInsights\n", " create_rai_job = rai_constructor_component(\n", " title=\"RAI Dashboard Example\",\n", " task_type=\"classification\",\n", " model_info=expected_model_id,\n", " model_input=Input(type=AssetTypes.MLFLOW_MODEL, path=azureml_model_id),\n", " train_dataset=train_data,\n", " test_dataset=test_data,\n", " target_column_name=target_column_name,\n", " categorical_column_names=json.dumps(categorical_features),\n", " classes=classes_in_target,\n", " # If your model has extra dependencies, and your Responsible AI job failed to\n", " # load mlflow model with ValueError, try set use_model_dependency to True.\n", " # If you have further questions, contact askamlrai@microsoft.com\n", " use_model_dependency=True,\n", " )\n", " create_rai_job.set_limits(timeout=7200)\n", "\n", " # Add an explanation\n", " explain_job = rai_explanation_component(\n", " comment=\"Explanation for the housing dataset\",\n", " rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,\n", " )\n", " explain_job.set_limits(timeout=7200)\n", "\n", " # Add causal analysis\n", " causal_job = rai_causal_component(\n", " rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,\n", " treatment_features=treatment_features,\n", " )\n", " causal_job.set_limits(timeout=7200)\n", "\n", " # Add counterfactual analysis\n", " counterfactual_job = rai_counterfactual_component(\n", " rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,\n", " total_cfs=10,\n", " desired_class=\"opposite\",\n", " )\n", " counterfactual_job.set_limits(timeout=7200)\n", "\n", " # Add error analysis\n", " erroranalysis_job = rai_erroranalysis_component(\n", " rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,\n", " )\n", " erroranalysis_job.set_limits(timeout=7200)\n", "\n", " # Combine everything\n", " rai_gather_job = rai_gather_component(\n", " constructor=create_rai_job.outputs.rai_insights_dashboard,\n", " insight_1=explain_job.outputs.explanation,\n", " insight_2=causal_job.outputs.causal,\n", " insight_3=counterfactual_job.outputs.counterfactual,\n", " insight_4=erroranalysis_job.outputs.error_analysis,\n", " )\n", " rai_gather_job.set_limits(timeout=7200)\n", "\n", " rai_gather_job.outputs.dashboard.mode = \"upload\"\n", " rai_gather_job.outputs.ux_json.mode = \"upload\"\n", "\n", " # Generate score card in pdf format for a summary report on model performance,\n", " # and observe distrbution of error between prediction vs ground truth.\n", " rai_scorecard_job = rai_scorecard_component(\n", " dashboard=rai_gather_job.outputs.dashboard,\n", " pdf_generation_config=score_card_config_path,\n", " )\n", "\n", " return {\n", " \"dashboard\": rai_gather_job.outputs.dashboard,\n", " \"ux_json\": rai_gather_job.outputs.ux_json,\n", " \"scorecard\": rai_scorecard_job.outputs.scorecard,\n", " }" ] }, { "cell_type": "markdown", "id": "2f99088d", "metadata": {}, "source": [ "With all of our jobs defined, we can assemble them into the pipeline itself." ] }, { "cell_type": "code", "execution_count": null, "id": "ec38eac3", "metadata": {}, "outputs": [], "source": [ "import uuid\n", "from azure.ai.ml import Output\n", "\n", "# Pipeline to construct the RAI Insights\n", "insights_pipeline_job = rai_classification_pipeline(\n", " target_column_name=target_feature,\n", " train_data=housing_train_pq,\n", " test_data=housing_test_pq,\n", " score_card_config_path=score_card_config_path,\n", ")\n", "\n", "# Workaround to enable the download\n", "rand_path = str(uuid.uuid4())\n", "insights_pipeline_job.outputs.dashboard = Output(\n", " path=f\"azureml://datastores/workspaceblobstore/paths/{rand_path}/dashboard/\",\n", " mode=\"upload\",\n", " type=\"uri_folder\",\n", ")\n", "insights_pipeline_job.outputs.ux_json = Output(\n", " path=f\"azureml://datastores/workspaceblobstore/paths/{rand_path}/ux_json/\",\n", " mode=\"upload\",\n", " type=\"uri_folder\",\n", ")\n", "insights_pipeline_job.outputs.scorecard = Output(\n", " path=f\"azureml://datastores/workspaceblobstore/paths/{rand_path}/scorecard/\",\n", " mode=\"upload\",\n", " type=\"uri_folder\",\n", ")" ] }, { "cell_type": "markdown", "id": "bbcdb183", "metadata": {}, "source": [ "Now, submit the pipeline job and wait for it to complete:" ] }, { "cell_type": "code", "execution_count": null, "id": "01d13bfa", "metadata": {}, "outputs": [], "source": [ "insights_job = submit_and_wait(ml_client, insights_pipeline_job)" ] }, { "cell_type": "markdown", "id": "3f24f236", "metadata": {}, "source": [ "## Downloading the Scorecard PDF\n", "\n", "We can download the scorecard PDF from our pipeline as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "c4edd429", "metadata": {}, "outputs": [], "source": [ "target_directory = \".\"\n", "\n", "ml_client.jobs.download(\n", " insights_job.name, download_path=target_directory, output_name=\"scorecard\"\n", ")" ] }, { "cell_type": "markdown", "id": "ba936b36", "metadata": {}, "source": [ "Once this is complete, we can go to the Registered Models view in the AzureML portal, and find the model we have just registered. On the 'Model Details' page, there is a \"Responsible AI dashboard\" tab where we can view the insights which we have just uploaded." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10 - SDK V2", "language": "python", "name": "python310-sdkv2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "vscode": { "interpreter": { "hash": "8fd340b5477ca1a0b454d48a3973beff39fee032ada47a04f6f3725b469a8988" } } }, "nbformat": 4, "nbformat_minor": 5 }

sdk/python/responsible-ai/tabular/responsibleaidashboard-housing-classification-model-debugging/responsibleaidashboard-housing-classification-model-debugging.ipynb (1,084 lines of code) (raw):