# Student Attrition Classification RAI dashboard
This notebook demonstrates the use of the `responsibleai` API to assess a classification model trained on a Fabricated Student Attrition classification dataset. The model predicts **if a university student will be retained for the next year at the university or prematurely leave the university (known as student attrition)** based on the independent features:

- FirstGenerationinCollegeFlag
- Gender
- Race
- HSGraduateorGED
- Age_Term_Min
- Age_Term_Max
- Total_Terms
- Entry_Type_DualEnrollment
- Entry_Type_EarlyAdmission
- Entry_Type_FirstTimeinCollege	
- Entry_Type_Other
- Entry_Type_Re-Entry
- Entry_Type_Transfer
- AcademicProbation
- AcademicSuspension
- AcademicSuspensionFor1Year
- AcademicWarning
- ExtendProbationForLowGpa
- GoodAcademicStanding
- ProbationAfterSuspen/Dismiss
- TransferedToNonBusiness
- CumulativeGPACumulativeCreditHoursEarnedPerTerm
- Blended
- FullyOnline
- RemoteLearning
- RemoteLearningBlended
- Traditional
- Adjunct
- Faculty
- Unknown_IntructorType
- PELL_Eligible
- Attrition


The Data Dictionary can be accessed through the following link: [Data_dictionary_Education](link-URL)

The Notebook walks through the API calls necessary to create a widget with model analysis insights, then guides a visual analysis of the model.

## **Installation**  

If you are **running the notebook for the first time**, you need to follow a few of steps for smooth execution of notebook:

1. Un-comment the below cell.
2. Run the cell.
3. After execution of this cell, comment the cell.
4. Re-start the kernel
5. Continue with running of all cells.


**Reminder** -- Be sure to set your kernel to "Python 3.8 - AzureML," via the drop-down menu at the right end of the taskbar. 

In [None]:
%pip install azure-ai-ml
%pip install sklearn

## **User Configuration**  
Confirm the compute name listed here is the same that was created using the included ARM template.  If not, change this name so they match. 

In [None]:
# Pass the name of your compute instance (See step 6 below for it's use)
compute_name = "raitextcluster"

## **After changing the above cell click on Run All.**
**The notebook will follow the below steps and complete execution in 15-30 minutes depending upon compute configurations**

## Automated Notebook steps:

**Step 1:** Loading the Data.

**Step 2:** Pre-processing.

**Step 3:** Splitting into Train Test datasets.

**Step 4:** Registering the datasets as data assets in AML.

**Step 5:** Define training and registering scripts for use in Training Pipeline.

**Step 6:** Create compute instance (if compute instance name not passed).

**Step 7:** Executing Model Training pipeline.

**Step 8:** Define components for Responsible AI Dashboard Generation Pipeline (The components are explained in later parts).

**Step 9:** Execute Dashboard Generation Pipeline (generate scorecard and save in directory).

**Step 10:** Click on the link at the end of the notebook to access the dashboard generated.

## Loading required modules

In [None]:
import pandas as pd
import numpy as np
import os

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

import zipfile
from io import BytesIO
import requests

## Accessing the Data

The following section examines the code necessary to create datasets and a model using components in AzureML.

In [None]:
def get_data(data_location, independent_features, target_feature, drop_col=None):
    """
    Function to read data in Pandas dataframe
    [TODO: Add any preprocessing steps within this function]

    Parameters
    ----------
    data_location: string
        Path of the Dataset
    independent_features: list
        List of names of the independent features
    target_feature: string
        Name of the target/dependent features
    drop_col: list
        List of column names to drop

    Returns
    -------
    df: Pandas DataFrame
        Pandas dataframe containing the dataset with the names passed
    """
    column_names = independent_features + [target_feature]

    # Download the blob data from the provided URL
    response = requests.get(data_location)
    blob_content = response.content

    with zipfile.ZipFile(BytesIO(blob_content), "r") as zip_ref:
        file_list = zip_ref.namelist()
        if len(file_list) > 0:
            # Assume the first file in the zip contains the data
            inner_blob_name = file_list[0]
            inner_blob_content = zip_ref.read(inner_blob_name)
            df = pd.read_csv(BytesIO(inner_blob_content))

    # df = pd.read_csv(data_location)
    l = list(df.columns)
    l.remove(target_feature)
    df = df[l + [target_feature]]
    df.columns = column_names
    if drop_col is not None:
        df.drop(drop_col, axis=1, inplace=True)
    df = df.dropna()
    return df

### Reading & Encoding the dataset

We load the data from github Repo directly and do basic pre-processing steps.

**Categorical Codes for "LoanStatus":**

 **Approved: The customer was approved for the Loan**
 
 **Rejected: The customer was not approved for the Loan**

In [None]:
data_df = get_data(
    # data_location="./Fabricated_Student_Attrition_Data.csv",
    data_location="https://publictestdatasets.blob.core.windows.net/data/RAI_fabricated_student_attrition_data.zip",
    target_feature="Attrition",
    independent_features=[
        "FirstGenerationinCollegeFlag",
        "Gender",
        "Race",
        "HSGraduateorGED",
        "Age_Term_Min",
        "Age_Term_Max",
        "Total_Terms",
        "Entry_Type_DualEnrollment",
        "Entry_Type_EarlyAdmission",
        "Entry_Type_FirstTimeinCollege",
        "Entry_Type_Other",
        "Entry_Type_Re-Entry",
        "Entry_Type_Transfer",
        "AcademicProbation",
        "AcademicSuspension",
        "AcademicSuspensionFor1Year",
        "AcademicWarning",
        "ExtendProbationForLowGpa",
        "GoodAcademicStanding",
        "ProbationAfterSuspen/Dismiss",
        "TransferedToNonBusiness",
        "CumulativeGPA",
        "CumulativeCreditHoursEarnedPerTerm",
        "Blended",
        "FullyOnline",
        "RemoteLearning",
        "RemoteLearningBlended",
        "Traditional",
        "Adjunct",
        "Faculty",
        "Unknown_IntructorType",
        "PELL_Eligible",
    ],
)

data_encoded = data_df.copy()

attrition_encoding = {
    1: "Attrition",
    0: "Retain",
}

data_encoded.replace({"Attrition": attrition_encoding}, inplace=True)
data_encoded

### Splitting the Data into training and test datasets

In [None]:
data_train, data_test = train_test_split(
    data_encoded, test_size=0.25, random_state=31415, stratify=data_encoded["Attrition"]
)

if len(data_test) <= 5000:
    print("Proceed with the analysis")
else:
    print("Reduce your test data size")

### Get the Data to AzureML

With the data now split into 'train' and 'test' DataFrames, we save them out to files in preparation for upload into AzureML:

In [None]:
train_data_path = "./data_student_attrition_classification/train/"
test_data_path = "./data_student_attrition_classification/test/"

os.makedirs(train_data_path, exist_ok=True)
os.makedirs(test_data_path, exist_ok=True)

train_filename = train_data_path + "student_attrition_classification_train.parquet"
test_filename = test_data_path + "student_attrition_classification_test.parquet"

data_train.to_parquet(train_filename, index=False)
data_test.to_parquet(test_filename, index=False)

We are going to create two Datasets in AzureML, one for the train and one for the test datasets. The first step is to create an `MLClient` to perform the upload. The method we use assumes that there is a `config.json` file (downloadable from the Azure or AzureML portals) present in the same directory as this notebook file:

In [None]:
# Enter details of your AML workspace
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
ml_client = MLClient(
    credential=credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=workspace,
)
print(ml_client)

In [None]:
# Define Version string (optional)
rai_student_attrition_classification_example_version_string = "1"

### Create an asset MLtable (or URI file) to register the Data into workspace
This is essential,  as the dashboard recognizes only registered assets. 

Reference:
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-data-assets?tabs=Python-SDK

In [None]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

#### Change the asset name of the below file if the train/test data has changed

In [None]:
input_train_data = "train_student_attrition_classification"

try:
    # Try getting data already registered in workspace
    train_data = ml_client.data.get(
        name=input_train_data,
        version=rai_student_attrition_classification_example_version_string,
    )

except Exception as e:
    train_data = Data(
        path=train_filename,
        type=AssetTypes.URI_FILE,
        description="RAI student attrition classification example training data",
        name=input_train_data,
        version=rai_student_attrition_classification_example_version_string,
    )
    ml_client.data.create_or_update(train_data)

In [None]:
input_test_data = "test_student_attrition_classification"

try:
    # Try getting data already registered in workspace
    test_data = ml_client.data.get(
        name=input_test_data,
        version=rai_student_attrition_classification_example_version_string,
    )

except Exception as e:
    test_data = Data(
        path=test_filename,
        type=AssetTypes.URI_FILE,
        description="RAI student attrition classification example test data",
        name=input_test_data,
        version=rai_student_attrition_classification_example_version_string,
    )
    ml_client.data.create_or_update(test_data)

## A model training pipeline

To simplify the model creation process, we're going to use a pipeline. This will have two stages:

1. The actual training component
2. A model registration component

We have to register the model in AzureML in order for our RAI insights components to use it.

### The Training Component

The training component is for this particular model. In this case, we are going to train an `Logistic Classifier` on the input data and save it using MLFlow. We need command line arguments to specify the location of the input data, the location where MLFlow should write the output model, and the name of the target column in the dataset.

We start by creating a directory to hold the component source:

In [None]:
os.makedirs("./component_src", exist_ok=True)
os.makedirs("./register_model_src", exist_ok=True)

**Create the training script**  
This cell creates a machine learning pipeline that trains a Logistic classifier using labeled data and then saves the trained model to a specified output path using MLFlow. 
- The code reads in the training data as a pandas dataframe from a specified path, extracts the target column name, and separates the target column from the feature columns. 
- Feature columns are then preprocessed using both a standard scaler for numeric data and a one-hot encoder for categorical data. 
- Preprocessed feature columns and target column are then fed into the Gaussian Naive Bayes classifier. 
- The trained model is saved to a temporary directory and then copied to the specified output path. 
- Code takes command-line arguments for the paths of the training data, the output model, and the name of the target column. 
- The code also uses the Azure Machine Learning (AML) Python SDK to log the model and tracking information with MLFlow. 
- Additional comments in the code provide details on each section of the pipeline.

In [None]:
%%writefile component_src/classification_training_script.py

import argparse
import os
import shutil
import tempfile


from azureml.core import Run

import mlflow
import mlflow.sklearn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

import pandas as pd
from sklearn.linear_model import LogisticRegression

def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--training_data", type=str, help="Path to training data")
    parser.add_argument("--target_column_name", type=str, help="Name of target column")
    parser.add_argument("--model_output", type=str, help="Path of output model")

    # parse args
    args = parser.parse_args()

    # return args
    return args


def main(args):
    current_experiment = Run.get_context().experiment
    tracking_uri = current_experiment.workspace.get_mlflow_tracking_uri()
    print("tracking_uri: {0}".format(tracking_uri))
    mlflow.set_tracking_uri(tracking_uri)
    mlflow.set_experiment(current_experiment.name)

    # Read in data
    print("Reading data")
    all_data = pd.read_parquet(args.training_data)

    print("Extracting X_train, y_train")
    print("all_data cols: {0}".format(all_data.columns))
    y_train = all_data[args.target_column_name]
    X_train = all_data.drop(labels=args.target_column_name, axis="columns")
    print("X_train cols: {0}".format(X_train.columns))

    print("Executing Model Training pipeline")
    # We create the preprocessing pipelines for both numeric and categorical data.
    numeric_transformer = Pipeline(steps=[
        ('scaler', StandardScaler())])

    categorical_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])

    continuous_features_names = ['Age_Term_Min',	'Age_Term_Max',	'Total_Terms',
                    'Entry_Type_DualEnrollment', 'Entry_Type_EarlyAdmission','Entry_Type_FirstTimeinCollege',
                    'Entry_Type_Other', 'Entry_Type_Re-Entry','Entry_Type_Transfer','AcademicProbation','AcademicSuspension',
                    'AcademicSuspensionFor1Year',	'AcademicWarning','ExtendProbationForLowGpa','GoodAcademicStanding',
                    'ProbationAfterSuspen/Dismiss', 'TransferedToNonBusiness','CumulativeGPA','CumulativeCreditHoursEarnedPerTerm',
                    'Blended',	'FullyOnline','RemoteLearning',	
                    'RemoteLearningBlended','Traditional','Adjunct','Faculty','Unknown_IntructorType','PELL_Eligible']
    categorical_features_names = ['FirstGenerationinCollegeFlag','Gender', 'Race',
                                        'HSGraduateorGED']

    transformations = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, continuous_features_names),
            ('cat', categorical_transformer, categorical_features_names)])

    # Append classifier to preprocessing pipeline.
    # Now we have a full prediction pipeline.
    # The estimator can be changed to suit
    model = Pipeline(steps=[('preprocessor', transformations),
                          ('classifier', LogisticRegression(solver='lbfgs', max_iter=1000))])

    model.fit(X_train, y_train)

    # Saving model with mlflow - leave this section unchanged
    with tempfile.TemporaryDirectory() as td:
        print("Saving model with MLFlow to temporary directory")
        tmp_output_dir = os.path.join(td, "my_model_dir")
        mlflow.sklearn.save_model(sk_model=model, path=tmp_output_dir)

        print("Copying MLFlow model to output path")
        for file_name in os.listdir(tmp_output_dir):
            print("  Copying: ", file_name)
            # As of Python 3.8, copytree will acquire dirs_exist_ok as
            # an option, removing the need for listdir
            shutil.copy2(src=os.path.join(tmp_output_dir, file_name), dst=os.path.join(args.model_output, file_name))


# run script
if __name__ == "__main__":
    # add space in logs
    print("*" * 60)
    print("\n\n")

    # parse args
    args = parse_args()

    # run main function
    main(args)

    # add space in logs
    print("*" * 60)
    print("\n\n")

**Define the YAML file**

This code snippet defines an Azure Machine Learning Command Component for training a classification model on a dataset. It starts by defining a YAML configuration file that specifies the inputs and outputs of the component, the command to run, and the environment to use. The YAML file is then saved to disk.

Next, the code uses the Azure ML Python SDK to load the Command Component from the YAML file. The resulting object can be used to run the component on a dataset, passing in the input paths and output paths as arguments.

Overall, this code provides a simple and reusable way to define and run machine learning training components in Azure ML.

In [None]:
from azure.ai.ml import load_component

yaml_contents = (
    f"""
$schema: http://azureml/sdk-2-0/CommandComponent.json
name: rai_classification_training_component
display_name: Classification training component for RAI example
version: {rai_student_attrition_classification_example_version_string}
type: command
inputs:
  training_data:
    type: path
  target_column_name:
    type: string
outputs:
  model_output:
    type: path
code: ./component_src/
environment: azureml://registries/azureml/environments/responsibleai-tabular/versions/14
"""
    + r"""
command: >-
  python classification_training_script.py
  --training_data ${{{{inputs.training_data}}}}
  --target_column_name ${{{{inputs.target_column_name}}}}
  --model_output ${{{{outputs.model_output}}}}
"""
)

yaml_filename = "RAIStudentAttritionTrainingComponent.yaml"

with open(yaml_filename, "w") as f:
    f.write(yaml_contents.format(yaml_contents))

train_model_component = load_component(source=yaml_filename)

This script loads a trained model, registers it via MLFlow, and saves the registered model information to a JSON file. Users need to provide the necessary arguments to register the model, including the path to the input model, path to the output model info JSON file, base name of the registered model, and an optional suffix for the registered model name.

To use this script, the following arguments must be defined: 
- model_input_path: Path to the input model  
- model_info_output_path: Path to write the model info JSON  
- model_base_name: Name of the registered model  
- model_name_suffix: An integer value to add as a suffix to the registered model name. If this is negative, the epoch time is used as the suffix.

In [None]:
%%writefile register_model_src/register.py

# ---------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# ---------------------------------------------------------

import argparse
import json
import os
import time


from azureml.core import Run

import mlflow
import mlflow.sklearn

# Based on example:
# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-cli
# which references
# https://github.com/Azure/azureml-examples/tree/main/cli/jobs/train/lightgbm/iris


def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--model_input_path", type=str, help="Path to input model")
    parser.add_argument(
        "--model_info_output_path", type=str, help="Path to write model info JSON"
    )
    parser.add_argument(
        "--model_base_name", type=str, help="Name of the registered model"
    )
    parser.add_argument(
        "--model_name_suffix", type=int, help="Set negative to use epoch_secs"
    )

    # parse args
    args = parser.parse_args()

    # return args
    return args


def main(args):
    current_experiment = Run.get_context().experiment
    tracking_uri = current_experiment.workspace.get_mlflow_tracking_uri()
    print("tracking_uri: {0}".format(tracking_uri))
    mlflow.set_tracking_uri(tracking_uri)
    mlflow.set_experiment(current_experiment.name)

    print("Loading model")
    mlflow_model = mlflow.sklearn.load_model(args.model_input_path)

    if args.model_name_suffix < 0:
        suffix = int(time.time())
    else:
        suffix = args.model_name_suffix
    registered_name = "{0}_{1}".format(args.model_base_name, suffix)
    print(f"Registering model as {registered_name}")

    print("Registering via MLFlow")
    mlflow.sklearn.log_model(
        sk_model=mlflow_model,
        registered_model_name=registered_name,
        artifact_path=registered_name,
    )

    print("Writing JSON")
    dict = {"id": "{0}:1".format(registered_name)}
    output_path = os.path.join(args.model_info_output_path, "model_info.json")
    with open(output_path, "w") as of:
        json.dump(dict, fp=of)


# run script
if __name__ == "__main__":
    # add space in logs
    print("*" * 60)
    print("\n\n")

    # parse args
    args = parse_args()

    # run main function
    main(args)

    # add space in logs
    print("*" * 60)
    print("\n\n")

Now that the model registration script is saved on our local drive, we create a YAML file to describe it as a component to AzureML. This involves defining the inputs and outputs, specifing the AzureML environment which can run the script, and telling AzureML how to invoke the model registration script:

In [None]:
yaml_contents = f"""
$schema: http://azureml/sdk-2-0/CommandComponent.json
name: register_model
display_name: Register Model
version: {rai_student_attrition_classification_example_version_string}
type: command
is_deterministic: False
inputs:
  model_input_path:
    type: path
  model_base_name:
    type: string
  model_name_suffix: # Set negative to use epoch_secs
    type: integer
    default: -1
outputs:
  model_info_output_path:
    type: path
code: ./register_model_src/
environment: azureml://registries/azureml/environments/responsibleai-tabular/versions/14
command: >-
  python register.py
  --model_input_path ${{{{inputs.model_input_path}}}}
  --model_base_name ${{{{inputs.model_base_name}}}}
  --model_name_suffix ${{{{inputs.model_name_suffix}}}}
  --model_info_output_path ${{{{outputs.model_info_output_path}}}}

"""

yaml_filename = "register.yaml"

with open(yaml_filename, "w") as f:
    f.write(yaml_contents)

register_component = load_component(source=yaml_filename)

We will create a new compute instance to run the jobs if it does not already exist by the name passed in the beginning of the notebook.

In [None]:
from azure.ai.ml.entities import AmlCompute

all_compute_names = [x.name for x in ml_client.compute.list()]

if compute_name in all_compute_names:
    print(f"Found existing compute: {compute_name}")
else:
    my_compute = AmlCompute(
        name=compute_name,
        size="Standard_DS4_v2",
        min_instances=0,
        max_instances=1,
        idle_time_before_scale_down=3600,
    )
    ml_client.compute.begin_create_or_update(my_compute).result()
    print("Initiated compute creation")

### Running a training pipeline

The 2 YAML files (RAILoanTrainingComponent.yaml & register.yaml) are used to define the 2 components in the model training pipeline

We start by ensuring that the compute cluster named in the begining exists:

In [None]:
import time

model_name_suffix = int(time.time())
model_name = "rai_student_attrition_classsification_model"

Next, we define the pipeline using objects from the AzureML SDKv2. As mentioned above, there are two component jobs: one to train the model, and one to register it:

In [None]:
from azure.ai.ml import dsl, Input

target_feature = "Attrition"
categorical_features = [
    "FirstGenerationinCollegeFlag",
    "Gender",
    "Race",
    "HSGraduateorGED",
]

loan_train_pq = Input(
    type="uri_file",
    path=f"azureml:{input_train_data}:{rai_student_attrition_classification_example_version_string}",
    mode="download",
)
loan_test_pq = Input(
    type="uri_file",
    path=f"azureml:{input_test_data}:{rai_student_attrition_classification_example_version_string}",
    mode="download",
)


@dsl.pipeline(
    compute=compute_name,
    description="Register Model for RAI Student Attrition classification example",
    experiment_name=f"RAI_classification_Example_Model_Training_{model_name_suffix}",
)
def my_training_pipeline(target_column_name, training_data):
    trained_model = train_model_component(
        target_column_name=target_column_name, training_data=training_data
    )
    trained_model.set_limits(timeout=1200)

    _ = register_component(
        model_input_path=trained_model.outputs.model_output,
        model_base_name=model_name,
        model_name_suffix=model_name_suffix,
    )

    return {}


model_registration_pipeline_job = my_training_pipeline(target_feature, loan_train_pq)

With the pipeline definition created, we can submit it to AzureML. We define a helper function to do the submission, which waits for the submitted job to complete:

In [None]:
from azure.ai.ml.entities import PipelineJob
from IPython.core.display import HTML
from IPython.display import display


def submit_and_wait(ml_client, pipeline_job) -> PipelineJob:
    created_job = ml_client.jobs.create_or_update(pipeline_job)
    assert created_job is not None

    print("Pipeline job can be accessed in the following URL:")
    display(HTML('<a href="{0}">{0}</a>'.format(created_job.studio_url)))

    while created_job.status not in [
        "Completed",
        "Failed",
        "Canceled",
        "NotResponding",
    ]:
        time.sleep(30)
        created_job = ml_client.jobs.get(created_job.name)
        print("Latest status : {0}".format(created_job.status))
    assert created_job.status == "Completed"
    return created_job


# This is the actual submission
training_job = submit_and_wait(ml_client, model_registration_pipeline_job)

##  Creating the RAI Insights

We have a registered model, and can now run a pipeline to create the RAI insights. First off, compute the name of the model we registered:

In [None]:
expected_model_id = f"{model_name}_{model_name_suffix}:1"
azureml_model_id = f"azureml:{expected_model_id}"


Now, we create the RAI pipeline itself. There are four 'component stages' in this pipeline:

1. Construct an empty `RAIInsights` object
1. Run the RAI tool components
1. Gather the tool outputs into a single `RAIInsights` object
1. (Optional) Generate a score card in pdf format summarizing model performance, and key aspects from the rai tool components

We start by loading the RAI component definitions for use in our pipeline:

In [None]:
# Get handle to azureml registry for the RAI built in components
registry_name = "azureml"

ml_client_registry = MLClient(
    credential=credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    registry_name=registry_name,
)
print(ml_client_registry)

## Add different components of ResponsibleAI dashboard to the Pipeline

Reference:
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-responsible-ai-insights-sdk-cli?tabs=python

In [None]:
label = "latest"

rai_constructor_component = ml_client_registry.components.get(
    name="rai_tabular_insight_constructor", label=label
)

# We get latest version and use the same version for all components
version = rai_constructor_component.version
print("The current version of RAI built-in components is: " + version)

rai_counterfactual_component = ml_client_registry.components.get(
    name="rai_tabular_counterfactual", version=version
)
rai_erroranalysis_component = ml_client_registry.components.get(
    name="rai_tabular_erroranalysis", version=version
)

rai_explanation_component = ml_client_registry.components.get(
    name="rai_tabular_explanation", version=version
)

rai_gather_component = ml_client_registry.components.get(
    name="rai_tabular_insight_gather", version=version
)

rai_scorecard_component = ml_client_registry.components.get(
    name="rai_tabular_score_card", version=version
)

## Score card generation config
For score card generation, we need some additional configuration in a separate json file. Here we configure the following model performance metrics for reporting:
- accuracy
- precision

In [None]:
import json

score_card_config_dict = {
    "Model": {
        "ModelName": "Student Attrition classification",
        "ModelType": "Classification",
        "ModelSummary": "<model summary>",
    },
    "Metrics": {"accuracy_score": {"threshold": ">=0.5"}, "precision_score": {}},
}

score_card_config_filename = (
    "rai_student_attrition_classification_score_card_config.json"
)

with open(score_card_config_filename, "w") as f:
    json.dump(score_card_config_dict, f)

score_card_config_path = Input(
    type="uri_file", path=score_card_config_filename, mode="download"
)

Now the pipeline itself. This creates an empty `RAIInsights` object, adds the analyses, and then gathers everything into the final `RAIInsights` output. Where complex objects need to be passed (such as a list of treatment feature names), they must be encoded as JSON strings.

Note that the timeout for the counterfactual generation is longer, since this is a comparatively slow process.

In [None]:
import json
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes

classes_in_target = json.dumps(["Retain", "Attrition"])


@dsl.pipeline(
    compute=compute_name,
    description="Example RAI computation on Student Attrition Classification",
    experiment_name=f"RAI_Student_Attrition_Classification_Example_RAIInsights_Computation_{model_name_suffix}",
)
def rai_classification_pipeline(
    target_column_name,
    train_data,
    test_data,
    score_card_config_path,
):
    # Initiate the RAIInsights
    create_rai_job = rai_constructor_component(
        title="RAI Dashboard Example",
        task_type="classification",
        model_info=expected_model_id,
        model_input=Input(type=AssetTypes.MLFLOW_MODEL, path=azureml_model_id),
        train_dataset=train_data,
        test_dataset=test_data,
        target_column_name=target_column_name,
        categorical_column_names=json.dumps(categorical_features),
        classes=classes_in_target,
    )
    create_rai_job.set_limits(timeout=3600)

    # Add an explanation
    explain_job = rai_explanation_component(
        comment="Explanation for the classification dataset",
        rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,
    )
    explain_job.set_limits(timeout=3600)

    # Add counterfactual analysis
    counterfactual_job = rai_counterfactual_component(
        rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,
        total_cfs=10,
        desired_class="opposite",
    )
    counterfactual_job.set_limits(timeout=3600)

    # Add error analysis
    erroranalysis_job = rai_erroranalysis_component(
        rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,
    )
    erroranalysis_job.set_limits(timeout=3600)

    # Combine everything
    rai_gather_job = rai_gather_component(
        constructor=create_rai_job.outputs.rai_insights_dashboard,
        insight_1=explain_job.outputs.explanation,
        # insight_2=causal_job.outputs.causal,
        insight_3=counterfactual_job.outputs.counterfactual,
        insight_4=erroranalysis_job.outputs.error_analysis,
    )
    rai_gather_job.set_limits(timeout=3600)

    rai_gather_job.outputs.dashboard.mode = "upload"
    rai_gather_job.outputs.ux_json.mode = "upload"

    # Generate score card in pdf format for a summary report on model performance,
    # and observe distrbution of error between prediction vs ground truth.
    rai_scorecard_job = rai_scorecard_component(
        dashboard=rai_gather_job.outputs.dashboard,
        pdf_generation_config=score_card_config_path,
    )

    return {
        "dashboard": rai_gather_job.outputs.dashboard,
        "ux_json": rai_gather_job.outputs.ux_json,
        "scorecard": rai_scorecard_job.outputs.scorecard,
    }

Next, we define the pipeline object itself, and ensure that the outputs will be available for download:

In [None]:
from datetime import datetime
from azure.ai.ml import Output

# Pipeline to construct the RAI Insights
insights_pipeline_job = rai_classification_pipeline(
    target_column_name=target_feature,
    train_data=loan_train_pq,
    test_data=loan_test_pq,
    score_card_config_path=score_card_config_path,
)

# Workaround to enable the download
timestamp = datetime.now().strftime("%Y%m%d_%H_%M_%S")
path = f"RAI_Student_Attrition_RAIInsights_{model_name_suffix}_{timestamp}"
insights_pipeline_job.outputs.dashboard = Output(
    path=f"azureml://datastores/workspaceblobstore/paths/{path}/dashboard/",
    mode="upload",
    type="uri_folder",
)
insights_pipeline_job.outputs.ux_json = Output(
    path=f"azureml://datastores/workspaceblobstore/paths/{path}/ux_json/",
    mode="upload",
    type="uri_folder",
)
insights_pipeline_job.outputs.scorecard = Output(
    path=f"azureml://datastores/workspaceblobstore/paths/{path}/scorecard/",
    mode="upload",
    type="uri_folder",
)

And submit the pipeline to AzureML for execution:

In [None]:
insights_job = submit_and_wait(ml_client, insights_pipeline_job)

The dashboard should appear in the AzureML portal in the registered model view. The following cell computes the expected URI:

## Downloading the Scorecard PDF

We can download the scorecard PDF from our pipeline as follows:

In [None]:
target_directory = "."

ml_client.jobs.download(
    insights_job.name, download_path=target_directory, output_name="scorecard"
)

## To Access the Dashboard follow the link below

In [None]:
sub_id = ml_client._operation_scope.subscription_id
rg_name = ml_client._operation_scope.resource_group_name
ws_name = ml_client.workspace_name

expected_uri = f"https://ml.azure.com/model/{expected_model_id}/model_analysis?wsid=/subscriptions/{sub_id}/resourcegroups/{rg_name}/workspaces/{ws_name}"

print(f"Please visit {expected_uri} to see your analysis")

Once this is complete, we can go to the Registered Models view in the AzureML portal, and find the model we have just registered. On the 'Model Details' page, there is a "Responsible AI dashboard" tab where we can view the insights which we have just uploaded.