sdk/python/responsible-ai/text/responsibleaidashboard-multilabel-text-classification-covid-events.ipynb (1,086 lines of code) (raw):
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "98605bcd",
"metadata": {},
"source": [
"# Mulitlabel Text Classification scenario with RAI Dashboard\n",
"\n",
"The [Covid19 Emergency Event Dataset](https://huggingface.co/datasets/joelito/covid19_emergency_event) is a multilabel dataset that presents a corpus of legislative documents manually annotated for exceptional measures against COVID-19. Each document has 8 possible labels, where the document can be tagged with up to all 8 labels, and each label representing a specific measurement against COVID-19. The events are:\n",
"\n",
"event1: State of Emergency\n",
"event2: Restrictions of fundamental rights and civil liberties\n",
"event3: Restrictions of daily liberties\n",
"event4: Closures / lockdown\n",
"event5: Suspension of international cooperation and commitments\n",
"event6: Police mobilization\n",
"event7: Army mobilization\n",
"event8: Government oversight\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e870e274",
"metadata": {},
"source": [
"Install datasets to retrieve this dataset from huggingface:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9730612a",
"metadata": {},
"outputs": [],
"source": [
"%pip install datasets\n",
"%pip install \"pandas<2.0.0\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e434de4d",
"metadata": {},
"source": [
"First, we need to specify the version of the RAI components which are available in the workspace. This was specified when the components were uploaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53b4eeac",
"metadata": {},
"outputs": [],
"source": [
"version_string = \"0.0.20\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "06008690",
"metadata": {},
"source": [
"We also need to give the name of the compute cluster we want to use in AzureML. Later in this notebook, we will create it if it does not already exist:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f1ad79f9",
"metadata": {},
"outputs": [],
"source": [
"compute_name = \"cpucluster\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9fc65dc7",
"metadata": {},
"source": [
"Finally, we need to specify a version for the data and components we will create while running this notebook. This should be unique for the workspace, but the specific value doesn't matter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "78053935",
"metadata": {},
"outputs": [],
"source": [
"rai_example_version_string = \"23\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "73be2b63",
"metadata": {},
"source": [
"## Accessing the Data\n",
"\n",
"We supply the data as a pair of parquet files and accompanying `MLTable` file. We can download them, preprocess them, and take a brief look:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f875f18",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"import datasets\n",
"import pandas as pd\n",
"\n",
"from sklearn import preprocessing\n",
"\n",
"NUM_TEST_SAMPLES = 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ccbfd923",
"metadata": {},
"outputs": [],
"source": [
"def load_covid19_emergency_event_dataset(split):\n",
" dataset = datasets.load_dataset(\"joelito/covid19_emergency_event\", split=split)\n",
" dataset = pd.DataFrame(\n",
" {\n",
" \"language\": dataset[\"language\"],\n",
" \"text\": dataset[\"text\"],\n",
" \"event1\": dataset[\"event1\"],\n",
" \"event2\": dataset[\"event2\"],\n",
" \"event3\": dataset[\"event3\"],\n",
" \"event4\": dataset[\"event4\"],\n",
" \"event5\": dataset[\"event5\"],\n",
" \"event6\": dataset[\"event6\"],\n",
" \"event7\": dataset[\"event7\"],\n",
" \"event8\": dataset[\"event8\"],\n",
" }\n",
" )\n",
" dataset = dataset[dataset.language == \"en\"].reset_index(drop=True)\n",
" dataset = dataset.drop(columns=\"language\")\n",
" return dataset\n",
"\n",
"\n",
"pd_test_data = load_covid19_emergency_event_dataset(\"test\")\n",
"test_data = pd_test_data[:NUM_TEST_SAMPLES]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "17d53df4",
"metadata": {},
"source": [
"Now create the mltable:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c7bbe58",
"metadata": {},
"outputs": [],
"source": [
"pq_filename = \"covid_data.parquet\"\n",
"\n",
"\n",
"def create_ml_table_file_contents(pq_filename):\n",
" return (\n",
" \"$schema: http://azureml/sdk-2-0/MLTable.json\\n\"\n",
" \"type: mltable\\n\"\n",
" \"paths:\\n\"\n",
" \" - file: ./{0}\\n\"\n",
" \"transformations:\\n\"\n",
" \" - read_parquet\\n\"\n",
" ).format(pq_filename)\n",
"\n",
"\n",
"def write_to_parquet(data, path, pq_filename):\n",
" os.makedirs(path, exist_ok=True)\n",
" data.to_parquet(os.path.join(path, pq_filename), index=False)\n",
"\n",
"\n",
"def create_ml_table_file(path, contents):\n",
" with open(os.path.join(path, \"MLTable\"), \"w\") as f:\n",
" f.write(contents)\n",
"\n",
"\n",
"test_data_path = \"multilabel_covid_test_data\"\n",
"\n",
"write_to_parquet(test_data, test_data_path, pq_filename)\n",
"\n",
"mltable_file_contents = create_ml_table_file_contents(pq_filename)\n",
"create_ml_table_file(test_data_path, mltable_file_contents)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a2c4ebb4",
"metadata": {},
"source": [
"Load some data for a quick view:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1027fa92",
"metadata": {},
"outputs": [],
"source": [
"import mltable\n",
"\n",
"tbl = mltable.load(test_data_path)\n",
"test_df: pd.DataFrame = tbl.to_pandas_dataframe()\n",
"\n",
"display(test_df)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1115ac59",
"metadata": {},
"source": [
"The label column contains the classes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5b42df3d",
"metadata": {},
"outputs": [],
"source": [
"target_column_name = [\n",
" \"event1\",\n",
" \"event2\",\n",
" \"event3\",\n",
" \"event4\",\n",
" \"event5\",\n",
" \"event6\",\n",
" \"event7\",\n",
" \"event8\",\n",
"]\n",
"encoded_target_column_name = json.dumps(target_column_name)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "52e79b04",
"metadata": {},
"source": [
"First, we need to upload the datasets to our workspace. We start by creating an `MLClient` for interactions with AzureML:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a0bddc4",
"metadata": {},
"outputs": [],
"source": [
"# Enter details of your AML workspace\n",
"subscription_id = \"<SUBSCRIPTION_ID>\"\n",
"resource_group = \"<RESOURCE_GROUP>\"\n",
"workspace = \"<AML_WORKSPACE_NAME>\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "395435fc",
"metadata": {},
"outputs": [],
"source": [
"# Handle to the workspace\n",
"from azure.ai.ml import MLClient\n",
"from azure.identity import DefaultAzureCredential\n",
"\n",
"try:\n",
" credential = DefaultAzureCredential()\n",
" ml_client = MLClient(\n",
" credential=credential,\n",
" subscription_id=subscription_id,\n",
" resource_group_name=resource_group,\n",
" workspace_name=workspace,\n",
" )\n",
"except Exception:\n",
" # If in compute instance we can get the config automatically\n",
" from azureml.core import Workspace\n",
"\n",
" workspace = Workspace.from_config()\n",
" workspace.write_config()\n",
" ml_client = MLClient.from_config(\n",
" credential=DefaultAzureCredential(exclude_shared_token_cache_credential=True),\n",
" logging_enable=True,\n",
" )\n",
"\n",
"print(ml_client)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7b501735",
"metadata": {},
"source": [
"We can now upload the data to AzureML:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "62eb02a2",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"input_test_data = \"Covid19_Events_Test_MLTable\"\n",
"\n",
"try:\n",
" test_data = ml_client.data.get(\n",
" name=input_test_data,\n",
" version=rai_example_version_string,\n",
" )\n",
"except Exception:\n",
" test_data = Data(\n",
" path=test_data_path,\n",
" type=AssetTypes.MLTABLE,\n",
" description=\"RAI Covid 19 Events Multilabel test data\",\n",
" name=input_test_data,\n",
" version=rai_example_version_string,\n",
" )\n",
" ml_client.data.create_or_update(test_data)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6815ba75",
"metadata": {},
"source": [
"# Creating the Model\n",
"\n",
"To simplify the model creation process, we're going to use a pipeline.\n",
"\n",
"We create a directory for the training script:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e78d869b",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.makedirs(\"covid19_events_component_src\", exist_ok=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ea86e55d",
"metadata": {},
"source": [
"Next, we write out our script to retrieve the trained model:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a523f144",
"metadata": {},
"outputs": [],
"source": [
"%%writefile covid19_events_component_src/training_script.py\n",
"\n",
"import argparse\n",
"import logging\n",
"import json\n",
"import os\n",
"import time\n",
"\n",
"\n",
"import mlflow\n",
"import mlflow.pyfunc\n",
"\n",
"import zipfile\n",
"from azureml.core import Run\n",
"\n",
"from transformers import AutoModelForSequenceClassification, \\\n",
" AutoTokenizer, pipeline\n",
"\n",
"from azureml.rai.utils import PyfuncModel\n",
"from raiutils.common.retries import retry_function\n",
"\n",
"try:\n",
" from urllib import urlretrieve\n",
"except ImportError:\n",
" from urllib.request import urlretrieve\n",
"\n",
"\n",
"_logger = logging.getLogger(__file__)\n",
"logging.basicConfig(level=logging.INFO)\n",
"\n",
"\n",
"COVID19_EVENTS_LABELS = [\"event1\", \"event2\", \"event3\", \"event4\",\n",
" \"event5\", \"event6\", \"event7\", \"event8\"]\n",
"COVID19_EVENTS_MODEL_NAME = \"covid19_events_model\"\n",
"\n",
"\n",
"def parse_args():\n",
" # setup arg parser\n",
" parser = argparse.ArgumentParser()\n",
"\n",
" # add arguments\n",
" parser.add_argument(\n",
" \"--model_output_path\", type=str, help=\"Path to write model info JSON\"\n",
" )\n",
" parser.add_argument(\n",
" \"--model_base_name\", type=str, help=\"Name of the registered model\"\n",
" )\n",
" parser.add_argument(\n",
" \"--model_name_suffix\", type=int, help=\"Set negative to use epoch_secs\"\n",
" )\n",
" parser.add_argument(\n",
" \"--device\", type=int, help=(\n",
" \"Device for CPU/GPU supports. Setting this to -1 will leverage \"\n",
" \"CPU, >=0 will run the model on the associated CUDA device id.\")\n",
" )\n",
"\n",
" # parse args\n",
" args = parser.parse_args()\n",
"\n",
" # return args\n",
" return args\n",
"\n",
"\n",
"class FetchModel(object):\n",
" def __init__(self):\n",
" pass\n",
"\n",
" def fetch(self):\n",
" zipfilename = COVID19_EVENTS_MODEL_NAME + '.zip'\n",
" url = ('https://publictestdatasets.blob.core.windows.net/models/' +\n",
" COVID19_EVENTS_MODEL_NAME + '.zip')\n",
" urlretrieve(url, zipfilename)\n",
" with zipfile.ZipFile(zipfilename, 'r') as unzip:\n",
" unzip.extractall(COVID19_EVENTS_MODEL_NAME)\n",
"\n",
"\n",
"def create_multilabel_pipeline(device):\n",
" fetcher = FetchModel()\n",
" action_name = \"Model download\"\n",
" err_msg = \"Failed to download model\"\n",
" max_retries = 4\n",
" retry_delay = 60\n",
" retry_function(fetcher.fetch, action_name, err_msg,\n",
" max_retries=max_retries,\n",
" retry_delay=retry_delay)\n",
" labels = COVID19_EVENTS_LABELS\n",
" num_labels = len(labels)\n",
" id2label = {idx: label for idx, label in enumerate(labels)}\n",
" label2id = {label: idx for idx, label in enumerate(labels)}\n",
" model = AutoModelForSequenceClassification.from_pretrained(\n",
" COVID19_EVENTS_MODEL_NAME, num_labels=num_labels,\n",
" problem_type=\"multi_label_classification\",\n",
" id2label=id2label,\n",
" label2id=label2id)\n",
"\n",
" if device >= 0:\n",
" model = model.cuda()\n",
"\n",
" tokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\")\n",
" # build a pipeline object to do predictions\n",
" pred = pipeline(\n",
" \"text-classification\",\n",
" model=model,\n",
" tokenizer=tokenizer,\n",
" device=device,\n",
" return_all_scores=True\n",
" )\n",
" return pred\n",
"\n",
"\n",
"def main(args):\n",
" current_experiment = Run.get_context().experiment\n",
" tracking_uri = current_experiment.workspace.get_mlflow_tracking_uri()\n",
" _logger.info(\"tracking_uri: {0}\".format(tracking_uri))\n",
" mlflow.set_tracking_uri(tracking_uri)\n",
" mlflow.set_experiment(current_experiment.name)\n",
"\n",
" _logger.info(\"Getting device\")\n",
" device = args.device\n",
"\n",
" # build a pipeline object to do predictions\n",
" _logger.info(\"Building pipeline\")\n",
"\n",
" pred = create_multilabel_pipeline(device)\n",
"\n",
" if args.model_name_suffix < 0:\n",
" suffix = int(time.time())\n",
" else:\n",
" suffix = args.model_name_suffix\n",
" registered_name = \"{0}_{1}\".format(args.model_base_name, suffix)\n",
" _logger.info(f\"Registering model as {registered_name}\")\n",
"\n",
" my_mlflow = PyfuncModel(pred)\n",
"\n",
" # Saving model with mlflow\n",
" _logger.info(\"Saving with mlflow\")\n",
" mlflow.pyfunc.log_model(\n",
" python_model=my_mlflow,\n",
" registered_model_name=registered_name,\n",
" artifact_path=registered_name,\n",
" )\n",
"\n",
" _logger.info(\"Writing JSON\")\n",
" dict = {\"id\": \"{0}:1\".format(registered_name)}\n",
" output_path = os.path.join(args.model_output_path, \"model_info.json\")\n",
" with open(output_path, \"w\") as of:\n",
" json.dump(dict, fp=of)\n",
"\n",
"\n",
"# run script\n",
"if __name__ == \"__main__\":\n",
" # add space in logs\n",
" print(\"*\" * 60)\n",
" print(\"\\n\\n\")\n",
"\n",
" # parse args\n",
" args = parse_args()\n",
"\n",
" # run main function\n",
" main(args)\n",
"\n",
" # add space in logs\n",
" print(\"*\" * 60)\n",
" print(\"\\n\\n\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e115dd6e",
"metadata": {},
"source": [
"Now, we can build this into an AzureML component:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d54e43f",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import load_component\n",
"\n",
"yaml_contents = f\"\"\"\n",
"$schema: http://azureml/sdk-2-0/CommandComponent.json\n",
"name: rai_covid19_events_training_component\n",
"display_name: Covid 19 Events training component for RAI example\n",
"version: {rai_example_version_string}\n",
"type: command\n",
"inputs:\n",
" model_base_name:\n",
" type: string\n",
" model_name_suffix: # Set negative to use epoch_secs\n",
" type: integer\n",
" default: -1\n",
" device: # set to >= 0 to use GPU\n",
" type: integer\n",
" default: 0\n",
"outputs:\n",
" model_output_path:\n",
" type: path\n",
"code: ./covid19_events_component_src/\n",
"environment: azureml://registries/azureml/environments/responsibleai-text/versions/13\n",
"command: >-\n",
" python training_script.py\n",
" --model_base_name ${{{{inputs.model_base_name}}}}\n",
" --model_name_suffix ${{{{inputs.model_name_suffix}}}}\n",
" --device ${{{{inputs.device}}}}\n",
" --model_output_path ${{{{outputs.model_output_path}}}}\n",
"\"\"\"\n",
"\n",
"yaml_filename = \"Covid19EventsTextTrainingComp.yaml\"\n",
"\n",
"with open(yaml_filename, \"w\") as f:\n",
" f.write(yaml_contents)\n",
"\n",
"train_component_definition = load_component(source=yaml_filename)\n",
"\n",
"ml_client.components.create_or_update(train_component_definition)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6d165e2b",
"metadata": {},
"source": [
"We need a compute target on which to run our jobs. The following checks whether the compute specified above is present; if not, then the compute target is created."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1e40fc38",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import AmlCompute\n",
"\n",
"all_compute_names = [x.name for x in ml_client.compute.list()]\n",
"\n",
"if compute_name in all_compute_names:\n",
" print(f\"Found existing compute: {compute_name}\")\n",
"else:\n",
" my_compute = AmlCompute(\n",
" name=compute_name,\n",
" size=\"STANDARD_DS3_V2\",\n",
" min_instances=0,\n",
" max_instances=4,\n",
" idle_time_before_scale_down=3600,\n",
" )\n",
" ml_client.compute.begin_create_or_update(my_compute)\n",
" print(\"Initiated compute creation\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9d8eb868",
"metadata": {},
"source": [
"## Running a training pipeline\n",
"\n",
"Now that we have our training component, we can run it. We begin by generating a unique name for the mode;"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad76242b",
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"model_base_name = \"multilabel_hf_model\"\n",
"model_name_suffix = \"12492\"\n",
"device = -1"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d49615a7",
"metadata": {},
"source": [
"Next, we define our training pipeline. This has two components. The first is the training component which we defined above. The second is a component to register the model in AzureML:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cb6c6cec",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml import dsl, Input\n",
"\n",
"train_model_component = ml_client.components.get(\n",
" name=\"rai_covid19_events_training_component\", version=rai_example_version_string\n",
")\n",
"\n",
"\n",
"@dsl.pipeline(\n",
" compute=compute_name,\n",
" description=\"Register Model for RAI Covid 19 Events Multilabel example\",\n",
" experiment_name=f\"RAI_Covid19_Events_Multilabel_Example_Model_Training_{model_name_suffix}\",\n",
")\n",
"def my_training_pipeline(model_base_name, model_name_suffix, device):\n",
" trained_model = train_component_definition(\n",
" model_base_name=model_base_name,\n",
" model_name_suffix=model_name_suffix,\n",
" device=device,\n",
" )\n",
" trained_model.set_limits(timeout=3600)\n",
"\n",
" return {}\n",
"\n",
"\n",
"model_registration_pipeline_job = my_training_pipeline(\n",
" model_base_name, model_name_suffix, device\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "2fa66ea6",
"metadata": {},
"source": [
"With the training pipeline defined, we can submit it for execution in AzureML. We define a helper function to wait for the job to complete:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f854eef5",
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.ml.entities import PipelineJob\n",
"\n",
"\n",
"def submit_and_wait(ml_client, pipeline_job) -> PipelineJob:\n",
" created_job = ml_client.jobs.create_or_update(pipeline_job)\n",
" assert created_job is not None\n",
"\n",
" while created_job.status not in [\n",
" \"Completed\",\n",
" \"Failed\",\n",
" \"Canceled\",\n",
" \"NotResponding\",\n",
" ]:\n",
" time.sleep(30)\n",
" created_job = ml_client.jobs.get(created_job.name)\n",
" print(\"Latest status : {0}\".format(created_job.status))\n",
" assert created_job.status == \"Completed\"\n",
" return created_job\n",
"\n",
"\n",
"# This is the actual submission\n",
"training_job = submit_and_wait(ml_client, model_registration_pipeline_job)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0722395e",
"metadata": {},
"source": [
"## Creating the RAI Text Insights\n",
"\n",
"Now that we have our model, we can generate RAI Text insights for it. We will need the `id` of the registered model, which will be as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7d3e6e6e",
"metadata": {},
"outputs": [],
"source": [
"expected_model_id = f\"{model_base_name}_{model_name_suffix}:1\"\n",
"azureml_model_id = f\"azureml:{expected_model_id}\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "310aa659",
"metadata": {},
"source": [
"Next, we load the RAI components, so that we can construct a pipeline:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d67b942e",
"metadata": {},
"outputs": [],
"source": [
"covid19_test_mltable = Input(\n",
" type=\"mltable\",\n",
" path=f\"{input_test_data}:{rai_example_version_string}\",\n",
" mode=\"download\",\n",
")\n",
"\n",
"registry_name = \"azureml\"\n",
"credential = DefaultAzureCredential()\n",
"\n",
"ml_client_registry = MLClient(\n",
" credential=credential,\n",
" subscription_id=ml_client.subscription_id,\n",
" resource_group_name=ml_client.resource_group_name,\n",
" registry_name=registry_name,\n",
")\n",
"\n",
"rai_text_insights_component = ml_client_registry.components.get(\n",
" name=\"rai_text_insights\", version=version_string\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c98cd2d9",
"metadata": {},
"source": [
"We can now specify our pipeline. Complex objects (such as lists of column names) have to be converted to JSON strings before being passed to the components."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a62105a7",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from azure.ai.ml import Input\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"\n",
"@dsl.pipeline(\n",
" compute=compute_name,\n",
" description=\"Example RAI computation on Covid 19 Events Multilabel data\",\n",
" experiment_name=f\"RAI_Covid19_Events_Multilabel_Example_RAIInsights_Computation_{model_name_suffix}\",\n",
")\n",
"def rai_covid19_text_classification_pipeline(\n",
" target_column_name,\n",
" test_data,\n",
" classes,\n",
" use_model_dependency,\n",
"):\n",
" # Initiate the RAIInsights\n",
" rai_text_job = rai_text_insights_component(\n",
" task_type=\"multilabel_text_classification\",\n",
" model_info=expected_model_id,\n",
" model_input=Input(type=AssetTypes.MLFLOW_MODEL, path=azureml_model_id),\n",
" test_dataset=test_data,\n",
" target_column_name=target_column_name,\n",
" classes=classes,\n",
" use_model_dependency=use_model_dependency,\n",
" )\n",
" rai_text_job.set_limits(timeout=7200)\n",
"\n",
" rai_text_job.outputs.dashboard.mode = \"upload\"\n",
" rai_text_job.outputs.ux_json.mode = \"upload\"\n",
"\n",
" return {\n",
" \"dashboard\": rai_text_job.outputs.dashboard,\n",
" \"ux_json\": rai_text_job.outputs.ux_json,\n",
" }"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6b5b14a9",
"metadata": {},
"source": [
"Next, we define the pipeline object itself, and ensure that the outputs will be available for download:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4d86ec2",
"metadata": {},
"outputs": [],
"source": [
"import uuid\n",
"from azure.ai.ml import Output\n",
"\n",
"insights_pipeline_job = rai_covid19_text_classification_pipeline(\n",
" target_column_name=encoded_target_column_name,\n",
" test_data=covid19_test_mltable,\n",
" classes=\"[]\",\n",
" use_model_dependency=True,\n",
")\n",
"\n",
"rand_path = str(uuid.uuid4())\n",
"insights_pipeline_job.outputs.dashboard = Output(\n",
" path=f\"azureml://datastores/workspaceblobstore/paths/{rand_path}/dashboard/\",\n",
" mode=\"upload\",\n",
" type=\"uri_folder\",\n",
")\n",
"insights_pipeline_job.outputs.ux_json = Output(\n",
" path=f\"azureml://datastores/workspaceblobstore/paths/{rand_path}/ux_json/\",\n",
" mode=\"upload\",\n",
" type=\"uri_folder\",\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "25f34573",
"metadata": {},
"source": [
"And submit the pipeline to AzureML for execution:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ca757f7",
"metadata": {},
"outputs": [],
"source": [
"insights_job = submit_and_wait(ml_client, insights_pipeline_job)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1381768a",
"metadata": {},
"source": [
"The dashboard should appear in the AzureML portal in the registered model view. The following cell computes the expected URI:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e86ab611",
"metadata": {},
"outputs": [],
"source": [
"sub_id = ml_client._operation_scope.subscription_id\n",
"rg_name = ml_client._operation_scope.resource_group_name\n",
"ws_name = ml_client.workspace_name\n",
"\n",
"expected_uri = f\"https://ml.azure.com/model/{expected_model_id}/model_analysis?wsid=/subscriptions/{sub_id}/resourcegroups/{rg_name}/workspaces/{ws_name}\"\n",
"\n",
"print(f\"Please visit {expected_uri} to see your analysis\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "93a8dff9",
"metadata": {},
"source": [
"## Constructing the pipeline in YAML\n",
"\n",
"It is also possible to specify the pipeline as a YAML file, and submit that using the command line. We will now create a YAML specification of the above pipeline and submit that:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "624bb0cd",
"metadata": {},
"outputs": [],
"source": [
"yaml_contents = f\"\"\"\n",
"$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json\n",
"experiment_name: AML_RAI_Multilabel_Text_Sample_{rai_example_version_string}\n",
"type: pipeline\n",
"\n",
"compute: azureml:cpucluster\n",
"\n",
"inputs:\n",
" hf_model_info: {expected_model_id}\n",
" my_test_data:\n",
" type: mltable\n",
" path: azureml:{input_test_data}:{rai_example_version_string}\n",
" mode: download\n",
"\n",
"settings:\n",
" default_datastore: azureml:workspaceblobstore\n",
" default_compute: azureml:cpucluster\n",
" continue_on_step_failure: false\n",
"\n",
"jobs:\n",
" analyse_model:\n",
" type: command\n",
" component: azureml://registries/azureml/components/rai_text_insights/versions/{version_string}\n",
" inputs:\n",
" task_type: multilabel_text_classification\n",
" model_input:\n",
" type: mlflow_model\n",
" path: {azureml_model_id}\n",
" model_info: ${{{{parent.inputs.hf_model_info}}}}\n",
" test_dataset: ${{{{parent.inputs.my_test_data}}}}\n",
" target_column_name: {target_column_name}\n",
" maximum_rows_for_test_dataset: 5000\n",
" classes: '[]'\n",
" enable_explanation: True\n",
" enable_error_analysis: True\n",
"\"\"\"\n",
"\n",
"yaml_pipeline_filename = \"rai_text_example.yaml\"\n",
"\n",
"with open(yaml_pipeline_filename, \"w\") as f:\n",
" f.write(yaml_contents)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1fd5f2dd",
"metadata": {},
"source": [
"The created file can then be submitted using the Azure CLI:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3bf9bb1c",
"metadata": {},
"outputs": [],
"source": [
"cmd_line = [\n",
" \"az\",\n",
" \"ml\",\n",
" \"job\",\n",
" \"create\",\n",
" \"--resource-group\",\n",
" rg_name,\n",
" \"--workspace\",\n",
" ws_name,\n",
" \"--file\",\n",
" yaml_pipeline_filename,\n",
"]\n",
"\n",
"import subprocess\n",
"\n",
"try:\n",
" cmd = subprocess.run(cmd_line, check=True, shell=True, capture_output=True)\n",
"except subprocess.CalledProcessError as cpe:\n",
" print(f\"Error invoking: {cpe.args}\")\n",
" print(cpe.stdout)\n",
" print(cpe.stderr)\n",
" raise\n",
"else:\n",
" print(\"Azure CLI submission completed\")"
]
}
],
"metadata": {
"celltoolbar": "Raw Cell Format",
"kernelspec": {
"display_name": "Python 3.10 - SDK V2",
"language": "python",
"name": "python310-sdkv2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.15"
},
"vscode": {
"interpreter": {
"hash": "8fd340b5477ca1a0b454d48a3973beff39fee032ada47a04f6f3725b469a8988"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}