sdk/python/jobs/pipelines/1h_automl_in_pipeline/automl-image-classification-multilabel-in-pipeline/automl-image-classification-multilabel-in-pipeline.ipynb (507 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# AutoML Image Classification Multilabel in pipeline\n",
"\n",
"**Requirements** - In order to benefit from this tutorial, you will need:\n",
"- A basic understanding of Machine Learning\n",
"- An Azure account with an active subscription - [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)\n",
"- An Azure ML workspace with computer cluster - [Configure workspace](../../configuration.ipynb)\n",
"- A python environment\n",
"- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section\n",
"\n",
"**Learning Objectives** - By the end of this tutorial, you should be able to:\n",
"- Create a pipeline with Image Classification Multilabel AutoML task.\n",
"\n",
"**Motivations** - This notebook explains how to use Image Classification Multilabel AutoML task inside pipeline."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Connect to Azure Machine Learning Workspace\n",
"\n",
"The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.\n",
"\n",
"## 1.1 Import the required libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import required libraries\n",
"from azure.identity import DefaultAzureCredential\n",
"\n",
"from azure.ai.ml import MLClient, Input, command, Output\n",
"from azure.ai.ml.automl import (\n",
" image_classification_multilabel,\n",
" SearchSpace,\n",
" ClassificationMultilabelPrimaryMetrics,\n",
")\n",
"from azure.ai.ml.dsl import pipeline\n",
"from azure.ai.ml.sweep import BanditPolicy, Choice, Uniform\n",
"from azure.ai.ml.entities import Environment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.2. Configure workspace details and get a handle to the workspace\n",
"\n",
"To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../configuration.ipynb) for more details on how to configure credentials and connect to a workspace."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credential = DefaultAzureCredential()\n",
"ml_client = None\n",
"try:\n",
" ml_client = MLClient.from_config(credential)\n",
"except Exception as ex:\n",
" print(ex)\n",
" # Enter details of your AML workspace\n",
" subscription_id = \"<SUBSCRIPTION_ID>\"\n",
" resource_group = \"<RESOURCE_GROUP>\"\n",
" workspace = \"<AML_WORKSPACE_NAME>\"\n",
" ml_client = MLClient(credential, subscription_id, resource_group, workspace)\n",
"print(ml_client)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. MLTable with input Training Data\n",
"\n",
"In order to generate models for computer vision tasks with automated machine learning, you need to bring labeled image data as input for model training in the form of an MLTable. You can create an MLTable from labeled training data in JSONL format. If your labeled training data is in a different format (like, pascal VOC or COCO), you can use a conversion script to first convert it to JSONL, and then create an MLTable. Alternatively, you can use Azure Machine Learning's [data labeling tool](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-image-labeling-projects) to manually label images, and export the labeled data to use for training your AutoML model.\n",
"\n",
"In this notebook, we use a toy dataset called Fridge Objects, which consists of 128 images of 4 labels of beverage container {`can`, `carton`, `milk bottle`, `water bottle`} photos taken on different backgrounds. It also includes a labels file in .csv format. This is one of the most common data formats for Image Classification Multi-Label: one csv file that contains the mapping of labels to a folder of images.\n",
"\n",
"\n",
"All images in this notebook are hosted in [this repository](https://github.com/microsoft/computervision-recipes) and are made available under the [MIT license](https://github.com/microsoft/computervision-recipes/blob/master/LICENSE)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.1 Download Data\n",
"\n",
"We first download and unzip the data locally. By default, the data would be downloaded in `./data` folder in current directory. \n",
"If you prefer to download the data at a different location, update it in `dataset_parent_dir = ...` in the next cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import urllib\n",
"from zipfile import ZipFile\n",
"\n",
"# Change to a different location if you prefer\n",
"dataset_parent_dir = \"./data\"\n",
"\n",
"# create data folder if it doesnt exist.\n",
"os.makedirs(dataset_parent_dir, exist_ok=True)\n",
"\n",
"# download data\n",
"download_url = \"https://automlsamplenotebookdata-adcuc7f7bqhhh8a4.b02.azurefd.net/image-classification/multilabelFridgeObjects.zip\"\n",
"\n",
"# Extract current dataset name from dataset url\n",
"dataset_name = os.path.split(download_url)[-1].split(\".\")[0]\n",
"# Get dataset path for later use\n",
"dataset_dir = os.path.join(dataset_parent_dir, dataset_name)\n",
"\n",
"# Get the data zip file path\n",
"data_file = os.path.join(dataset_parent_dir, f\"{dataset_name}.zip\")\n",
"\n",
"# Download the dataset\n",
"urllib.request.urlretrieve(download_url, filename=data_file)\n",
"\n",
"# extract files\n",
"with ZipFile(data_file, \"r\") as zip:\n",
" print(\"extracting files...\")\n",
" zip.extractall(path=dataset_parent_dir)\n",
" print(\"done\")\n",
"# delete zip file\n",
"os.remove(data_file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a sample image from this dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import Image\n",
"\n",
"sample_image = os.path.join(dataset_dir, \"images\", \"56.jpg\")\n",
"Image(filename=sample_image)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.2 Upload the images to Datastore through an AML Data asset (URI Folder)\n",
"\n",
"In order to use the data for training in Azure ML, we upload it to our default Azure Blob Storage of our Azure ML Workspace.\n",
"\n",
"Reference to URI FOLDER data asset example for further details: https://github.com/Azure/azureml-examples/blob/samuel100/data-samples/sdk/assets/data/data.ipynb"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Uploading image files by creating a 'data asset URI FOLDER':\n",
"\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml.constants import AssetTypes, InputOutputModes\n",
"from azure.ai.ml import Input\n",
"\n",
"my_data = Data(\n",
" path=dataset_dir,\n",
" type=AssetTypes.URI_FOLDER,\n",
" description=\"Fridge-items images multilabel\",\n",
" name=\"fridge-items-images-multilabel\",\n",
")\n",
"\n",
"uri_folder_data_asset = ml_client.data.create_or_update(my_data)\n",
"\n",
"print(uri_folder_data_asset)\n",
"print(\"\")\n",
"print(\"Path to folder in Blob Storage:\")\n",
"print(uri_folder_data_asset.path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.3 Convert the downloaded data to JSON metadata\n",
"\n",
"In this example, the fridge object dataset is annotated in the CSV file, where each image corresponds to a line. It defines a mapping of the filename to the labels. Since this is a multi-label classification problem, each image can be associated to multiple labels. \n",
"\n",
"For documentation on preparing the datasets beyond this notebook, please refer to the [documentation on how to prepare datasets](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-prepare-datasets-for-automl-images).\n",
"\n",
"\n",
"In order to use this data to create an AzureML MLTable, we first need to convert it to the required JSONL format. The following script is creating two `.jsonl` files (one for training and one for validation) in the corresponding MLTable folder. The train / validation ratio corresponds to 20% of the data going into the validation file. For further details on jsonl file used for image classification task in automated ml, please refer to the [data schema documentation for multi-label image classification task](https://learn.microsoft.com/en-us/azure/machine-learning/reference-automl-images-schema#image-classification-multi-label)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import os\n",
"\n",
"# We'll copy each JSONL file within its related MLTable folder\n",
"training_mltable_path = os.path.join(dataset_parent_dir, \"training-mltable-folder\")\n",
"validation_mltable_path = os.path.join(dataset_parent_dir, \"validation-mltable-folder\")\n",
"\n",
"# First, let's create the folders if they don't exist\n",
"os.makedirs(training_mltable_path, exist_ok=True)\n",
"os.makedirs(validation_mltable_path, exist_ok=True)\n",
"\n",
"train_validation_ratio = 5\n",
"\n",
"# Path to the training and validation files\n",
"train_annotations_file = os.path.join(training_mltable_path, \"train_annotations.jsonl\")\n",
"validation_annotations_file = os.path.join(\n",
" validation_mltable_path, \"validation_annotations.jsonl\"\n",
")\n",
"\n",
"# Baseline of json line dictionary\n",
"json_line_sample = {\n",
" \"image_url\": uri_folder_data_asset.path,\n",
" \"label\": [],\n",
"}\n",
"\n",
"# Path to the labels file.\n",
"labelFile = os.path.join(dataset_dir, \"labels.csv\")\n",
"\n",
"# Read each annotation and convert it to jsonl line\n",
"with open(train_annotations_file, \"w\") as train_f:\n",
" with open(validation_annotations_file, \"w\") as validation_f:\n",
" with open(labelFile, \"r\") as labels:\n",
" for i, line in enumerate(labels):\n",
" # Skipping the title line and any empty lines.\n",
" if i == 0 or len(line.strip()) == 0:\n",
" continue\n",
" line_split = line.strip().split(\",\")\n",
" if len(line_split) != 2:\n",
" print(f\"Skipping the invalid line: {line}\")\n",
" continue\n",
" json_line = dict(json_line_sample)\n",
" json_line[\"image_url\"] += f\"images/{line_split[0]}\"\n",
" json_line[\"label\"] = line_split[1].strip().split(\" \")\n",
"\n",
" if i % train_validation_ratio == 0:\n",
" # validation annotation\n",
" validation_f.write(json.dumps(json_line) + \"\\n\")\n",
" else:\n",
" # train annotation\n",
" train_f.write(json.dumps(json_line) + \"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.4 Create MLTable data input\n",
"\n",
"Create MLTable data input using the jsonl files created above.\n",
"\n",
"For documentation on creating your own MLTable assets for jobs beyond this notebook, please refer to below resources\n",
"- [MLTable YAML Schema](https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-mltable) - covers how to write MLTable YAML, which is required for each MLTable asset.\n",
"- [Create MLTable data asset](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-data-assets?tabs=Python-SDK#create-a-mltable-data-asset) - covers how to create MLTable data asset. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def create_ml_table_file(filename):\n",
" \"\"\"Create ML Table definition\"\"\"\n",
"\n",
" return (\n",
" \"paths:\\n\"\n",
" \" - file: ./{0}\\n\"\n",
" \"transformations:\\n\"\n",
" \" - read_json_lines:\\n\"\n",
" \" encoding: utf8\\n\"\n",
" \" invalid_lines: error\\n\"\n",
" \" include_path_column: false\\n\"\n",
" \" - convert_column_types:\\n\"\n",
" \" - columns: image_url\\n\"\n",
" \" column_type: stream_info\"\n",
" ).format(filename)\n",
"\n",
"\n",
"def save_ml_table_file(output_path, mltable_file_contents):\n",
" with open(os.path.join(output_path, \"MLTable\"), \"w\") as f:\n",
" f.write(mltable_file_contents)\n",
"\n",
"\n",
"# Create and save train mltable\n",
"train_mltable_file_contents = create_ml_table_file(\n",
" os.path.basename(train_annotations_file)\n",
")\n",
"save_ml_table_file(training_mltable_path, train_mltable_file_contents)\n",
"\n",
"# Save train and validation mltable\n",
"validation_mltable_file_contents = create_ml_table_file(\n",
" os.path.basename(validation_annotations_file)\n",
")\n",
"save_ml_table_file(validation_mltable_path, validation_mltable_file_contents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Basic pipeline job with Image Classification Multilabel task\n",
"\n",
"## 3.1 Build pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# note that the used docker image doesn't suit for all size of gpu compute. Please use the following command to create gpu compute if experiment failed\n",
"# !az ml compute create -n gpu-cluster --type amlcompute --min-instances 0 --max-instances 4 --size Standard_NC12"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## create custom pipeline environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Define pipeline\n",
"@pipeline(\n",
" description=\"AutoML Image Clasiification Multilabel Pipeline\",\n",
")\n",
"def automl_image_classification_multilabel(\n",
" image_classification_multilabel_train_data,\n",
" image_classification_multilabel_validation_data,\n",
"):\n",
" # define the automl image_classification_multilabel task with automl function\n",
" image_classification_multilabel_node = image_classification_multilabel(\n",
" training_data=image_classification_multilabel_train_data,\n",
" validation_data=image_classification_multilabel_validation_data,\n",
" target_column_name=\"label\",\n",
" primary_metric=ClassificationMultilabelPrimaryMetrics.IOU,\n",
" # currently need to specify outputs \"mlflow_model\" explictly to reference it in following nodes\n",
" outputs={\"best_model\": Output(type=\"mlflow_model\")},\n",
" )\n",
" image_classification_multilabel_node.set_limits(\n",
" max_trials=10, max_concurrent_trials=2, timeout_minutes=180\n",
" )\n",
"\n",
" image_classification_multilabel_node.extend_search_space(\n",
" [\n",
" SearchSpace(\n",
" model_name=Choice([\"vitb16r224\"]),\n",
" learning_rate=Uniform(0.005, 0.05),\n",
" number_of_epochs=Choice([15, 30]),\n",
" gradient_accumulation_step=Choice([1, 2]),\n",
" ),\n",
" SearchSpace(\n",
" model_name=Choice([\"seresnext\"]),\n",
" learning_rate=Uniform(0.005, 0.05),\n",
" # model-specific, valid_resize_size should be larger or equal than valid_crop_size\n",
" validation_resize_size=Choice([288, 320, 352]),\n",
" validation_crop_size=Choice([224, 256]), # model-specific\n",
" training_crop_size=Choice([224, 256]), # model-specific\n",
" ),\n",
" ]\n",
" )\n",
"\n",
" image_classification_multilabel_node.set_sweep(\n",
" sampling_algorithm=\"Random\",\n",
" early_termination=BanditPolicy(\n",
" evaluation_interval=2, slack_factor=0.2, delay_evaluation=6\n",
" ),\n",
" )\n",
"\n",
" # define command function for registering the model\n",
" command_func = command(\n",
" inputs=dict(\n",
" model_input_path=Input(type=\"mlflow_model\"),\n",
" model_base_name=\"image_classification_multilabel_example_model\",\n",
" ),\n",
" code=\"./register.py\",\n",
" command=\"python register.py \"\n",
" + \"--model_input_path ${{inputs.model_input_path}} \"\n",
" + \"--model_base_name ${{inputs.model_base_name}}\",\n",
" environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n",
" )\n",
" register_model = command_func(\n",
" model_input_path=image_classification_multilabel_node.outputs.best_model\n",
" )\n",
"\n",
"\n",
"pipeline = automl_image_classification_multilabel(\n",
" image_classification_multilabel_train_data=Input(\n",
" path=training_mltable_path, type=\"mltable\"\n",
" ),\n",
" image_classification_multilabel_validation_data=Input(\n",
" path=validation_mltable_path, type=\"mltable\"\n",
" ),\n",
")\n",
"\n",
"# set pipeline level compute\n",
"pipeline.settings.default_compute = \"gpu-cluster\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3.2 Submit pipeline job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# submit the pipeline job\n",
"pipeline_job = ml_client.jobs.create_or_update(\n",
" pipeline, experiment_name=\"pipeline_samples\"\n",
")\n",
"pipeline_job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Wait until the job completes\n",
"ml_client.jobs.stream(pipeline_job.name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Next Steps\n",
"You can see further examples of running a pipeline job [here](../)"
]
}
],
"metadata": {
"description": {
"description": "Create pipeline with automl node"
},
"kernelspec": {
"display_name": "Python 3.10 - SDK V2",
"language": "python",
"name": "python310-sdkv2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.7"
},
"vscode": {
"interpreter": {
"hash": "a3e1ce86190527341b095dce2d981b591205330162e59d5b85eea3038817dc05"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}