sdk/python/jobs/single-step/pytorch/distributed-training/distributed-cifar10.ipynb (438 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"# Distributed Pytorch training - CIFAR 10 dataset\n",
"\n",
"### Requirements/Prerequisites\n",
"- An Azure acoount with active subscription [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)\n",
"- Azure Machine Learning workspace [Configure workspace](../../../configuration.ipynb) \n",
"- Python Environment\n",
"- Install Azure ML Python SDK Version 2\n",
"### Learning Objectives\n",
"- Connect to workspace using Python SDK v2\n",
"- Setting up the _Command_ to download data from a web url to AML workspace blob storage by running a _job_.\n",
"- Use this data stored in AML workspace blob storage as the input to the train job _command_.\n",
"- Distributed training of Pytorch model.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"# 1. Connect to Azure Machine Learning Workspace"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## 1.1 Import required libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import required libraries\n",
"from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential\n",
"\n",
"from azure.ai.ml import MLClient, Input\n",
"from azure.ai.ml.dsl import pipeline\n",
"from azure.ai.ml import load_component"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## 1.2 Connect to workspace using DefaultAzureCredential\n",
"`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. \n",
"\n",
"Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"credential = DefaultAzureCredential()\n",
"ml_client = None\n",
"try:\n",
" ml_client = MLClient.from_config(credential)\n",
"except Exception as ex:\n",
" print(ex)\n",
" # Enter details of your AML workspace\n",
" subscription_id = \"<SUBSCRIPTION_ID>\"\n",
" resource_group = \"<RESOURCE_GROUP>\"\n",
" workspace = \"<AML_WORKSPACE_NAME>\"\n",
" ml_client = MLClient(credential, subscription_id, resource_group, workspace)"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"## 1.3 Get handle to workspace"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"#### Retrieving credentials from the `ml_client`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ml_client.workspace_name"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"# print(ml_client)\n",
"\n",
"workspace = ml_client.workspace_name\n",
"subscription_id = ml_client.workspaces.get(workspace).id.split(\"/\")[2]\n",
"resource_group = ml_client.workspaces.get(workspace).resource_group"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"# 2. Configure and Run Command \n",
"\n",
"In this section we will be configuring and running two standalone jobs. \n",
"- `command` for reading and writing data\n",
"- `command` for distributed training job.\n",
"\n",
"\n",
"The `command` allows user to configure the following key aspects.\n",
"- `code` - This is the path where the code to run the command is located\n",
"- `command` - This is the command that needs to be run\n",
"- `inputs` - This is the dictionary of inputs using name value pairs to the command. The key is a name for the input within the context of the job and the value is the input value. Inputs can be referenced in the `command` using the `${{inputs.<input_name>}}` expression. To use files or folders as inputs, we can use the `Input` class. The `Input` class supports three parameters:\n",
" - `type` - The type of input. This can be a `uri_file` or `uri_folder`. The default is `uri_folder`. \n",
" - `path` - The path to the file or folder. These can be local or remote files or folders. For remote files - http/https, wasb are supported. \n",
" - Azure ML `data`/`dataset` or `datastore` are of type `uri_folder`. To use `data`/`dataset` as input, you can use registered dataset in the workspace using the format '<data_name>:<version>'. For e.g Input(type='uri_folder', path='my_dataset:1')\n",
" - `mode` - \tMode of how the data should be delivered to the compute target. Allowed values are `ro_mount`, `rw_mount` and `download`. Default is `ro_mount`\n",
"- `environment` - This is the environment needed for the command to run. Curated or custom environments from the workspace can be used. Or a custom environment can be created and used as well. Check out the [environment](../../../../assets/environment/environment.ipynb) notebook for more examples.\n",
"- `compute` - The compute on which the command will run. In this example we are using [serverless compute (preview)](https://learn.microsoft.com/azure/machine-learning/how-to-use-serverless-compute?view=azureml-api-2&tabs=python) so there is no need to specify any compute. You can also replace serverless with any other compute in the workspace. You can run it on the local machine by using `local` for the compute. This will run the command on the local machine and all the run details and output of the job will be uploaded to the Azure ML workspace.\n",
"- `distribution` - Distribution configuration for distributed training scenarios. Azure Machine Learning supports PyTorch, TensorFlow, and MPI-based distributed \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"### 2.1 Configure Command for reading and writing data\n",
"The CIFAR 10 dataset, a compressed file, is downloaded from a public url. The `read_write_data.py` code which is in the `src` folder does the extraction of files using the `tarfile library`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "inputs",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"from azure.ai.ml import command\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml import Input\n",
"from azure.ai.ml import Output\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"\n",
"inputs = {\n",
" \"cifar_zip\": Input(\n",
" type=AssetTypes.URI_FILE,\n",
" path=\"https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz\",\n",
" ),\n",
"}\n",
"\n",
"outputs = {\n",
" \"cifar\": Output(\n",
" type=AssetTypes.URI_FOLDER,\n",
" path=f\"azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/workspaceblobstore/paths/CIFAR-10\",\n",
" )\n",
"}\n",
"\n",
"job = command(\n",
" code=\"./src\", # local path where the code is stored\n",
" command=\"python read_write_data.py --input_data ${{inputs.cifar_zip}} --output_folder ${{outputs.cifar}}\",\n",
" inputs=inputs,\n",
" outputs=outputs,\n",
" environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n",
")\n",
"\n",
"# submit the command\n",
"returned_job = ml_client.jobs.create_or_update(job)\n",
"# get a URL for the status of the job\n",
"returned_job.studio_url"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ml_client.jobs.stream(returned_job.name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"print(returned_job.name)\n",
"print(returned_job.experiment_name)\n",
"print(returned_job.outputs.cifar)\n",
"print(returned_job.outputs.cifar.path)"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"### 2.2 Configure Command for distributed training using Pytorch"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"name": "job",
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"from azure.ai.ml import command\n",
"from azure.ai.ml.entities import Data\n",
"from azure.ai.ml import Input\n",
"from azure.ai.ml import Output\n",
"from azure.ai.ml.constants import AssetTypes\n",
"\n",
"# === Note on path ===\n",
"# can be can be a local path or a cloud path. AzureML supports https://`, `abfss://`, `wasbs://` and `azureml://` URIs.\n",
"# Local paths are automatically uploaded to the default datastore in the cloud.\n",
"# More details on supported paths: https://docs.microsoft.com/azure/machine-learning/how-to-read-write-data-v2#supported-paths\n",
"\n",
"inputs = {\n",
" \"cifar\": Input(\n",
" type=AssetTypes.URI_FOLDER, path=returned_job.outputs.cifar.path\n",
" ), # path=\"azureml:azureml_stoic_cartoon_wgb3lgvgky_output_data_cifar:1\"), #path=\"azureml://datastores/workspaceblobstore/paths/azureml/stoic_cartoon_wgb3lgvgky/cifar/\"),\n",
" \"epoch\": 10,\n",
" \"batchsize\": 64,\n",
" \"workers\": 2,\n",
" \"lr\": 0.01,\n",
" \"momen\": 0.9,\n",
" \"prtfreq\": 200,\n",
" \"output\": \"./outputs\",\n",
"}\n",
"\n",
"from azure.ai.ml.entities import ResourceConfiguration\n",
"\n",
"job = command(\n",
" code=\"./src\", # local path where the code is stored\n",
" command=\"python train.py --data-dir ${{inputs.cifar}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --workers ${{inputs.workers}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momen}} --print-freq ${{inputs.prtfreq}} --model-dir ${{inputs.output}}\",\n",
" inputs=inputs,\n",
" environment=\"azureml:AzureML-acpt-pytorch-2.2-cuda12.1@latest\",\n",
" instance_count=2, # In this, only 2 node cluster was created.\n",
" distribution={\n",
" \"type\": \"PyTorch\",\n",
" # set process count to the number of gpus per node\n",
" # NC6s_v3 has only 1 GPU\n",
" \"process_count_per_instance\": 1,\n",
" },\n",
")\n",
"job.resources = ResourceConfiguration(\n",
" instance_type=\"Standard_NC6s_v3\", instance_count=2\n",
") # Serverless compute resources"
]
},
{
"cell_type": "markdown",
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
},
"source": [
"# 3. Submit the job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"ml_client.jobs.create_or_update(job)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernel_info": {
"name": "python310-sdkv2"
},
"kernelspec": {
"display_name": "Python 3.10 - SDK V2",
"language": "python",
"name": "python310-sdkv2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
},
"nteract": {
"version": "nteract-front-end@1.0.0"
}
},
"nbformat": 4,
"nbformat_minor": 1
}