sdk/python/jobs/single-step/pytorch/distributed-training/distributed-cifar10.ipynb (438 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "# Distributed Pytorch training - CIFAR 10 dataset\n", "\n", "### Requirements/Prerequisites\n", "- An Azure acoount with active subscription [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)\n", "- Azure Machine Learning workspace [Configure workspace](../../../configuration.ipynb) \n", "- Python Environment\n", "- Install Azure ML Python SDK Version 2\n", "### Learning Objectives\n", "- Connect to workspace using Python SDK v2\n", "- Setting up the _Command_ to download data from a web url to AML workspace blob storage by running a _job_.\n", "- Use this data stored in AML workspace blob storage as the input to the train job _command_.\n", "- Distributed training of Pytorch model.\n" ] }, { "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "# 1. Connect to Azure Machine Learning Workspace" ] }, { "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "## 1.1 Import required libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import required libraries\n", "from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential\n", "\n", "from azure.ai.ml import MLClient, Input\n", "from azure.ai.ml.dsl import pipeline\n", "from azure.ai.ml import load_component" ] }, { "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "## 1.2 Connect to workspace using DefaultAzureCredential\n", "`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. \n", "\n", "Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "credential = DefaultAzureCredential()\n", "ml_client = None\n", "try:\n", " ml_client = MLClient.from_config(credential)\n", "except Exception as ex:\n", " print(ex)\n", " # Enter details of your AML workspace\n", " subscription_id = \"<SUBSCRIPTION_ID>\"\n", " resource_group = \"<RESOURCE_GROUP>\"\n", " workspace = \"<AML_WORKSPACE_NAME>\"\n", " ml_client = MLClient(credential, subscription_id, resource_group, workspace)" ] }, { "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "## 1.3 Get handle to workspace" ] }, { "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "#### Retrieving credentials from the `ml_client`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml_client.workspace_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "# print(ml_client)\n", "\n", "workspace = ml_client.workspace_name\n", "subscription_id = ml_client.workspaces.get(workspace).id.split(\"/\")[2]\n", "resource_group = ml_client.workspaces.get(workspace).resource_group" ] }, { "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "# 2. Configure and Run Command \n", "\n", "In this section we will be configuring and running two standalone jobs. \n", "- `command` for reading and writing data\n", "- `command` for distributed training job.\n", "\n", "\n", "The `command` allows user to configure the following key aspects.\n", "- `code` - This is the path where the code to run the command is located\n", "- `command` - This is the command that needs to be run\n", "- `inputs` - This is the dictionary of inputs using name value pairs to the command. The key is a name for the input within the context of the job and the value is the input value. Inputs can be referenced in the `command` using the `${{inputs.<input_name>}}` expression. To use files or folders as inputs, we can use the `Input` class. The `Input` class supports three parameters:\n", " - `type` - The type of input. This can be a `uri_file` or `uri_folder`. The default is `uri_folder`. \n", " - `path` - The path to the file or folder. These can be local or remote files or folders. For remote files - http/https, wasb are supported. \n", " - Azure ML `data`/`dataset` or `datastore` are of type `uri_folder`. To use `data`/`dataset` as input, you can use registered dataset in the workspace using the format '<data_name>:<version>'. For e.g Input(type='uri_folder', path='my_dataset:1')\n", " - `mode` - \tMode of how the data should be delivered to the compute target. Allowed values are `ro_mount`, `rw_mount` and `download`. Default is `ro_mount`\n", "- `environment` - This is the environment needed for the command to run. Curated or custom environments from the workspace can be used. Or a custom environment can be created and used as well. Check out the [environment](../../../../assets/environment/environment.ipynb) notebook for more examples.\n", "- `compute` - The compute on which the command will run. In this example we are using [serverless compute (preview)](https://learn.microsoft.com/azure/machine-learning/how-to-use-serverless-compute?view=azureml-api-2&tabs=python) so there is no need to specify any compute. You can also replace serverless with any other compute in the workspace. You can run it on the local machine by using `local` for the compute. This will run the command on the local machine and all the run details and output of the job will be uploaded to the Azure ML workspace.\n", "- `distribution` - Distribution configuration for distributed training scenarios. Azure Machine Learning supports PyTorch, TensorFlow, and MPI-based distributed \n" ] }, { "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "### 2.1 Configure Command for reading and writing data\n", "The CIFAR 10 dataset, a compressed file, is downloaded from a public url. The `read_write_data.py` code which is in the `src` folder does the extraction of files using the `tarfile library`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "name": "inputs", "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "from azure.ai.ml import command\n", "from azure.ai.ml.entities import Data\n", "from azure.ai.ml import Input\n", "from azure.ai.ml import Output\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "\n", "inputs = {\n", " \"cifar_zip\": Input(\n", " type=AssetTypes.URI_FILE,\n", " path=\"https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz\",\n", " ),\n", "}\n", "\n", "outputs = {\n", " \"cifar\": Output(\n", " type=AssetTypes.URI_FOLDER,\n", " path=f\"azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/workspaceblobstore/paths/CIFAR-10\",\n", " )\n", "}\n", "\n", "job = command(\n", " code=\"./src\", # local path where the code is stored\n", " command=\"python read_write_data.py --input_data ${{inputs.cifar_zip}} --output_folder ${{outputs.cifar}}\",\n", " inputs=inputs,\n", " outputs=outputs,\n", " environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n", ")\n", "\n", "# submit the command\n", "returned_job = ml_client.jobs.create_or_update(job)\n", "# get a URL for the status of the job\n", "returned_job.studio_url" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml_client.jobs.stream(returned_job.name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "print(returned_job.name)\n", "print(returned_job.experiment_name)\n", "print(returned_job.outputs.cifar)\n", "print(returned_job.outputs.cifar.path)" ] }, { "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "### 2.2 Configure Command for distributed training using Pytorch" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "name": "job", "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "from azure.ai.ml import command\n", "from azure.ai.ml.entities import Data\n", "from azure.ai.ml import Input\n", "from azure.ai.ml import Output\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "# === Note on path ===\n", "# can be can be a local path or a cloud path. AzureML supports https://`, `abfss://`, `wasbs://` and `azureml://` URIs.\n", "# Local paths are automatically uploaded to the default datastore in the cloud.\n", "# More details on supported paths: https://docs.microsoft.com/azure/machine-learning/how-to-read-write-data-v2#supported-paths\n", "\n", "inputs = {\n", " \"cifar\": Input(\n", " type=AssetTypes.URI_FOLDER, path=returned_job.outputs.cifar.path\n", " ), # path=\"azureml:azureml_stoic_cartoon_wgb3lgvgky_output_data_cifar:1\"), #path=\"azureml://datastores/workspaceblobstore/paths/azureml/stoic_cartoon_wgb3lgvgky/cifar/\"),\n", " \"epoch\": 10,\n", " \"batchsize\": 64,\n", " \"workers\": 2,\n", " \"lr\": 0.01,\n", " \"momen\": 0.9,\n", " \"prtfreq\": 200,\n", " \"output\": \"./outputs\",\n", "}\n", "\n", "from azure.ai.ml.entities import ResourceConfiguration\n", "\n", "job = command(\n", " code=\"./src\", # local path where the code is stored\n", " command=\"python train.py --data-dir ${{inputs.cifar}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --workers ${{inputs.workers}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momen}} --print-freq ${{inputs.prtfreq}} --model-dir ${{inputs.output}}\",\n", " inputs=inputs,\n", " environment=\"azureml:AzureML-acpt-pytorch-2.2-cuda12.1@latest\",\n", " instance_count=2, # In this, only 2 node cluster was created.\n", " distribution={\n", " \"type\": \"PyTorch\",\n", " # set process count to the number of gpus per node\n", " # NC6s_v3 has only 1 GPU\n", " \"process_count_per_instance\": 1,\n", " },\n", ")\n", "job.resources = ResourceConfiguration(\n", " instance_type=\"Standard_NC6s_v3\", instance_count=2\n", ") # Serverless compute resources" ] }, { "cell_type": "markdown", "metadata": { "nteract": { "transient": { "deleting": false } } }, "source": [ "# 3. Submit the job" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "ml_client.jobs.create_or_update(job)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernel_info": { "name": "python310-sdkv2" }, "kernelspec": { "display_name": "Python 3.10 - SDK V2", "language": "python", "name": "python310-sdkv2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" }, "nteract": { "version": "nteract-front-end@1.0.0" } }, "nbformat": 4, "nbformat_minor": 1 }