tutorials/e2e-distributed-pytorch-image/e2e-object-classification-distributed-pytorch.ipynb

{ "cells": [ { "cell_type": "markdown", "id": "b1aadfd2", "metadata": {}, "source": [ "# Distributed PyTorch Image Classification\n", "\n", "**Learning Objectives** - By the end of this tutorial you should be able to use Azure Machine Learning (AzureML) to:\n", "- quickly implement basic commands for data preparation\n", "- test and run a multi-node multi-gpu pytorch job\n", "- use mlflow to analyze your metrics\n", "\n", "**Requirements** - In order to benefit from this tutorial, you need:\n", "- to have provisioned an AzureML workspace\n", "- to have permissions to provision a minimal cpu and gpu cluster or simply use [serverless compute (preview)](https://learn.microsoft.com/azure/machine-learning/how-to-use-serverless-compute?view=azureml-api-2&tabs=python)\n", "- to have [installed Azure Machine Learning Python SDK v2](https://github.com/Azure/azureml-examples/blob/sdk-preview/sdk/README.md)\n", "\n", "**Motivations** - Let's consider the following scenario: we want to explore training different image classifiers on distinct kinds of problems, based on a large public dataset that is available at a given url. This ML pipeline will be future-looking, in particular we want:\n", "- **genericity**: to be fairly independent from the data we're ingesting (so that we could switch to internal proprietary data in the future),\n", "- **configurability**: to run different versions of that training with simple configuration changes,\n", "- **scalability**: to iterate on the pipeline on small sample, then smoothly transition to running at scale." ] }, { "cell_type": "markdown", "id": "6641f516", "metadata": {}, "source": [ "### Connect to AzureML\n", "\n", "Before we dive in the code, we'll need to create an instance of MLClient to connect to Azure ML.\n", "\n", "We are using `DefaultAzureCredential` to get access to workspace. `DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios.\n", "\n", "Reference for more available credentials if it does not work for you: [configure credential example](https://github.com/Azure/azureml-examples/blob/sdk-preview/sdk/jobs/configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python)." ] }, { "cell_type": "code", "execution_count": null, "id": "c92cd1bc", "metadata": {}, "outputs": [], "source": [ "# authentication package\n", "from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential\n", "\n", "try:\n", " credential = DefaultAzureCredential()\n", " # Check if given credential can get token successfully.\n", " credential.get_token(\"https://management.azure.com/.default\")\n", "except Exception as ex:\n", " # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work\n", " credential = InteractiveBrowserCredential()" ] }, { "cell_type": "code", "execution_count": null, "id": "c8791906", "metadata": {}, "outputs": [], "source": [ "# handle to the workspace\n", "from azure.ai.ml import MLClient\n", "\n", "# get a handle to the workspace\n", "ml_client = MLClient(\n", " subscription_id=\"<SUBSCRIPTION_ID>\",\n", " resource_group_name=\"<RESOURCE_GROUP>\",\n", " workspace_name=\"<AML_WORKSPACE_NAME>\",\n", " credential=credential,\n", ")\n", "cpu_cluster = None\n", "gpu_cluster = None" ] }, { "cell_type": "markdown", "id": "41d3ce5e", "metadata": {}, "source": [ "### Provision the required resources for this notebook (Optional)\n", "\n", "We'll need 2 clusters for this notebook, a CPU cluster and a GPU cluster. First, let's create a minimal cpu cluster." ] }, { "cell_type": "code", "execution_count": null, "id": "a708daab", "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import AmlCompute\n", "\n", "cpu_compute_target = \"cpu-cluster\"\n", "\n", "try:\n", " # let's see if the compute target already exists\n", " cpu_cluster = ml_client.compute.get(cpu_compute_target)\n", " print(\n", " f\"You already have a cluster named {cpu_compute_target}, we'll reuse it as is.\"\n", " )\n", "\n", "except Exception:\n", " print(\"Creating a new cpu compute target...\")\n", "\n", " # Let's create the Azure ML compute object with the intended parameters\n", " cpu_cluster = AmlCompute(\n", " # Name assigned to the compute cluster\n", " name=\"cpu-cluster\",\n", " # Azure ML Compute is the on-demand VM service\n", " type=\"amlcompute\",\n", " # VM Family\n", " size=\"STANDARD_DS3_V2\",\n", " # Minimum running nodes when there is no job running\n", " min_instances=0,\n", " # Nodes in cluster\n", " max_instances=4,\n", " # How many seconds will the node running after the job termination\n", " idle_time_before_scale_down=180,\n", " # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination\n", " tier=\"Dedicated\",\n", " )\n", "\n", " # Now, we pass the object to MLClient's create_or_update method\n", " cpu_cluster = ml_client.begin_create_or_update(cpu_cluster)\n", "\n", "print(\n", " f\"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}\"\n", ")" ] }, { "cell_type": "markdown", "id": "5481d489", "metadata": {}, "source": [ "For GPUs, we're creating the cluster below with the smallest VM family." ] }, { "cell_type": "code", "execution_count": null, "id": "027a2b2e", "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml.entities import AmlCompute\n", "\n", "gpu_compute_target = \"gpu-cluster\"\n", "\n", "try:\n", " # let's see if the compute target already exists\n", " gpu_cluster = ml_client.compute.get(gpu_compute_target)\n", " print(\n", " f\"You already have a cluster named {gpu_compute_target}, we'll reuse it as is.\"\n", " )\n", "\n", "except Exception:\n", " print(\"Creating a new gpu compute target...\")\n", "\n", " gpu_cluster = AmlCompute(\n", " name=\"gpu-cluster\",\n", " type=\"amlcompute\",\n", " size=\"STANDARD_NC6s_v3\", # 1 x NVIDIA Tesla V100\n", " min_instances=0,\n", " max_instances=4,\n", " idle_time_before_scale_down=180,\n", " tier=\"Dedicated\",\n", " )\n", "\n", " gpu_cluster = ml_client.begin_create_or_update(gpu_cluster)\n", "\n", "print(\n", " f\"AMLCompute with name {gpu_cluster.name} is created, the compute size is {gpu_cluster.size}\"\n", ")" ] }, { "cell_type": "markdown", "id": "202ab207", "metadata": {}, "source": [ "# 1. Unpack a public image archives with a simple command (no code)\n", "\n", "To train our classifier, we'll consume the [Stanford Dogs Dataset](http://vision.stanford.edu/aditya86/ImageNetDogs/) or the [Places2 dataset](http://places2.csail.mit.edu/download.html). If we were to use this locally, the sequence would be very basic: download a large tar archive, untar and put in different train/validation folders, upload to the cloud for consumption by the training script.\n", "\n", "We'll do just that, but in the cloud, without too much pain." ] }, { "cell_type": "markdown", "id": "08cca166", "metadata": {}, "source": [ "## 1.1. Unpack a first small dataset for testing\n", "\n", "The Azure ML SDK provides `entities` to implement any step of a workflow. In the example below, we create a `CommandJob` with just a shell command. We parameterize this command by using a string template syntax provided by the SDK:\n", "\n", "> ```\n", "> tar xvfm ${{inputs.archive}} --no-same-owner -C ${{outputs.images}}\n", "> ```\n", "\n", "Creating the component just consists in declaring the names of the inputs, outputs, and specifying an environment. For this simple job we'll use a curated environment from AzureML. After that, we'll be able to reuse that component multiple times in our pipeline design.\n", "\n", "Note: in this job, we're using an input type `uri_file` with a direct url. In this case, Azure ML will download the file from the url and provide it for the job to execute." ] }, { "cell_type": "code", "execution_count": null, "id": "e70c8d83", "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import command\n", "from azure.ai.ml import Input, Output\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "dogs_dataset_command_job = command(\n", " display_name=\"untar_dogs\", # optional: this will show in the UI\n", " # this component has no code, just a simple unzip command\n", " command=\"tar xvfm ${{inputs.archive}} --no-same-owner -C ${{outputs.images}}\",\n", " # I/O specifications, each using a specific key and type\n", " inputs={\n", " \"archive\": Input(\n", " type=AssetTypes.URI_FILE,\n", " path=\"http://vision.stanford.edu/aditya86/ImageNetDogs/images.tar\",\n", " )\n", " },\n", " outputs={\n", " # two outputs, used in command as outputs.*\n", " \"images\": Output(\n", " type=AssetTypes.URI_FOLDER,\n", " mode=\"upload\",\n", " path=\"azureml://datastores/workspaceblobstore/paths/tutorial-datasets/dogs/\",\n", " ),\n", " },\n", " # we're using a curated environment\n", " environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n", " compute=\"cpu-cluster\"\n", " if (cpu_cluster)\n", " else None, # No compute needs to be passed to use serverless\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "01415285", "metadata": {}, "outputs": [], "source": [ "import webbrowser\n", "\n", "# submit the command\n", "returned_job = ml_client.create_or_update(\n", " dogs_dataset_command_job,\n", ")\n", "\n", "# get a URL for the status of the job\n", "print(\"The url to see your live job running is returned by the sdk:\")\n", "print(returned_job.studio_url)\n", "# open the browser with this url\n", "webbrowser.open(returned_job.studio_url)\n", "\n", "# print the pipeline run id\n", "print(\n", " f\"The pipeline details can be access programmatically using identifier: {returned_job.name}\"\n", ")" ] }, { "cell_type": "markdown", "id": "d4f625a5", "metadata": {}, "source": [ "## 1.2. Unpack a second larger dataset for training [optional]\n", "\n", "If you'd like to test the distributed training job below with a more complex dataset, the code below will unpack the [Places2 dataset](http://places2.csail.mit.edu/download.html) dataset images, which has 1.8 million images in 365 categories. This will require a larger VM than the one you provisioned earlier. We recommend you provision a [STANDARD_DS12_V2](https://docs.microsoft.com/en-us/azure/virtual-machines/dv2-dsv2-series-memory). The code below will use compute cluster name `cpu-cluster-lg`." ] }, { "cell_type": "markdown", "id": "88190d2b", "metadata": {}, "source": [ "```python\n", "from azure.ai.ml import command\n", "from azure.ai.ml import Input, Output\n", "from azure.ai.ml.constants import AssetTypes\n", "\n", "places2_command_job = command(\n", " display_name=\"untar_places2\", # optional: this will show in the UI\n", " # this component has no code, just a simple unzip command\n", " command=\"&&\\n\".join(\n", " [\n", " # two lines of commands, one for training, one for validation\n", " \"tar xvfm ${{inputs.archive}} --no-same-owner -C ${{outputs.valid_images}} places365_standard/val/\",\n", " \"tar xvfm ${{inputs.archive}} --no-same-owner -C ${{outputs.train_images}} places365_standard/train/\",\n", " ]\n", " ),\n", " # I/O specifications, each using a specific key and type\n", " inputs={\n", " \"archive\": Input(\n", " type=AssetTypes.URI_FILE,\n", " path=\"http://data.csail.mit.edu/places/places365/places365standard_easyformat.tar\",\n", " )\n", " },\n", " outputs={\n", " # two outputs, used in command as outputs.*\n", " \"train_images\": Output(\n", " type=AssetTypes.URI_FOLDER,\n", " mode=\"upload\",\n", " path=\"azureml://datastores/workspaceblobstore/paths/tutorial-datasets/places2/train/\"\n", " ),\n", " \"valid_images\": Output(\n", " type=AssetTypes.URI_FOLDER,\n", " mode=\"upload\",\n", " path=\"azureml://datastores/workspaceblobstore/paths/tutorial-datasets/places2/valid/\"\n", " ),\n", " },\n", " # we're using a curated environment\n", " environment=\"azureml://registries/azureml/environments/sklearn-1.5/labels/latest\",\n", " compute=\"cpu-cluster-lg\",\n", ")\n", "\n", "# submit the command\n", "returned_job = ml_client.create_or_update(places2_command_job)\n", "\n", "# get a URL for the status of the job\n", "print(\"The url to see your live job running is returned by the sdk:\")\n", "print(returned_job.studio_url)\n", "```" ] }, { "cell_type": "markdown", "id": "2e26664b", "metadata": {}, "source": [ "# 2. Training a distributed gpu job\n", "\n", "Implementing a distributed pytorch training is complex. Of course in this tutorial we've written one for you, but the point is: it takes time, it takes several iterations, each requiring you to try your code locally, then in the cloud, then try it at scale, until satisfied and then run a full blown production model training. This trial/error process can be made easier if we can create reusable code we can iterate on quickly, and that can be configured to run from small to large scale.\n", "\n", "So, to develop our training pipeline, we set a couple constraints for ourselves:\n", "- we want to minimize the effort to iterate on the pipeline code when porting it in the cloud,\n", "- we want to use the same code for small scale and large scale testing\n", "- we do not want to manipulate large data locally (ex: download/upload that data could take multiple hours),\n", "\n", "We've implemented a distributed pytorch training script that we can load as a command job. For this, we've decided to parameterize this job with relevant training arguments (see below).\n", "\n", "We can now test this code by running it on a smaller dataset in Azure ML. Here, we will use the dogs dataset both for training and validation. Of course, the model will not be valid. But training will be short (8 mins on 2 x STANDARD_NC6 for 1 epoch) to allow us to iterate if needed." ] }, { "cell_type": "code", "execution_count": null, "id": "d6681745", "metadata": {}, "outputs": [], "source": [ "from azure.ai.ml import command\n", "from azure.ai.ml import Input\n", "from azure.ai.ml.entities import ResourceConfiguration\n", "\n", "training_job = command(\n", " # local path where the code is stored\n", " code=\"./src/pytorch_dl_train/\",\n", " # describe the command to run the python script, with all its parameters\n", " # use the syntax below to inject parameter values from code\n", " command=\"\"\"python train.py \\\n", " --train_images ${{inputs.train_images}} \\\n", " --valid_images ${{inputs.valid_images}} \\\n", " --batch_size ${{inputs.batch_size}} \\\n", " --num_workers ${{inputs.num_workers}} \\\n", " --prefetch_factor ${{inputs.prefetch_factor}} \\\n", " --model_arch ${{inputs.model_arch}} \\\n", " --model_arch_pretrained ${{inputs.model_arch_pretrained}} \\\n", " --num_epochs ${{inputs.num_epochs}} \\\n", " --learning_rate ${{inputs.learning_rate}} \\\n", " --momentum ${{inputs.momentum}} \\\n", " --register_model_as ${{inputs.register_model_as}} \\\n", " --enable_profiling ${{inputs.enable_profiling}}\n", " \"\"\",\n", " inputs={\n", " \"train_images\": Input(\n", " type=\"uri_folder\",\n", " path=\"azureml://datastores/workspaceblobstore/paths/tutorial-datasets/dogs/\",\n", " # path=\"azureml://datastores/workspaceblobstore/paths/tutorial-datasets/places2/train/\",\n", " mode=\"download\", # use download to make access faster, mount if dataset is larger than VM\n", " ),\n", " \"valid_images\": Input(\n", " type=\"uri_folder\",\n", " path=\"azureml://datastores/workspaceblobstore/paths/tutorial-datasets/dogs/\",\n", " # path=\"azureml://datastores/workspaceblobstore/paths/tutorial-datasets/places2/valid/\",\n", " mode=\"download\", # use download to make access faster, mount if dataset is larger than VM\n", " ),\n", " \"batch_size\": 64,\n", " \"num_workers\": 5, # number of cpus for pre-fetching\n", " \"prefetch_factor\": 2, # number of batches fetched in advance\n", " \"model_arch\": \"resnet18\",\n", " \"model_arch_pretrained\": True,\n", " \"num_epochs\": 7,\n", " \"learning_rate\": 0.01,\n", " \"momentum\": 0.01,\n", " \"register_model_as\": \"dogs_dev\",\n", " # \"register_model_as\": \"places_dev\",\n", " \"enable_profiling\": False,\n", " },\n", " environment=\"AzureML-acpt-pytorch-2.2-cuda12.1@latest\",\n", " compute=\"gpu-cluster\"\n", " if (gpu_cluster)\n", " else None, # No compute needs to be passed to use serverless\n", " distribution={\n", " \"type\": \"PyTorch\",\n", " # set process count to the number of gpus on the node\n", " # NC6 has only 1\n", " \"process_count_per_instance\": 1,\n", " },\n", " # set instance count to the number of nodes you want to use\n", " instance_count=2,\n", " display_name=\"pytorch_training_sample\",\n", " description=\"training a torchvision model\",\n", ")\n", "if gpu_cluster == None:\n", " training_job.resources = ResourceConfiguration(\n", " instance_type=\"Standard_NC6s_v3\", instance_count=2\n", " ) # resources for serverless job" ] }, { "cell_type": "markdown", "id": "dc7b9c5f", "metadata": {}, "source": [ "Once we create that job, we submit it through `MLClient`." ] }, { "cell_type": "code", "execution_count": null, "id": "8252fbc8", "metadata": {}, "outputs": [], "source": [ "import webbrowser\n", "\n", "# submit the job\n", "returned_job = ml_client.jobs.create_or_update(\n", " training_job,\n", " # Project's name\n", " experiment_name=\"e2e_image_sample\",\n", ")\n", "\n", "# get a URL for the status of the job\n", "print(\"The url to see your live job running is returned by the sdk:\")\n", "print(returned_job.studio_url)\n", "# open the browser with this url\n", "webbrowser.open(returned_job.studio_url)\n", "\n", "# print the pipeline run id\n", "print(\n", " f\"The pipeline details can be access programmatically using identifier: {returned_job.name}\"\n", ")\n", "# saving it for later in this notebook\n", "small_scale_run_id = returned_job.name" ] }, { "cell_type": "markdown", "id": "2cf19ded", "metadata": {}, "source": [ "You can iterate on this design as much as you'd like, updating the local code of the job and re-submit the pipeline.\n", "\n", "Note: in the code above, we have commented out the lines you'd need to test this training job on the Places 2 dataset (1.8m images)." ] }, { "cell_type": "markdown", "id": "c782c68e", "metadata": {}, "source": [ "# 3. Analyze experiments using MLFlow\n", "\n", "Azure ML natively integrates with MLFlow so that if your code already supports MLFlow logging, you will not have to modify it to report your metrics within Azure ML. The component above is using MLFlow internally to report relevant metrics, logs and artifacts. Look for `mlflow` calls within the script `train.py`.\n", "\n", "To access this data in the Azure ML Studio, click on the component in the pipeline to open the Details panel, then choose the **Metrics** panel.\n", "\n", "You can also access those metrics programmatically using mlflow. We'll demo a couple examples below.\n", "\n", "## 3.1. Connect to Azure ML using MLFlow client\n", "\n", "Connecting to Azure ML using MLFlow required to `pip install azureml-mlflow mlflow` (both). You can use the `MLClient` to obtain a tracking uri to connect with the mlflow client. In the example below, we'll get all the runs related to the training experiment:" ] }, { "cell_type": "code", "execution_count": null, "id": "6af554a2", "metadata": {}, "outputs": [], "source": [ "import mlflow\n", "from mlflow.tracking import MlflowClient\n", "import matplotlib.pyplot as plt\n", "\n", "mlflow.set_tracking_uri(ml_client.workspaces.get().mlflow_tracking_uri)\n", "\n", "# search for the training step within the pipeline\n", "mlflow.set_experiment(\"e2e_image_sample\")\n", "\n", "# search for all runs and return as a pandas dataframe\n", "mlflow_runs = mlflow.search_runs()\n", "\n", "# display all runs as a dataframe in the notebook\n", "mlflow_runs" ] }, { "cell_type": "markdown", "id": "af9bea13", "metadata": {}, "source": [ "## 3.2. Analyze metrics accross multiple jobs\n", "\n", "You can also use mlflow to search all your runs, filter by some specific properties and get the results as a pandas dataframe. Once you get that dataframe, you can implement any analysis on top of it.\n", "\n", "Below, we're extracting all runs and show the effect of profiling on the epoch training time.\n", "\n", "![mlflow runs in a pandas dataframe](./media/pytorch_train_mlflow_runs.png)" ] }, { "cell_type": "code", "execution_count": null, "id": "477c5a0f", "metadata": {}, "outputs": [], "source": [ "runs = mlflow.search_runs(\n", " # we're using mlflow syntax to restrict to a specific parameter\n", " filter_string=f\"params.model_arch = 'resnet18'\"\n", ")\n", "\n", "# we're keeping only some relevant columns\n", "columns = [\n", " \"run_id\",\n", " \"status\",\n", " \"end_time\",\n", " \"metrics.epoch_train_time\",\n", " \"metrics.epoch_train_acc\",\n", " \"metrics.epoch_valid_acc\",\n", " \"params.enable_profiling\",\n", "]\n", "\n", "# showing the raw results in notebook\n", "runs[columns].dropna()" ] }, { "cell_type": "markdown", "id": "c81fc568", "metadata": {}, "source": [ "## 3.3. Analyze the metrics of a specific job\n", "\n", "Using MLFlow, you can retrieve all the metrics produces by a given run. You can then leverage any usual tool to draw the analysis that is relevant for you. In the example below, we're plotting accuracy per epoch.\n", "\n", "![plot training and validation accuracy over epochs](./media/pytorch_train_mlflow_plot.png)" ] }, { "cell_type": "code", "execution_count": null, "id": "7969e008", "metadata": {}, "outputs": [], "source": [ "# here we're using the small scale training on validation data\n", "training_run_id = small_scale_run_id\n", "\n", "# alternatively, you can directly use a known training step id\n", "# training_run_id = \"...\"\n", "\n", "# open a client to get metric history\n", "client = MlflowClient()\n", "\n", "print(f\"Obtaining results for run id {training_run_id}\")\n", "\n", "# create a plot\n", "plt.rcdefaults()\n", "fig, ax = plt.subplots()\n", "ax.set_xlabel(\"epoch\")\n", "\n", "for metric in [\"epoch_train_acc\", \"epoch_valid_acc\"]:\n", " # get all values taken by the metric\n", " try:\n", " metric_history = client.get_metric_history(training_run_id, metric)\n", " except:\n", " print(f\"Metric {metric} could not be found in history\")\n", " continue\n", "\n", " epochs = [metric_entry.step for metric_entry in metric_history]\n", " metric_array = [metric_entry.value for metric_entry in metric_history]\n", " ax.plot(epochs, metric_array, label=metric)\n", "\n", "plt.legend()" ] }, { "cell_type": "markdown", "id": "fcb98d44", "metadata": {}, "source": [ "## 3.4. Retrieve artifacts for local analysis (ex: tensorboard)\n", "\n", "MLFlow also allows you to record artifacts during training. The script `train.py` leverages the [PyTorch profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) to produce logs for analyzing GPU performance. It uses mlflow to record those logs as artifacts.\n", "\n", "To benefit from that, use the option `enable_profiling=True` in the submission code of section 2.\n", "\n", "In the following, we'll download those locally to inspect with other tools such as tensorboard." ] }, { "cell_type": "code", "execution_count": null, "id": "f2796344", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "# here we're using the small scale training on validation data\n", "training_run_id = small_scale_run_id\n", "\n", "# alternatively, you can directly use a known training step id\n", "# training_run_id = \"...\"\n", "\n", "# open a client to get metric history\n", "client = MlflowClient()\n", "\n", "# create local directory to store artefacts\n", "os.makedirs(\"./logs/\", exist_ok=True)\n", "\n", "for artifact in client.list_artifacts(training_run_id, path=\"profiler/markdown/\"):\n", " print(f\"Downloading artifact {artifact.path}\")\n", " client.download_artifacts(training_run_id, path=artifact.path, dst_path=\"./logs\")\n", "else:\n", " print(f\"No artefacts were found for profiler/markdown/ in run id {training_run_id}\")\n", "\n", "for artifact in client.list_artifacts(\n", " training_run_id, path=\"profiler/tensorboard_logs/\"\n", "):\n", " print(f\"Downloading artifact {artifact.path}\")\n", " client.download_artifacts(training_run_id, path=artifact.path, dst_path=\"./logs\")\n", "else:\n", " print(f\"No artefacts were found for profiler/markdown/ in run id {training_run_id}\")" ] }, { "cell_type": "markdown", "id": "631a3912", "metadata": {}, "source": [ "We can now run tensorboard locally with the downloaded artifacts to run some analysis of GPU performance (see example snapshot below).\n", "\n", "```\n", "tensorboard --logdir=\"./logs/profiler/tensorboard_logs/\"\n", "```\n", "\n", "![tensorboard logs generated by pytorch profiler](./media/pytorch_train_tensorboard_logs.png)" ] }, { "cell_type": "markdown", "id": "9f658d73", "metadata": {}, "source": [ "![](media/mlflow_plot.png)" ] } ], "metadata": { "description": { "description": "Prepare data, test and run a multi-node multi-gpu pytorch job. Use mlflow to analyze your metrics" }, "kernelspec": { "display_name": "Python 3.10 - SDK v2", "language": "python", "name": "python310-sdkv2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 5 }

tutorials/e2e-distributed-pytorch-image/e2e-object-classification-distributed-pytorch.ipynb (713 lines of code) (raw):