training/sagemaker-debugger/tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Profiling TensorFlow Multi GPU Multi Node Training Job with Amazon SageMaker Debugger (SageMaker API)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This notebook walks you through creating a TensorFlow training job with the SageMaker Debugger profiling feature enabled. It will create a multi GPU multi node training using Horovod. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (Optional) Install SageMaker and SMDebug\n", "To use the new Debugger profiling features released in December 2020, ensure that you have the latest versions of Boto3, SageMaker, and SMDebug libraries installed. Use the following cell (switch `install_needed` to `True`) to update the libraries and restarts the Jupyter kernel to apply the updates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "import IPython\n", "install_needed = False # should only be True once\n", "if install_needed:\n", " print(\"installing deps and restarting kernel\")\n", " !{sys.executable} -m pip install -U boto3 sagemaker smdebug\n", " IPython.Application.instance().kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Create a Training Job with Debugger Enabled<a class=\"anchor\" id=\"option-1\"></a>\n", "\n", "You will learn how to use the Boto3 SageMaker client's `create_training_job()` function to start a training job." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start a SageMaker session and retrieve the current region and the default Amazon S3 bucket URI" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "\n", "session = sagemaker.Session()\n", "region = session.boto_region_name\n", "bucket = session.default_bucket()\n", "print(region, bucket)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload a training script to the S3 bucket" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3, tarfile\n", "\n", "source = \"source.tar.gz\"\n", "project = \"debugger-boto3-profiling-test\"\n", "\n", "tar = tarfile.open(source, \"w:gz\")\n", "tar.add(\"entry_point/tf-hvd-train.py\")\n", "tar.close()\n", "\n", "s3 = boto3.client(\"s3\")\n", "s3.upload_file(source, bucket, project + \"/\" + source)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "upload_file_path = f\"s3://{bucket}/{project}/{source}\"\n", "print(upload_file_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a Boto3 SageMaker client object" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sm = boto3.Session(region_name=region).client(\"sagemaker\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure the request body of the `create_training_job()` function\n", "\n", "The following parameters are required to include to the request body for `create_training_job()` function.\n", "\n", "- `TrainingJobName` - Specify a prefix or a full name if you want to modify\n", "- `HyperParameters` - Set up the following items:\n", " - `sagemaker_program` and `sagemaker_submit_directory` - The S3 bucket URI of the training script. This enables SageMaker to read the training script from the URI and start a training job.\n", " - `sagemaker_mpi` options - Configure these key-value pairs to set up distributed training.\n", " - You can also add other hyperparameters for your model.\n", "- `AlgorithmSpecification` - Specify `TrainingImage`. In this example, an official TensorFlow DLC image is used. You can also use your own training container images here. \n", "- `RoleArn` - **To run the following cell, you must specify the right SageMaker execution role ARN that you want to use for training**.\n", "- The `DebugHookConfig` and `DebugRuleConfigurations` are preconfigured for watching loss values and a loss not decreasing issue.\n", "- The `ProfilerConfig` and `ProfilerRuleConfigurations` are preconfigured to collect system and framework metrics, initiate all profiling rules, and create a Debugger profiling report.\n", "\n", "**Important**: For `DebugRuleConfigurations` and `ProfilerRuleConfigurations`, **to run the following cell, you must specify the right Debugger rule image URI from [Amazon SageMaker Debugger Registry URLs for Built-in Rule Evaluators](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html)**. The `sagemaker.debugger.get_rule_container_image_uri(region)` function retrieves the Debugger rule image automatically. For example:\n", "- If you are in `us-east-1`, the right image URI is **503895931360**.dkr.ecr.**us-east-1**.amazonaws.com/sagemaker-debugger-rules:latest.\n", "- If you are in `us-west-2`, the right image URI is **895741380848**.dkr.ecr.**us-west-2**.amazonaws.com/sagemaker-debugger-rules:latest.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "\n", "training_job_name = \"profiler-boto3-\" + datetime.datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S\")\n", "\n", "sm.create_training_job(\n", " TrainingJobName=training_job_name,\n", " HyperParameters={\n", " \"sagemaker_program\": \"entry_point/tf-hvd-train.py\",\n", " \"sagemaker_submit_directory\": \"s3://\" + bucket + \"/\" + project + \"/\" + source,\n", " \"sagemaker_mpi_custom_mpi_options\": \"-verbose -x HOROVOD_TIMELINE=./hvd_timeline.json -x NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none\",\n", " \"sagemaker_mpi_enabled\": \"true\",\n", " \"sagemaker_mpi_num_of_processes_per_host\": \"4\",\n", " },\n", " AlgorithmSpecification={\n", " \"TrainingImage\": \"763104351884.dkr.ecr.\"\n", " + region\n", " + \".amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04\",\n", " \"TrainingInputMode\": \"File\",\n", " \"EnableSageMakerMetricsTimeSeries\": False,\n", " },\n", " # You must specify your SageMaker execution role ARN here\n", " RoleArn=sagemaker.get_execution_role(),\n", " OutputDataConfig={\"S3OutputPath\": \"s3://\" + bucket + \"/\" + project + \"/output\"},\n", " ResourceConfig={\"InstanceType\": \"ml.p3.8xlarge\", \"InstanceCount\": 2, \"VolumeSizeInGB\": 30},\n", " StoppingCondition={\"MaxRuntimeInSeconds\": 86400},\n", " DebugHookConfig={\n", " \"S3OutputPath\": \"s3://\" + bucket + \"/\" + project + \"/debug-output\",\n", " \"CollectionConfigurations\": [\n", " {\"CollectionName\": \"losses\", \"CollectionParameters\": {\"train.save_interval\": \"50\"}}\n", " ],\n", " },\n", " DebugRuleConfigurations=[\n", " {\n", " \"RuleConfigurationName\": \"LossNotDecreasing\",\n", " # You must specify the correct image URI from https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html\n", " \"RuleEvaluatorImage\": sagemaker.debugger.get_rule_container_image_uri(region),\n", " \"RuleParameters\": {\"rule_to_invoke\": \"LossNotDecreasing\"},\n", " }\n", " ],\n", " ProfilerConfig={\n", " \"S3OutputPath\": \"s3://\" + bucket + \"/\" + project + \"/profiler-output\",\n", " \"ProfilingIntervalInMilliseconds\": 500,\n", " \"ProfilingParameters\": {\n", " \"DataloaderProfilingConfig\": '{\"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\", }',\n", " \"DetailedProfilingConfig\": '{\"StartStep\": 5, \"NumSteps\": 3, }',\n", " \"PythonProfilingConfig\": '{\"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cprofile\", \"cProfileTimer\": \"total_time\"}',\n", " \"LocalPath\": \"/opt/ml/output/profiler/\", # Optional. Local path for profiling outputs\n", " },\n", " },\n", " ProfilerRuleConfigurations=[\n", " {\n", " \"RuleConfigurationName\": \"ProfilerReport\",\n", " # You must specify the correct image URI from https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html\n", " \"RuleEvaluatorImage\": sagemaker.debugger.get_rule_container_image_uri(region),\n", " \"RuleParameters\": {\"rule_to_invoke\": \"ProfilerReport\"},\n", " }\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Analyze Profiling Data\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install the SMDebug client library to use Debugger analysis tools" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pip\n", "\n", "\n", "def import_or_install(package):\n", " try:\n", " __import__(package)\n", " except ImportError:\n", " pip.main([\"install\", package])\n", "\n", "\n", "import_or_install(\"smdebug\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use SMDebug to retrieve saved output data and use analysis tools\n", "\n", "While the training is still in progress you can visualize the performance data in SageMaker Studio or in the notebook.\n", "Debugger provides utilities to plot system metrics in form of timeline charts or heatmaps. Checkout out the notebook \n", "[interactive_analysis_profiling_data.ipynb](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-debugger/debugger_interactive_analysis_profiling/interactive_analysis_profiling_data.ipynb) for more details. In the following code cell we plot the total CPU and GPU utilization as timeseries charts. To visualize other metrics such as I/O, memory, network you simply need to extend the list passed to `select_dimension` and `select_events`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob\n", "\n", "tj = TrainingJob(training_job_name, region)\n", "tj.wait_for_sys_profiling_data_to_be_available()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts\n", "\n", "system_metrics_reader = tj.get_systems_metrics_reader()\n", "system_metrics_reader.refresh_event_file_list()\n", "\n", "view_timeline_charts = TimelineCharts(\n", " system_metrics_reader,\n", " framework_metrics_reader=None,\n", " select_dimensions=[\"CPU\", \"GPU\"],\n", " select_events=[\"total\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Download Debugger Profiling Report" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The profiling report rule will create an html report `profiler-report.html` with a summary of builtin rules and recommenades of next steps. You can find this report in your S3 bucket. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rule_output_path = (\n", " \"s3://\"\n", " + bucket\n", " + \"/\"\n", " + project\n", " + \"/output/\"\n", " + training_job_name\n", " + \"/rule-output/ProfilerReport/profiler-output/\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! aws s3 ls {rule_output_path} --recursive" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! aws s3 cp {rule_output_path} . --recursive" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import FileLink\n", "\n", "display(\"Click link below to view the profiler report\", FileLink(\"profiler-report.html\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note: If you are using JupyterLab, make sure you click `Trust HTML` at the top left corner after you open the report.**" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }

training/sagemaker-debugger/tf-resnet-profiling-multi-gpu-multi-node-boto3.ipynb (430 lines of code) (raw):