training/distributed-training/yolov5.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Distributed data parallel YOLOv5 training with PyTorch and SageMaker distributed\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "[Amazon SageMaker's distributed library](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html) can be used to train deep learning models faster and cheaper. The [data parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html) feature in this library (`smdistributed.dataparallel`) is a distributed data parallel training framework for PyTorch, TensorFlow, and MXNet.\n", "\n", "This notebook demonstrates how to use `smdistributed.dataparallel` with PyTorch(version 1.10.2) on [Amazon SageMaker](https://aws.amazon.com/sagemaker/) to train a YOLOv5 model on [BCCD dataset](https://public.roboflow.com/object-detection/bccd/4) using [Amazon FSx for Lustre file-system](https://aws.amazon.com/fsx/lustre/) as data source.\n", "\n", "The outline of steps is as follows:\n", "\n", "1. Stage BCCD dataset on [Amazon S3](https://aws.amazon.com/s3/)\n", "2. Create Amazon FSx Lustre file-system and import data into the file-system from S3\n", "3. Build Docker training image and push it to [Amazon ECR](https://aws.amazon.com/ecr/)\n", "4. Configure data input channels for SageMaker\n", "5. Configure hyper-parameters\n", "6. Define training metrics\n", "7. Define training job and start training\n", "\n", "**NOTE:** With large training dataset, we recommend using [Amazon FSx](https://aws.amazon.com/fsx/) as the input file system for the SageMaker training job. FSx file input to SageMaker significantly cuts down training start up time on SageMaker because it avoids downloading the training data each time you start the training job (as done with S3 input for SageMaker training job) and provides good data read throughput." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Amazon SageMaker Initialization\n", "\n", "Initialize the notebook instance. Get the AWS Region and a SageMaker execution role.\n", "\n", "### SageMaker role\n", "\n", "The following code cell defines `role` which is the IAM role ARN used to create and run SageMaker training and hosting jobs. This is the same IAM role used to create this SageMaker Notebook instance. \n", "\n", "`role` must have permission to create a SageMaker training job and host a model. For granular policies you can use to grant these permissions, see [Amazon SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).\n", "\n", "As described above, since we will be using FSx, please make sure to attach `FSx Access` permission to this IAM role." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "! python3 -m pip install --upgrade sagemaker\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.estimator import Estimator\n", "import boto3\n", "\n", "sagemaker_session = sagemaker.Session()\n", "bucket = sagemaker_session.default_bucket()\n", "\n", "role = (\n", " get_execution_role()\n", ") # provide a pre-existing role ARN as an alternative to creating a new role\n", "role_name = role.split([\"/\"][-1])\n", "print(f\"SageMaker Execution Role: {role}\")\n", "print(f\"The name of the Execution role: {role_name[-1]}\")\n", "\n", "client = boto3.client(\"sts\")\n", "account = client.get_caller_identity()[\"Account\"]\n", "print(f\"AWS account: {account}\")\n", "\n", "session = boto3.session.Session()\n", "region = session.region_name\n", "print(f\"AWS region: {region}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To verify that the role above has required permissions:\n", "\n", "1. Go to the IAM console: https://console.aws.amazon.com/iam/home.\n", "2. Select **Roles**.\n", "3. Enter the role name in the search box to search for that role. \n", "4. Select the role.\n", "5. Use the **Permissions** tab to verify this role has required permissions attached." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare SageMaker Training Images\n", "\n", "SageMaker by default use the latest [Amazon Deep Learning Container Images (DLC)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) PyTorch training image. In this step, we use it as a base image and install additional dependencies required for training YOLOv5 model.\n", "\n", "### Build and Push Docker Image to ECR\n", "\n", "Run the below command build the docker image and push it to ECR." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "image = \"yolov5-smdataparallel-sagemaker\" # use any image name you want, as long as it is all lower case\n", "tag = \"pt1.10.2\" # use any tag name you want" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<subprocess.Popen at 0x7fb6a153b048>" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import subprocess\n", "\n", "subprocess.Popen(\n", " \"aws ecr get-login-password --region \"\n", " + region\n", " + \" docker login --username AWS --password-stdin 763104351884.dkr.ecr.\"\n", " + region\n", " + \".amazonaws.com\",\n", " shell=True,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize ./Dockerfile" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize ./build_and_push.sh" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "! chmod +x build_and_push.sh; bash build_and_push.sh {region} {image} {tag}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "\n", "subprocess.Popen(\n", " \"rm -rf SMDDP-Examples && git clone https://github.com/HerringForks/SMDDP-Examples\", shell=True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing FSx Input for SageMaker\n", "\n", "1. Download and prepare your training dataset on S3.\n", "2. [Create a FSx linked with your S3 bucket with training data](https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-fs-linked-data-repo.html). Make sure to add an endpoint to your VPC allowing S3 access.\n", "3. [Configure your SageMaker training job to use FSx](https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems). \n", "\n", "### Important Caveats\n", "\n", "1. You need to use the same `subnet` and `vpc` and `security group` used with FSx when launching the SageMaker notebook instance. The same configurations will be used by your SageMaker training job.\n", "2. Make sure you [set appropriate inbound/output rules in the `security group`](https://docs.aws.amazon.com/fsx/latest/LustreGuide/limit-access-security-groups.html). Especially, opening up these ports is necessary for SageMaker to access the FSx file system in the training job. \n", "3. Make sure `SageMaker IAM Role` used to launch this SageMaker training job has access to `AmazonFSx`.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker PyTorch Estimator function options\n", "\n", "In the following code block, you can update the estimator function to use a different instance type, instance count, and distribution strategy. You're also passing in the training script you reviewed in the previous cell.\n", "\n", "**Instance types**\n", "\n", "1. ml.p3.16xlarge\n", "1. ml.p3dn.24xlarge [Recommended]\n", "1. ml.p4d.24xlarge [Recommended]\n", "\n", "**Instance count**\n", "\n", "You should use at least 2 instances, but you can also use 1 for testing this example.\n", "\n", "**Distribution strategy**\n", "\n", "Note that to use DDP mode, you need to update the \"distribution\" strategy. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from sagemaker.pytorch import PyTorch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "instance_type = \"ml.p4d.24xlarge\" # Other supported instance type: ml.p3.16xlarge, ml.p4d.24xlarge\n", "instance_count = 8 # You can use 2, 4, 8 etc.\n", "docker_image = f\"{account}.dkr.ecr.{region}.amazonaws.com/{image}:{tag}\" # YOUR_ECR_IMAGE_BUILT_WITH_ABOVE_DOCKER_FILE\n", "username = \"AWS\"\n", "subnets = [\"<subnet-xxx>\"] # Should be same as Subnet used for FSx. Example: subnet-0f9XXXX\n", "security_group_ids = [\"<sg-xxx>\"] # Should be same as Security group used for FSx. sg-03ZZZZZZ\n", "job_name = \"<job-xxx>\" # This job name is used as prefix to the sagemaker training job. Makes it easy for your look for your training job in SageMaker Training job console.\n", "file_system_id = \"<fs-xxx>\" # FSx file system ID with your training dataset. Example: 'fs-0bYYYYYY'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\n", " \"data\": \"data/bccd.yaml\",\n", " \"cfg\": \"models/yolov5x.yaml\",\n", " \"epochs\": 10,\n", " \"batch-size\": 1024,\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator = PyTorch(\n", " entry_point=\"train.py\",\n", " role=role,\n", " image_uri=docker_image, # your image\n", " source_dir=\"./SMDDP-Examples/pytorch/object_detection/yolov5\",\n", " instance_count=instance_count,\n", " instance_type=instance_type,\n", " framework_version=\"1.10.2\",\n", " py_version=\"py38\",\n", " sagemaker_session=sagemaker_session,\n", " hyperparameters=hyperparameters,\n", " subnets=subnets,\n", " security_group_ids=security_group_ids,\n", " debugger_hook_config=False,\n", " distribution={\"smdistributed\": {\"dataparallel\": {\"enabled\": True}}},\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Configure FSx Input for your SageMaker Training job\n", "\n", "from sagemaker.inputs import FileSystemInput\n", "\n", "file_system_directory_path = \"<xxx>\" # NOTE: use the fsx path to your data\n", "file_system_access_mode = \"rw\"\n", "file_system_type = \"FSxLustre\"\n", "train_fs = FileSystemInput(\n", " file_system_id=file_system_id,\n", " file_system_type=file_system_type,\n", " directory_path=file_system_directory_path,\n", " file_system_access_mode=file_system_access_mode,\n", ")\n", "data_channels = {\"train\": train_fs}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Submit SageMaker training job\n", "estimator.fit(inputs=data_channels, job_name=job_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "Now we have trained the YOLOv5 model on SageMaker using SMDDP. Now the model is ready to be deployed and used for inference." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/training|distributed_training|pytorch|data_parallel|yolov5|yolov5.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "conda_amazonei_pytorch_latest_p36", "language": "python", "name": "conda_amazonei_pytorch_latest_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }

training/distributed-training/yolov5.ipynb (408 lines of code) (raw):