miscellaneous/distributed_tensorflow_mask_rcnn/mask-rcnn-scriptmode-s3.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Distributed Training of Mask-RCNN in Amazon SageMaker using S3\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This notebook is a step-by-step tutorial on distributed training of [Mask R-CNN](https://arxiv.org/abs/1703.06870) implemented in [TensorFlow](https://www.tensorflow.org/) framework. Mask R-CNN is also referred to as heavy weight object detection model and it is part of [MLPerf](https://www.mlperf.org/training-results-0-6/).\n", "\n", "Concretely, we will describe the steps for training [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) and [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) in [Amazon SageMaker](https://aws.amazon.com/sagemaker/) using [Amazon S3](https://aws.amazon.com/s3/) as data source.\n", "\n", "The outline of steps is as follows:\n", "\n", "1. Stage COCO 2017 dataset in [Amazon S3](https://aws.amazon.com/s3/)\n", "2. Build SageMaker training image and push it to [Amazon ECR](https://aws.amazon.com/ecr/)\n", "3. Configure data input channels\n", "4. Configure hyper-prarameters\n", "5. Define training metrics\n", "6. Define training job and start training\n", "\n", "Before we get started, let us initialize two python variables ```aws_region``` and ```s3_bucket``` that we will use throughout the notebook. The ```s3_bucket``` must be located in the region of this notebook instance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "\n", "session = boto3.session.Session()\n", "aws_region = session.region_name\n", "s3_bucket = # your-s3-bucket-name\n", "\n", "\n", "try:\n", " s3_client = boto3.client('s3')\n", " response = s3_client.get_bucket_location(Bucket=s3_bucket)\n", " print(f\"Bucket region: {response['LocationConstraint']}\")\n", "except:\n", " print(f\"Access Error: Check if '{s3_bucket}' S3 bucket is in '{aws_region}' region\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stage COCO 2017 dataset in Amazon S3\n", "\n", "We use [COCO 2017 dataset](http://cocodataset.org/#home) for training. We download COCO 2017 training and validation dataset to this notebook instance, extract the files from the dataset archives, and upload the extracted files to your Amazon [S3 bucket](https://docs.aws.amazon.com/en_pv/AmazonS3/latest/gsg/CreatingABucket.html) with the prefix ```mask-rcnn/sagemaker/input/train```. The ```prepare-s3-bucket.sh``` script executes this step.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat ./prepare-s3-bucket.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Using your *Amazon S3 bucket* as argument, run the cell below. If you have already uploaded COCO 2017 dataset to your Amazon S3 bucket *in this AWS region*, you may skip this step. The expected time to execute this step is 20 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "!./prepare-s3-bucket.sh {s3_bucket}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build and push SageMaker training images\n", "\n", "For this step, the [IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) attached to this notebook instance needs full access to Amazon ECR service. If you created this notebook instance using the ```./stack-sm.sh``` script in this repository, the IAM Role attached to this notebook instance is already setup with full access to ECR service. \n", "\n", "Below, we have a choice of two different implementations:\n", "\n", "1. [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) implementation supports a maximum per-GPU batch size of 1, and does not support mixed precision. It can be used with mainstream TensorFlow releases.\n", "\n", "2. [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) is an optimized implementation that supports a maximum batch size of 4 and supports mixed precision. This implementation uses custom TensorFlow ops. The required custom TensorFlow ops are available in [AWS Deep Learning Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) images in ```tensorflow-training``` repository with image tag ```1.15.2-gpu-py36-cu100-ubuntu18.04```, or later.\n", "\n", "It is recommended that you build and push both SageMaker training images and use either image for training later.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TensorPack Faster-RCNN/Mask-RCNN\n", "\n", "Use ```./container-script-mode/build_tools/build_and_push.sh``` script to build and push the TensorPack Faster-RCNN/Mask-RCNN training image to Amazon ECR. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat ./container-script-mode/build_tools/build_and_push.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using your *AWS region* as argument, run the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "! ./container-script-mode/build_tools/build_and_push.sh {aws_region}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set ```tensorpack_image``` below to Amazon ECR URI of the image you pushed above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tensorpack_image = # mask-rcnn-tensorpack-sagemaker ECR URI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AWS Samples Mask R-CNN\n", "Use ```./container-optimized-script-mode/build_tools/build_and_push.sh``` script to build and push the AWS Samples Mask R-CNN training image to Amazon ECR." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat ./container-optimized-script-mode/build_tools/build_and_push.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using your *AWS region* as argument, run the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "! ./container-optimized-script-mode/build_tools/build_and_push.sh {aws_region}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Set ```aws_samples_image``` below to Amazon ECR URI of the image you pushed above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "aws_samples_image = # mask-rcnn-tensorflow-sagemaker ECR URI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker Initialization \n", "\n", "First we upgrade SageMaker to 2.3.0 API. If your notebook is already using latest Sagemaker 2.x API, you may skip the next cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install --upgrade pip\n", "! pip install sagemaker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have staged the data and we have built and pushed the training docker image to Amazon ECR. Now we are ready to start using Amazon SageMaker.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.tensorflow.estimator import TensorFlow\n", "\n", "role = (\n", " get_execution_role()\n", ") # provide a pre-existing role ARN as an alternative to creating a new role\n", "print(f\"SageMaker Execution Role:{role}\")\n", "\n", "client = boto3.client(\"sts\")\n", "account = client.get_caller_identity()[\"Account\"]\n", "print(f\"AWS account:{account}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we set ```training_image``` to the Amazon ECR image URI you saved in a previous step. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_image = # set to tensorpack_image or aws_samples_image \n", "print(f'Training image: {training_image}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define SageMaker Data Channels\n", "In this step, we define SageMaker *train* data channel. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.inputs import TrainingInput\n", "\n", "prefix = \"mask-rcnn/sagemaker\" # prefix in your S3 bucket\n", "\n", "s3train = f\"s3://{s3_bucket}/{prefix}/input/train\"\n", "train_input = TrainingInput(\n", " s3_data=s3train, distribution=\"FullyReplicated\", s3_data_type=\"S3Prefix\", input_mode=\"File\"\n", ")\n", "\n", "data_channels = {\"train\": train_input}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we define the model output location in S3 bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_output_location = f\"s3://{s3_bucket}/{prefix}/output\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure Hyper-parameters\n", "Next, we define the hyper-parameters. \n", "\n", "Note, some hyper-parameters are different between the two implementations. The batch size per GPU in TensorPack Faster-RCNN/Mask-RCNN is fixed at 1, but is configurable in AWS Samples Mask-RCNN. The learning rate schedule is specified in units of steps in TensorPack Faster-RCNN/Mask-RCNN, but in epochs in AWS Samples Mask-RCNN.\n", "\n", "The detault learning rate schedule values shown below correspond to training for a total of 24 epochs, at 120,000 images per epoch.\n", "\n", "<table align='left'>\n", " <caption>TensorPack Faster-RCNN/Mask-RCNN Hyper-parameters</caption>\n", " <tr>\n", " <th style=\"text-align:center\">Hyper-parameter</th>\n", " <th style=\"text-align:center\">Description</th>\n", " <th style=\"text-align:center\">Default</th>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">mode_fpn</td>\n", " <td style=\"text-align:left\">Flag to indicate use of Feature Pyramid Network (FPN) in the Mask R-CNN model backbone</td>\n", " <td style=\"text-align:center\">\"True\"</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">mode_mask</td>\n", " <td style=\"text-align:left\">A value of \"False\" means Faster-RCNN model, \"True\" means Mask R-CNN moodel</td>\n", " <td style=\"text-align:center\">\"True\"</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">eval_period</td>\n", " <td style=\"text-align:left\">Number of epochs period for evaluation during training</td>\n", " <td style=\"text-align:center\">1</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">lr_schedule</td>\n", " <td style=\"text-align:left\">Learning rate schedule in training steps</td>\n", " <td style=\"text-align:center\">'[240000, 320000, 360000]'</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">batch_norm</td>\n", " <td style=\"text-align:left\">Batch normalization option ('FreezeBN', 'SyncBN', 'GN', 'None') </td>\n", " <td style=\"text-align:center\">'FreezeBN'</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">images_per_epoch</td>\n", " <td style=\"text-align:left\">Images per epoch </td>\n", " <td style=\"text-align:center\">120000</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">data_train</td>\n", " <td style=\"text-align:left\">Training data under data directory</td>\n", " <td style=\"text-align:center\">'coco_train2017'</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">data_val</td>\n", " <td style=\"text-align:left\">Validation data under data directory</td>\n", " <td style=\"text-align:center\">'coco_val2017'</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">resnet_arch</td>\n", " <td style=\"text-align:left\">Must be 'resnet50' or 'resnet101'</td>\n", " <td style=\"text-align:center\">'resnet50'</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">backbone_weights</td>\n", " <td style=\"text-align:left\">ResNet backbone weights</td>\n", " <td style=\"text-align:center\">'ImageNet-R50-AlignPadding.npz'</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">load_model</td>\n", " <td style=\"text-align:left\">Pre-trained model to load</td>\n", " <td style=\"text-align:center\"></td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">config:</td>\n", " <td style=\"text-align:left\">Any hyperparamter prefixed with <b>config:</b> is set as a model config parameter</td>\n", " <td style=\"text-align:center\"></td>\n", " </tr>\n", "</table>\n", "\n", " \n", "<table align='left'>\n", " <caption>AWS Samples Mask-RCNN Hyper-parameters</caption>\n", " <tr>\n", " <th style=\"text-align:center\">Hyper-parameter</th>\n", " <th style=\"text-align:center\">Description</th>\n", " <th style=\"text-align:center\">Default</th>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">mode_fpn</td>\n", " <td style=\"text-align:left\">Flag to indicate use of Feature Pyramid Network (FPN) in the Mask R-CNN model backbone</td>\n", " <td style=\"text-align:center\">\"True\"</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">mode_mask</td>\n", " <td style=\"text-align:left\">A value of \"False\" means Faster-RCNN model, \"True\" means Mask R-CNN moodel</td>\n", " <td style=\"text-align:center\">\"True\"</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">eval_period</td>\n", " <td style=\"text-align:left\">Number of epochs period for evaluation during training</td>\n", " <td style=\"text-align:center\">1</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">lr_epoch_schedule</td>\n", " <td style=\"text-align:left\">Learning rate schedule in epochs</td>\n", " <td style=\"text-align:center\">'[(16, 0.1), (20, 0.01), (24, None)]'</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">batch_size_per_gpu</td>\n", " <td style=\"text-align:left\">Batch size per gpu ( Minimum 1, Maximum 4)</td>\n", " <td style=\"text-align:center\">4</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">batch_norm</td>\n", " <td style=\"text-align:left\">Batch normalization option ('FreezeBN', 'SyncBN', 'GN', 'None') </td>\n", " <td style=\"text-align:center\">'FreezeBN'</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">images_per_epoch</td>\n", " <td style=\"text-align:left\">Images per epoch </td>\n", " <td style=\"text-align:center\">120000</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">data_train</td>\n", " <td style=\"text-align:left\">Training data under data directory</td>\n", " <td style=\"text-align:center\">'train2017'</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">data_val</td>\n", " <td style=\"text-align:left\">Validation data under data directory</td>\n", " <td style=\"text-align:center\">'val2017'</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">resnet_arch</td>\n", " <td style=\"text-align:left\">Must be 'resnet50' or 'resnet101'</td>\n", " <td style=\"text-align:center\">'resnet50'</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">backbone_weights</td>\n", " <td style=\"text-align:left\">ResNet backbone weights</td>\n", " <td style=\"text-align:center\">'ImageNet-R50-AlignPadding.npz'</td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">load_model</td>\n", " <td style=\"text-align:left\">Pre-trained model to load</td>\n", " <td style=\"text-align:center\"></td>\n", " </tr>\n", " <tr>\n", " <td style=\"text-align:center\">config:</td>\n", " <td style=\"text-align:left\">Any hyperparamter prefixed with <b>config:</b> is set as a model config parameter</td>\n", " <td style=\"text-align:center\"></td>\n", " </tr>\n", "</table>" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\n", " \"mode_fpn\": \"True\",\n", " \"mode_mask\": \"True\",\n", " \"eval_period\": 1,\n", " \"batch_norm\": \"FreezeBN\",\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Training Metrics\n", "Next, we define the regular expressions that SageMaker uses to extract algorithm metrics from training logs and send them to [AWS CloudWatch metrics](https://docs.aws.amazon.com/en_pv/AmazonCloudWatch/latest/monitoring/working_with_metrics.html). These algorithm metrics are visualized in SageMaker console." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_definitions = [\n", " {\"Name\": \"fastrcnn_losses/box_loss\", \"Regex\": \".*fastrcnn_losses/box_loss:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"fastrcnn_losses/label_loss\", \"Regex\": \".*fastrcnn_losses/label_loss:\\\\s*(\\\\S+).*\"},\n", " {\n", " \"Name\": \"fastrcnn_losses/label_metrics/accuracy\",\n", " \"Regex\": \".*fastrcnn_losses/label_metrics/accuracy:\\\\s*(\\\\S+).*\",\n", " },\n", " {\n", " \"Name\": \"fastrcnn_losses/label_metrics/false_negative\",\n", " \"Regex\": \".*fastrcnn_losses/label_metrics/false_negative:\\\\s*(\\\\S+).*\",\n", " },\n", " {\n", " \"Name\": \"fastrcnn_losses/label_metrics/fg_accuracy\",\n", " \"Regex\": \".*fastrcnn_losses/label_metrics/fg_accuracy:\\\\s*(\\\\S+).*\",\n", " },\n", " {\n", " \"Name\": \"fastrcnn_losses/num_fg_label\",\n", " \"Regex\": \".*fastrcnn_losses/num_fg_label:\\\\s*(\\\\S+).*\",\n", " },\n", " {\"Name\": \"maskrcnn_loss/accuracy\", \"Regex\": \".*maskrcnn_loss/accuracy:\\\\s*(\\\\S+).*\"},\n", " {\n", " \"Name\": \"maskrcnn_loss/fg_pixel_ratio\",\n", " \"Regex\": \".*maskrcnn_loss/fg_pixel_ratio:\\\\s*(\\\\S+).*\",\n", " },\n", " {\"Name\": \"maskrcnn_loss/maskrcnn_loss\", \"Regex\": \".*maskrcnn_loss/maskrcnn_loss:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"maskrcnn_loss/pos_accuracy\", \"Regex\": \".*maskrcnn_loss/pos_accuracy:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/IoU=0.5\", \"Regex\": \".*mAP\\\\(bbox\\\\)/IoU=0\\\\.5:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/IoU=0.5:0.95\", \"Regex\": \".*mAP\\\\(bbox\\\\)/IoU=0\\\\.5:0\\\\.95:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/IoU=0.75\", \"Regex\": \".*mAP\\\\(bbox\\\\)/IoU=0\\\\.75:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/large\", \"Regex\": \".*mAP\\\\(bbox\\\\)/large:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/medium\", \"Regex\": \".*mAP\\\\(bbox\\\\)/medium:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/small\", \"Regex\": \".*mAP\\\\(bbox\\\\)/small:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/IoU=0.5\", \"Regex\": \".*mAP\\\\(segm\\\\)/IoU=0\\\\.5:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/IoU=0.5:0.95\", \"Regex\": \".*mAP\\\\(segm\\\\)/IoU=0\\\\.5:0\\\\.95:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/IoU=0.75\", \"Regex\": \".*mAP\\\\(segm\\\\)/IoU=0\\\\.75:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/large\", \"Regex\": \".*mAP\\\\(segm\\\\)/large:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/medium\", \"Regex\": \".*mAP\\\\(segm\\\\)/medium:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/small\", \"Regex\": \".*mAP\\\\(segm\\\\)/small:\\\\s*(\\\\S+).*\"},\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define SageMaker Training Job\n", "\n", "Next, we use SageMaker [Estimator](https://sagemaker.readthedocs.io/en/stable/estimators.html) API to define a SageMaker Training Job. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select script\n", "\n", "In script-mode, first we have to select an entry point script that acts as interface with SageMaker and launches the training job. For training [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) model, set ```script``` to ```\"tensorpack-mask-rcnn.py\"```. For training [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) model, set ```script``` to ```\"aws-mask-rcnn.py\"```." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "script= # \"tensorpack-mask-rcnn.py\" or \"aws-mask-rcnn.py\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select distribution mode\n", "\n", "We use Message Passing Interface (MPI) to distribute the training job across multiple hosts. The ```custom_mpi_options``` below is only used by [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) model, and can be safely commented out for [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mpi_distribution = {\"mpi\": {\"enabled\": True, \"custom_mpi_options\": \"-x TENSORPACK_FP16=1 \"}}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define SageMaker Tensorflow Estimator\n", "We recommned using 32 GPUs, so we set ```instance_count=4``` and ```instance_type='ml.p3.16xlarge'```, because there are 8 Tesla V100 GPUs per ```ml.p3.16xlarge``` instance. We recommend using 100 GB [Amazon EBS](https://aws.amazon.com/ebs/) storage volume with each training instance, so we set ```volume_size = 100```. \n", "\n", "We run the training job in your private VPC, so we need to set the ```subnets``` and ```security_group_ids``` prior to running the cell below. You may specify multiple subnet ids in the ```subnets``` list. The subnets included in the ```sunbets``` list must be part of the output of ```./stack-sm.sh``` CloudFormation stack script used to create this notebook instance. Specify only one security group id in ```security_group_ids``` list. The security group id must be part of the output of ```./stack-sm.sh``` script.\n", "\n", "For ```instance_type``` below, you have the option to use ```ml.p3.16xlarge``` with 16 GB per-GPU memory and 25 Gbs network interconnectivity, or ```ml.p3dn.24xlarge``` with 32 GB per-GPU memory and 100 Gbs network interconnectivity. The ```ml.p3dn.24xlarge``` instance type offers significantly better performance than ```ml.p3.16xlarge``` for Mask R-CNN distributed TensorFlow training.\n", "\n", "We use MPI to distribute the training job across multiple hosts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Give Amazon SageMaker Training Jobs Access to FileSystem Resources in Your Amazon VPC.\n", "security_group_ids = # ['sg-xxxxxxxx'] \n", "subnets = # [ 'subnet-xxxxxxx']\n", "sagemaker_session = sagemaker.session.Session(boto_session=session)\n", "\n", "mask_rcnn_estimator = TensorFlow(image_uri=training_image,\n", " role=role, \n", " py_version='py3',\n", " instance_count=4, \n", " instance_type='ml.p3.16xlarge',\n", " distribution=mpi_distribution,\n", " entry_point=script,\n", " volume_size = 100,\n", " max_run = 400000,\n", " output_path=s3_output_location,\n", " sagemaker_session=sagemaker_session, \n", " hyperparameters = hyperparameters,\n", " metric_definitions = metric_definitions,\n", " subnets=subnets,\n", " security_group_ids=security_group_ids)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we launch the SageMaker training job. See ```Training Jobs``` in SageMaker console to monitor the training job. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "job_name = f\"mask-rcnn-s3-script-mode{int(time.time())}\"\n", "print(f\"Launching Training Job: {job_name}\")\n", "\n", "# set wait=True below if you want to print logs in cell output\n", "mask_rcnn_estimator.fit(inputs=data_channels, job_name=job_name, logs=\"All\", wait=False)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-s3.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "conda_tensorflow_p36", "language": "python", "name": "conda_tensorflow_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 4 }

miscellaneous/distributed_tensorflow_mask_rcnn/mask-rcnn-scriptmode-s3.ipynb (722 lines of code) (raw):