# Distributed Data Parallel EfficientNet Training with PyTorch and SageMaker Distributed


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

---


[Amazon SageMaker's distributed library](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html) can be used to train deep learning models faster and cheaper. The [data parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html) feature in this library (`smdistributed.dataparallel`) is a distributed data parallel training framework for PyTorch, TensorFlow, and MXNet.

This notebook demonstrates how to use `smdistributed.dataparallel` with PyTorch on [Amazon SageMaker](https://aws.amazon.com/sagemaker/) to train an EfficientNet model on a large image dataset such as [ImageNet](https://image-net.org/download.php) using [Amazon FSx for Lustre file-system](https://aws.amazon.com/fsx/lustre/) as data source.

The outline of steps is as follows:

1. Stage the ImageNet dataset files on [Amazon S3](https://aws.amazon.com/s3/)
2. Create [Amazon FSx Lustre file-system](https://docs.aws.amazon.com/fsx/) and import data into the file-system from S3
3. Configure a data channel for training using Amazon FSx Lustre file-system.
3. Build Docker training image and push it to [Amazon ECR](https://aws.amazon.com/ecr/)
5. Configure the estimator function options, like distribution strategy and hyperparameters.
7. Start training

**NOTE:** With large training dataset such as ImageNet, we recommend using [Amazon FSx](https://aws.amazon.com/fsx/) as the input file system for the SageMaker training job. FSx file input to SageMaker significantly cuts down training start up time on SageMaker because it avoids downloading the training data each time you start the training job (as done with S3 input for SageMaker training job) and provides good data read throughput.


**NOTE:** This example requires SageMaker Python SDK v2.X.

## Amazon SageMaker Initialization

Initialize the notebook instance. Get the AWS Region and a SageMaker execution role.

### SageMaker role

The following code cell defines `role` which is the IAM role ARN used to create and run SageMaker training and hosting jobs. This is the same IAM role used to create this SageMaker Notebook instance. 

`role` must have permission to create a SageMaker training job and host a model. For granular policies you can use to grant these permissions, see [Amazon SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). If you do not require fine-tuned permissions for this demo, you can use the IAM managed policy AmazonSageMakerFullAccess to complete this demo. 

As described above, since we will be using FSx, please make sure to attach `FSx Access` permission to this IAM role. If you do not require fine-tuned permissions for this demo, you can use the IAM managed policy AmazonFSxFullAccess to complete this demo.

In [None]:
%%time
! python3 -m pip install --upgrade sagemaker
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
import boto3

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

role = (
    get_execution_role()
)  # provide a pre-existing role ARN as an alternative to creating a new role
role_name = role.split(["/"][-1])
print(f"SageMaker Execution Role: {role}")
print(f"The name of the Execution role: {role_name[-1]}")

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account: {account}")

session = boto3.session.Session()
region = session.region_name
print(f"AWS region: {region}")

To verify that the role above has required permissions:

1. Go to the [IAM console](https://console.aws.amazon.com/iam/home).
2. Select **Roles**.
3. Enter the role name in the search box to search for that role. 
4. Select the role.
5. Use the **Permissions** tab to verify this role has required permissions attached.

## Preparing FSx Input for SageMaker

Pre-requisite: [Create an Amazon S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html).

1. Download, prepare, and upload your training dataset on Amazon S3. Follow [step 2-5 from this guide to download and decompress the training images into a local directory](https://github.com/HerringForks/SMDDP-Examples/tree/main/pytorch/image_classification/efficientnet#quick-start-guide). Then, [upload the data to Amazon S3](https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html).
2. Follow these [steps to create a FSx linked with your S3 bucket with training data](https://docs.aws.amazon.com/fsx/latest/LustreGuide/getting-started-step1.html). Make sure to add an endpoint to your VPC allowing S3 access.

**Important Caveats**

- You need to use the same `subnet`, `vpc` and `security group` used with FSx when launching the SageMaker notebook instance. The same configurations will be used by your SageMaker training job.
- Make sure you set the [appropriate inbound/output rules in the `security group`](https://docs.aws.amazon.com/fsx/latest/LustreGuide/limit-access-security-groups.html). Specifically, opening up these ports is necessary for SageMaker to access the FSx filesystem in the training job.
- Make sure `SageMaker IAM Role` used to launch this SageMaker training job has access to `AmazonFSx`.

### Specify the FSx information for the training job
Go to your [FSx console](console.aws.amazon.com/fsx/) and access your recently created filesystem. In the **Summary** section, note the _File system ID_ and the _Mount name_. Also, in the **Network & security** tab, click on the Network Interface to display the _Subnet ID_ and _Security groups_. Use this information to complete the cell below and run it to configure the FSx input for your SageMaker training job.

In [None]:
# Configure FSx Input for your SageMaker Training job
from sagemaker.inputs import FileSystemInput

# FSx file system ID with your training dataset. Example: 'fs-0bYYYYYY'
file_system_id = "<fsx_id>"
# FSx path with your training data # Example: '/fsx_mount_name/imagenet'
file_system_directory_path = "/<fsx_mount_name>/<path_to_data>"
file_system_access_mode = "rw"
file_system_type = "FSxLustre"
train_fs = FileSystemInput(
    file_system_id=file_system_id,
    file_system_type=file_system_type,
    directory_path=file_system_directory_path,
    file_system_access_mode=file_system_access_mode,
)
# Specify the training data channel using the FSx filesystem. This will be provided to the SageMaker training job later
data_channels = {"train": train_fs}

# The following variables will be used later in the notebook
fsx_subnets = ["<subnet_id>"]  # Should be the subnet used for FSx. Example: subnet-0f9XXXX
fsx_security_group_ids = [
    "<security_group_id>"
]  # Should be the security group used for FSx. sg-03ZZZZZZ

## Prepare SageMaker Training Images

SageMaker by default uses the latest [Amazon Deep Learning Container Images (DLC)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) PyTorch training image. In this step, we use it as a base image and install additional dependencies required for training EfficientNet model.

Run the cell below to indicate the account that hosts the DLC. Note that the account ID of the PyTorch DLC image might vary depending on AWS Regions. To look up the right DLC image URI with a corresponding account ID for your Region, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

In [None]:
dlc_account_id = 763104351884  # By default, set the account ID used for most regions

**Minimum PyTorch version**

Note that the training script used in this notebook references the SageMaker distributed data parallel library as a backend for PyTorch. That capability was first introduced in the version 1.4.0 of the library, which was integrated with the support for PyTorch 1.10.2. Hence, the base DLC should use PT >= 1.10.2.

Run the cells below to assign a name for your image and observe the dockerfile.

In [None]:
image = (
    "pt-smdataparallel-efficientnet-sagemaker"  # Example: pt-smdataparallel-efficientnet-sagemaker
)
tag = "latest"  # Example: latest

In [None]:
!pygmentize ./Dockerfile

### Get the training script
In this [GitHub repository](https://github.com/HerringForks/SMDDP-Examples/tree/main/pytorch/image_classification), we have forked an EfficientNet example from [NVIDIA/DeepLearningExamples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets) and modified the training script to work with SageMaker distributed data parallel library. Starting from v1.4.0, the library is available as a backend option for the PyTorch distributed package. Hence, only 2 changes are required to adapt any PyTorch DDP script. Those are:
    
- To import the `smdistributed.dataparallel.torch.torch_smddp` module.
    
- To use `"smddp"` as the backend when calling `dist.init_process_group`.
    
   Example:
    
```python
import smdistributed.dataparallel.torch.torch_smddp
import torch.distributed as dist

dist.init_process_group(backend='smddp')
```
Learn more about how to [modify a PyTorch training script to use SageMaker data parallel library](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html).


Run the cell below to clone the repository that contains the adaptation of EfficientNet with PyTorch-SMDataParallel

In [None]:
# Note that the requirements file is removed as those dependencies are built into the docker container.
!rm -rf SMDDP-Examples
!git clone --recursive https://github.com/HerringForks/SMDDP-Examples.git && \
    rm SMDDP-Examples/pytorch/image_classification/requirements.txt

### Build the docker container and push it to ECR

The last step to prepare the training image is to build the docker container and push it to [Amazon ECR](https://aws.amazon.com/ecr/). To do that, we provide with the following script that takes care of setting up Amazon ECR repository, handling permissions, building the docker container and pushing it to Amazon ECR.

In [None]:
!pygmentize ./build_and_push.sh

In [None]:
! aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {dlc_account_id}.dkr.ecr.{region}.amazonaws.com
! chmod +x build_and_push.sh; bash build_and_push.sh {dlc_account_id} {region} {image} {tag}

### Save the docker image name from Amazon ECR
Now that the docker image is in Amazon ECR, it is ready to be used in the training job. Save the name in a variable.

In [None]:
docker_image = f"{account}.dkr.ecr.{region}.amazonaws.com/{image}:{tag}"

## Configure SageMaker PyTorch Estimator function options

In the following code blocks, you can update the estimator function to use a different instance type, instance count, distribution strategy and hyperparameters. You're also passing an entry point to the training script you downloaded in previous steps.

**Instance types**

`smdistributed.dataparallel` supports model training on SageMaker with the following instance types only. For best performance, it is recommended you use an instance type that supports [Amazon Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/).

1. `ml.p3.16xlarge`
1. `ml.p3dn.24xlarge` [Recommended]
1. `ml.p4d.24xlarge` [Recommended]

**Instance count**

To get the best performance and the most out of `smdistributed.dataparallel`, you should use at least 2 instances, but you can also use 1 for testing this example.

In [None]:
instance_type = "ml.p4d.24xlarge"  # Other supported instance type: ml.p3.16xlarge, ml.p3dn.24xlarge
instance_count = 2  # You can use 2, 4, 8, etc.

**Distribution strategy**

Note that to use DDP mode, you need to update the `distribution` strategy, and set it to use `smdistributed dataparallel`.

In [None]:
dist_strategy = {"smdistributed": {"dataparallel": {"enabled": True}}}

**Hyperparameters**

The EfficientNet training script used in this notebook, provides an internal mechanism to select a set of default hyperparameters based on model, training mode, precision and platform. You can read more in the [default configuration section](https://github.com/HerringForks/SMDDP-Examples/tree/main/pytorch/image_classification/efficientnet#default-configuration).
If you want to override any of the values, you can specify the hyperparameters manually in the cell below. For example, you could add `"epochs": 10` to run only for 10 epochs.

In [None]:
# Configure the hyperparameters
model_name = "efficientnet-b4"  # Either efficientnet-b0 or efficientnet-b4
hyperparameters = {
    "model": model_name,
    "mode": "benchmark_training",  # Use benchmark_training_short for a quick run with syntetic data
    "precision": "AMP",  # TF32 or AMP
    "platform": "P4D",
}

**Configure metrics to be displayed for the training job**

In this example, we show how to record a custom training throughput metric based on a regex expression. Learn more about [defining training metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html).

In [None]:
metric_definitions = [
    {"Name": "train_throughput", "Regex": "train.compute_ips : (.*?) "},
]

**Create the estimator function and pass the parameters**

Use all parameters from previous sections to configure the estimator function.

In [None]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="entry_point.py",
    role=role,
    image_uri=docker_image,
    source_dir=".",
    instance_count=instance_count,
    instance_type=instance_type,
    py_version="py38",
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparameters,
    subnets=fsx_subnets,
    security_group_ids=fsx_security_group_ids,
    debugger_hook_config=False,
    distribution=dist_strategy,
)

## Start the SageMaker training job
The last step before launching the training job is to assign it a name. It is used as prefix to the SageMaker training job, so you can identify it easily in the [SageMaker console](console.aws.amazon.com/sagemaker/).

In [None]:
job_name = f"pt-smddp-{model_name}-{instance_count}p4d"

In [None]:
# Submit SageMaker training job
estimator.fit(inputs=data_channels, job_name=job_name)

## Next steps

Now that you have a trained model, you can deploy an endpoint to host the model. After you deploy the endpoint, you can then test it with inference requests. The following cell will store the model_data variable to be used with the inference notebook.

In [None]:
model_data = estimator.model_data
print("Storing {} as model_data".format(model_data))
%store model_data

## Clean Up

To avoid incurring unnecessary charges, follow these [steps to use the AWS Management Console to delete resources such as endpoints, notebook instances, S3 buckets, and CloudWatch logs](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html).

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/training|distributed_training|pytorch|data_parallel|efficientnet|pytorch_smdataparallel_efficientnet_demo.ipynb)
