sagemaker/04_distributed_training_model_parallelism/sagemaker-notebook.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Huggingface Sagemaker-sdk - Distributed Training Demo\n", "\n", "### Model Parallelism using `SageMakerTrainer` " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. [Introduction](#Introduction) \n", "2. [Development Environment and Permissions](#Development-Environment-and-Permissions)\n", " 1. [Installation](#Installation) \n", " 2. [Development environment](#Development-environment) \n", " 3. [Permissions](#Permissions)\n", "3. [Processing](#Preprocessing) \n", " 1. [Tokenization](#Tokenization) \n", " 2. [Uploading data to sagemaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket) \n", "4. [Fine-tuning & starting Sagemaker Training Job](#Fine-tuning-\\&-starting-Sagemaker-Training-Job) \n", " 1. [Creating an Estimator and start a training job](#Creating-an-Estimator-and-start-a-training-job) \n", " 2. [Estimator Parameters](#Estimator-Parameters) \n", " 3. [Download fine-tuned model from s3](#Download-fine-tuned-model-from-s3)\n", " 3. [Attach to old training job to an estimator ](#Attach-to-old-training-job-to-an-estimator) \n", "5. [_Coming soon_:Push model to the Hugging Face hub](#Push-model-to-the-Hugging-Face-hub)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "Welcome to our end-to-end distributed Text-Classification example. In this demo, we will use the Hugging Face `transformers` and `datasets` library together with a Amazon sagemaker-sdk extension to run GLUE `mnli` benchmark on a multi-node multi-gpu cluster using [SageMaker Model Parallelism Library](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-intro.html). The demo will use the new smdistributed library to run training on multiple gpus. We extended the `Trainer` API to a the `SageMakerTrainer` to use the model parallelism library. Therefore you only have to change the imports in your `train.py`.\n", "\n", "```python\n", "from transformers.sagemaker import SageMakerTrainingArguments as TrainingArguments\n", "from transformers.sagemaker import SageMakerTrainer as Trainer\n", "```\n", "\n", "_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Development Environment and Permissions " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation\n", "\n", "_*Note:* we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install \"sagemaker>=2.48.0\" --upgrade" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Development environment " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker.huggingface" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Permissions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import sagemaker\n", "import boto3\n", "sess = sagemaker.Session()\n", "# sagemaker session bucket -> used for uploading data, models and logs\n", "# sagemaker will automatically create this bucket if it not exists\n", "sagemaker_session_bucket=None\n", "if sagemaker_session_bucket is None and sess is not None:\n", " # set to default bucket if a bucket name is not given\n", " sagemaker_session_bucket = sess.default_bucket()\n", "\n", "try:\n", " role = sagemaker.get_execution_role()\n", "except ValueError:\n", " iam = boto3.client('iam')\n", " role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n", "\n", "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n", "\n", "print(f\"sagemaker role arn: {role}\")\n", "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n", "print(f\"sagemaker session region: {sess.boto_region_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fine-tuning & starting Sagemaker Training Job\n", "\n", "In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator we define, which fine-tuning script should be used as `entry_point`, which `instance_type` should be used, which `hyperparameters` are passed in .....\n", "\n", "\n", "\n", "```python\n", "huggingface_estimator = HuggingFace(entry_point='train.py',\n", " source_dir='./scripts',\n", " base_job_name='huggingface-sdk-extension',\n", " instance_type='ml.p3.2xlarge',\n", " instance_count=1,\n", " transformers_version='4.4',\n", " pytorch_version='1.6',\n", " py_version='py36',\n", " role=role,\n", " hyperparameters = {'epochs': 1,\n", " 'train_batch_size': 32,\n", " 'model_name':'distilbert-base-uncased'\n", " })\n", "```\n", "\n", "When we create a SageMaker training job, SageMaker takes care of starting and managing all the required ec2 instances for us with the `huggingface` container, uploads the provided fine-tuning script `train.py` and downloads the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. Then, it starts the training job by running. \n", "\n", "```python\n", "/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32\n", "```\n", "\n", "The `hyperparameters` you define in the `HuggingFace` estimator are passed in as named arguments. \n", "\n", "Sagemaker is providing useful properties about the training environment through various environment variables, including the following:\n", "\n", "* `SM_MODEL_DIR`: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.\n", "\n", "* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.\n", "\n", "* `SM_CHANNEL_XXXX:` A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.\n", "\n", "\n", "To run your training job locally you can define `instance_type='local'` or `instance_type='local_gpu'` for gpu usage. _Note: this does not working within SageMaker Studio_\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating an Estimator and start a training job\n", "\n", "In this example we are going to use the `run_glue.py` from the transformers example scripts. We modified it and included `SageMakerTrainer` instead of the `Trainer` to enable model-parallelism. You can find the code [here](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification).\n", "\n", "```python\n", "from transformers.sagemaker import SageMakerTrainingArguments as TrainingArguments, SageMakerTrainer as Trainer\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from sagemaker.huggingface import HuggingFace" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# hyperparameters, which are passed into the training job\n", "hyperparameters={\n", " 'model_name_or_path':'roberta-large',\n", " 'task_name': 'mnli',\n", " 'per_device_train_batch_size': 16,\n", " 'per_device_eval_batch_size': 16,\n", " 'do_train': True,\n", " 'do_eval': True,\n", " 'do_predict': True,\n", " 'num_train_epochs': 2,\n", " 'output_dir':'/opt/ml/model',\n", " 'max_steps': 500,\n", "}\n", "\n", "# configuration for running training on smdistributed Model Parallel\n", "mpi_options = {\n", " \"enabled\" : True,\n", " \"processes_per_host\" : 8,\n", "}\n", "smp_options = {\n", " \"enabled\":True,\n", " \"parameters\": {\n", " \"microbatches\": 4,\n", " \"placement_strategy\": \"spread\",\n", " \"pipeline\": \"interleaved\",\n", " \"optimize\": \"speed\",\n", " \"partitions\": 4,\n", " \"ddp\": True,\n", " }\n", "}\n", "\n", "distribution={\n", " \"smdistributed\": {\"modelparallel\": smp_options},\n", " \"mpi\": mpi_options\n", "}\n", "\n", "# git configuration to download our fine-tuning script\n", "git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.26.0'}\n", "\n", "# instance configurations\n", "instance_type='ml.p3.16xlarge'\n", "instance_count=1\n", "volume_size=200\n", "\n", "# metric definition to extract the results\n", "metric_definitions=[\n", " {'Name': 'train_runtime', 'Regex':\"train_runtime.*=\\D*(.*?)$\"},\n", " {'Name': 'train_samples_per_second', 'Regex': \"train_samples_per_second.*=\\D*(.*?)$\"},\n", " {'Name': 'epoch', 'Regex': \"epoch.*=\\D*(.*?)$\"},\n", " {'Name': 'f1', 'Regex': \"f1.*=\\D*(.*?)$\"},\n", " {'Name': 'exact_match', 'Regex': \"exact_match.*=\\D*(.*?)$\"}]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# estimator\n", "huggingface_estimator = HuggingFace(entry_point='run_glue.py',\n", " source_dir='./examples/pytorch/text-classification',\n", " git_config=git_config,\n", " metrics_definition=metric_definitions,\n", " instance_type=instance_type,\n", " instance_count=instance_count,\n", " volume_size=volume_size,\n", " role=role,\n", " transformers_version='4.26.0',\n", " pytorch_version='1.13.1',\n", " py_version='py39',\n", " distribution= distribution,\n", " hyperparameters = hyperparameters,\n", " debugger_hook_config=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "huggingface_estimator.hyperparameters()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# starting the train job with our uploaded datasets as input\n", "huggingface_estimator.fit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deploying the endpoint\n", "\n", "To deploy our endpoint, we call `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor = huggingface_estimator.deploy(1,\"ml.g4dn.xlarge\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we use the returned predictor object to call the endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sentiment_input= {\"inputs\":\"I love using the new Inference DLC.\"}\n", "\n", "predictor.predict(sentiment_input)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we delete the endpoint again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor.delete_model()\n", "predictor.delete_endpoint()" ] } ], "metadata": { "instance_type": "ml.t3.medium", "interpreter": { "hash": "c281c456f1b8161c8906f4af2c08ed2c40c50136979eaae69688b01f70e9f4a9" }, "kernelspec": { "display_name": "conda_pytorch_p39", "language": "python", "name": "conda_pytorch_p39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" } }, "nbformat": 4, "nbformat_minor": 4 }

sagemaker/04_distributed_training_model_parallelism/sagemaker-notebook.ipynb (392 lines of code) (raw):