sagemaker/05_spot_instances/sagemaker-notebook.ipynb (369 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Spot Instances - Amazon SageMaker x Hugging Face Transformers\n", "### Learn how to use Spot Instances and Checkpointing and save up to 90% training cost" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Amazon EC2 Spot Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html) are a way to take advantage of unused EC2 capacity in the AWS cloud. A Spot Instance is an instance that uses spare EC2 capacity that is available for less than the On-Demand price. The hourly price for a Spot Instance is called a Spot price. If you want to learn more about Spot Instances, you should check out the concepts of it in the [documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html#spot-pricing). One concept we should nevertheless briefly address here is `Spot Instance interruption`. \n", "\n", "> Amazon EC2 terminates, stops, or hibernates your Spot Instance when Amazon EC2 needs the capacity back or the Spot price exceeds the maximum price for your request. Amazon EC2 provides a Spot Instance interruption notice, which gives the instance a two-minute warning before it is interrupted.\n", "\n", "[Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) and the [Hugging Face DLCs](https://huggingface.co/docs/sagemaker/main) make it easy to train transformer models using managed Spot instances. Managed spot training can optimize the cost of training models up to 90% over on-demand instances. \n", "\n", "As we learned spot instances can be interrupted, causing jobs to potentially stop before they are finished. To prevent any loss of model weights or information Amazon SageMaker offers support for [remote S3 Checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html) where data from a local path to Amazon S3 is saved. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path.\n", "\n", "![imgs](imgs/overview.png)\n", "\n", "In this example, we will learn how to use [managed Spot Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) and [S3 checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html) with Hugging Face Transformers to save up to 90% of the training costs. \n", "\n", "We are going to:\n", "\n", "- preprocess a dataset in the notebook and upload it to Amazon S3\n", "- configure checkpointing and spot training in the `HuggingFace` estimator\n", "- run training on a spot instance\n", "\n", "_**NOTE: You can run this demo in Sagemaker Studio, your local machine, or Sagemaker Notebook Instances**_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Development Environment and Permissions**\n", "\n", "*Note: we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install \"sagemaker>=2.140.0\" \"transformers==4.26.1\" \"datasets[s3]==2.10.1\" --upgrade" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Permissions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import sagemaker\n", "import boto3\n", "sess = sagemaker.Session()\n", "# sagemaker session bucket -> used for uploading data, models and logs\n", "# sagemaker will automatically create this bucket if it not exists\n", "sagemaker_session_bucket=None\n", "if sagemaker_session_bucket is None and sess is not None:\n", " # set to default bucket if a bucket name is not given\n", " sagemaker_session_bucket = sess.default_bucket()\n", "\n", "try:\n", " role = sagemaker.get_execution_role()\n", "except ValueError:\n", " iam = boto3.client('iam')\n", " role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n", "\n", "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n", "\n", "print(f\"sagemaker role arn: {role}\")\n", "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n", "print(f\"sagemaker session region: {sess.boto_region_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing\n", "\n", "We are using the `datasets` library to download and preprocess the `emotion` dataset. After preprocessing, the dataset will be uploaded to our `sagemaker_session_bucket` to be used within our training job. The [emotion](https://github.com/dair-ai/emotion_dataset) dataset consists of 16000 training examples, 2000 validation examples, and 2000 testing examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from datasets import load_dataset\n", "from transformers import AutoTokenizer\n", "\n", "# model_id used for training and preprocessing\n", "model_id = 'distilbert-base-uncased'\n", "\n", "# dataset used\n", "dataset_name = 'emotion'\n", "\n", "# s3 key prefix for the data\n", "s3_prefix = 'samples/datasets/emotion'\n", "\n", "# download tokenizer\n", "tokenizer = AutoTokenizer.from_pretrained(model_id)\n", "\n", "# tokenizer helper function\n", "def tokenize(batch):\n", " return tokenizer(batch['text'], padding='max_length', truncation=True)\n", "\n", "# load dataset\n", "train_dataset, test_dataset = load_dataset(dataset_name, split=['train', 'test'])\n", "\n", "# tokenize dataset\n", "train_dataset = train_dataset.map(tokenize, batched=True)\n", "test_dataset = test_dataset.map(tokenize, batched=True)\n", "\n", "# set format for pytorch\n", "train_dataset = train_dataset.rename_column(\"label\", \"labels\")\n", "train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])\n", "test_dataset = test_dataset.rename_column(\"label\", \"labels\")\n", "test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After we processed the `datasets` we are going to use the new `FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our dataset to S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# save train_dataset to s3\n", "training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'\n", "train_dataset.save_to_disk(training_input_path)\n", "\n", "# save test_dataset to s3\n", "test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'\n", "test_dataset.save_to_disk(test_input_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure checkpointing and spot training in the `HuggingFace` estimator\n", "\n", "After we have uploaded we can configure our spot training and make sure we have checkpointing enabled to not lose any progress if interruptions happen. \n", "\n", "To configure spot training we need to define the `max_wait` and `max_run` in the `HuggingFace` estimator and set `use_spot_instances` to `True`. \n", "\n", "- `max_wait`: Duration in seconds until Amazon SageMaker will stop the managed spot training if not completed yet\n", "- `max_run`: Max duration in seconds for training the training job\n", "\n", "`max_wait` also needs to be greater than `max_run`, because `max_wait` is the duration for waiting/accessing spot instances (can take time when no spot capacity is free) + the expected duration of the training job. \n", "\n", "**Example**\n", "\n", "If you expect your training to take 3600 seconds (1 hour) you can set `max_run` to `4000` seconds (buffer) and `max_wait` to `7200` to include a `3200` seconds waiting time for your spot capacity." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# enables spot training\n", "use_spot_instances=True\n", "# max time including spot start + training time\n", "max_wait=7200\n", "# expected training time\n", "max_run=4000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To enable checkpointing we need to define `checkpoint_s3_uri` in the `HuggingFace` estimator. `checkpoint_s3_uri` is a S3 URI in which to save the checkpoints. By default Amazon SageMaker will save now any file, which is written to `/opt/ml/checkpoints` in the training job to `checkpoint_s3_uri`. \n", "\n", "*It is possible to adjust `/opt/ml/checkpoints` by overwriting `checkpoint_local_path` in the `HuggingFace` estimator*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# s3 uri where our checkpoints will be uploaded during training\n", "base_job_name = \"emotion-checkpointing\"\n", "\n", "checkpoint_s3_uri = f's3://{sess.default_bucket()}/{base_job_name}/checkpoints'\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next step is to create our `HuggingFace` estimator, provide our `hyperparameters` and add our spot and checkpointing configurations." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from sagemaker.huggingface import HuggingFace\n", "\n", "# hyperparameters, which are passed into the training job\n", "hyperparameters={\n", " 'epochs': 1, # number of training epochs\n", " 'train_batch_size': 32, # batch size for training\n", " 'eval_batch_size': 64, # batch size for evaluation\n", " 'learning_rate': 3e-5, # learning rate used during training\n", " 'model_id':model_id, # pre-trained model id \n", " 'fp16': True, # Whether to use 16-bit (mixed) precision training\n", " 'output_dir':'/opt/ml/checkpoints' # make sure files are saved to the checkpoint directory\n", "}\n", "\n", "# create the Estimator\n", "huggingface_estimator = HuggingFace(\n", " entry_point = 'train.py', # fine-tuning script used in training jon\n", " source_dir = './scripts', # directory where fine-tuning script is stored\n", " instance_type = 'ml.p3.2xlarge', # instances type used for the training job\n", " instance_count = 1, # the number of instances used for training\n", " base_job_name = base_job_name, # the name of the training job\n", " role = role, # Iam role used in training job to access AWS ressources, e.g. S3\n", " transformers_version = '4.26.0', # the transformers version used in the training job\n", " pytorch_version = '1.13.1', # the pytorch_version version used in the training job\n", " py_version = 'py39', # the python version used in the training job\n", " hyperparameters = hyperparameters, # the hyperparameter used for running the training job\n", " use_spot_instances = use_spot_instances,# wether to use spot instances or not\n", " max_wait = max_wait, # max time including spot start + training time\n", " max_run = max_run, # max expected training time\n", " checkpoint_s3_uri = checkpoint_s3_uri, # s3 uri where our checkpoints will be uploaded during training\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When using remote S3 checkpointing you have to make sure that your `train.py` also supports checkpointing. `Transformers` and the `Trainer` offers utilities on how to do this. You only need to add the following snippet to your `Trainer` training script\n", "\n", "```python\n", "from transformers.trainer_utils import get_last_checkpoint\n", "\n", "# check if checkpoint existing if so continue training\n", "if get_last_checkpoint(args.output_dir) is not None:\n", " logger.info(\"***** continue training *****\")\n", " last_checkpoint = get_last_checkpoint(args.output_dir)\n", " trainer.train(resume_from_checkpoint=last_checkpoint)\n", "else:\n", " trainer.train()\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run training on a spot instance\n", "\n", "The last step of this example is to start our managed Spot Training. Therefore we simple call the `.fit` method of our estimator and provide our dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# define train data object\n", "data = {\n", " 'train': training_input_path,\n", " 'test': test_input_path\n", "}\n", "\n", "# starting the train job with our uploaded datasets as input\n", "huggingface_estimator.fit(data)\n", "\n", "# Training seconds: 874\n", "# Billable seconds: 262\n", "# Managed Spot Training savings: 70.0%" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the training is successful run you should see your spot savings in the logs. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "interpreter": { "hash": "c281c456f1b8161c8906f4af2c08ed2c40c50136979eaae69688b01f70e9f4a9" }, "kernelspec": { "display_name": "conda_pytorch_p39", "language": "python", "name": "conda_pytorch_p39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" } }, "nbformat": 4, "nbformat_minor": 4 }