notebooks/text-classification/fine_tune_bert.ipynb (324 lines of code) (raw):

{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "2015aaa8", "metadata": {}, "source": [ "# Getting started with AWS Trainium and Hugging Face Transformers\n", "\n", "*This tutorial is available in two different formats, as [web page](https://huggingface.co/docs/optimum-neuron/training_tutorials/fine_tune_bert) and [notebook version](https://github.com/huggingface/optimum-neuron/blob/main/notebooks/text-classification/fine_tune_bert.ipynb)*.\n", "\n", "This guide will help you to get started with [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/?nc1=h_ls) and Hugging Face Transformers. It will cover how to set up a Trainium instance on AWS, load & fine-tune a transformers model for text-classification.\n", "\n", "You will learn how to:\n", "\n", "1. Setup AWS environment\n", "2. Load and process the dataset\n", "3. Fine-tune BERT using Hugging Face Transformers and Optimum Neuron\n", "\n", "Before we can start, make sure you have a [Hugging Face Account](https://huggingface.co/join) to save artifacts and experiments.\n", "\n", "## Quick intro: AWS Trainium\n", "\n", "[AWS Trainium (Trn1)](https://aws.amazon.com/de/ec2/instance-types/trn1/) is a purpose-built EC2 for deep learning (DL) training workloads. Trainium is the successor of [AWS Inferentia](https://aws.amazon.com/ec2/instance-types/inf1/?nc1=h_ls) focused on high-performance training workloads claiming up to 50% cost-to-train savings over comparable GPU-based instances.\n", "\n", "Trainium has been optimized for training natural language processing, computer vision, and recommender models used. The accelerator supports a wide range of data types, including FP32, TF32, BF16, FP16, UINT8, and configurable FP8.\n", "\n", "The biggest Trainium instance, the `trn1.32xlarge` comes with over 500GB of memory, making it easy to fine-tune ~10B parameter models on a single instance. Below you will find an overview of the available instance types. More details [here](https://aws.amazon.com/en/ec2/instance-types/trn1/#Product_details):\n", "\n", "| instance size | accelerators | accelerator memory | vCPU | CPU Memory | price per hour |\n", "| --- | --- | --- | --- | --- | --- |\n", "| trn1.2xlarge | 1 | 32 | 8 | 32 | $1.34 |\n", "| trn1.32xlarge | 16 | 512 | 128 | 512 | $21.50 |\n", "| trn1n.32xlarge (2x bandwidth) | 16 | 512 | 128 | 512 | $24.78 |\n", "\n", "---\n", "\n", "Now we know what Trainium offers, let's get started. 🚀\n", "\n", "*Note: This tutorial was created on a trn1.2xlarge AWS EC2 Instance.* \n", "\n", "## 1. Setup AWS environment\n", "\n", "In this tutorial, we will use the `trn1.2xlarge` instance on AWS with 1 Accelerator, including two Neuron Cores and the [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2).\n", "\n", "Once the instance is up and running, we can ssh into it. But instead of developing inside a terminal we want to use a `Jupyter` environment, which we can use for preparing our dataset and launching the training. For this, we need to add a port for forwarding in the `ssh` command, which will tunnel our localhost traffic to the Trainium instance.\n", "\n", "```bash\n", "PUBLIC_DNS=\"\" # IP address, e.g. ec2-3-80-....\n", "KEY_PATH=\"\" # local path to key, e.g. ssh/trn.pem\n", "\n", "ssh -L 8080:localhost:8080 -i ${KEY_NAME}.pem ubuntu@$PUBLIC_DNS\n", "```\n", "\n", "We need to make sure we have the `training` extra installed, to get all the necessary dependencies:\n", "\n", "```bash\n", "python -m pip install .[training]\n", "```\n", "\n", "We can now start our **`jupyter`** server.\n", "\n", "```bash\n", "python -m notebook --allow-root --port=8080\n", "```\n", "\n", "You should see a familiar **`jupyter`** output with a URL to the notebook.\n", "\n", "**`http://localhost:8080/?token=8c1739aff1755bd7958c4cfccc8d08cb5da5234f61f129a9`**\n", "\n", "We can click on it, and a **`jupyter`** environment opens in our local browser.\n", "\n", "![jupyter.webp](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/optimum/neuron/tutorial-fine-tune-bert-jupyter.png)\n", "\n", "We are going to use the Jupyter environment only for preparing the dataset and then `torchrun` for launching our training script on both neuron cores for distributed training. Lets create a new notebook and get started. \n", "\n", "## 2. Load and process the dataset\n", "\n", "We are training a Text Classification model on the [emotion](https://huggingface.co/datasets/dair-ai/emotion) dataset to keep the example straightforward. The `emotion` is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.\n", "\n", "We will use the `load_dataset()` method from the [🤗 Datasets](https://huggingface.co/docs/datasets/index) library to load the `emotion`." ] }, { "cell_type": "code", "execution_count": null, "id": "ace04f1e", "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "\n", "\n", "# Dataset id from huggingface.co/dataset\n", "dataset_id = \"dair-ai/emotion\"\n", "\n", "# Load raw dataset\n", "raw_dataset = load_dataset(dataset_id)\n", "\n", "print(f\"Train dataset size: {len(raw_dataset['train'])}\")\n", "print(f\"Test dataset size: {len(raw_dataset['test'])}\")\n", "\n", "# Train dataset size: 16000\n", "# Test dataset size: 2000" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f3269156", "metadata": {}, "source": [ "Let’s check out an example of the dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "b87eab11", "metadata": {}, "outputs": [], "source": [ "from random import randrange\n", "\n", "\n", "random_id = randrange(len(raw_dataset['train']))\n", "raw_dataset['train'][random_id]\n", "# {'text': 'i also like to listen to jazz whilst painting it makes me feel more artistic and ambitious actually look to the rainbow', 'label': 1}" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1dd8f764", "metadata": {}, "source": [ "We must convert our \"Natural Language\" to token IDs to train our model. This is done by a Tokenizer, which tokenizes the inputs (including converting the tokens to their corresponding IDs in the pre-trained vocabulary). if you want to learn more about this, out [chapter 6](https://huggingface.co/course/chapter6/1?fw=pt) of the [Hugging Face Course](https://huggingface.co/course/chapter1/1).\n", "\n", "In order to avoid graph recompilation, inputs should have a fixed shape. We need to truncate or pad all samples to the same length." ] }, { "cell_type": "code", "execution_count": null, "id": "6c7e8ffd", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "from transformers import AutoTokenizer\n", "\n", "\n", "# Model id to load the tokenizer\n", "model_id = \"bert-base-uncased\"\n", "\n", "# Load Tokenizer\n", "tokenizer = AutoTokenizer.from_pretrained(model_id)\n", "\n", "# Tokenize helper function\n", "def tokenize(batch):\n", " return tokenizer(batch['text'], padding='max_length', truncation=True,return_tensors=\"pt\")\n", "def tokenize_function(example):\n", " return tokenizer(\n", " example[\"text\"],\n", " padding=\"max_length\",\n", " truncation=True,\n", " )\n", "\n", "# Tokenize dataset\n", "tokenized_emotions = raw_dataset.map(tokenize, batched=True, remove_columns=[\"text\"])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6b47509d", "metadata": {}, "source": [ "## 3. Fine-tune BERT using Hugging Face Transformers\n", "\n", "We can use the **[Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer)** and **[TrainingArguments](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments)** to fine-tune PyTorch-based transformer models.\n", "\n", "We prepared a simple [train.py](https://github.com/huggingface/optimum-neuron/blob/main/notebooks/text-classification/scripts/train.py) training script to perform training and evaluation on the dataset. Below is an excerpt:\n", "\n", "\n", "```python\n", "from transformers import Trainer, TrainingArguments\n", "\n", "def parse_args():\n", "\t...\n", "\n", "def training_function(args):\n", "\n", " ...\n", "\n", " # Download the model from huggingface.co/models\n", " model = AutoModelForSequenceClassification.from_pretrained(\n", " args.model_id, num_labels=num_labels, label2id=label2id, id2label=id2label\n", " )\n", "\n", " training_args = TrainingArguments(\n", "\t\t\t...\n", " )\n", "\n", " # Create Trainer instance\n", " trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=tokenized_emotions[\"train\"],\n", " eval_dataset=tokenized_emotions[\"validation\"],\n", " processing_class=tokenizer,\n", " )\n", "\n", "\n", " # Start training\n", " trainer.train()\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "id": "147f5f0c", "metadata": {}, "source": [ "We can load the training script into our environment using the `wget` command or manually copy it into the notebook from [here](https://github.com/huggingface/optimum-neuron/blob/notebooks/text-classification/scripts/train.py)." ] }, { "cell_type": "code", "execution_count": null, "id": "bdecdef2", "metadata": {}, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/huggingface/optimum-neuron/main/notebooks/text-classification/scripts/train.py" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f3ba7064", "metadata": {}, "source": [ "We will use `torchrun` to launch our training script on both neuron cores for distributed training, thus allowing data parallelism. `torchrun` is a tool that automatically distributes a PyTorch model across multiple accelerators. We can pass the number of accelerators as `nproc_per_node` arguments alongside our hyperparameters.\n", "\n", "We'll use the following command to launch training:" ] }, { "cell_type": "code", "execution_count": null, "id": "1dd5cc6f", "metadata": {}, "outputs": [], "source": [ "!torchrun --nproc_per_node=2 train.py \\\n", " --model_id bert-base-uncased \\\n", " --lr 5e-5 \\\n", " --per_device_train_batch_size 8 \\\n", " --bf16 True \\\n", " --epochs 3" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6701f479", "metadata": {}, "source": [ "After compilation, it will only take few minutes to complete the training.\n", "\n", "```python\n", "***** train metrics *****\n", " epoch = 3.0\n", " eval_loss = 0.1761\n", " eval_runtime = 0:00:03.73\n", " eval_samples_per_second = 267.956\n", " eval_steps_per_second = 16.881\n", " total_flos = 1470300GF\n", " train_loss = 0.2024\n", " train_runtime = 0:07:27.14\n", " train_samples_per_second = 53.674\n", " train_steps_per_second = 6.709\n", "\n", "```\n", "\n", "\n", "\n", "Last but not least, terminate the EC2 instance to avoid unnecessary charges. Looking at the price-performance, our training only costs **`20ct`** (**`1.34$/h * 0.13h = 0.18$`**)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "28d93f05", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "aws_neuronx_venv_pytorch_2_1", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }