notebooks/text-classification/fine_tune_bert.ipynb (324 lines of code) (raw):
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "2015aaa8",
"metadata": {},
"source": [
"# Getting started with AWS Trainium and Hugging Face Transformers\n",
"\n",
"*This tutorial is available in two different formats, as [web page](https://huggingface.co/docs/optimum-neuron/training_tutorials/fine_tune_bert) and [notebook version](https://github.com/huggingface/optimum-neuron/blob/main/notebooks/text-classification/fine_tune_bert.ipynb)*.\n",
"\n",
"This guide will help you to get started with [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/?nc1=h_ls) and Hugging Face Transformers. It will cover how to set up a Trainium instance on AWS, load & fine-tune a transformers model for text-classification.\n",
"\n",
"You will learn how to:\n",
"\n",
"1. Setup AWS environment\n",
"2. Load and process the dataset\n",
"3. Fine-tune BERT using Hugging Face Transformers and Optimum Neuron\n",
"\n",
"Before we can start, make sure you have a [Hugging Face Account](https://huggingface.co/join) to save artifacts and experiments.\n",
"\n",
"## Quick intro: AWS Trainium\n",
"\n",
"[AWS Trainium (Trn1)](https://aws.amazon.com/de/ec2/instance-types/trn1/) is a purpose-built EC2 for deep learning (DL) training workloads. Trainium is the successor of [AWS Inferentia](https://aws.amazon.com/ec2/instance-types/inf1/?nc1=h_ls) focused on high-performance training workloads claiming up to 50% cost-to-train savings over comparable GPU-based instances.\n",
"\n",
"Trainium has been optimized for training natural language processing, computer vision, and recommender models used. The accelerator supports a wide range of data types, including FP32, TF32, BF16, FP16, UINT8, and configurable FP8.\n",
"\n",
"The biggest Trainium instance, the `trn1.32xlarge` comes with over 500GB of memory, making it easy to fine-tune ~10B parameter models on a single instance. Below you will find an overview of the available instance types. More details [here](https://aws.amazon.com/en/ec2/instance-types/trn1/#Product_details):\n",
"\n",
"| instance size | accelerators | accelerator memory | vCPU | CPU Memory | price per hour |\n",
"| --- | --- | --- | --- | --- | --- |\n",
"| trn1.2xlarge | 1 | 32 | 8 | 32 | $1.34 |\n",
"| trn1.32xlarge | 16 | 512 | 128 | 512 | $21.50 |\n",
"| trn1n.32xlarge (2x bandwidth) | 16 | 512 | 128 | 512 | $24.78 |\n",
"\n",
"---\n",
"\n",
"Now we know what Trainium offers, let's get started. 🚀\n",
"\n",
"*Note: This tutorial was created on a trn1.2xlarge AWS EC2 Instance.* \n",
"\n",
"## 1. Setup AWS environment\n",
"\n",
"In this tutorial, we will use the `trn1.2xlarge` instance on AWS with 1 Accelerator, including two Neuron Cores and the [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2).\n",
"\n",
"Once the instance is up and running, we can ssh into it. But instead of developing inside a terminal we want to use a `Jupyter` environment, which we can use for preparing our dataset and launching the training. For this, we need to add a port for forwarding in the `ssh` command, which will tunnel our localhost traffic to the Trainium instance.\n",
"\n",
"```bash\n",
"PUBLIC_DNS=\"\" # IP address, e.g. ec2-3-80-....\n",
"KEY_PATH=\"\" # local path to key, e.g. ssh/trn.pem\n",
"\n",
"ssh -L 8080:localhost:8080 -i ${KEY_NAME}.pem ubuntu@$PUBLIC_DNS\n",
"```\n",
"\n",
"We need to make sure we have the `training` extra installed, to get all the necessary dependencies:\n",
"\n",
"```bash\n",
"python -m pip install .[training]\n",
"```\n",
"\n",
"We can now start our **`jupyter`** server.\n",
"\n",
"```bash\n",
"python -m notebook --allow-root --port=8080\n",
"```\n",
"\n",
"You should see a familiar **`jupyter`** output with a URL to the notebook.\n",
"\n",
"**`http://localhost:8080/?token=8c1739aff1755bd7958c4cfccc8d08cb5da5234f61f129a9`**\n",
"\n",
"We can click on it, and a **`jupyter`** environment opens in our local browser.\n",
"\n",
"\n",
"\n",
"We are going to use the Jupyter environment only for preparing the dataset and then `torchrun` for launching our training script on both neuron cores for distributed training. Lets create a new notebook and get started. \n",
"\n",
"## 2. Load and process the dataset\n",
"\n",
"We are training a Text Classification model on the [emotion](https://huggingface.co/datasets/dair-ai/emotion) dataset to keep the example straightforward. The `emotion` is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.\n",
"\n",
"We will use the `load_dataset()` method from the [🤗 Datasets](https://huggingface.co/docs/datasets/index) library to load the `emotion`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ace04f1e",
"metadata": {},
"outputs": [],
"source": [
"from datasets import load_dataset\n",
"\n",
"\n",
"# Dataset id from huggingface.co/dataset\n",
"dataset_id = \"dair-ai/emotion\"\n",
"\n",
"# Load raw dataset\n",
"raw_dataset = load_dataset(dataset_id)\n",
"\n",
"print(f\"Train dataset size: {len(raw_dataset['train'])}\")\n",
"print(f\"Test dataset size: {len(raw_dataset['test'])}\")\n",
"\n",
"# Train dataset size: 16000\n",
"# Test dataset size: 2000"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f3269156",
"metadata": {},
"source": [
"Let’s check out an example of the dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b87eab11",
"metadata": {},
"outputs": [],
"source": [
"from random import randrange\n",
"\n",
"\n",
"random_id = randrange(len(raw_dataset['train']))\n",
"raw_dataset['train'][random_id]\n",
"# {'text': 'i also like to listen to jazz whilst painting it makes me feel more artistic and ambitious actually look to the rainbow', 'label': 1}"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1dd8f764",
"metadata": {},
"source": [
"We must convert our \"Natural Language\" to token IDs to train our model. This is done by a Tokenizer, which tokenizes the inputs (including converting the tokens to their corresponding IDs in the pre-trained vocabulary). if you want to learn more about this, out [chapter 6](https://huggingface.co/course/chapter6/1?fw=pt) of the [Hugging Face Course](https://huggingface.co/course/chapter1/1).\n",
"\n",
"In order to avoid graph recompilation, inputs should have a fixed shape. We need to truncate or pad all samples to the same length."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c7e8ffd",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"from transformers import AutoTokenizer\n",
"\n",
"\n",
"# Model id to load the tokenizer\n",
"model_id = \"bert-base-uncased\"\n",
"\n",
"# Load Tokenizer\n",
"tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
"\n",
"# Tokenize helper function\n",
"def tokenize(batch):\n",
" return tokenizer(batch['text'], padding='max_length', truncation=True,return_tensors=\"pt\")\n",
"def tokenize_function(example):\n",
" return tokenizer(\n",
" example[\"text\"],\n",
" padding=\"max_length\",\n",
" truncation=True,\n",
" )\n",
"\n",
"# Tokenize dataset\n",
"tokenized_emotions = raw_dataset.map(tokenize, batched=True, remove_columns=[\"text\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6b47509d",
"metadata": {},
"source": [
"## 3. Fine-tune BERT using Hugging Face Transformers\n",
"\n",
"We can use the **[Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer)** and **[TrainingArguments](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments)** to fine-tune PyTorch-based transformer models.\n",
"\n",
"We prepared a simple [train.py](https://github.com/huggingface/optimum-neuron/blob/main/notebooks/text-classification/scripts/train.py) training script to perform training and evaluation on the dataset. Below is an excerpt:\n",
"\n",
"\n",
"```python\n",
"from transformers import Trainer, TrainingArguments\n",
"\n",
"def parse_args():\n",
"\t...\n",
"\n",
"def training_function(args):\n",
"\n",
" ...\n",
"\n",
" # Download the model from huggingface.co/models\n",
" model = AutoModelForSequenceClassification.from_pretrained(\n",
" args.model_id, num_labels=num_labels, label2id=label2id, id2label=id2label\n",
" )\n",
"\n",
" training_args = TrainingArguments(\n",
"\t\t\t...\n",
" )\n",
"\n",
" # Create Trainer instance\n",
" trainer = Trainer(\n",
" model=model,\n",
" args=training_args,\n",
" train_dataset=tokenized_emotions[\"train\"],\n",
" eval_dataset=tokenized_emotions[\"validation\"],\n",
" processing_class=tokenizer,\n",
" )\n",
"\n",
"\n",
" # Start training\n",
" trainer.train()\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "147f5f0c",
"metadata": {},
"source": [
"We can load the training script into our environment using the `wget` command or manually copy it into the notebook from [here](https://github.com/huggingface/optimum-neuron/blob/notebooks/text-classification/scripts/train.py)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bdecdef2",
"metadata": {},
"outputs": [],
"source": [
"!wget https://raw.githubusercontent.com/huggingface/optimum-neuron/main/notebooks/text-classification/scripts/train.py"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f3ba7064",
"metadata": {},
"source": [
"We will use `torchrun` to launch our training script on both neuron cores for distributed training, thus allowing data parallelism. `torchrun` is a tool that automatically distributes a PyTorch model across multiple accelerators. We can pass the number of accelerators as `nproc_per_node` arguments alongside our hyperparameters.\n",
"\n",
"We'll use the following command to launch training:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1dd5cc6f",
"metadata": {},
"outputs": [],
"source": [
"!torchrun --nproc_per_node=2 train.py \\\n",
" --model_id bert-base-uncased \\\n",
" --lr 5e-5 \\\n",
" --per_device_train_batch_size 8 \\\n",
" --bf16 True \\\n",
" --epochs 3"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6701f479",
"metadata": {},
"source": [
"After compilation, it will only take few minutes to complete the training.\n",
"\n",
"```python\n",
"***** train metrics *****\n",
" epoch = 3.0\n",
" eval_loss = 0.1761\n",
" eval_runtime = 0:00:03.73\n",
" eval_samples_per_second = 267.956\n",
" eval_steps_per_second = 16.881\n",
" total_flos = 1470300GF\n",
" train_loss = 0.2024\n",
" train_runtime = 0:07:27.14\n",
" train_samples_per_second = 53.674\n",
" train_steps_per_second = 6.709\n",
"\n",
"```\n",
"\n",
"\n",
"\n",
"Last but not least, terminate the EC2 instance to avoid unnecessary charges. Looking at the price-performance, our training only costs **`20ct`** (**`1.34$/h * 0.13h = 0.18$`**)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "28d93f05",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "aws_neuronx_venv_pytorch_2_1",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}