# Introduction to ðŸ¤— Optimum Graphcore: BERT-Large Fine-tuning on IPU

<p align="center">
    <img src="https://github.com/huggingface/optimum-graphcore/blob/main/readme_logo.png?raw=true" />
</p>

##  ðŸ¤—  Optimum Graphcore

ðŸ¤— Optimum Graphcore is the interface between the [ðŸ¤— Transformers library](https://huggingface.co/docs/transformers/index) and [Graphcore IPUs](https://www.graphcore.ai/products/ipu).
It provides a set of tools enabling parallelization and loading of models on IPUs, training and fine-tuning on all the tasks already supported by ðŸ¤— Transformers while being compatible with the ðŸ¤— Hub and every model available on it out of the box.

ðŸ¤— Optimum Graphcore was designed with one goal in mind: make training and evaluation straightforward for any ðŸ¤— Transformers user while leveraging the complete power of IPUs.


## What is an Intelligence Processing Unit (IPU)?
Quote from the Hugging Face [blog post](https://huggingface.co/blog/graphcore#what-is-an-intelligence-processing-unit):
>IPUs are the processors that power Graphcoreâ€™s IPU-POD data center compute systems. This new type of processor is designed to support the very specific computational requirements of AI and machine learning. Characteristics such as fine-grained parallelism, low-precision arithmetic, and the ability to handle sparsity have been built into the silicon.

> Instead of adopting a SIMD/SIMT architecture like GPUs, Graphcoreâ€™s IPU uses a massively parallel, MIMD architecture, with ultra-high bandwidth memory placed adjacent to the processor cores, right on the silicon die.

> This design delivers high performance and new levels of efficiency, whether running todayâ€™s most popular models, such as BERT and EfficientNet, or exploring next-generation AI applications.

## About this notebook 

This notebook will demonstrate how to fine-tune a pre-trained BERT model with PyTorch on the Graphcore IPU-POD4 system using Optimum Graphcore. We will use a BERT-Large model and fine-tune on the SQuADv1 Question/Answering task.

We will show how to take a BERT model written in PyTorch from the Hugging Face Transformers library and run it on Graphcore IPUs using Optimum Graphcore.

|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
| Natural language processing | Question answering | bert-large-uncased | SQUADv1 | Fine-tuning|  |   |

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

## Background


### BERT

BERT fine-tuning is when you train a BERT model on a supervised learning task on a relatively small amount of data, by using starting weights obtained from pre-training on a large, generic text corpus. Pre-training of BERT requires a lot of unlabelled data (for instance all of Wikipedia + thousands of books) and a lot of compute. It is expensive and time-consuming, but after pre-training, BERT will have learned an extremely good language model that can be fine-tuned on downstream tasks with a small amount of labelled data, achieving great results.


![bert.png](images/bert.png)


In this notebook, we will fine-tune BERT (pre-trained on IPUs with the Wikipedia dataset) on a question answering task called SQuAD. Then we will perform inference on the accompanying validation dataset.

### What is SQuAD?

The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

From https://rajpurkar.github.io/SQuAD-explorer/

Basically you train a model to take a question and read a passage of text and predict the start and end positions of where that answer lies in the passage. The image below shows an example from the dataset:

(Source: [Rajpurkar GitHub](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/Normans.html))

For the case of SQuADv1, there are no unanswerable questions in the dataset.

## Environment setup

The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

[![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://ipu.dev/3CExwVy)

To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

## Dependencies and configuration

In order to improve usability and support for future users, Graphcore would like to collect information about the
applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:

- User progression through the notebook
- Notebook details: number of cells, code being run and the output of the cells
- Environment details

You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell.

Install the dependencies for this notebook.

In [None]:
%pip install "optimum-graphcore==0.7"
%pip install graphcore-cloud-tools[logger]@git+https://github.com/graphcore/graphcore-cloud-tools
%load_ext graphcore_cloud_tools.notebook_logging.gc_logger

Values for machine size and cache directories can be configured through environment variables or directly in the notebook:

In [None]:
import os

n_ipu = int(os.getenv("NUM_AVAILABLE_IPU", 4))
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/introduction_to_optimum_graphcore"

In [None]:
# Import standard packages
import transformers
import torch
import torch.nn as nn
import numpy as np
from tqdm.notebook import trange, tqdm
from datasets import load_dataset, load_metric
import time
from pathlib import Path

# To run on IPU we import popart and poptorch packages
from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments

In [None]:
import warnings
warnings.filterwarnings("ignore")

## Get the data


We use the ðŸ¤— `datasets` package to automatically download the SQuAD dataset:

In [None]:
datasets = load_dataset("squad")

The SQuAD dataset consists of pre-defined training and validation splits.

In [None]:
datasets

Each row in the data consists of a passage of text - `context` - a question about the passage - `question` - and the answer(s) to the question - `answers`. The latter consists of the text in the passage and the start position in the text.

Here is an example row:

In [None]:
datasets["train"][10016]

**How do we preprocess this data to train it with a deep learning model?**

We need to `tokenize` the text to turn it from words into numbers. This is done using `transformers.BertTokenizer`. Let's use this to tokenize a shortened version of the example above:

In [None]:
from squad_preprocessing import tokenizer

In [None]:
example = {"context": "Institutes of technology in Venezuela were developed in the 1950s",
           "question": "When were Institutes of technology developed?"}
tokenized_example = tokenizer(
        example["question"],
        example["context"],
        truncation="only_second",
        max_length=32,
        stride=16,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

In [None]:
tokenized_example.keys()

Let's look at the `input_ids`:

In [None]:
tokenized_example.input_ids[0]

In [None]:
tokenizer.decode(tokenized_example.input_ids[0])

As you can see in the decoded version, the question is placed at the start followed by a `[SEP]` token, then the context, followed by padding if required.

In [None]:
from squad_preprocessing import prepare_train_features, prepare_validation_features, tokenizer

In [None]:
train_dataset = datasets["train"].map(
    prepare_train_features,
    batched=True,
    num_proc=1,
    remove_columns=datasets["train"].column_names,
    load_from_cache_file=True,
)

# Create validation features from dataset
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    num_proc=1,
    remove_columns=datasets["validation"].column_names,
    load_from_cache_file=True,
)

## Get the BERT model from `transformers`

Create the model on the host. We can use `from_pretrained` to load pre-trained checkpoints from the Hugging Face Hub.

In [None]:
model = transformers.BertForQuestionAnswering.from_pretrained("Graphcore/bert-large-uncased")

**Now we are ready to use Optimum!**

We can now set up our pipelined execution by specifying which layers to put on each IPU, and passing it to the `parallelize` method that we defined above.

We also call the `.half()` method to cast all the model weights to half-precision (FP16). `.train()` sets the PyTorch model to training mode.

If you are unfamiliar with training in half precision on IPUs, then our tutorial on [Half and Mixed Precision in PopTorch](https://github.com/graphcore/examples/tree/master/tutorials/pytorch/mixed_precision) can serve as a quick introduction.

## How `optimum-graphcore` runs models on IPUs

`optimum-graphcore` will run the model on IPUs using both **pipelining** and **data parallelism** in order to maximise hardware use.

### Parallelism through pipelining

The model layers are split over 4 IPUs. We then use [*pipeline parallelism*](https://docs.graphcore.ai/projects/tf-model-parallelism/en/latest/pipelining.html) over the IPUs with gradient accumulation. We subdivide the compute batch into micro-batches that pass through the pipeline in the forward pass and then come back again in the backwards pass, accumulating gradients for the parameters as they go.

A complete pipeline step has a ramp-up phase at the start and a ramp-down phase at the end. Increasing the gradient accumulation factor increases the total batch size and also increases the pipeline efficiency, and therefore throughput, because the proportion of time in ramp-up/down phases will be reduced. 

![pipelining.png](images/pipelining.png)

### Partitioning the model

BERT Large has 24 transformer layers, which we will split over our 4 IPUs. The position and word embeddings, and the first three encoder layers will sit on IPU0, the following 3 IPUs have seven transformer layers each. This partition is specified in `IPUConfig` with the `layers_per_ipu` parameter.

![bert-pipelining.png](images/bert-pipelining.png)


### Data parallelism

An IPU-POD4 contains 4 IPUs and our pipeline is 4 IPUs long, therefore we cannot replicate the pipeline. If we were running on an IPU-POD16, then we could utilise replication by feeding four different micro-batches to the device, which quadruples the effective mini-batch size. We call this configuration a "4x4 pipeline".


### Recomputation checkpoints

We can make more efficient use of the valuable In-Processor-Memory by saving only selected activation inputs and recomputing the rest. This lets us optimise on memory savings (by not storing all activations) vs FLOP expenditure (by not having to recompute all activations). 

<img src="images/recomputation.png" width="800" />

Source: [TensorFlow Model Parallelism: Recomputation](https://docs.graphcore.ai/projects/tf-model-parallelism/en/latest/pipelining.html#recomputation)

Checkpoints are automatically placed between each pipeline stage. In addition to these automatic checkpoints, we are adding one at the end of every transformer layer, which leads to better performance.

### Replicated tensor sharding of optimizer state

If we were using multiple replicas, we can also distribute our optimizer state to reduce local memory usage, a method called [on-chip replicated tensor sharding](https://docs.graphcore.ai/projects/graphcore-glossary/en/latest/index.html#term-Replicated-tensor-sharding). To utilise this feature, you must be on a IPU-POD16 system. 

> To further improve memory availability we also have the option to store tensors in the IPU-POD16 Streaming Memory at the cost of increased communications.

![rts.png](images/rts.png)

## Running with Optimum Graphcore

To use `optimum-graphcore`, there are three main classes you need to know about:
- `IPUTrainer`: the trainer class that takes care of compiling the model to run on IPUs. It also takes care of performing training and evaluation.
- `IPUTrainingArguments`: the parameters for how the model will be trained by the trainer.
- `IPUConfig`: the class that specifies attributes and configuration parameters to compile and put the model on the device.

The `IPUTrainer` class is very similar to the [ðŸ¤— Transformers Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class, and adapting a script that currently uses the `Trainer` class to make it work with IPUs will mostly consist of simply swapping the `Trainer` class with the `IPUTrainer` class.

The `IPUTrainingArguments` class is also very similar to the [ðŸ¤— Transformers TrainingArguments](https://huggingface.co/docs/transformers/v4.20.1/en/main_classes/trainer#transformers.TrainingArguments) class with a few extra arguments for IPUs. 

In [None]:
ipu_config = IPUConfig.from_pretrained("Graphcore/bert-large-ipu",
                                       executable_cache_dir = executable_cache_dir)

In [None]:
ipu_config

`device_iterations` is the number of batches the device should run before returning to the user. Increasing `device_iterations` can be more efficient because the loop runs on the IPU directly, reducing overhead costs. Please see the [PopTorch documentation](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/batching.html?highlight=device%20iterations#poptorch-options-deviceiterations) for more information on this parameter.

In [None]:
global_batch_size = 256
per_device_train_batch_size = 1
per_device_eval_batch_size = 2
replication_factor = 1
gradient_accumulation = int(global_batch_size / per_device_train_batch_size / replication_factor)

In [None]:
training_args = IPUTrainingArguments(output_dir="/tmp/outputs",
                                     do_train=True,
                                     do_eval=True,
                                     per_device_train_batch_size=per_device_train_batch_size,
                                     per_device_eval_batch_size=per_device_eval_batch_size,
                                     gradient_accumulation_steps=gradient_accumulation,
                                     learning_rate=2e-4,
                                     num_train_epochs=2,
                                     logging_steps=25,
                                     dataloader_num_workers=32,
                                     resume_from_checkpoint=True,
                                     pad_on_batch_axis=True,
                                     n_ipu=n_ipu,
                                     save_strategy="epoch",
                                     report_to="none",
                                    )

## Training loop

In [None]:
from squad_preprocessing import PadCollate

Now we create the `IPUTrainer` from `optimum-graphcore` to train our model on the IPU:

In [None]:
trainer = IPUTrainer(model=model,
                     ipu_config=ipu_config,
                     args=training_args, 
                     train_dataset=train_dataset,
                     eval_dataset=validation_features,
                    )

In [None]:
trainer.train(resume_from_checkpoint=False)

After training, we save the model weights to disk.

In [None]:
trainer.save_model()

## Validation

We will now take the model we just trained on the training data and run validation on the SQuAD validation dataset. The model will run on a 2-IPU pipeline that we will replicate eight times.

We loop over all the validation data examples and get the `raw_predictions` for the start and end positions of where the answer to the question lies in the text passage for each one.

In [None]:
eval_output = trainer.predict(validation_features)

In [None]:
from datasets import load_metric
from squad_preprocessing import postprocess_qa_predictions

In [None]:
raw_predictions = []
raw_predictions.append(eval_output.predictions[0].astype(float))
raw_predictions.append(eval_output.predictions[1].astype(float))

In [None]:
raw_predictions[0].shape

In [None]:
validation_features

We now post-process the raw predictions to the question answering task to get the best prediction that's valid for each one.

In [None]:
final_predictions = postprocess_qa_predictions(datasets["validation"],
                                               validation_features,
                                               raw_predictions)

In [None]:
metric = load_metric("squad")
formatted_predictions = [{"id": k, "prediction_text": v}
                         for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]}
              for ex in datasets["validation"]]
metrics = metric.compute(predictions=formatted_predictions, references=references)
print(metrics)

We obtained a good validation score for SQuADv1.

| BERT-Large                             | Exact Match | F1 Score |
|----------------------------------------|:-----------:|:--------:|
| Reference (Devling et al. 2018)        | 84.1        | 90.9     |
| IPU-POD16 with IPU pre-trained weights | 84.5        | 91.0     |

## Inference

We can now use our fine-tuned model to answer questions. Let's start by defining a task:

In [None]:
# Define task
question = "What speed-up can one expect from using sequence packing for training BERT on IPU?"
answer_text = "We find that at sequence length 512 padding tokens represent in excess of 50% of the Wikipedia" \
              "dataset used for pretraining BERT (Bidirectional Encoder Representations from Transformers)." \
             "Therefore by removing all padding we achieve a 2x speed-up in terms of sequences/sec." \
             "To exploit this characteristic of the dataset," \
             "we develop and contrast two deterministic packing algorithms."

Let's get the model inputs ready and create our model. We'll import the weights from the pre-trained, fine-tuned BERT model from the previous sections:

In [None]:
# Apply the tokenizer to the input text, treating them as a text-pair.
input_encoding = tokenizer.encode_plus((question, answer_text))

# Extract inputs, add batch dimension
input_tensor = torch.tensor(input_encoding["input_ids"]).unsqueeze(0)
attention_tensor= torch.tensor(input_encoding["attention_mask"]).unsqueeze(0)
token_types=torch.tensor(input_encoding["token_type_ids"]).unsqueeze(0)
    
# Get model and load the fine-tuned weights
model = transformers.BertForQuestionAnswering.from_pretrained("/tmp/outputs")

Optionally, instead of using the fine-tuned weights we saved in the previous section, you can download fine-tuned weights from the [Graphcore organisation on the Hugging Face Model Hub](https://huggingface.co/Graphcore). 

In [None]:
# model = transformers.BertForQuestionAnswering.from_pretrained("Graphcore/bert-large-uncased-squad11")

We can now solve the task and print the answer to the question:

In [None]:
# Solve task
outputs = model(input_tensor, attention_tensor, token_types)

# Extract answer
answer_start, answer_stop = outputs.start_logits.argmax(), outputs.end_logits.argmax()
answer_ids = input_tensor.squeeze()[answer_start:answer_stop + 1]
answer_tokens = tokenizer.convert_ids_to_tokens(answer_ids, skip_special_tokens=True)
answer = tokenizer.convert_tokens_to_string(answer_tokens)

# Print results
print(f"Question: {question}")
print(f"Answer: {answer}")

## Sharing your model with the Hugging Face community

We can share our model on the ðŸ¤— Models Hub and leverage the ðŸ¤— inference API for downstream tasks.

In [None]:
# Make sure you have git-lfs and huggingface-hub
!apt-get update && apt-get upgrade -y && apt-get install -y git git-lfs 
# !pip install -y huggingface-hub 

You can share your model with the ðŸ¤— community. You do this by completing the following steps:

1. Store your authentication token from the ðŸ¤— website. [Sign up to ðŸ¤—](https://huggingface.co/join) if you haven't already.
2. Execute the following cell and input your username and authentication token.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Now you can upload your model to the Hugging Face Hub. Uncomment the code in the cell below, and specify an identifier made up of your ðŸ¤— username and a name for your model:

In [None]:
# Upload the checkpoint to Hugging Face Model Hub.

# model.push_to_hub("<hf-username>/<name-of-model>")
# tokenizer.push_to_hub("<hf-username>/<name-of-model>")

You can also share this model and other users can load it with the identifier "<hf-username>/<name-of-model>" so for instance:

```python
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("sgugger/my-awesome-model")
```

## Next steps

Try out the other [IPU-powered Jupyter Notebooks](https://www.graphcore.ai/ipu-jupyter-notebooks) to see how how IPUs perform on other tasks.