# Language Modelling on IPUs - Fine-tuning 

In this notebook, we'll see how to fine-tune one of the [ðŸ¤— Transformers](https://github.com/huggingface/transformers) models on language modelling tasks. We will cover two types of language modelling tasks:

- Causal language modelling: The model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, it gets an attention mask that will prevent it from accessing the tokens after token `i` when trying to predict token `i+1` in the sentence.

![Widget inference representing the causal language modelling task](images/causal_language_modeling.png)

- Masked language modelling: The model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the masked tokens to predict their value.

![Widget inference representing the masked language modelling task](images/masked_language_modeling.png)

We will see how to easily load and preprocess the dataset for each of those tasks, and how to use the `IPUTrainer` API to fine-tune a model on it.

|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
| Natural language processing | Text generation | Multiple | Wikitext 2 | Fine-tuning | 4 | ~30mins  |

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

## Environment setup

The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

## Dependencies and configuration

Install the dependencies for this notebook.

In [None]:
%pip install "optimum-graphcore>=0.5, <0.6"

Values for machine size and cache directories can be configured through environment variables or directly in the notebook:

In [None]:
import os

n_ipu = int(os.getenv("NUM_AVAILABLE_IPU", 4))
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/language_modeling"

### Sharing your model with the community

You can share your model with the ðŸ¤— community and generate results like the one shown in the figure below via the inference API. You do this by completing the following steps:

1. Store your authentication token from the ðŸ¤— website. [Sign up to ðŸ¤—](https://huggingface.co/join) if you haven't already.
2. Execute the following cell and input your username and password.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Then you need to install Git-LFS to manage large files:

In [None]:
!apt install git-lfs

## Preparing the dataset

For each of the two language modelling tasks, we will use the Wikitext 2 dataset. You can easily load this dataset with the ðŸ¤— Datasets library.

In [None]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

You can use any dataset hosted on the [ðŸ¤— Datasets Hub](https://huggingface.co/datasets) using the `load_dataset` function.

You can also use your own data. Just uncomment the following cell and replace the paths shown with the paths to your files:

In [None]:
# datasets = load_dataset("text", data_files={"train": path_to_train.txt, "validation": path_to_validation.txt}

You can also load datasets from a CSV or a JSON file. Refer to the ðŸ¤— documentation on [loading datasets from local files](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information.

To access an actual element, you need to select a split ("train" in the example), and then specify an index:

In [None]:
datasets["train"][10]

We want to get a sense of what the data looks like, so we define the `show_random_elements` function to display some samples picked randomly from the dataset.

In [None]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

As we can see, some of the text samples are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Causal Language modeling

For causal language modelling (CLM) we are going to take all the text samples in our dataset and concatenate them after they have been tokenized. Then we will split the result into samples of a certain sequence length. This means the model will receive chunks of contiguous text that may look like:
```
part of text 1
```
or 
```
end of text 1 [BOS_TOKEN] beginning of text 2
```
depending on whether they span over several of the original text samples in the dataset or not. The labels will be the same as the inputs, shifted to the left.

We will use the [`gpt2`](https://huggingface.co/gpt2) model for this example. You can pick any of the [ðŸ¤— models for causal language modelling](https://huggingface.co/models?filter=causal-lm) as long as that model is supported by Optimum Graphcore:

In [None]:
model_checkpoint = "gpt2"

To tokenize all our text samples with the same vocabulary that was used when training the model, we have to download a pre-trained tokenizer. This is all done by the `AutoTokenizer` class:

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can now call the tokenizer on all our text samples. This is very simple using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that calls the tokenizer on our text samples:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column for this example, so we discard it.

In [None]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

If we now look at an element of our datasets, we will see the text samples have been replaced with the `input_ids` the model will need:

In [None]:
tokenized_datasets["train"][1]

Now we need to concatenate all our text samples together and then split the result into small chunks of a certain block size. To do this, we will use the `map` method again, with the option `batched=True`. This option lets us change the number of samples in the datasets by returning a different number of samples than we originally had. This means we can create our new samples from a batch of the original samples.

We can read the maximum length our model was pre-trained with (with `tokenizer.model_max_length`), but since the value might be too big to fit on your IPU RAM, we set it to 128.

In [None]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our text samples into batches:

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

Note that we have duplicated the inputs for our labels. This is because the model of the ðŸ¤— Transformers library automatically applies a shift to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 samples to be preprocessed. So here, we will drop the remainder to make the concatenated tokenized text samples a multiple of `block_size` every 1,000 examples. You can adjust this behaviour by passing a higher batch size (which will also take longer to be processed). You can speed-up the preprocessing by using multiprocessing:

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

When we now look at our datasets, we see that they have changed. Now, the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original text samples.

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

Now that the data has been preprocessed, we are ready to instantiate our `IPUTrainer` class. Firstly we will create a model:

In [None]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

Then we need to define `IPUConfig`, which is a class that specifies attributes and configuration parameters to compile and put the model on the device. We initialize `IPUConfig` with a config name or path, which we set earlier. We also get the model configuration from the model name set earlier and initialize our model using that config:

In [None]:
from optimum.graphcore import IPUConfig
ipu_config_name = "Graphcore/gpt2-small-ipu"
ipu_config = IPUConfig.from_pretrained(
    ipu_config_name, executable_cache_dir=executable_cache_dir
)

Finally we define `IPUTrainingArguments`, which is a class that contains all the attributes to customize the training. `IPUTrainingArguments` requires one folder name, which will be used to save the checkpoints of the model. All other arguments are optional:

In [None]:
from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments

In [None]:
micro_batch_size = 1
gradient_accumulation_steps = 16

model_name = model_checkpoint.split("/")[-1]
training_args = IPUTrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=micro_batch_size,
    per_device_eval_batch_size=micro_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    n_ipu=n_ipu,
    dataloader_drop_last=True,
    logging_steps=10,
    push_to_hub=False,
)

`push_to_hub` in `IPUTrainingArguments` is necessary if we want to push the model to the [ðŸ¤— Models Hub](https://huggingface.co/models) regularly during training. You can remove them if you didn't follow the installation steps at the beginning of this notebook. If you want to save your model locally to a name that is different to the name of the repository it will be pushed to, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance "sgugger/gpt-finetuned-wikitext2" or "huggingface/gpt-finetuned-wikitext2").

We pass all of these to the `IPUTrainer` class:

In [None]:
trainer = IPUTrainer(
    model=model,
    ipu_config=ipu_config,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

Now, we can train our model:

In [None]:
trainer.train()

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [None]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

You can now upload the result of the training to the ðŸ¤— Hub:

In [None]:
# trainer.push_to_hub()

You can now share this model and other users can load it with the identifier "your-username/the-name-you-picked" so for instance:

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("sgugger/my-awesome-model")
```

## Masked language modeling

For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens).

We will use the [`roberta-base`](https://huggingface.co/roberta-base) model for this example. You can pick any of the [ðŸ¤— models for masked language modelling](https://huggingface.co/models?filter=masked-lm) as long as that model is supported by Optimum Graphcore:

In [None]:
model_checkpoint = "roberta-base"

We can apply the same tokenization function as before, but we need to update our tokenizer to use the model checkpoint we just picked:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

As with the causal language modelling example, we group the text samples together and create chunks of length `block_size`. You can skip this step if your dataset is composed of individual sentences.

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

The rest is very similar to what we did for the causal language modelling example, with two exceptions:
* We need a model suitable for masked language modelling.
* We need a special data collator.

First, we use a model suitable for masked language modelling:

In [None]:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

We redefine `IPUConfig`:

In [None]:
ipu_config_name = "Graphcore/roberta-base-ipu"
ipu_config = IPUConfig.from_pretrained(
    ipu_config_name, executable_cache_dir=executable_cache_dir
)

We redefine `IPUTrainingArguments`:

In [None]:
micro_batch_size = 1
gradient_accumulation_steps = 16

model_name = model_checkpoint.split("/")[-1]
training_args = IPUTrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=micro_batch_size,
    per_device_eval_batch_size=micro_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    n_ipu=n_ipu,
    dataloader_drop_last=True,
    logging_steps=10,
    push_to_hub=False,
)

`push_to_hub` in `IPUTrainingArguments` is necessary if we want to push the model to the [ðŸ¤— Models Hub](https://huggingface.co/models) regularly during training. You can remove them if you didn't follow the installation steps at the beginning of this notebook. If you want to save your model locally to a name that is different to the name of the repository it will be pushed to, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance "sgugger/gpt-finetuned-wikitext2" or "huggingface/gpt-finetuned-wikitext2").

Finally, we use a special data collator. The `data_collator` function is responsible for taking the samples and batching them into tensors. In the causal language modelling example, we didn't need anything special, so we just used the default data collator. Here we want to randomly mask the data. We could do it as a pre-processing step (like with the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the data collator, we ensure this random masking is done in a new way each time we go over the data.

To do this masking, we use `DataCollatorForLanguageModeling` which lets us adjust the probability of the masking:

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Then we simply pass everything to `IPUTrainer` and begin training:

In [None]:
trainer = IPUTrainer(
    model=model,
    ipu_config=ipu_config,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
)

In [None]:
trainer.train()

Like before, we can evaluate our model on the validation set. The perplexity is much lower than for the CLM objective because for the MLM objective, we only have to make predictions for the masked tokens (which represent 15% of the total here) while having access to the rest of the tokens. It's thus an easier task for the model.

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

You can now upload the result of the training to the ðŸ¤— Hub:

In [None]:
# trainer.push_to_hub()

You can also share this model and other users can load it with the identifier "your-username/the-name-you-picked" so for instance:

```python
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("sgugger/my-awesome-model")
```

## Next steps

Try out the other [IPU-powered Jupyter Notebooks](https://www.graphcore.ai/ipu-jupyter-notebooks) to see how how IPUs perform on other tasks.