![](../readme_logo.png)

# Fine-tuning GPT2-XL with ðŸ¤— Optimum Habana

This notebook shows how to fine-tune GPT2-XL for causal language modeling with Optimum Habana. You can find more information in the [documentation](https://huggingface.co/docs/optimum/habana/index) and in the [package repository](https://github.com/huggingface/optimum-habana).

Any other model that has been validated for language modeling (see [here](https://huggingface.co/docs/optimum/habana/index)) can be used, like BERT or RoBERTa.

## What is Causal Language Modeling?

Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the model **only attends to the left context** (tokens on the left of the mask). Such a training is particularly interesting for generation tasks.

Here is an example of inputs that could be used for causal language modeling:

> This live AI webinar is organized by Habana Labs and Hugging Face and

## Training Script

We are going to use the `run_clm.py` example script that you can find [here](https://github.com/huggingface/optimum-habana/blob/main/examples/language-modeling/run_clm.py). It performs the following:
- download and preprocess the dataset,
- instantiate the model by downloading a pre-trained checkpoint or initializing a new one,
- download a tokenizer,
- model training
- model evaluation

It enables to **fine-tune** or **pre-train** a model.

> The only difference with the `run_clm.py` example script of Transformers is that the `Trainer` and the `TrainingArguments` classes have been replaced by `GaudiTrainer` and `GaudiTrainingArguments` respectively.

## Dataset

The **WikiText** language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

It is available on the Hugging Face Hub and you can find more information about it [here](https://huggingface.co/datasets/wikitext).

## 1. Install Dependencies

We first install the latest version of Optimum Habana:

In [None]:
!pip install optimum-habana

Let's also install the required libraries to run this example:

In [None]:
!pip install datasets sentencepiece protobuf scikit-learn evaluate

## 2. Fine-tuning GPT2-XL on 8 HPUs

### Training Arguments

Let's specify the training arguments the same way as in Transformers.

In [None]:
training_args = {
    "output_dir": "/tmp/clm_gpt2_xl",
    "dataset_name": "wikitext",
    "dataset_config_name": "wikitext-2-raw-v1",
    "num_train_epochs": 1,
    "per_device_train_batch_size": 4,
    "per_device_eval_batch_size": 4,
    "gradient_checkpointing": True,
    "do_train": True,
    "do_eval": True,
    "overwrite_output_dir": True,
    "use_cache": False,
}

Decide below whether you want to run pre-training or fine-tuning:

In [None]:
pretraining = False
model_name = "gpt2-xl"

if pretraining:
    training_args["config_name"] = model_name
    training_args["tokenizer_name"] = model_name
else:
    training_args["model_name_or_path"] = model_name

And finally the Gaudi-related arguments:

In [None]:
training_args["use_habana"] = True  # Whether to use HPUs or not
training_args["use_lazy_mode"] = True  # Whether to use lazy or eager mode
training_args["gaudi_config_name"] = "Habana/gpt2"  # Gaudi configuration to use
training_args["throughput_warmup_steps"] = 3  # Remove the first N training iterations from throughput computation

All the existing Gaudi configurations are [here](https://huggingface.co/habana). You can also create your own Gaudi configuration and upload it to the Hugging Face Hub!

### Running the Script

We are going to leverage the `DistributedRunner` class to launch a distributed training. This could also be done with the [`gaudi_spawn.py`](https://github.com/huggingface/optimum-habana/blob/main/examples/gaudi_spawn.py) script. More information [here](https://huggingface.co/docs/optimum/habana/usage_guides/distributed).

To be initialized, an instance of this class requires the command to execute and the number of devices to use. Since one Gaudi has 8 HPUs, we are going to use all of them.

> **Disclaimer: the run below will fail!**

In [None]:
from optimum.habana.distributed import DistributedRunner


# Build the command to execute
training_args_command_line = " ".join(f"--{key} {value}" for key, value in training_args.items())
command = f"../examples/language-modeling/run_clm.py {training_args_command_line}"

# # Instantiate a distributed runner
# distributed_runner = DistributedRunner(
#     command_list=[command],  # The command(s) to execute
#     world_size=8,            # The number of HPUs
#     use_mpi=True,            # OpenMPI is used for multi-processing
# )

# # Launch training
# ret_code = distributed_runner.run()

This run failed because it was too big to fit in HPUs memory... Let's use DeepSpeed to solve this!

## 3. DeepSpeed for HPUs

It is possible to use DeepSpeed with HPUs to train larger models! This will enable to spread the optimizer states and gradients across processes to use less memory.

How to switch to distributed training with DeepSpeed:
1. Install Habana DeepSpeed.
2. Add one training argument to specify the DeepSpeed configuration to use.
3. Instantiate a new distributed runner.

Let's install Habana DeepSpeed:

In [None]:
!pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.21.0

We need a DeepSpeed configuration. We are going to use [this one](https://github.com/huggingface/optimum-habana/tree/main/notebooks/configs/deepspeed_zero_2.json).

In [None]:
training_args["deepspeed"] = "configs/deepspeed_zero_2.json"

We now have to instantiate a new distributed runner and to run it:

In [None]:
# Build the command to execute
training_args_command_line = " ".join(f"--{key} {value}" for key, value in training_args.items())
command = f"../examples/language-modeling/run_clm.py {training_args_command_line}"

# Instantiate a distributed runner
distributed_runner = DistributedRunner(
    command_list=[command],  # The command(s) to execute
    world_size=8,  # The number of HPUs
    use_deepspeed=True,  # Enable DeepSpeed
)

# Launch training
ret_code = distributed_runner.run()

Let's try the model we just fine-tuned!

In [None]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer


# The sequence to complete
prompt_text = "This live AI webinar is organized by Habana Labs and Hugging Face and"

path_to_model = training_args["output_dir"]  # the folder where everything related to our run was saved

device = torch.device("hpu")

# Load the tokenizer and the model
tokenizer = GPT2Tokenizer.from_pretrained(path_to_model)
model = GPT2LMHeadModel.from_pretrained(path_to_model)
model.to(device)

# Encode the prompt
encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
encoded_prompt = encoded_prompt.to(device)

# Generate the following of the prompt
output_sequences = model.generate(
    input_ids=encoded_prompt,
    max_length=16 + len(encoded_prompt[0]),
    do_sample=True,
    num_return_sequences=3,
)

# Remove the batch dimension when returning multiple sequences
if len(output_sequences.shape) > 2:
    output_sequences.squeeze_()

generated_sequences = []

for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
    print(f"=== GENERATED SEQUENCE {generated_sequence_idx + 1} ===")
    generated_sequence = generated_sequence.tolist()

    # Decode text
    text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)

    # Remove all text after the stop token
    text = text[: text.find(".")]

    # Add the prompt at the beginning of the sequence. Remove the excess text that was used for pre-processing
    total_sequence = prompt_text + text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)) :]

    generated_sequences.append(total_sequence)
    print(total_sequence)

And here are the costs for 3 epochs with Gaudi and with Nvidia V100:

In [None]:
import numpy as np


gaudi_price_per_hour = 13.10904
v100_price_per_hour = 12.24

print(
    f"Gaudi    (dl1.24xlarge): training time = 630s, cost = {np.round(630 * gaudi_price_per_hour / 3600, 2)}$ ({gaudi_price_per_hour}$/hr)"
)
print(
    f"4 x V100 (p3.8xlarge)  : training time = 858s, cost = {np.round(858 * v100_price_per_hour / 3600, 2)}$ ({v100_price_per_hour}$/hr)"
)

We successfully trained GPT2-XL which has 1.6 billion parameters.
You can train even bigger models with Gaudi and DeepSpeed, try it now! More information is available in [the documentation of Optimum Habana](https://huggingface.co/docs/optimum/habana/usage_guides/deepspeed).