# üß† Mastering LLM Fine-Tuning with TRL ‚Äì Supervised Fine-Tuning

Welcome! This notebook is part of a tutorial series where you'll learn how to fine-tune Large Language Models (LLMs) using ü§ó TRL.
We introduce key concepts, set up the required tools, and use techniques like Supervised Fine-Tuning (SFT) and Group-Relative Policy Optimization (GRPO).

## üìã Prerequisites

Before you begin, make sure you have the following:

* A working knowledge of Python and PyTorch
* A basic understanding of machine learning and deep learning concepts
* Access to a GPU accelerator ‚Äì this notebook is designed to run with **at least 16GB of GPU memory**, such as what is available for free on [Google Colab](https://colab.research.google.com). Runtime Tab -> Change runtime type -> T4 (GPU).
* The `trl` library installed ‚Äì this tutorial has been tested with **TRL version 0.17**
  If you don‚Äôt have `trl` installed yet, you can install it by running the following code block:

In [None]:
%pip install trl

* A [Hugging Face account](https://huggingface.co) with a configured access token. If needed, run the following code.
This will prompt you to enter your Hugging Face access token. You can generate one from your Hugging Face account settings under [Access Tokens](https://huggingface.co/settings/tokens). The token must have `Write access to contents/settings of all repos under your personal namespace`

In [None]:
from huggingface_hub import notebook_login
notebook_login()

## üîÑ Quick Recap of the Last Session

In the previous session, we explored the foundational concepts behind training and fine-tuning Large Language Models (LLMs). Here's a brief summary:

* LLMs operate on sequences of integers known as *tokens*. Given a sequence, they predict the probability distribution of the next token in the sequence.
* The first phase of training an LLM is called **pretraining**. This involves training the model on a massive corpus of unlabeled text data.
* The output of pretraining is a **base model**, which has learned general language patterns but isn‚Äôt specialized for specific tasks.
* To adapt the base model for a particular use case‚Äîlike building a chatbot‚Äîwe need to **fine-tune** (or post-train) it on a dataset of conversations.
* Many high-quality conversational datasets are available publicly on the [Hugging Face Hub](https://huggingface.co/datasets).
* These datasets are often not in a format that's directly usable for training, so **data preprocessing** is usually required.

Now that we're on the same page, let's dive into the next session! First Let's load our base model and tokenizer. 


We'll be using a different model from the first session: `SmolLM2-360M`.


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M", device_map="auto")

As before, let's define a chat template and make the necessary modifications to the template and tokenizer.

In [None]:
tokenizer.chat_template = """{{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
{%- for message in messages %}
    {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}"""
tokenizer.eos_token = "<|im_end|>"
model.config.eos_token_id = tokenizer.eos_token_id
model.generation_config.eos_token_id = tokenizer.eos_token_id

Remember, configuring a chat template doesn't make the model capable of chatting.
It only gives the ability to format inputs in a dialogue structure; the model still needs to be fine-tuned on conversational data to respond like a chatbot.

In [None]:
from transformers import pipeline

pipeline = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

question = "What does it mean for a matrix to be invertible?"
prompt = [{"role": "user", "content": question}]
pipeline(prompt, max_new_tokens=50)[0]["generated_text"]

To make the exercise more engaging, we'll use a custom dataset I've created especially for you. Instead of the usual back-and-forth between a lazy user and a helpful assistant, I've spiced things up with dialogues between Rick and Morty. If this reference doesn't ring a bell - sorry for you!

So what we're going to do is basically train a Rick to respond to Morty.

![](https://cdn-uploads.huggingface.co/production/uploads/631ce4b244503b72277fc89f/m9fHggYpFjil8L55a3UNt.png)

In [None]:
from datasets import load_dataset

dataset = load_dataset("qgallouedec/rick-science", split="train")
dataset[0]

Similarly, you can see that the dataset needs to be pre-formatted. So, as before, we write and apply the function to format the dataset.

In [None]:
def to_conversation(example):
    return {
        "messages": [
            {"role": "user", "content": example["question"]},
            {"role": "assistant", "content": f"<think>{example['reasoning']}</think> {example['answer']}"},
        ]
    }

dataset = dataset.map(to_conversation, remove_columns=dataset.column_names)
dataset[0]

At this point, we have a model ready for training and a dataset ready to go. This is where TRL comes in‚Äîit provides the `SFTTrainer`, a trainer designed to fine-tune the model using the dataset.

**But what is SFT?**


## üïµÔ∏è Supervised Fine-Tuning (SFT)

SFT stands for **Supervised Fine-Tuning**. There‚Äôs nothing particularly revolutionary about the SFT method. It‚Äôs simply training the model to predict the next token in a supervised way. Just like in pretraining, we minimize the cross-entropy loss between the model‚Äôs predicted distribution and the actual next token. The key difference is that, in SFT, the model is trained on **curated**, labeled data ‚Äî often conversational or instruction-following examples.

I'm going to break from the usual approach and show you the training process *before* diving into the explanation. Why?

1. You'll get an immediate feel for what the training actually does.
2. The training can run in the background while I walk you through what's happening.

Without explanation, here's the training code block.

In [None]:
from trl import SFTTrainer, SFTConfig

args = SFTConfig(
    output_dir="data/SmolLM2-360M-Rickified",
    gradient_checkpointing=True,
    bf16=True,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    max_length=650,
    logging_steps=10,
    run_name="SmolLM2-360M-Rickified",
)

trainer = SFTTrainer(
    model=model,
    args=args,
    processing_class=tokenizer,
    train_dataset=dataset,
)

In [None]:
trainer.train()

Now that we've trained our model, it's time to share it with the world! To do so, simply push it to the Hugging Face Hub using `push_to_hub`.

In [None]:
trainer.push_to_hub(dataset_name="qgallouedec/rick-science")

It's very important to remember that what training LLM, VRAM is the backbone of the battle. This is especially important in our case, as we‚Äôre working with limited compute resources. I designed this notebook to be runnable on the free version of Colab, which only provide a GPUs with 16GB of memory. As a result, we must manage GPU memory carefully to avoid running out. This constraint also presents a valuable opportunity to learn about GPU memory usage and optimization‚Äîskills that are essential when training large language models (LLMs).

### ü•õ What consumes GPU memory?

When you profile GPU memory usage during training, you get a chart like this one:

![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/train_memory/colorized_training_profile.png)

As shown in the graph, several components occupy GPU memory:

* **Model**: The neural network itself, loaded into memory.
* **Optimizer**: This usually takes up twice as much memory as the model.
* **Activations**: These are the intermediate outputs from each layer during the forward pass.
* **Gradients**: The derivatives of the loss with respect to the activations, used during backpropagation.
* **Optimizer states**: Temporary variables needed by the optimizer during training.

### ü§è Use a smaller model

You might have noticed that base models are often released in multiple sizes. Larger models tend to perform better, but they also consume significantly more memory. Choosing a smaller model is the most impactful optimization you can make‚Äîit reduces not just the model size, but also the memory used by the optimizer and activations.

For this notebook, we‚Äôll be using `SmolLM2-360M`.

### üìê Handle the sequence length carefully

The memory required for activations is directly proportional to the number of tokens in a batch. The number of tokens in a batch, in turn, depends on two factors: the *batch size* and the *sequence length*. Doubling the sequence length will double the memory needed for activations, and similarly, doubling the batch size will also double the memory required for activations.

![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/why_you_should_truncate.png)

And remember, whether you hit an Out Of Memory (OOM) error only depends on the longest sequence in the entire dataset. If, at some point during training, you encounter a sequence longer than what the GPU can handle, you'll get an OOM error, and the training will need to be restarted from scratch. So, it's important to control this maximum sequence length, and truncate any sequences that exceed it.

To get a better sense of what could be a reasonable value for the maximum sequence length, you can check the distribution of sequence lengths in your dataset. You can do this by running the following code:

In [None]:
import matplotlib.pyplot as plt

input_ids = [tokenizer.apply_chat_template(example["messages"]) for example in dataset]
lens = [len(ids) for ids in input_ids]

plt.hist(lens, bins=50)
plt.xlabel("Length of input_ids")
plt.ylabel("Number of examples")
plt.title("Distribution of input_ids length")
plt.show()

### Handle the batch size carefully

As explained earlier, batch size directly impacts activation memory. But there's a tradeoff: larger batch sizes generally lead to more stable training.

To balance memory constraints with training quality, you can use gradient accumulation‚Äîa technique that simulates a larger batch size by accumulating gradients over multiple smaller batches. For example, all of the following configurations result in the same effective batch size (16):

```python
from trl import SFTConfig

training_args = SFTConfig(per_device_train_batch_size=16, gradient_accumulation_steps=1, ...)  # fast but memory intensive
training_args = SFTConfig(per_device_train_batch_size=8, gradient_accumulation_steps=2, ...)
training_args = SFTConfig(per_device_train_batch_size=4, gradient_accumulation_steps=4, ...)
training_args = SFTConfig(per_device_train_batch_size=2, gradient_accumulation_steps=8, ...)
training_args = SFTConfig(per_device_train_batch_size=1, gradient_accumulation_steps=16, ...)  # slow but memory efficient
```

Just note: more accumulation steps mean more forward passes per update, which increases training time. When possible, prefer fewer steps with a larger batch size.



### üï∞ Use gradient checkpointing

Gradient checkpointing is a memory-saving technique that reduces GPU usage during training by selectively storing only some intermediate activations and recomputing the others during backpropagation.

![](https://github.com/cybertronai/gradient-checkpointing/raw/master/img/output.gif)

![](https://github.com/cybertronai/gradient-checkpointing/raw/master/img/output2.gif)

We won't go into the details of how it works, but remember that it can be a bit slower than the standard approach. However, it‚Äôs a great way to save memory, especially when training large models. And it's super easy to enable in the `SFTTrainer`, you just have to set the `gradient_checkpointing` argument to `True` when creating the config:

```python
from trl import SFTConfig

config = SFTConfig(gradient_checkpointing=True,  ...)
```

### üßä Use mixed precision

Mixed precision training speeds up training and reduces memory usage by combining 16-bit (`float16` or `bfloat16`) and 32-bit (`float32`) floating-point arithmetic. It keeps most operations in 16-bit to save memory and compute time, while using 32-bit where higher precision is needed (like loss scaling or certain model updates). This technique is especially helpful when training large models, as it allows you to fit larger batches or longer sequences in memory‚Äîoften without a noticeable drop in model performance.

```python
from trl import SFTConfig

training_args = SFTConfig(bf16=True, ...)
```

## üõ∏ Did rickification work?

Let's see with a basic question!

In [None]:
from transformers import pipeline

pipeline = pipeline(task="text-generation", model="qgallouedec/SmolLM2-360M-Rickified")

question = "What would happen to time for an astronaut traveling near the speed of light?"
prompt = [{"role": "user", "content": question}]
pipeline(prompt, max_new_tokens=400)[0]["generated_text"]

It sounds very much like Rick!
Let's try with another one, a bit out-of-distribution, to see how the model reacts.

In [None]:
question = "What is Edmonton's average temperature in January?"
prompt = [{"role": "user", "content": question}]
pipeline(prompt, max_new_tokens=400)[0]["generated_text"]

Before we close this chapter, let's generate a final one, just for fun!

In [None]:
question = "A ball is thrown vertically upward with an initial speed of 12 m/s. What is its maximum height?"
prompt = [{"role": "user", "content": question}]
pipeline(prompt, max_new_tokens=400)[0]["generated_text"]