# üß† Mastering LLM Fine-Tuning with TRL ‚Äì Introduction and Prerequisites

Welcome! This notebook is part of a tutorial series where you'll learn how to fine-tune Large Language Models (LLMs) using ü§ó TRL.
We introduce key concepts, set up the required tools, and use techniques like Supervised Fine-Tuning (SFT) and Group-Relative Policy Optimization (GRPO).

## üìã Prerequisites

Before you begin, make sure you have the following:

* A working knowledge of Python and PyTorch
* A basic understanding of machine learning and deep learning concepts
* Access to a GPU accelerator ‚Äì this notebook is designed to run with **at least 16GB of GPU memory**, such as what is available for free on [Google Colab](https://colab.research.google.com). Runtime Tab -> Change runtime type -> T4 (GPU).

* The `trl` library installed ‚Äì this tutorial has been tested with **TRL version 0.17**
  If you don‚Äôt have `trl` installed yet, you can install it by running the following code block:

In [None]:
%pip install trl

* A [Hugging Face account](https://huggingface.co) with a configured access token. If needed, run the following code.
This will prompt you to enter your Hugging Face access token. You can generate one from your Hugging Face account settings under [Access Tokens](https://huggingface.co/settings/tokens). The token must have `Write access to contents/settings of all repos under your personal namespace`

In [None]:
from huggingface_hub import notebook_login
notebook_login()

## ü§î Do you remember how LLMs work?

LLMs are essentially highly advanced autocomplete systems.
You provide them with a bit of text, and they predict what comes next.

In [None]:
from transformers import pipeline

pipeline = pipeline(task="text-generation", model="Qwen/Qwen2.5-1.5B")
prompt = "Octopuses have three"
pipeline(prompt, max_new_tokens=2)

That's right, octopuses have three hearts ü´Ä! I didn't know before I wrote this notebook to be honest.

The problem with `pipeline` is that it hides the underlying details that are essential to understand before proceeding. Let's break down the pipeline to examine what is happening behind the scenes.


### ü™ô Tokenization

The first step is making sure the model can understand the text. This is done by transforming the text into tokens. Tokens are small units of text that the model can interpret. 
The tokenizer is responsible for encoding the text into token ids and decoding the token ids back into text.
Let's use the same example as before:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")

prompt = "Octopuses have three"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
inputs

Here is the token mapping:

| Text   | `Oct`  | `op` | `uses` | `‚ê£have` | `‚ê£three` |
|--------|--------|------|--------|---------|----------|
| Tokens | `18053`| `453`|  `4776`|    `614`|    `2326`|

### ‚è© The Forward Pass

Now that we have a list of integers, we can pass them to the model. Let's see what happens when we do that.

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B").to("cuda")
output = model(**inputs)
output

I seems not to output text ü•∫. Let's see what it returns.

The output is a `CausalLMOutputWithPast` object, which contains several attributes. The only one we are concerned with for now is `logits`. To understand what this is, let's first check its shape.

In [None]:
output.logits.shape

The logits tensor is 3-dimensional:
- `batch_size`: 1 (since we only have one sequence)
- `sequence_length`: 5 (because the text was tokenized into 5 tokens, as seen earlier)
- `vocab_size`: 151936 (the total number of unique tokens the tokenizer can map to an integer-varies depending on the tokenizer).

**But what are these logits?**

Logits are the scores that the model assigns to each token in the vocabulary for the next position in the sequence. They represent the model's confidence in predicting each token as the next one in the sequence.
Conceptually, you can think of logits as a probability distribution over the vocabulary. The model is saying, "Given the input sequence, here are my scores for each possible next token."

Consequently, the last column of the logits tensor corresponds to the model's prediction for the next token in the sequence. In other word, what comes after `"Octopuses have three"`? To get a better unserstanding, let's try to plot the distribution of the logits for the last token in the sequence.

In [None]:
import torch

# Take only the logits of the last token
last_logits = output.logits[0, -1, :]  # shape = (151936,)
last_probs = torch.softmax(last_logits, dim=-1)  # turn logits into probabilities

# Let's consider only the 10 most probable tokens
top_last_probs, top_last_ids = torch.topk(last_probs, k=10)

print(f"The most probable next token ids are: {top_last_ids.tolist()}")

If models understand tokens ids, I don't. So let's decode these ids and see what they mean.

In [None]:
top_last_tokens = tokenizer.batch_decode(top_last_ids)
print(f"The most probable next tokens are: {top_last_tokens}")

Let's plot the distribution to better understand what the model is predicting.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.bar(top_last_tokens, top_last_probs.tolist())

plt.xlabel('Next token')
plt.xticks(rotation=45)
plt.ylabel('Logit')
plt.title('Octopuses have three...')
plt.show()

We've done it for the last token in the sequence, but we can do it for all tokens in the sequence. Let's see what the model is predicting for each token in the sequence.

In [None]:
probs = torch.softmax(output.logits[0], dim=-1)  # turn logits into probabilities, shape = (5, 151936)
# Let's consider only the 10 most probable tokens
top_probs, top_ids = torch.topk(probs, k=10)
top_tokens = [tokenizer.batch_decode(ids) for ids in top_ids]
# Replace " " with "‚ê£" for better visualization
top_tokens = [[token.replace(" ", "‚ê£") for token in tokens] for tokens in top_tokens]

fig, ax = plt.subplots(1, 5, figsize=(16, 3))
for i in range(5):
    ax[i].bar(tokenizer.batch_decode(top_ids[i]), top_probs[i].tolist())
    ax[i].set_xticks(range(10))
    ax[i].set_xticklabels(top_tokens[i], rotation=45)
    ax[i].set_xlabel('Next token')
    ax[i].set_ylabel('Probability')
    partial_seq = tokenizer.decode(inputs['input_ids'][0][:i + 1])
    ax[i].set_title(f"{partial_seq}...")
plt.tight_layout()

Pretty interesting, right? At each step, we can see what the model thinks as the most likely next token.

![image.png](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/tuto_forward_pass.png)

## üçî How do we end up with a model capable of outputting such distributions?

LLMs don‚Äôt start out smart ‚Äî far from it. The impressive ability to generate coherent, relevant text comes from a process called **pretraining**.

#### üèóÔ∏è What is Pretraining?

During pretraining, a model starts with **random weights** and learns by trying to **predict the next token** in massive amounts of text ‚Äî often hundreds of billions of tokens scraped from the internet. For Llama3, about 15,000 billion tokens were used. This process teaches the model general language patterns, grammar, and world knowledge.

Pretraining is:

* **Massive** in scale (weeks or months on hundreds of GPUs)
* **Costly** (millions of dollars)
* **Foundational** ‚Äî it's what makes an LLM even remotely useful

Once this phase is complete, we get what‚Äôs called a **base model**.

#### üß™ What is a Base Model?

A base model is pretrained, but it hasn't been taught *how* to behave.

It doesn't follow instructions well.  
It doesn‚Äôt know how to have a conversation or answer questions directly.  
It simply continues text based on patterns it has seen.

So if you give it a prompt like:

> *‚ÄúWhat is the capital of Germany?‚Äù*

In [None]:
prompt = "What is the capital of Germany?\n"
print(pipeline(prompt, max_new_tokens=100)[0]["generated_text"])

The model generates something that *looks* like a multiple-choice question ‚Äî because it has seen many of those during pretraining ‚Äî but it won‚Äôt actually answer the question.
Why? Because it hasn‚Äôt been trained to respond helpfully. No one has told it: *"This is how you should respond."*

#### üß† What About ChatGPT, Claude, and Others?

When you use popular models like GPT-4o, Claude, DeepSeek-R1, or o3, you‚Äôre *not* using a base model.

You‚Äôre using a model that‚Äôs been **fine-tuned** ‚Äî and often **reinforcement-aligned** ‚Äî to be helpful, safe, and responsive.

For example:

- **DeepSeek-R1** is fine-tuned from a model called **DeepSeek-V3-Base**.  
- **OpenAI o4-mini** is a fine-tuned version of an unknown base model.  
- **Llama 4 Scout** (officially: *Llama-4-Scout-17B-16E-Instruct*) is a fine-tuned version of *Llama-4-Scout-17B-16E*.

### üéØ Why Does This Matter?

In this tutorial, we‚Äôre starting with a **base model** ‚Äî one that can generate text, but isn‚Äôt yet useful on its own.

It won‚Äôt follow instructions well, and it may not be helpful or safe by default.

Your job is to **fine-tune** it into something smarter, more helpful, or more aligned to your specific goals.

That‚Äôs the magic of **post-training** ‚Äî and that‚Äôs where **TRL** and your creativity come in.

![image.png](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/tuto_pretraining_posttraining.png)

‚òï At this point, I think it's a good time to take a break. Let's grab a coffee and come back in 10 minutes.  

## üéõÔ∏è Fine-tuning

In the previous section, we discussed what pretraining is and how it gives us a base model. To recap, a base model is one that has been pretrained on a huge dataset, but hasn‚Äôt yet been adapted for specific tasks. It can generate text, but by itself it isn‚Äôt particularly useful.

### üó£Ô∏è Chat template

To move beyond simple text completions and start building something remotely helpful, like a chatbot, the first step is to use a conversation template. When you interact with a chatbot, your message isn't passed to the model as plain text. Instead, it's formatted in a structured way that tells the model it's part of a conversation‚Äîwho's speaking, what has been said before, and so on. This structure is defined using what's called a **chat template**.

Here is an example of such templated input:

In [None]:
prompt = """<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
How many hearts do octopuses have?<|im_end|>
<|im_start|>assistant
"""

It's important to remember that just using a chat template doesn't automatically make the model behave like a chatbot. This format is likely unfamiliar to the model‚Äîit probably hasn't seen much data like this before. But let's try it out and see what happens:

In [None]:
print(pipeline(prompt, max_new_tokens=20)[0]["generated_text"])

I don‚Äôt even know how to describe that output. But what‚Äôs clear is that it‚Äôs not satisfactory.

You may have noticed that I used a special format to represent the conversation. Specifically, I used custom tokens like `<|im_start|>` and `<|im_end|>`. This kind of formatting is very convenient and easy to parse for later use. It's known as a *chat template*. Tokenizers are also capable of handling these templates‚Äîas long as you specify the one you want using Jinja2 format:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
tokenizer.chat_template = """{{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
{%- for message in messages %}
    {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}"""


Now we can use it to format conversations properly. For example:

In [None]:
messages = [
    {"role": "user", "content": "How many hearts do octopuses have?"},
    {"role": "assistant", "content": "Octopuses have three hearts."},
]
print(tokenizer.apply_chat_template(messages, tokenize=False))

This doesn‚Äôt solve our problem yet‚Äîbut at least now we have a tool to format conversations properly.

### üóÇÔ∏è Conversational Data

Let‚Äôs recap. At this point, we have a model that can generate text, but it‚Äôs not yet capable of holding a conversation. We also have a tool to format dialogues using a chat template. So, what‚Äôs still missing?

You guessed it‚Äîjust look at the section title. What we‚Äôre missing is *data*.

Hugging Face tools have already been incredibly helpful, even if you didn‚Äôt notice. First, you loaded the model using the `transformers` library. Then, you loaded the tokenizer the same way. Both the model and tokenizer were automatically downloaded for you from the ü§ó Hugging Face Hub.

Here‚Äôs the great part: the Hub doesn‚Äôt just host models‚Äîit also offers a huge collection of datasets. As of writing, there are 381,735! Let‚Äôs go check one out.

Let‚Äôs pick a conversational dataset, like [Open-Thoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k). To load it, we‚Äôll use another fantastic library in the Hugging Face ecosystem: `datasets`.

In [None]:
from datasets import load_dataset

dataset = load_dataset("open-thoughts/OpenThoughts-114k")

Let‚Äôs take a peek at what this dataset contains‚Äîstarting with the first example.

In [None]:
example = dataset["train"][0]

The raw output might not be very readable, but you can always explore it visually on the Hugging Face Hub. What‚Äôs most important for us is that this dataset contains conversations‚Äîin the `conversations` column. So let‚Äôs try formatting one of them using our chat template.

In [None]:
tokenizer.apply_chat_template(example["conversations"], tokenize=False)

Uh oh! `UndefinedError: 'dict object' has no attribute 'role'`. Looks like the dataset isn‚Äôt in the format we expected. Yep, that happens‚Äîand it‚Äôs actually pretty common.

Whenever you're training models, you‚Äôll almost always have to go through a data preprocessing step. So let‚Äôs tackle that now.

### üßπ Data Preparation

What we want is for each conversation to look like this:

```python
{
    "messages": [
        {"role": "user", "content": "How many hearts do octopuses have?"},
        {"role": "assistant", "content": "Octopuses have three hearts."},
    ]
}
```

But the dataset actually looks like this:

```python
{
    "conversations": [
        {"from": "human", "value": "How many hearts do octopuses have?"},
        {"from": "assistant", "value": "Octopuses have three hearts."}
    ]
}
```

Let‚Äôs write a function to convert from the second format to the one we need.

In [None]:
def format_example(example):
    messages = []
    for message in example["conversations"]:
        role = message["from"]
        content = message["value"]
        message = {"role": role, "content": content}
        messages.append(message)
    return {"messages": messages}

format_example(example)

Perfect! Now, how do we apply this function to the entire dataset? Simple‚Äîjust use `dataset.map`.

In [None]:
dataset = dataset.map(format_example, remove_columns="conversations")

That was quick. Let‚Äôs now try formatting the first example using our chat template again.

In [None]:
example = dataset["train"][0]
print(tokenizer.apply_chat_template(example["messages"], tokenize=False))

Voil√†! We now have a dataset of properly formatted conversations. It's ready to be used for training our model.


> Waita second! We forgot to apply the chat template to the entire dataset!

No worries, the trainer will take care of that for us. üòâ