# üß† Mastering LLM Fine-Tuning with TRL ‚Äì Group Relative Policy Optimization

Welcome! This notebook is part of a tutorial series where you'll learn how to fine-tune Large Language Models (LLMs) using ü§ó TRL.
We introduce key concepts, set up the required tools, and use techniques like Supervised Fine-Tuning (SFT) and Group-Relative Policy Optimization (GRPO).

## üìã Prerequisites

Before you begin, make sure you have the following:

* A working knowledge of Python and PyTorch
* A basic understanding of machine learning and deep learning concepts
* Access to a GPU accelerator ‚Äì this notebook is designed to run with **at least 16GB of GPU memory**, such as what is available for free on [Google Colab](https://colab.research.google.com). Runtime Tab -> Change runtime type -> T4 (GPU).
* The `trl` library installed ‚Äì this tutorial has been tested with **TRL version 0.17**
  If you don‚Äôt have `trl` installed yet, you can install it by running the following code block:

In [None]:
%pip install trl

* A [Hugging Face account](https://huggingface.co) with a configured access token. If needed, run the following code.
This will prompt you to enter your Hugging Face access token. You can generate one from your Hugging Face account settings under [Access Tokens](https://huggingface.co/settings/tokens). The token must have `Write access to contents/settings of all repos under your personal namespace`

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## üîÑ Quick Recap of the Last Session

In the previous session, we explored Supervised fine-tuning (SFT) and how to use it to post-train a language model on a custom dataset.

* SFT is a technique used to adapt a pre-trained (base) language model to a specific task or domain by training it on a labeled dataset.
* We used the `trl` library to load a pre-trained model, and then fine-tuned it on a custom dataset.
* We also discussed the importance of data preprocessing and how to prepare your dataset for training.
* We discussed how to manage memory, which is crucial when working with large models.
* We pushed the fine-tuned model to the Hugging Face Hub, making it accessible for others to use.
* We showed that, even if the model is now capable of be conversational, it is still not very good at scientific tasks.

In [None]:
import transformers
import textwrap

pipeline = transformers.pipeline(task="text-generation", model="qgallouedec/SmolLM2-360M-Rickified")

question = "During a race, a runner maintains a constant velocity of 8 meters per second for 15 seconds. What is the total displacement of the runner?"
prompt = [{"role": "user", "content": question}]

generated_text = pipeline(prompt, max_new_tokens=512)[0]["generated_text"]  # [{'role': 'user', 'content': "How do ..."}, {'role': 'assistant', 'content': "<think>Alright, ..."}]
completion = generated_text[1:] # [{'role': 'assistant', 'content': "<think>Alright, ..."}]
print(textwrap.fill(completion[0]["content"], width=120))

#### ü§Æ Fun finding

It seems that Rick can burp continuously when asked certain questions!

In [None]:
question = "A 3 kg block of copper decreases in temperature from 150¬∞C to 50¬∞C. If copper's specific heat capacity is 0.39 J/g¬∞C, calculate the energy released by the block."
prompt = [{"role": "user", "content": question}]

generated_text = pipeline(prompt, max_new_tokens=512)[0]["generated_text"]  # [{'role': 'user', 'content': "How do ..."}, {'role': 'assistant', 'content': "<think>Alright, ..."}]
completion = generated_text[1:] # [{'role': 'assistant', 'content': "<think>Alright, ..."}]
print(textwrap.fill(completion[0]["content"], width=120))

![](https://cdn-uploads.huggingface.co/production/uploads/631ce4b244503b72277fc89f/GTv6M_T1MreorkLlNtvKA.png)

Our goal here is to continue fine-tuning our model to improve its performance on scientific tasks. To do this, we‚Äôll use a technique that has recently proven highly effective and led to the development of some of the best-performing reasoning models, such as DeepSeek-R1 and Qwen3: RLVR ‚Äî **Reinforcement Learning with Verifiable Rewards**.

## ‚úÖ What is RLVR?

The RLVR approach deals with problems where it is possible to check whether an answer is correct or not. Or, more generally, where it's possible to objectively assign a score to an answer.

In our case, we want our model (Rick) to first lay out its reasoning, enclosed in `<think></think>` tags, and then provide the final answer after.

These two requirements can be implemented as functions.

In [None]:
# During a race, a runner maintains a constant velocity of 8 meters per second for 15 seconds. What is the total displacement of the runner?

# Correct answer, correct format
completion_1 = [
    {
        "role": "assistant",
        "content": "<think>Alright, let's break this down. The total displacement of the runner can be calculated using the formula: displacement = velocity * time. In this case, the velocity is 8 m/s and the time is 15 seconds. So, the total displacement is 8 m/s * 15 s = 120 meters.</think> The total displacement of the runner is 120 meters."
    }
]
# Wrong format, correct answer
completion_2 = [
    {
        "role": "assistant",
        "content": "The total displacement of the runner is 120 meters. To calculate this, we use the formula: displacement = velocity * time. In this case, the velocity is 8 m/s and the time is 15 seconds. So, the total displacement is 8 m/s * 15 s = 120 meters."
    }
]
# Wrong answer, correct format
completion_3 = [
    {
        "role": "assistant",
        "content": "<think>Okay, let's analyze the problem. The formula for displacement is: displacement = velocity * time. Here, the velocity is 8 m/s and the time is 15 seconds. So, the total displacement would be 8 m/s * 15 s = 80 meters.</think> The total displacement of the runner is 80 meters."
    }
]

Let's check the format first:

In [None]:
import re

def format_reward(completions, **kwargs):
    pattern = r"^<think>(?!.*<think>)(.*?)</think>.*$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completion_contents]
    return [1.0 if match else 0.0 for match in matches]

format_reward([completion_1, completion_2, completion_3])

Now let's check the correctness of the answer. We can do this by checking if the answer is in the list of possible answers. If it is, we can return a reward of 1.0, otherwise we return 0.0. It's pretty basic, not very robust, but it works for our basic use case. 

In [None]:
def correctness_reward(completions, solution, **kwargs):
    rewards = []
    for completion, ground_truth in zip(completions, solution):
        content = completion[0]["content"]
        reward = 1.0 if ground_truth in content else 0.0
        rewards.append(reward)
    return rewards

correctness_reward([completion_1, completion_2, completion_3], solution=["120 meters", "120 meters", "120 meters"])

The function above provides a working example, but as you can see, it's not very robust. For instance, if the model returns `"120 m"` instead of `"120 meters"`, the function will output `0.0`, even though the answer is actually correct.

Parsing a model‚Äôs response may seem like a minor detail, but it's actually very important. There are advanced methods to handle this, but today we'll keep it simple. We'll just make our reward function a bit more robust by comparing the model‚Äôs output not to a single correct answer, but to a list of acceptable answers.


In [None]:
def correctness_reward(completions, solutions, **kwargs):
    rewards = []
    for completion, ground_truths in zip(completions, solutions):
        content = completion[0]["content"]
        matches = [ground_truth in content for ground_truth in ground_truths]
        reward = 1.0 if any(matches) else 0.0
        rewards.append(reward)
    return rewards

And we will be using it like this:

In [None]:
correctness_reward([completion_1], solutions=[[ "120 m", "120.0 m", "120 meters", "120m", "120.0 meters"]])

Once again, the above rewards are determinisitc, it means that there is not need for a reward model here. You may be familiar with the RLHF (Reinforcement Learning from Human Feedback) approach, where a reward model is trained to predict the reward for a given input. **This is not the case here.** That's why we call this approach **Verifiable Rewards**.

## üë®‚Äçüë®‚Äçüë¶‚Äçüë¶ Group Relative Policy Optimization (GRPO)

Now the goal is to train a model that maximize both rewards. To do this, we will be using Group Relative Policy Optimization (GRPO), a technique that has been shown to be effective in training models with verifiable rewards.

The following diagram illustrates how GRPO works:

![](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/grpo_visual.png)

We can break the process down into several steps. The goal here isn‚Äôt to dive deep into the math or underlying theory, but rather to build a minimal understanding of how the method works in practice.

#### 1. Sample a batch of prompts

```python
prompts = ["The capital of Canada is", "The sky is"]
```

#### 2. For each prompt, generate a list of completions

```python
completions = [["Ottawa", "Toronto", "Edmonton"], ["red", "blue", "yellow"]]
```

#### 3. For each completion, compute the reward

Assume we have a reward function that returns `1.0` if the completion is correct and `0.0` otherwise:

```python
rewards = [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]
```

#### 4. Compute the advantage by normalizing the rewards within each group

The advantage is calculated as follows:

$$
\hat{A}_{i,t} = \frac{r_{i,t} - \text{mean}(\mathbf{r}_i)}{\text{std}(\mathbf{r}_i)}
$$

This yields:

```python
advantages = [[0.67, -0.33, -0.33], [-0.33, 0.67, -0.33]]
```

#### 5. Compute the GRPO loss

$$
\mathcal{L}_{\text{GRPO}}(\theta) = -\frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \left[ \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\left[\pi_\theta(o_{i,t} \mid q, o_{i,< t})\right]_{\text{no grad}}} \hat{A}_{i,t} - \beta \, \mathbb{D}_{\text{KL}}\left[\pi_\theta \| \pi_{\text{ref}}\right] \right]
$$

#### 6. Update the model parameters

Like in any other deep learning task, we can use the GRPO loss to update the model parameters using backpropagation.

And we go back to step 1!

---

So, let‚Äôs recap where we are so far:
We understand what RLVR is, we‚Äôve defined our reward functions, we have a general grasp of how GRPO works, and our model is ready for fine-tuning.

What‚Äôs missing? The data!

Similarly, building a dataset with verifiable data isn‚Äôt trivial‚Äîbut no worries, I‚Äôve done it for you in this example:
üëâ https://huggingface.co/datasets/qgallouedec/rick-physics-grpo

Let's load it and check the first few samples.

In [None]:
from datasets import load_dataset

dataset = load_dataset("qgallouedec/rick-physics-grpo", split="train")
dataset[0]

And now we're good? Not quite yet. Remember, in the previous tutorial, we said that data often needed to be prepared. Well, that's the case here. GRPO expects data to be *prompt-only*. Also, we want our data to be conversational. We'll need to process our dataset:

In [None]:
def format_dataset(example):
    return {"prompt": [{"role": "user", "content": example["question"]}]}

dataset = dataset.map(format_dataset)
dataset[0]

At this point, we finally have everything we need to start training.
Let‚Äôs see how to do that using `trl`!

In [None]:
from trl import GRPOTrainer, GRPOConfig

args = GRPOConfig(
    num_generations=16,
    max_completion_length=512,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=10,
    # Speedup and reduce memory
    gradient_checkpointing=True,
    bf16=True,
    output_dir="data/SmolLM2-360M-Rickified-GRPO",
    # Logging
    run_name="SmolLM2-360M-Rickified-GRPO",
    logging_steps=2,
    log_completions=True,
    num_completions_to_print=1,
)

trainer = GRPOTrainer(
    model="qgallouedec/SmolLM2-360M-Rickified",
    reward_funcs=[format_reward, correctness_reward],
    train_dataset=dataset,
    args=args,
)
trainer.train()
trainer.push_to_hub(dataset_name="qgallouedec/rick-physics-grpo")

GRPO takes time to train, so I‚Äôve done it in advance using a few optimizations and some extra compute‚Äîwhile keeping the process equivalent.
The trained model is available here:
üëâ https://huggingface.co/qgallouedec/SmolLM2-360M-Rickified-GRPO

Let's try it out!

In [None]:
pipeline = transformers.pipeline(task="text-generation", model="qgallouedec/SmolLM2-360M-Rickified-GRPO")

question = "Sarah is cooking soup on the stove, and the heat is transferred from the hot soup at 90¬∞C to a cooler room at 20¬∞C. If the heat transfer is mainly due to convection and the rate of heat transfer is 250 Joules per minute, how much heat is transferred into the room over 10 minutes?"
# The answer is 2500 J
prompt = [{"role": "user", "content": question}]

generated_text = pipeline(prompt, max_new_tokens=512)[0]["generated_text"]  # [{'role': 'user', 'content': "How do ..."}, {'role': 'assistant', 'content': "<think>Alright, ..."}]
completion = generated_text[1:]  # [{'role': 'assistant', 'content': "<think>Alright, ..."}]
print(textwrap.fill(completion[0]["content"], width=120))

### üöÄ Some recent findings about GRPO

Recently, the scientific community has improved GRPO, so that today it's customary to use a slightly different version in practice. Two of the most important innovations are:

#### ü§ô No need to use KL's divergence to regularize the drive

```python
GRPOConfig(beta=0.0, ...)
```

#### üôà Generations that don't stick to the budget should be ignored

![](https://pbs.twimg.com/media/GoIZ3grbMAAFBdF?format=jpg&name=4096x4096)

### üîÆ What's next for GRPO?

Are you tired of hearing about math?

One of today's challenges is to use GRPO for more diverse tasks. This is now possible with TRL, you can use an arbitrary number of rewards. What's missing is verifiable data covering other fields.

![](https://pbs.twimg.com/media/GmcKjsgaQAAawqj?format=jpg&name=4096x4096)