# Creating a KTO Preference dataset using Argilla and Spaces

This notebook walks through the steps involved in creating a KTO dataset using Argilla and ðŸ¤— Spaces. This notebook already assumes you are at the point where you already have a dataset with the following dataset consisting of prompts and responses. 
Using this data as a starting point we'll setup an Argilla Space which anyone with a Hugging Face account can login to and provide feedback on the responses generated by a model(s). 

In this example we'll focus on a dataset containing prompts and responses focused on generating Haiku. The approach could be applied to any dataset where you want to collect human ratings for a set of prompts and responses. Our end goal is to produce a dataset that can be used with the the [`trl`](https://github.com/huggingface/trl) library `KTOTrainer`. 

The steps we'll cover are:
- Setting up an Argilla Space
- Uploading the dataset to the Space
- Labeling the dataset
- Exporting the labeled dataset
- Formatting the labeled dataset for use with `KTOTrainer`
- Sharing the dataset to the Hub



If you are running the notebook on Google Colab you need to install `argilla` 

In [None]:
# %pip install argilla 

In [1]:
from huggingface_hub import duplicate_space
from huggingface_hub import hf_hub_download
from huggingface_hub import HfApi
from huggingface_hub import SpaceCard
from rich import print

## 1. Create the Argilla Hugging Face Space

To collect out preference data we'll use Argilla hosted on Hugging Face Spaces. This setup will allow anyone with a Hub account (using oauth authentication) to contribute to the dataset (you can also restrict access to a specific group of people if you want). The first step is to create a Space on Hugging Face Spaces. Before we do this we'll authenticate with the `huggingface_hub` library to make sure we can programmatically interact with Spaces.

In [None]:
from huggingface_hub import login
login()

### Duplicate a template Space

We'll duplicate an existing Argilla Space template. This will help us get up and running with an Argilla Space quickly. 

In [3]:
from_id = "argilla/argilla-template-space-with-oauth"
to_id = "davanstrien/haiku-preferences"
new_space = duplicate_space(from_id, to_id=to_id)
new_space

RepoUrl('https://huggingface.co/spaces/davanstrien/haiku-preferences', endpoint='https://huggingface.co', repo_type='space', repo_id='davanstrien/haiku-preferences')

We update the tile and description of the Space to reflect the dataset we are creating. Update this to reflect the dataset you are creating. 

In [4]:
card = SpaceCard.load(to_id)
card.data.title = f"DIBT haiku preferences"
card.push_to_hub(to_id)

CommitInfo(commit_url='https://huggingface.co/spaces/davanstrien/haiku-preferences/commit/00e3a2dbe0d0dd0845bb8e15ee9c2297330df026', commit_message='Upload README.md with huggingface_hub', commit_description='', oid='00e3a2dbe0d0dd0845bb8e15ee9c2297330df026', pr_url=None, pr_revision=None, pr_num=None)

## 2. Create an application on the Hub

To enable the Oauth integration we need to create an application on the Hub. We can do this via the Hugging Face settings UI.

- Go to this page: [https://huggingface.co/settings/applications/new](https://huggingface.co/settings/applications/new)
- Complete the form to create a new application. You will need to provide the following values:
    - Name of application
    - Homepage URL: Your Argilla Space Direct URL.
    - Logo URL: [Your Argilla Space Direct URL]/favicon.ico
    - Scopes: openid and profile.
    - Redirect URL: [Your Argilla Space Direct URL]/oauth/huggingface/callback

The cell below will show you the URL for these values.



In [5]:
homepage_url = f"https://{new_space.repo_id.lower().replace('/', '-')}.hf.space"
favicon_url = f"{homepage_url.lower()}/favicon.ico"
redirect_url = f"{homepage_url.lower()}/oauth/huggingface/callback"
print(f"Homepage URL: {homepage_url.lower()} \n Logo URL: {favicon_url} \n Redirect URL: {redirect_url}")


## 3. Setup your Space secrets

Once we have created the application we will need to update our Space secrets to add these values which will be shown once you've created your application using the steps above.

- `OAUTH2_HUGGINGFACE_CLIENT_ID`: [Your Client ID]
- `OAUTH2_HUGGINGFACE_CLIENT_SECRET` : [Your App Secret]

Additionally, we highly recommend setting up a custom API_KEY and PASSWORD for the owner role (you). This owner role would be the only one allowed to create, delete, read and update datasets, so it's important to change the defaults:

- `OWNER_API_KEY`: you can put any alphanumeric value
- `OWNER_PASSWORD`: at least 8 digits/characters.

You can add these secrets via the settings page of your Space. 

![secrets](assets/secrets.png)

The secrets can be added via the settings tab of your Space.

In [8]:
f"{new_space.url}/settings"

'https://huggingface.co/spaces/davanstrien/haiku-preferences/settings'

## 4. Persistent Storage + Upgrade CPU

To ensure all annotations are safely stored we'll want to enable persistent storage on our Space. This means that if the Space is stopped and restarted, all annotations will still be available. Additionally, we'll upgrade the CPU and disable sleeping to ensure the Space is always available for annotators!

![storage](assets/storage.png)

We now need to factory reset the Space to ensure all of the above changes register

In [9]:
from huggingface_hub import restart_space

restart_space(to_id, factory_reboot=True)

SpaceRuntime(stage='RUNNING_BUILDING', hardware='cpu-basic', requested_hardware='cpu-upgrade', sleep_time=172800, storage='small', raw={'stage': 'RUNNING_BUILDING', 'hardware': {'current': 'cpu-basic', 'requested': 'cpu-upgrade'}, 'storage': 'small', 'gcTimeout': 172800, 'replicas': {'current': 1, 'requested': 1}, 'devMode': False})

## 5. Testing your Space

At this point you are ready to verify the installation. You need to go to following Space URL

In [11]:
f"https://huggingface.co/spaces/{to_id}"

'https://huggingface.co/spaces/davanstrien/haiku-preferences'

You should see something like this:
![](assets/space.png)

If you don't see the Sign in with Hugging Face button, you need to go back to Steps 3 and 4 to make sure the OAuth app is correctly set up (make sure the callback URL is correct) and the secret are correct.

The next step is to test the Sign in, you should see something like this:


![Access page](assets/access.png)

If you see an error after Authorizing, please double check the callback URL on your OAuth application settings at https://huggingface.co/settings/connected-applications

If you are still having issues feel free to reach out on Discord. 

## 6. Loading our data into the Argilla Space

First we need to set up the Argilla SDK client with the URL and owner credentials for our space. I'm using the `python-dotenv` library to load the secrets from a `.env` file but you can also add these directly to the notebook. 


In [13]:
import argilla as rg
from dotenv import load_dotenv
import os

load_dotenv()
OWNER_API_KEY = os.getenv("ARGILLA_KEY")

assert (
    OWNER_API_KEY is not None
), "Please set OWNER_API_KEY to the API token you just set in the Space settings"

rg.init(api_url=homepage_url, api_key=OWNER_API_KEY, workspace="admin")

Finally, we're ready to create our dataset in the admin workspace. At this point we'll need to grab whatever data we want to go get human preferences for. The steps below will vary depending on the data you're working with. We give some pointers for things you may want to consider.

We already have a dataset which contains a prompt and three completions per prompt. We will use this dataset to get human preferences.

In [14]:
from datasets import load_dataset

# If the dataset is gated/private, make sure you have run huggingface-cli login
dataset = load_dataset("davanstrien/haiku_dpo", "aesthetic-preference", split='train')

Let's take a look at what a row looks like

In [15]:
dataset[0]

{'input': 'Can you compose a haiku about the serenity of mountain peaks?',
 'generation_model': ['mistralai/Mistral-7B-Instruct-v0.2',
  'meta-llama/Llama-2-70b-chat-hf',
  'NousResearch/Nous-Hermes-2-Yi-34B'],
 'generation_prompt': ['<s>[INST] <<SYS>>\nYou are a poet specialising in creating Haiku. \nYour haiku consist of three lines, with five syllables in the first line, seven in the second, and five in the third.\nBeyond being technically correct, your haiku should also be beautiful and meaningful. \nYou respond only with a haiku. You do not add anything else to your responses. \n\n<</SYS>>\n\nCan you compose a haiku about the serenity of mountain peaks? [/INST]',
  '<s>[INST] <<SYS>>\nYou are a poet specialising in creating Haiku. \nYour haiku consist of three lines, with five syllables in the first line, seven in the second, and five in the third.\nBeyond being technically correct, your haiku should also be beautiful and meaningful. \nYou respond only with a haiku. You do not add

As you can see we have one input prompt, some metadata about the models used for each generation and the three completions. We will use this data to get human preferences. 

### Defining the task

We'll use the Argilla SDK to define the task and setup our annotations and dataset. We'll use Argilla's [`Feedback Dataset`](https://docs.argilla.io/en/latest/practical_guides/create_update_dataset/create_dataset.html#feedback-dataset) dataset. This `Feedback Dataset` is a dataset comes with different [task templates](https://docs.argilla.io/en/latest/practical_guides/create_update_dataset/create_dataset.html#task-templates). These give you a starting point for different tasks you might want to gather data for. In this case we'll use the `for_text_classification` task template as a starting point. This task template is designed for text classification tasks, which is very close to what we're doing when we're collecting KTO data, so it's a good starting point. 

We'll create some very short guidelines for the annotators to follow. If you are collecting KTO dataset for a tasks with a lot of nuance you might want to extend these guidelines to be more detailed.

In [16]:
guidelines = """
Do you like this haiku? 
Yes or no? 
A vibes only assessment is fine!"""

When using the `for_text_classification` template we need to provide the labels we're using, in our case we use `Yes` or `No` to indicate our binary preference. This will be converted to a `bool` value once we parse the dataset later. 


In [17]:
argilla_ds = rg.FeedbackDataset.for_text_classification(
    labels=["Yes", "No"],
    use_markdown=True,
    guidelines=guidelines,
    metadata_properties=None,
    vectors_settings=None,
)


We get back a `RemoteFeedbackDataset` object which we can use to add our data to the dataset. We can also continue to modify the formatting of our task. 

In [18]:
argilla_ds

FeedbackDataset(
   fields=[TextField(name='text', title='Text', required=True, type='text', use_markdown=True)]
   questions=[LabelQuestion(name='label', title='Label', description='Classify the text by selecting the correct label from the given list of labels.', required=True, type='label_selection', labels=['Yes', 'No'], visible_labels=None)]
   guidelines=
   Do you like this haiku? 
   Yes or no? 
   A vibes only assessment is fine!)
   metadata_properties=[])
   vectors_settings=[])
)

One thing we might want to change is the titles of the question to make it more clear to the annotators what they are doing.

In [19]:
argilla_ds.questions[0].title = "Do you like this haiku?"

The `fields` are shown in the UI to the annotators. Again we can change the title (what's shown to the annotators) and the name (how the field is tracked dataset) to make it easier for us later. 

In [20]:
argilla_ds.fields[0].title = "Haiku"
argilla_ds.fields[0].name = "completion"

While most text classification tasks will have a single text field that is classified, in our case we probably want to show the prompt to the user so they can rank the completion in the context of the prompt. For a `FeedbackDataset` the fields are shown in the order in which they appear in the `fields` attribute. To add the prompt we can add this as a `TextField` at the start of the `fields` list.

In [21]:
argilla_ds.fields.insert(0, rg.TextField(name="prompt", title="Haiku prompt", required=True,use_markdown=True))

In [22]:
argilla_ds

FeedbackDataset(
   fields=[TextField(name='prompt', title='Haiku prompt', required=True, type='text', use_markdown=True), TextField(name='completion', title='Haiku', required=True, type='text', use_markdown=True)]
   questions=[LabelQuestion(name='label', title='Do you like this haiku?', description='Classify the text by selecting the correct label from the given list of labels.', required=True, type='label_selection', labels=['Yes', 'No'], visible_labels=None)]
   guidelines=
   Do you like this haiku? 
   Yes or no? 
   A vibes only assessment is fine!)
   metadata_properties=[])
   vectors_settings=[])
)

### Loading the data

We can now load our data into the `RemoteFeedbackDataset`. We do this by creating a list of all the records (data points) we want to add. Each item in this list will be a `rg.FeedbackRecord` object. We need to pass in the expected fields (as defined above). We can also add some metadata to each record. This metadata won't be shown to the annotators, but will be stored with the record. This can be particularly helpful for tracking the source of the generations i.e which model was used to generate a completion. We may latter want to use this metadata to filter the data or to compare the performance of different models.

#### Filtering the data

Often we want to show all of the data to the annotators, but sometimes we might want to filter the data. In our case, since we expect haiku to be three lines long we can define a simple filter so we don't show annotators any completions that are not three lines long.


In [23]:
def is_three_lines(haiku):
    return len(haiku.split("\n")) == 3

We can now create our records, we'll loop through all the rows in our dataset, we'll then loop through all the generations in our dataset (remember in this example we had three generations per prompt). We'll then create a `FeedbackRecord` for each generation. We'll add the prompt and the completion to the record. We'll also add some metadata about the model used to generate the completion.

In [24]:
# create records
records = []
for row in dataset:
    for generation_model, generation in zip(
        row["generation_model"], row["generations"]
    ):
        if is_three_lines(generation):
            prompt = row["input"]
            metadata = {"prompt": prompt, "generation_model": generation_model}
            record = rg.FeedbackRecord(
                fields={"prompt": prompt, "completion": generation.strip()},
                metadata=metadata,
            )
            records.append(record)

If we look at one of the records we can see the prompt and the completion. We can also see the metadata we added to the record.

In [25]:
print(records[0])

Since there will be three generations per prompt we can shuffle the record to help avoid seeing to many generations from the prompt in a row (you could skip this step if you only have on generation)

In [26]:
import random

random.shuffle(records)


We now add the records to our `RemoteFeedbackDataset` using the `add_records` method.

In [27]:
argilla_ds.add_records(records)

We can now use the `push_to_argilla` method to push the dataset to the Argilla Space. This will make the dataset available to the annotators.  We need to give a name to our task in the `push_to_argilla` method. This name will be used to identify the task in the Argilla Space.

In [None]:
# push the dataset to Argilla
argilla_ds.push_to_argilla("haiku-preference", workspace="admin")

When you are logged in to your Argilla Space you should see the dataset available

![dataset](assets/datasets.png)

Clicking on the dataset will show you the annotation UI

![task](assets/task.png)

### Gather a community and start collecting preferences!

You can now share the link to your Space with your community and start collecting preferences! We're excited to see what kinds of dataset people choose to build, so please feel free to share your Space with us on Discord. If you share on Twitter or other social media, please tag us so we can help promote your task!

## 7. Loading our annotated data

Once we have collected our preferences we can load the data back into the notebook. We can then use this data to train a model using the `KTOTrainer` from the `trl` library. If you run this notebook later, you may need to re-run the cell below (uncommented) to authenticate with Argilla Space.

In [None]:
# import argilla as rg
# from dotenv import load_dotenv
# import os

# load_dotenv()
# OWNER_API_KEY = os.getenv("ARGILLA_KEY")
# homepage_url = None
# assert homepage_url is not None, "Please set homepage_url to the URL of the Space you created"
# assert (
#     OWNER_API_KEY is not None
# ), "Please set OWNER_API_KEY to the API token you just set in the Space settings"

# rg.init(api_url=homepage_url, api_key=OWNER_API_KEY, workspace="admin")

We can grab data back from our Argilla Space by using the `FeedbackDataset`'s `from_argilla` method. We need to pass in the name of the dataset we want to load as well as the workspace. 

In [30]:
argilla_ds = rg.FeedbackDataset.from_argilla("haiku-preference", workspace="admin")
argilla_ds

RemoteFeedbackDataset(
   id=ded71479-9170-4b6e-8de6-5bb1d27e49ac
   name=haiku-preference
   workspace=Workspace(id=b39093b2-d11e-4794-b7e2-5f6547ff2dc9, name=admin, inserted_at=2024-03-14 14:58:19.503243, updated_at=2024-03-14 14:58:19.503243)
   url=https://davanstrien-haiku-preferences.hf.space/dataset/ded71479-9170-4b6e-8de6-5bb1d27e49ac/annotation-mode
   fields=[RemoteTextField(id=UUID('b37cab6c-350d-4fe4-aa08-54e4c84b673f'), client=None, name='prompt', title='Haiku prompt', required=True, type='text', use_markdown=True), RemoteTextField(id=UUID('5eeff5ab-fdb6-4f04-b0ce-c154b137024b'), client=None, name='completion', title='Haiku', required=True, type='text', use_markdown=True)]
   questions=[RemoteLabelQuestion(id=UUID('802abc40-2dbd-48f4-80c9-47d1af685280'), client=None, name='label', title='Do you like this haiku?', description=None, required=True, type='label_selection', labels=['Yes', 'No'], visible_labels=None)]
   guidelines=
   Do you like this haiku? 
   Yes or no? 
   

We can push the raw annotations from our notebook to the Hugging Face hub as a dataset. We'll put this in a `raw-argilla` dataset. This will allow us to share the raw annotations with others. 

In [None]:
argilla_ds.push_to_huggingface("davanstrien/haiku-kto-raw-argilla")

You'll see when we push the dataset to the Hub that Argilla autogenerates a nice dataset card for us! 

At the moment our dataset contains all of the data including rows without any annotations. We also want to format things a bit differently for use with the `KTOTrainer`. We'll do this in the next section.

## 8. Formatting the labeled dataset for use with `KTOTrainer`

We can format our `RemoteFeedbackDataset` as a Hugging Face dataset. 

In [53]:
dataset = argilla_ds.format_as("datasets")

We'll see this is the same number of rows as the records we uploaded. We'll also see that we have the columns we'd expect based on our `fields` definition, as well as some additional columns that track metadata for our data. 

In [54]:
dataset

Dataset({
    features: ['prompt', 'completion', 'label', 'label-suggestion', 'label-suggestion-metadata', 'external_id', 'metadata'],
    num_rows: 3952
})

If we look at a single example, we can get a better sense of our data. 

In [56]:
dataset[0]

{'prompt': 'Can you write a haiku that describes the danger of an iceberg?',
 'completion': 'Iceberg, silent threat\nDeceptive beauty, hidden\nSinking ships, cold death',
 'label': [],
 'label-suggestion': None,
 'label-suggestion-metadata': {'type': None, 'score': None, 'agent': None},
 'external_id': None,
 'metadata': '{"prompt": "Can you write a haiku that describes the danger of an iceberg?", "generation_model": "NousResearch/Nous-Hermes-2-Yi-34B"}'}

Since we want to make sure we have a preference for each prompt we can filter out any rows where we don't have any labels

In [None]:
dataset = dataset.filter(lambda x: len(x['label']) > 0)

In [58]:
dataset

Dataset({
    features: ['prompt', 'completion', 'label', 'label-suggestion', 'label-suggestion-metadata', 'external_id', 'metadata'],
    num_rows: 11
})

With the way we've set up our task most rows will have a single annotation but we may sometimes have overlap. There are different ways of dealing with this. If we we're collecting ratings we could create an average but since KTP expects a binary preference this doesn't really work. One approach if we have more than one label is to take a majority vote (this assumes we have an odd number of annotators for each row). 

However, intuitively we probably want fairly "strong" preferences in our dataset. If we have a generation where many annotators disagree this might not be a good point to use for preference training. Another approach to deal with this is to filter out rows where there is a tie. This is the approach we'll show here, but there is also a code snippet to take a majority vote if you want to try that approach.

To ensure we have good comptability with the `KTOTrainer` we'll use boolean values for our labels. 

In [48]:
def is_perfect_agreement(row):
    labels = row.get("label")
    values = (label["value"] for label in labels)
    return len(set(values)) == 1

dataset = dataset.filter(is_perfect_agreement)
dataset

Dataset({
    features: ['prompt', 'completion', 'label', 'label-suggestion', 'label-suggestion-metadata', 'external_id', 'metadata'],
    num_rows: 10
})

In [49]:
def format_label(row):
    label = row.get("label", None)
    return {"label": label[0].get("value") == "Yes"}

In [80]:
dataset = dataset.map(format_label)
dataset[0]

{'prompt': "Can you compose a haiku about the beauty of winter's first snow?",
 'completion': "Softly falls the snow\nBlanketing all in white peace\nWinter's gentle hush",
 'label': True,
 'label-suggestion': None,
 'label-suggestion-metadata': {'type': None, 'score': None, 'agent': None},
 'external_id': None,
 'metadata': '{"prompt": "Can you compose a haiku about the beauty of winter\'s first snow?", "generation_model": "meta-llama/Llama-2-70b-chat-hf"}'}

If you want to play around with other approaches you can modify the code below.

In [78]:
# from collections import Counter

# def get_majority_label_and_discard_no_majority(row):
#     labels = row.get("label")
#     values = [label["value"] for label in labels]
#     # check if there are multiple labels
#     if len(values) >1:
#         counts = Counter(values)
#         # check if there is a majority label
#         if len(set(counts.values())) == 1:
#             return {"label": "No majority"}
#         max_key = max(counts, key=counts.get)
#         return {"label": max_key=="Yes"}
#     return {"label": values[0] == "Yes"}


### Format as messages

We'll also format prompts/generations data as a list of messages. This is the format that the `KTOTrainer` expects.


In [81]:
def formatted_as_messages(row):
    prompt = row["prompt"]
    completion = row["completion"]
    return [{"role": "user", "content": prompt}, {"role": "assistant", "content": completion}]


In [82]:
def create_messages_column(row):
    return {"messages": formatted_as_messages(row)}

In [None]:
dataset = dataset.map(create_messages_column)

In [86]:
dataset[0]

{'prompt': "Can you compose a haiku about the beauty of winter's first snow?",
 'completion': "Softly falls the snow\nBlanketing all in white peace\nWinter's gentle hush",
 'label': True,
 'label-suggestion': None,
 'label-suggestion-metadata': {'type': None, 'score': None, 'agent': None},
 'external_id': None,
 'metadata': '{"prompt": "Can you compose a haiku about the beauty of winter\'s first snow?", "generation_model": "meta-llama/Llama-2-70b-chat-hf"}',
 'messages': [{'content': "Can you compose a haiku about the beauty of winter's first snow?",
   'role': 'user'},
  {'content': "Softly falls the snow\nBlanketing all in white peace\nWinter's gentle hush",
   'role': 'assistant'}]}

We can now push this dataset to the Hub using the `push_to_hub` method! 

In [None]:
dataset.push_to_hub("davanstrien/haiku_kto")