# 01. Creating our subsample of Aya to prepare for creating a DPO dataset

This notebook walks through the steps required to create a sample from the full Aya dataset for the language you are interested in working in. 
In this notebook and the subsequent notebooks we'll focus on Dutch as an example but the process will be rather similar for other languages.

In [1]:
from collections import Counter
from datasets import Dataset
from datasets import load_dataset
from statistics import mean, median

Let's start by loading the Aya dataset!

In [2]:
aya_ds = load_dataset("CohereForAI/aya_dataset",split='train')

In [3]:
aya_ds

Dataset({
    features: ['inputs', 'targets', 'language', 'language_code', 'annotation_type', 'user_id'],
    num_rows: 202362
})

We want to only include the data that is relevant to the language we are interested in. This means we need to filter out the data that is not in Dutch. 

In [4]:
dutch_only = aya_ds.filter(lambda x: x['language'] == 'Dutch')
dutch_only

Dataset({
    features: ['inputs', 'targets', 'language', 'language_code', 'annotation_type', 'user_id'],
    num_rows: 1733
})

### Getting some statistics about the data

To help with the next stages of this process we'll get some statistics about the data. 

In [5]:
def get_stats(ds: Dataset):
    input_lengths = []
    output_lengths = []
    annotator_counts: Counter = Counter()
    for row in ds:
        input_lengths.append(len(row["inputs"]))
        output_lengths.append(len(row["targets"]))
    annotator_counts.update(ds["user_id"])
    mean_input_length = mean(input_lengths)
    median_input_length = median(input_lengths)
    mean_output_length = mean(output_lengths)
    median_output_length = median(output_lengths)
    max_input_length = max(input_lengths)
    max_output_length = max(output_lengths)
    return {
        "number_of_unique_annotators": len(annotator_counts),
        "input_lengths": input_lengths,
        "output_lengths": output_lengths,
        "annotator_counts": dict(annotator_counts),
        "mean_input_length": mean_input_length,
        "median_input_length": median_input_length,
        "mean_output_length": mean_output_length,
        "median_output_length": median_output_length,
        "max_input_length": max_input_length,
        "max_output_length": max_output_length,
    }

In [6]:
stats = get_stats(dutch_only)

There are various things we might be interest in from these stats but some of the most relevant are the length of input and outputs of the data. This may help us decide which LLMs to use in the next stage of the process. 

In [7]:
print(f"Max input length: {stats['max_input_length']}")
print(f"Max output length: {stats['max_output_length']}")
print(f"Mean input length: {stats['mean_input_length']}")
print(f"Mean output length: {stats['mean_output_length']}")

Max input length: 3030
Max output length: 21707
Mean input length: 223.67109059434506
Mean output length: 352.1806116560877


## Push the subset to the Hub 

To help us make testing our pipelines easier we'll create a very small test split (10 samples) that we can use when we're testing out our pipelines.  

In [8]:
dutch_only = dutch_only.train_test_split(test_size=100)

We'll now push this subset to the Hub so that we can use it in the next stage of the process. Don't forget to update this to point to your own Hub workspace. If you are not already authenticated on the Hub uncomment the cell below and run it. 


In [9]:
# from huggingface_hub import login 
# login()

In [None]:
dutch_only.push_to_hub("data-is-better-together/aya_dataset_dutch_example")