cookbook-efforts/dpo-orpo-preference/01_data_prep.ipynb (248 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 01. Creating our subsample of Aya to prepare for creating a DPO dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook walks through the steps required to create a sample from the full Aya dataset for the language you are interested in working in. \n", "In this notebook and the subsequent notebooks we'll focus on Dutch as an example but the process will be rather similar for other languages." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "from datasets import Dataset\n", "from datasets import load_dataset\n", "from statistics import mean, median" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by loading the Aya dataset!" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "aya_ds = load_dataset(\"CohereForAI/aya_dataset\",split='train')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dataset({\n", " features: ['inputs', 'targets', 'language', 'language_code', 'annotation_type', 'user_id'],\n", " num_rows: 202362\n", "})" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aya_ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to only include the data that is relevant to the language we are interested in. This means we need to filter out the data that is not in Dutch. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dataset({\n", " features: ['inputs', 'targets', 'language', 'language_code', 'annotation_type', 'user_id'],\n", " num_rows: 1733\n", "})" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dutch_only = aya_ds.filter(lambda x: x['language'] == 'Dutch')\n", "dutch_only" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting some statistics about the data\n", "\n", "To help with the next stages of this process we'll get some statistics about the data. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def get_stats(ds: Dataset):\n", " input_lengths = []\n", " output_lengths = []\n", " annotator_counts: Counter = Counter()\n", " for row in ds:\n", " input_lengths.append(len(row[\"inputs\"]))\n", " output_lengths.append(len(row[\"targets\"]))\n", " annotator_counts.update(ds[\"user_id\"])\n", " mean_input_length = mean(input_lengths)\n", " median_input_length = median(input_lengths)\n", " mean_output_length = mean(output_lengths)\n", " median_output_length = median(output_lengths)\n", " max_input_length = max(input_lengths)\n", " max_output_length = max(output_lengths)\n", " return {\n", " \"number_of_unique_annotators\": len(annotator_counts),\n", " \"input_lengths\": input_lengths,\n", " \"output_lengths\": output_lengths,\n", " \"annotator_counts\": dict(annotator_counts),\n", " \"mean_input_length\": mean_input_length,\n", " \"median_input_length\": median_input_length,\n", " \"mean_output_length\": mean_output_length,\n", " \"median_output_length\": median_output_length,\n", " \"max_input_length\": max_input_length,\n", " \"max_output_length\": max_output_length,\n", " }" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "stats = get_stats(dutch_only)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are various things we might be interest in from these stats but some of the most relevant are the length of input and outputs of the data. This may help us decide which LLMs to use in the next stage of the process. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Max input length: 3030\n", "Max output length: 21707\n", "Mean input length: 223.67109059434506\n", "Mean output length: 352.1806116560877\n" ] } ], "source": [ "print(f\"Max input length: {stats['max_input_length']}\")\n", "print(f\"Max output length: {stats['max_output_length']}\")\n", "print(f\"Mean input length: {stats['mean_input_length']}\")\n", "print(f\"Mean output length: {stats['mean_output_length']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Push the subset to the Hub \n", "\n", "To help us make testing our pipelines easier we'll create a very small test split (10 samples) that we can use when we're testing out our pipelines. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "dutch_only = dutch_only.train_test_split(test_size=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll now push this subset to the Hub so that we can use it in the next stage of the process. Don't forget to update this to point to your own Hub workspace. If you are not already authenticated on the Hub uncomment the cell below and run it. \n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# from huggingface_hub import login \n", "# login()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dutch_only.push_to_hub(\"data-is-better-together/aya_dataset_dutch_example\")" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.1" } }, "nbformat": 4, "nbformat_minor": 2 }