notebooks/en/clean_dataset_judges

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Clean an Existing Preference Dataset with LLMs as Judges\n", "\n", "_Authored by: [David Berenstein](https://huggingface.co/davidberenstein1957) and [Sara Han Díaz](https://huggingface.co/sdiazlor)_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub)\n", "- **Components**: [LoadDataFromDicts](https://distilabel.argilla.io/dev/components-gallery/steps/loaddatafromdicts/), [UltraFeedback](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/), [KeepColumns](https://distilabel.argilla.io/latest/components-gallery/steps/groupcolumns/), [PreferenceToArgilla](https://distilabel.argilla.io/latest/components-gallery/steps/textgenerationtoargilla/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/), [GlobalStep](https://distilabel.argilla.io/latest/sections/how_to_guides/basic/step/global_step/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we'll use distilabel to clean a dataset using the LLMs as judges by providing AI feedback on the quality of the data. [distilabel](https://github.com/argilla-io/distilabel) is a synthetic data and AI feedback framework for engineers who need fast, reliable and scalable pipelines based on verified research papers. Check the documentation [here](https://distilabel.argilla.io/latest/).\n", "\n", "To evaluate the responses, we will use the [serverless HF Inference API](https://huggingface.co/docs/api-inference/index) integrated with distilabel. This is free but rate-limited, allowing you to test and evaluate over 150,000 public models, or your own private models, via simple HTTP requests, with fast inference hosted on Hugging Face shared infrastructure. If you need more compute power, you can deploy your own inference endpoint with [Hugging Face Inference Endpoints](https://huggingface.co/docs/inference-endpoints/guides/create_endpoint).\n", "\n", "Finally, to further curate the data, we will use [Argilla](https://github.com/argilla-io/argilla), which allows us to provide human feedback on the data quality. Argilla is a collaboration tool for AI engineers and domain experts who need to build high-quality datasets for their projects. Check the documentation [here](https://docs.argilla.io/latest/).\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install the dependencies\n", "\n", "To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install \"distilabel[hf-inference-endpoints]\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install \"transformers~=4.0\" \"torch~=2.0\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make the required imports:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import random\n", "\n", "from datasets import load_dataset\n", "\n", "from distilabel.llms import InferenceEndpointsLLM\n", "from distilabel.pipeline import Pipeline\n", "from distilabel.steps import (\n", " KeepColumns,\n", " LoadDataFromDicts,\n", " PreferenceToArgilla,\n", ")\n", "from distilabel.steps.tasks import UltraFeedback" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll need an `HF_TOKEN` to use the HF Inference Endpoints. Login to use it directly within this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from huggingface_hub import login\n", "\n", "login(token=os.getenv(\"HF_TOKEN\"), add_to_git_credential=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (optional) Deploy Argilla\n", "\n", "You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](https://docs.argilla.io/latest/getting_started/quickstart/). \n", "\n", "Along with that, you will need to install Argilla as a distilabel extra." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install \"distilabel[argilla, hf-inference-endpoints]\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, we will clean a preference dataset, so we will use the [`Intel/orca_dpo_pairs`](https://huggingface.co/datasets/Intel/orca_dpo_pairs) dataset from the Hugging Face Hub." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<iframe\n", " src=\"https://huggingface.co/datasets/Intel/orca_dpo_pairs/embed/viewer/default/train\"\n", " frameborder=\"0\"\n", " width=\"100%\"\n", " height=\"560px\"\n", "></iframe>" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "dataset = load_dataset(\"Intel/orca_dpo_pairs\", split=\"train[:20]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will shuffle the `chosen` and `rejected` columns to avoid any bias in the dataset." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def shuffle_and_track(chosen, rejected):\n", " pair = [chosen, rejected]\n", " random.shuffle(pair)\n", " order = [\"chosen\" if x == chosen else \"rejected\" for x in pair]\n", " return {\"generations\": pair, \"order\": order}\n", "\n", "dataset = dataset.map(lambda x: shuffle_and_track(x[\"chosen\"], x[\"rejected\"]))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "dataset = dataset.to_list()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (optional) Create a custom step\n", "\n", "A step is a block in a distilabel pipeline used to manipulate, generate, or evaluate data, among other tasks. A set of predefined steps is provided, but you can also create your [own custom steps](https://distilabel.argilla.io/latest/sections/how_to_guides/basic/step/#defining-custom-steps). Instead of preprocessing the data as in the previous section, it is possible to use a custom step to shuffle the columns. This step should be in a separate module to be imported and used in the pipeline. In this case, the pipeline would start by loading the `orca_dpo_pairs` dataset using the `LoadDataFromHub` step and then applying the `ShuffleStep`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# \"shuffle_step.py\"\n", "from typing import TYPE_CHECKING, List\n", "from distilabel.steps import GlobalStep, StepInput\n", "\n", "if TYPE_CHECKING:\n", " from distilabel.steps.typing import StepOutput\n", " \n", "import random\n", "\n", "class ShuffleStep(GlobalStep):\n", " @property\n", " def inputs(self) -> List[str]:\n", " return [\"instruction\", \"chosen\", \"rejected\"]\n", "\n", " @property\n", " def outputs(self) -> List[str]:\n", " return [\"instruction\", \"generations\", \"order\"]\n", "\n", " def process(self, inputs: StepInput) -> \"StepOutput\":\n", " outputs = []\n", "\n", " for input in inputs:\n", " chosen = input[\"chosen\"]\n", " rejected = input[\"rejected\"]\n", " pair = [chosen, rejected]\n", " random.shuffle(pair)\n", " order = [\"chosen\" if x == chosen else \"rejected\" for x in pair]\n", " \n", " outputs.append({\"instruction\": input[\"instruction\"], \"generations\": pair, \"order\": order})\n", "\n", " yield outputs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from shuffle_step import ShuffleStep" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define the pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To clean an existing preference dataset, we will need to define a `Pipeline` with all the necessary steps. However, a similar workflow can be used to clean an SFT dataset. Below, we will go over each step in detail." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load the dataset\n", "We will use the dataset we just shuffled as source data.\n", "\n", "- Component: `LoadDataFromDicts`\n", "- Input columns: `system`, `question`, `chosen`, `rejected`, `generations` and `order`, the same keys as in the loaded list of dictionaries.\n", "- Output columns: `system`, `instruction`, `chosen`, `rejected`, `generations` and `order`. We will use `output_mappings` to rename the columns." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([{'system': '',\n", " 'question': \"You will be given a definition of a task first, then some input of the task.\\nThis task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.\\n\\nAFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.\\nOutput:\",\n", " 'chosen': '[\\n [\"AFC Ajax (amateurs)\", \"has ground\", \"Sportpark De Toekomst\"],\\n [\"Ajax Youth Academy\", \"plays at\", \"Sportpark De Toekomst\"]\\n]',\n", " 'rejected': \" Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\\n\\n[AFC Ajax (amateurs), hasGround, Sportpark De Toekomst]\\n[Ajax Youth Academy, playsAt, Sportpark De Toekomst]\\n\\nExplanation:\\n\\n* AFC Ajax (amateurs) is the subject of the first triplet, and hasGround is the predicate that describes the relationship between AFC Ajax (amateurs) and Sportpark De Toekomst.\\n* Ajax Youth Academy is the subject of the second triplet, and playsAt is the predicate that describes the relationship between Ajax Youth Academy and Sportpark De Toekomst.\\n\\nNote that there may be other possible RDF triplets that could be derived from the input sentence, but the above triplets capture the main relationships present in the sentence.\",\n", " 'generations': [\" Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\\n\\n[AFC Ajax (amateurs), hasGround, Sportpark De Toekomst]\\n[Ajax Youth Academy, playsAt, Sportpark De Toekomst]\\n\\nExplanation:\\n\\n* AFC Ajax (amateurs) is the subject of the first triplet, and hasGround is the predicate that describes the relationship between AFC Ajax (amateurs) and Sportpark De Toekomst.\\n* Ajax Youth Academy is the subject of the second triplet, and playsAt is the predicate that describes the relationship between Ajax Youth Academy and Sportpark De Toekomst.\\n\\nNote that there may be other possible RDF triplets that could be derived from the input sentence, but the above triplets capture the main relationships present in the sentence.\",\n", " '[\\n [\"AFC Ajax (amateurs)\", \"has ground\", \"Sportpark De Toekomst\"],\\n [\"Ajax Youth Academy\", \"plays at\", \"Sportpark De Toekomst\"]\\n]'],\n", " 'order': ['rejected', 'chosen']}],\n", " True)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "load_dataset = LoadDataFromDicts(\n", " data=dataset[:1],\n", " output_mappings={\"question\": \"instruction\"},\n", " pipeline=Pipeline(name=\"showcase-pipeline\"),\n", ")\n", "load_dataset.load()\n", "next(load_dataset.process())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate the responses\n", "\n", "To evaluate the quality of the responses, we will use [`meta-llama/Meta-Llama-3.1-70B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct), applying the `UltraFeedback` task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness). For an SFT dataset, you can use [`PrometheusEval`](../papers/prometheus.md) instead.\n", "\n", "- Component: `UltraFeedback` task with LLMs using `InferenceEndpointsLLM`\n", "- Input columns: `instruction`, `generations`\n", "- Output columns: `ratings`, `rationales`, `distilabel_metadata`, `model_name`\n", "\n", "For your use case and to improve the results, you can use any [other LLM of your choice](https://distilabel.argilla.io/latest/components-gallery/llms/)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'instruction': \"What's the capital of Spain?\",\n", " 'generations': ['Madrid', 'Barcelona'],\n", " 'ratings': [5, 1],\n", " 'rationales': [\"The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\",\n", " \"The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"],\n", " 'distilabel_metadata': {'raw_output_ultra_feedback_0': \"#### Output for Text 1\\nRating: 5 (Excellent)\\nRationale: The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\\n\\n#### Output for Text 2\\nRating: 1 (Low Quality)\\nRationale: The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent.\"},\n", " 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "evaluate_responses = UltraFeedback(\n", " aspect=\"overall-rating\",\n", " llm=InferenceEndpointsLLM(\n", " model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n", " tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n", " generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n", " ),\n", " pipeline=Pipeline(name=\"showcase-pipeline\"),\n", ")\n", "evaluate_responses.load()\n", "next(\n", " evaluate_responses.process(\n", " [\n", " {\n", " \"instruction\": \"What's the capital of Spain?\",\n", " \"generations\": [\"Madrid\", \"Barcelona\"],\n", " }\n", " ]\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Keep only the required columns\n", "\n", "We will get rid of the unneeded columns.\n", "\n", "- Component: `KeepColumns`\n", "- Input columns: `system`, `instruction`, `chosen`, `rejected`, `generations`, `ratings`, `rationales`, `distilabel_metadata` and `model_name`\n", "- Output columns: `instruction`, `chosen`, `rejected`, `generations` and `order`" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'instruction': \"What's the capital of Spain?\",\n", " 'generations': ['Madrid', 'Barcelona'],\n", " 'order': ['chosen', 'rejected'],\n", " 'ratings': [5, 1],\n", " 'rationales': ['', ''],\n", " 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "keep_columns = KeepColumns(\n", " columns=[\n", " \"instruction\",\n", " \"generations\",\n", " \"order\",\n", " \"ratings\",\n", " \"rationales\",\n", " \"model_name\",\n", " ],\n", " pipeline=Pipeline(name=\"showcase-pipeline\"),\n", ")\n", "keep_columns.load()\n", "next(\n", " keep_columns.process(\n", " [\n", " {\n", " \"system\": \"\",\n", " \"instruction\": \"What's the capital of Spain?\",\n", " \"chosen\": \"Madrid\",\n", " \"rejected\": \"Barcelona\",\n", " \"generations\": [\"Madrid\", \"Barcelona\"],\n", " \"order\": [\"chosen\", \"rejected\"],\n", " \"ratings\": [5, 1],\n", " \"rationales\": [\"\", \"\"],\n", " \"model_name\": \"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n", " }\n", " ]\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (Optional) Further data curation\n", "\n", "You can use Argilla to further curate your data.\n", "\n", "- Component: `PreferenceToArgilla` step\n", "- Input columns: `instruction`, `generations`, `generation_models`, `ratings`\n", "- Output columns: `instruction`, `generations`, `generation_models`, `ratings`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "to_argilla = PreferenceToArgilla(\n", " dataset_name=\"cleaned-dataset\",\n", " dataset_workspace=\"argilla\",\n", " api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n", " api_key=\"[your-api-key]\",\n", " num_generations=2\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run the pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below, you can see the full pipeline definition:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "with Pipeline(name=\"clean-dataset\") as pipeline:\n", "\n", " load_dataset = LoadDataFromDicts(\n", " data=dataset, output_mappings={\"question\": \"instruction\"}\n", " )\n", "\n", " evaluate_responses = UltraFeedback(\n", " aspect=\"overall-rating\",\n", " llm=InferenceEndpointsLLM(\n", " model_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n", " tokenizer_id=\"meta-llama/Meta-Llama-3.1-70B-Instruct\",\n", " generation_kwargs={\"max_new_tokens\": 512, \"temperature\": 0.7},\n", " ),\n", " )\n", "\n", " keep_columns = KeepColumns(\n", " columns=[\n", " \"instruction\",\n", " \"generations\",\n", " \"order\",\n", " \"ratings\",\n", " \"rationales\",\n", " \"model_name\",\n", " ]\n", " )\n", "\n", " to_argilla = PreferenceToArgilla(\n", " dataset_name=\"cleaned-dataset\",\n", " dataset_workspace=\"argilla\",\n", " api_url=\"https://[your-owner-name]-[your-space-name].hf.space\",\n", " api_key=\"[your-api-key]\",\n", " num_generations=2,\n", " )\n", "\n", " load_dataset.connect(evaluate_responses)\n", " evaluate_responses.connect(keep_columns)\n", " keep_columns.connect(to_argilla)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now run the pipeline and clean our preference dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "distiset = pipeline.run()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check it! If you have loaded the data to Argilla, you can [start annotating in the Argilla UI](https://docs.argilla.io/latest/how_to_guides/annotate/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can push the dataset to the Hub for sharing with the community and [embed it to explore the data](https://huggingface.co/docs/hub/datasets-viewer-embed)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "distiset.push_to_hub(\"[your-owner-name]/example-cleaned-preference-dataset\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<iframe\n", " src=\"https://huggingface.co/datasets/distilabel-internal-testing/example-cleaned-preference-dataset/embed/viewer/default/train\"\n", " frameborder=\"0\"\n", " width=\"100%\"\n", " height=\"560px\"\n", "></iframe>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we showcased the detailed steps to build a pipeline for cleaning a preference dataset using distilabel. However, you can customize this pipeline for your own use cases, such as cleaning an SFT dataset or adding custom steps.\n", "\n", "We used a preference dataset as our starting point and shuffled the data to avoid any bias. Next, we evaluated the responses using a model through the serverless Hugging Face Inference API, following the UltraFeedback standards. Finally, we kept the needed columns and used Argilla for further curation." ] } ], "metadata": { "kernelspec": { "display_name": "distilabel-tutorials", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 2 }

notebooks/en/clean_dataset_judges_distilabel.ipynb (616 lines of code) (raw):