6_synthetic_datasets/notebooks/instruction_sft

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Generate a dataset for instruction tuning\n", "\n", "This notebook will guide you through the process of generating a dataset for instruction tuning. We'll use the `distilabel` package to generate a dataset for instruction tuning.\n", "\n", "So let's dig in to some instruction tuning datasets.\n", "\n", "<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>\n", " <h2 style='margin: 0;color:blue'>Exercise: Generate a dataset for instruction tuning</h2>\n", " <p>Now that you've seen how to generate a dataset for instruction tuning, try generating a dataset for instruction tuning.</p>\n", " <p><b>Difficulty Levels</b></p>\n", " <p>🐢 Generate an instruction tuning dataset</p>\n", " <p>🐕 Generate a dataset for instruction tuning with seed data</p>\n", " <p>🦁 Generate a dataset for instruction tuning with seed data and with instruction evolution</p>\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "plaintext" } }, "source": [ "## Install dependencies\n", "\n", "Instead of transformers, you can also install `vllm` or `hf-inference-endpoints`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install \"distilabel[hf-transformers,outlines,instructor]\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start synthesizing\n", "\n", "As we've seen in the previous course content, we can create a distilabel pipelines for instruction dataset generation. The bare minimum pipline is already provided. Make sure to scale up this pipeline to generate a large dataset for instruction tuning. Swap out models, model providers and generation arguments to see how they affect the quality of the dataset. Experiment small, scale up later.\n", "\n", "Check out the [distilabel components gallery](https://distilabel.argilla.io/latest/components-gallery/) for information about the processing classes and how to use them. \n", "\n", "An example of loading data from the Hub instead of dictionaries is provided below.\n", "\n", "```python\n", "from datasets import load_dataset\n", "\n", "with Pipeline(...) as pipeline:\n", " ...\n", "\n", "if __name__ == \"__main__:\n", " dataset = load_dataset(\"my-dataset\", split=\"train\")\n", " distiset = pipeline.run(dataset=dataset)\n", "```\n", "\n", "Don't forget to push your dataset to the Hub after running the pipeline!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from distilabel.llms import TransformersLLM\n", "from distilabel.pipeline import Pipeline\n", "from distilabel.steps import LoadDataFromDicts\n", "from distilabel.steps.tasks import TextGeneration\n", "\n", "with Pipeline() as pipeline:\n", " data = LoadDataFromDicts(data=[{\"instruction\": \"Generate a short question about the Hugging Face Smol-Course.\"}])\n", " llm = TransformersLLM(model=\"HuggingFaceTB/SmolLM2-1.7B-Instruct\")\n", " gen_a = TextGeneration(llm=llm, output_mappings={\"generation\": \"instruction\"})\n", " gen_b = TextGeneration(llm=llm, output_mappings={\"generation\": \"response\"})\n", " data >> gen_a >> gen_b\n", "\n", "if __name__ == \"__main__\":\n", " distiset = pipeline.run(use_cache=False)\n", " distiset.push_to_hub(\"huggingface-smol-course-instruction-tuning-dataset\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 🌯 That's a wrap\n", "\n", "You've now seen how to generate a dataset for instruction tuning. You could use this to:\n", "\n", "- Generate a dataset for instruction tuning.\n", "- Create evaluation datasets for instruction tuning.\n", "\n", "Next\n", "\n", "🧑‍🏫 Learn - About [generating preference datasets](./preference_datasets.md)\n", "🏋️‍♂️ Fine-tune a model for instruction tuning with a synthetic dataset based on the [instruction tuning chapter](../../1_instruction_tuning/README.md)\n" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }

6_synthetic_datasets/notebooks/instruction_sft_dataset.ipynb (119 lines of code) (raw):