3_parameter_efficient_finetuning/notebooks/finetune_sft_peft.ipynb (516 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "z-6LLOPZouLg" }, "source": [ "# How to Fine-Tune LLMs with LoRA Adapters using Hugging Face TRL\n", "\n", "This notebook demonstrates how to efficiently fine-tune large language models using LoRA (Low-Rank Adaptation) adapters. LoRA is a parameter-efficient fine-tuning technique that:\n", "- Freezes the pre-trained model weights\n", "- Adds small trainable rank decomposition matrices to attention layers\n", "- Typically reduces trainable parameters by ~90%\n", "- Maintains model performance while being memory efficient\n", "\n", "We'll cover:\n", "1. Setup development environment and LoRA configuration\n", "2. Create and prepare the dataset for adapter training\n", "3. Fine-tune using `trl` and `SFTTrainer` with LoRA adapters\n", "4. Test the model and merge adapters (optional)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "fXqd9BXgouLi" }, "source": [ "## 1. Setup development environment\n", "\n", "Our first step is to install Hugging Face Libraries and Pytorch, including trl, transformers and datasets. If you haven't heard of trl yet, don't worry. It is a new library on top of transformers and datasets, which makes it easier to fine-tune, rlhf, align open LLMs.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tKvGVxImouLi" }, "outputs": [], "source": [ "# Install the requirements in Google Colab\n", "# !pip install transformers datasets trl huggingface_hub\n", "\n", "# Authenticate to Hugging Face\n", "\n", "from huggingface_hub import login\n", "\n", "login()\n", "\n", "# for convenience you can create an environment variable containing your hub token as HF_TOKEN" ] }, { "cell_type": "markdown", "metadata": { "id": "XHUzfwpKouLk" }, "source": [ "## 2. Load the dataset" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "z4p6Bvo7ouLk" }, "outputs": [ { "data": { "text/plain": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['full_topic', 'messages'],\n", " num_rows: 2260\n", " })\n", " test: Dataset({\n", " features: ['full_topic', 'messages'],\n", " num_rows: 119\n", " })\n", "})" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load a sample dataset\n", "from datasets import load_dataset\n", "\n", "# TODO: define your dataset and config using the path and name parameters\n", "dataset = load_dataset(path=\"HuggingFaceTB/smoltalk\", name=\"everyday-conversations\")\n", "dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "9TOhJdtsouLk" }, "source": [ "## 3. Fine-tune LLM using `trl` and the `SFTTrainer` with LoRA\n", "\n", "The [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) from `trl` provides integration with LoRA adapters through the [PEFT](https://huggingface.co/docs/peft/en/index) library. Key advantages of this setup include:\n", "\n", "1. **Memory Efficiency**: \n", " - Only adapter parameters are stored in GPU memory\n", " - Base model weights remain frozen and can be loaded in lower precision\n", " - Enables fine-tuning of large models on consumer GPUs\n", "\n", "2. **Training Features**:\n", " - Native PEFT/LoRA integration with minimal setup\n", " - Support for QLoRA (Quantized LoRA) for even better memory efficiency\n", "\n", "3. **Adapter Management**:\n", " - Adapter weight saving during checkpoints\n", " - Features to merge adapters back into base model\n", "\n", "We'll use LoRA in our example, which combines LoRA with 4-bit quantization to further reduce memory usage without sacrificing performance. The setup requires just a few configuration steps:\n", "1. Define the LoRA configuration (rank, alpha, dropout)\n", "2. Create the SFTTrainer with PEFT config\n", "3. Train and save the adapter weights\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "from transformers import AutoModelForCausalLM, AutoTokenizer\n", "from datasets import load_dataset\n", "from trl import SFTConfig, SFTTrainer, setup_chat_format\n", "import torch\n", "\n", "device = (\n", " \"cuda\"\n", " if torch.cuda.is_available()\n", " else \"mps\" if torch.backends.mps.is_available() else \"cpu\"\n", ")\n", "\n", "# Load the model and tokenizer\n", "model_name = \"HuggingFaceTB/SmolLM2-135M\"\n", "\n", "model = AutoModelForCausalLM.from_pretrained(\n", " pretrained_model_name_or_path=model_name\n", ").to(device)\n", "tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)\n", "\n", "# Set up the chat format\n", "model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)\n", "\n", "# Set our name for the finetune to be saved &/ uploaded to\n", "finetune_name = \"SmolLM2-FT-MyDataset\"\n", "finetune_tags = [\"smol-course\", \"module_1\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "ZbuVArTHouLk" }, "source": [ "The `SFTTrainer`  supports a native integration with `peft`, which makes it super easy to efficiently tune LLMs using, e.g. LoRA. We only need to create our `LoraConfig` and provide it to the trainer.\n", "\n", "<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>\n", " <h2 style='margin: 0;color:blue'>Exercise: Define LoRA parameters for finetuning</h2>\n", " <p>Take a dataset from the Hugging Face hub and finetune a model on it. </p> \n", " <p><b>Difficulty Levels</b></p>\n", " <p>🐢 Use the general parameters for an abitrary finetune</p>\n", " <p>🐕 Adjust the parameters and review in weights & biases.</p>\n", " <p>🦁 Adjust the parameters and show change in inference results.</p>\n", "</div>" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "blDSs9swouLk" }, "outputs": [], "source": [ "from peft import LoraConfig\n", "\n", "# TODO: Configure LoRA parameters\n", "# r: rank dimension for LoRA update matrices (smaller = more compression)\n", "rank_dimension = 6\n", "# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)\n", "lora_alpha = 8\n", "# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)\n", "lora_dropout = 0.05\n", "\n", "peft_config = LoraConfig(\n", " r=rank_dimension, # Rank dimension - typically between 4-32\n", " lora_alpha=lora_alpha, # LoRA scaling factor - typically 2x rank\n", " lora_dropout=lora_dropout, # Dropout probability for LoRA layers\n", " bias=\"none\", # Bias type for LoRA. the corresponding biases will be updated during training.\n", " target_modules=\"all-linear\", # Which modules to apply LoRA to\n", " task_type=\"CAUSAL_LM\", # Task type for model architecture\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "l5NUDPcaouLl" }, "source": [ "Before we can start our training we need to define the hyperparameters (`TrainingArguments`) we want to use." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NqT28VZlouLl" }, "outputs": [], "source": [ "# Training configuration\n", "# Hyperparameters based on QLoRA paper recommendations\n", "args = SFTConfig(\n", " # Output settings\n", " output_dir=finetune_name, # Directory to save model checkpoints\n", " # Training duration\n", " num_train_epochs=1, # Number of training epochs\n", " # Batch size settings\n", " per_device_train_batch_size=2, # Batch size per GPU\n", " gradient_accumulation_steps=2, # Accumulate gradients for larger effective batch\n", " # Memory optimization\n", " gradient_checkpointing=True, # Trade compute for memory savings\n", " # Optimizer settings\n", " optim=\"adamw_torch_fused\", # Use fused AdamW for efficiency\n", " learning_rate=2e-4, # Learning rate (QLoRA paper)\n", " max_grad_norm=0.3, # Gradient clipping threshold\n", " # Learning rate schedule\n", " warmup_ratio=0.03, # Portion of steps for warmup\n", " lr_scheduler_type=\"constant\", # Keep learning rate constant after warmup\n", " # Logging and saving\n", " logging_steps=10, # Log metrics every N steps\n", " save_strategy=\"epoch\", # Save checkpoint every epoch\n", " # Precision settings\n", " bf16=True, # Use bfloat16 precision\n", " # Integration settings\n", " push_to_hub=False, # Don't push to HuggingFace Hub\n", " report_to=\"none\", # Disable external logging\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "cGhR7uFBouLl" }, "source": [ "We now have every building block we need to create our `SFTTrainer` to start then training our model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "M00Har2douLl" }, "outputs": [], "source": [ "max_seq_length = 1512 # max sequence length for model and packing of the dataset\n", "\n", "# Create SFTTrainer with LoRA configuration\n", "trainer = SFTTrainer(\n", " model=model,\n", " args=args,\n", " train_dataset=dataset[\"train\"],\n", " peft_config=peft_config, # LoRA configuration\n", " max_seq_length=max_seq_length, # Maximum sequence length\n", " tokenizer=tokenizer,\n", " packing=True, # Enable input packing for efficiency\n", " dataset_kwargs={\n", " \"add_special_tokens\": False, # Special tokens handled by template\n", " \"append_concat_token\": False, # No additional separator needed\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "zQ_kRN24ouLl" }, "source": [ "Start training our model by calling the `train()` method on our `Trainer` instance. This will start the training loop and train our model for 3 epochs. Since we are using a PEFT method, we will only save the adapted model weights and not the full model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Tq4nIYqKouLl" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "300e5dfbb4b54750b77324345c7591f9", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/72 [00:00<?, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "TrainOutput(global_step=72, training_loss=1.6402628521124523, metrics={'train_runtime': 195.2398, 'train_samples_per_second': 1.485, 'train_steps_per_second': 0.369, 'total_flos': 282267289092096.0, 'train_loss': 1.6402628521124523, 'epoch': 0.993103448275862})" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# start training, the model will be automatically saved to the hub and the output directory\n", "trainer.train()\n", "\n", "# save model\n", "trainer.save_model()" ] }, { "cell_type": "markdown", "metadata": { "id": "y4HHSYYzouLl" }, "source": [ "The training with Flash Attention for 3 epochs with a dataset of 15k samples took 4:14:36 on a `g5.2xlarge`. The instance costs `1.21$/h` which brings us to a total cost of only ~`5.3$`.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "C309KsXjouLl" }, "source": [ "### Merge LoRA Adapter into the Original Model\n", "\n", "When using LoRA, we only train adapter weights while keeping the base model frozen. During training, we save only these lightweight adapter weights (~2-10MB) rather than a full model copy. However, for deployment, you might want to merge the adapters back into the base model for:\n", "\n", "1. **Simplified Deployment**: Single model file instead of base model + adapters\n", "2. **Inference Speed**: No adapter computation overhead\n", "3. **Framework Compatibility**: Better compatibility with serving frameworks\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from peft import AutoPeftModelForCausalLM\n", "\n", "\n", "# Load PEFT model on CPU\n", "model = AutoPeftModelForCausalLM.from_pretrained(\n", " pretrained_model_name_or_path=args.output_dir,\n", " torch_dtype=torch.float16,\n", " low_cpu_mem_usage=True,\n", ")\n", "\n", "# Merge LoRA and base model and save\n", "merged_model = model.merge_and_unload()\n", "merged_model.save_pretrained(\n", " args.output_dir, safe_serialization=True, max_shard_size=\"2GB\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "-yO6E9quouLl" }, "source": [ "## 3. Test Model and run Inference\n", "\n", "After the training is done we want to test our model. We will load different samples from the original dataset and evaluate the model on those samples, using a simple loop and accuracy as our metric.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>\n", " <h2 style='margin: 0;color:blue'>Bonus Exercise: Load LoRA Adapter</h2>\n", " <p>Use what you learnt from the ecample note book to load your trained LoRA adapter for inference.</p> \n", "</div>" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "I5B494OdouLl" }, "outputs": [], "source": [ "# free the memory again\n", "del model\n", "del trainer\n", "torch.cuda.empty_cache()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "P1UhohVdouLl" }, "outputs": [], "source": [ "import torch\n", "from peft import AutoPeftModelForCausalLM\n", "from transformers import AutoTokenizer, pipeline\n", "\n", "# Load Model with PEFT adapter\n", "tokenizer = AutoTokenizer.from_pretrained(finetune_name)\n", "model = AutoPeftModelForCausalLM.from_pretrained(\n", " finetune_name, device_map=\"auto\", torch_dtype=torch.float16\n", ")\n", "pipe = pipeline(\n", " \"text-generation\", model=merged_model, tokenizer=tokenizer, device=device\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "99uFDAuuouLl" }, "source": [ "Lets test some prompt samples and see how the model performs." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "id": "-shSmUbvouLl", "outputId": "16d97c61-3b31-4040-c780-3c4de75c3824" }, "outputs": [], "source": [ "prompts = [\n", " \"What is the capital of Germany? Explain why thats the case and if it was different in the past?\",\n", " \"Write a Python function to calculate the factorial of a number.\",\n", " \"A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?\",\n", " \"What is the difference between a fruit and a vegetable? Give examples of each.\",\n", "]\n", "\n", "\n", "def test_inference(prompt):\n", " prompt = pipe.tokenizer.apply_chat_template(\n", " [{\"role\": \"user\", \"content\": prompt}],\n", " tokenize=False,\n", " add_generation_prompt=True,\n", " )\n", " outputs = pipe(\n", " prompt,\n", " )\n", " return outputs[0][\"generated_text\"][len(prompt) :].strip()\n", "\n", "\n", "for prompt in prompts:\n", " print(f\" prompt:\\n{prompt}\")\n", " print(f\" response:\\n{test_inference(prompt)}\")\n", " print(\"-\" * 50)" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 0 }