notebooks/en/agent

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Agentic RAG: turbocharge your RAG with query reformulation and self-query! 🚀\n", "_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_\n", "\n", "> This tutorial is advanced. You should have notions from [this other cookbook](advanced_rag) first!\n", "\n", "> Reminder: Retrieval-Augmented-Generation (RAG) is “using an LLM to answer a user query, but basing the answer on information retrieved from a knowledge base”. It has many advantages over using a vanilla or fine-tuned LLM: to name a few, it allows to ground the answer on true facts and reduce confabulations, it allows to provide the LLM with domain-specific knowledge, and it allows fine-grained control of access to information from the knowledge base.\n", "\n", "But vanilla RAG has limitations, most importantly these two:\n", "- It **performs only one retrieval step**: if the results are bad, the generation in turn will be bad.\n", "- __Semantic similarity is computed with the *user query* as a reference__, which might be suboptimal: for instance, the user query will often be a question and the document containing the true answer will be in affirmative voice, so its similarity score will be downgraded compared to other source documents in the interrogative form, leading to a risk of missing the relevant information.\n", "\n", "But we can alleviate these problems by making a **RAG agent: very simply, an agent armed with a retriever tool!**\n", "\n", "This agent will: ✅ Formulate the query itself and ✅ Critique to re-retrieve if needed.\n", "\n", "So it should naively recover some advanced RAG techniques!\n", "- Instead of directly using the user query as the reference in semantic search, the agent formulates itself a reference sentence that can be closer to the targeted documents, as in [HyDE](https://huggingface.co/papers/2212.10496)\n", "- The agent can the generated snippets and re-retrieve if needed, as in [Self-Query](https://docs.llamaindex.ai/en/stable/examples/evaluation/RetryQuery/)\n", "\n", "Let's build this system. 🛠️\n", "\n", "Run the line below to install required dependencies:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install pandas langchain langchain-community sentence-transformers faiss-cpu smolagents --upgrade -q" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's login in order to call the HF Inference API:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from huggingface_hub import notebook_login\n", "\n", "notebook_login()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first load a knowledge base on which we want to perform RAG: this dataset is a compilation of the documentation pages for many `huggingface` packages, stored as markdown." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import datasets\n", "\n", "knowledge_base = datasets.load_dataset(\"m-ric/huggingface_doc\", split=\"train\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we prepare the knowledge base by processing the dataset and storing it into a vector database to be used by the retriever.\n", "\n", "We use [LangChain](https://python.langchain.com/) for its excellent vector database utilities.\n", "For the embedding model, we use [thenlper/gte-small](https://huggingface.co/thenlper/gte-small) since it performed well in our `RAG_evaluation` cookbook." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Splitting documents...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 2647/2647 [00:52<00:00, 50.28it/s]\n", "/var/folders/6m/9b1tts6d5w960j80wbw9tx3m0000gn/T/ipykernel_26437/2798339493.py:37: LangChainDeprecationWarning: The class `HuggingFaceEmbeddings` was deprecated in LangChain 0.2.2 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-huggingface package and should be used instead. To use it run `pip install -U :class:`~langchain-huggingface` and import as `from :class:`~langchain_huggingface import HuggingFaceEmbeddings``.\n", " embedding_model = HuggingFaceEmbeddings(model_name=\"thenlper/gte-small\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Embedding documents... This should take a few minutes (5 minutes on MacBook with M1 Pro)\n" ] } ], "source": [ "from tqdm import tqdm\n", "from transformers import AutoTokenizer\n", "from langchain.docstore.document import Document\n", "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "from langchain.vectorstores import FAISS\n", "from langchain_community.embeddings import HuggingFaceEmbeddings\n", "from langchain_community.vectorstores.utils import DistanceStrategy\n", "\n", "source_docs = [\n", " Document(page_content=doc[\"text\"], metadata={\"source\": doc[\"source\"].split(\"/\")[1]})\n", " for doc in knowledge_base\n", "]\n", "\n", "text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(\n", " AutoTokenizer.from_pretrained(\"thenlper/gte-small\"),\n", " chunk_size=200,\n", " chunk_overlap=20,\n", " add_start_index=True,\n", " strip_whitespace=True,\n", " separators=[\"\\n\\n\", \"\\n\", \".\", \" \", \"\"],\n", ")\n", "\n", "# Split docs and keep only unique ones\n", "print(\"Splitting documents...\")\n", "docs_processed = []\n", "unique_texts = {}\n", "for doc in tqdm(source_docs):\n", " new_docs = text_splitter.split_documents([doc])\n", " for new_doc in new_docs:\n", " if new_doc.page_content not in unique_texts:\n", " unique_texts[new_doc.page_content] = True\n", " docs_processed.append(new_doc)\n", "\n", "print(\n", " \"Embedding documents... This should take a few minutes (5 minutes on MacBook with M1 Pro)\"\n", ")\n", "embedding_model = HuggingFaceEmbeddings(model_name=\"thenlper/gte-small\")\n", "vectordb = FAISS.from_documents(\n", " documents=docs_processed,\n", " embedding=embedding_model,\n", " distance_strategy=DistanceStrategy.COSINE,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the database is ready: let’s build our agentic RAG system!\n", "\n", "👉 We only need a `RetrieverTool` that our agent can leverage to retrieve information from the knowledge base.\n", "\n", "Since we need to add a vectordb as an attribute of the tool, we cannot simply use the [simple tool constructor](https://huggingface.co/docs/transformers/main/en/agents#create-a-new-tool) with a `@tool` decorator: so we will follow the advanced setup highlighted in the [advanced agents documentation](https://huggingface.co/docs/transformers/main/en/agents_advanced#directly-define-a-tool-by-subclassing-tool-and-share-it-to-the-hub)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from smolagents import Tool\n", "from langchain_core.vectorstores import VectorStore\n", "\n", "\n", "class RetrieverTool(Tool):\n", " name = \"retriever\"\n", " description = \"Using semantic similarity, retrieves some documents from the knowledge base that have the closest embeddings to the input query.\"\n", " inputs = {\n", " \"query\": {\n", " \"type\": \"string\",\n", " \"description\": \"The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.\",\n", " }\n", " }\n", " output_type = \"string\"\n", "\n", " def __init__(self, vectordb: VectorStore, **kwargs):\n", " super().__init__(**kwargs)\n", " self.vectordb = vectordb\n", "\n", " def forward(self, query: str) -> str:\n", " assert isinstance(query, str), \"Your search query must be a string\"\n", "\n", " docs = self.vectordb.similarity_search(\n", " query,\n", " k=7,\n", " )\n", "\n", " return \"\\nRetrieved documents:\\n\" + \"\".join(\n", " [\n", " f\"===== Document {str(i)} =====\\n\" + doc.page_content\n", " for i, doc in enumerate(docs)\n", " ]\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it’s straightforward to create an agent that leverages this tool!\n", "\n", "The agent will need these arguments upon initialization:\n", "- *`tools`*: a list of tools that the agent will be able to call.\n", "- *`model`*: the LLM that powers the agent.\n", "\n", "Our `model` must be a callable that takes as input a list of [messages](https://huggingface.co/docs/transformers/main/chat_templating) and returns text. It also needs to accept a `stop_sequences` argument that indicates when to stop its generation. For convenience, we directly use the `InferenceClientModel` class provided in the package to get a LLM engine that calls our [Inference API](https://huggingface.co/docs/api-inference/en/index).\n", "\n", "And we use [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), served for free on Hugging Face's Inference API!\n", "\n", "_Note:_ The Inference API hosts models based on various criteria, and deployed models may be updated or replaced without prior notice. Learn more about it [here](https://huggingface.co/docs/api-inference/supported-models)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from smolagents import InferenceClientModel, ToolCallingAgent\n", "\n", "model = InferenceClientModel(\"meta-llama/Llama-3.1-70B-Instruct\")\n", "\n", "retriever_tool = RetrieverTool(vectordb)\n", "agent = ToolCallingAgent(\n", " tools=[retriever_tool], model=model\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we initialized the agent as a `ReactJsonAgent`, it has been automatically given a default system prompt that tells the LLM engine to process step-by-step and generate tool calls as JSON blobs (you could replace this prompt template with your own as needed).\n", "\n", "Then when its `.run()` method is launched, the agent takes care of calling the LLM engine, parsing the tool call JSON blobs and executing these tool calls, all in a loop that ends only when the final answer is provided." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭──────────────────────────────────────────────────── New run ────────────────────────────────────────────────────╮\n", "│ │\n", "│ How can I push a model to the Hub? │\n", "│ │\n", "╰─ InferenceClientModel - meta-llama/Llama-3.1-70B-Instruct ────────────────────────────────────────────────────────────────╯\n", "</pre>\n" ], "text/plain": [ "\u001b[38;2;212;183;2m╭─\u001b[0m\u001b[38;2;212;183;2m───────────────────────────────────────────────────\u001b[0m\u001b[38;2;212;183;2m \u001b[0m\u001b[1;38;2;212;183;2mNew run\u001b[0m\u001b[38;2;212;183;2m \u001b[0m\u001b[38;2;212;183;2m───────────────────────────────────────────────────\u001b[0m\u001b[38;2;212;183;2m─╮\u001b[0m\n", "\u001b[38;2;212;183;2m│\u001b[0m \u001b[38;2;212;183;2m│\u001b[0m\n", "\u001b[38;2;212;183;2m│\u001b[0m \u001b[1mHow can I push a model to the Hub?\u001b[0m \u001b[38;2;212;183;2m│\u001b[0m\n", "\u001b[38;2;212;183;2m│\u001b[0m \u001b[38;2;212;183;2m│\u001b[0m\n", "\u001b[38;2;212;183;2m╰─\u001b[0m\u001b[38;2;212;183;2m InferenceClientModel - meta-llama/Llama-3.1-70B-Instruct \u001b[0m\u001b[38;2;212;183;2m───────────────────────────────────────────────────────────────\u001b[0m\u001b[38;2;212;183;2m─╯\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 0 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n", "</pre>\n" ], "text/plain": [ "\u001b[38;2;212;183;2m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ \u001b[0m\u001b[1mStep \u001b[0m\u001b[1;36m0\u001b[0m\u001b[38;2;212;183;2m ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n", "│ Calling tool: 'retriever' with arguments: {'query': 'Push a model'} │\n", "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n", "</pre>\n" ], "text/plain": [ "╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n", "│ Calling tool: 'retriever' with arguments: {'query': 'Push a model'} │\n", "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Observations: Retrieved documents:\n", "===== Document 0 =====\n", ". The second way to upload a model, though, is to call model.push_to_hub(). So this is more of a once-off method - \n", "it's not called regularly during training. You can just call this manually whenever you want to upload a model to \n", "the hub. So we recommend running this after the end of training, just to make sure that you have a commit message \n", "just to guarantee that this was the final version of the model at the end of training. And it just makes sure that \n", "you're working with the definitive end-of-training model and not accidentally using a model that's from a \n", "checkpoint somewhere along the way===== Document 1 =====\n", "model.fit(my_data, my_labels)\n", "\n", "model.push_to_hub(\"my-new-model\")\n", "```\n", "\n", "You can also use the \n", "[PushToHubCallback](https://huggingface.co/docs/transformers/main_classes/keras_callbacks#transformers.PushToHubCal\n", "lback) to upload checkpoints regularly during a longer training run! Either way, you’ll get a model page and an \n", "autogenerated model card, and most importantly of all, anyone else can use your model to get predictions, or as a \n", "starting point for further training, using exactly the same API as they use to load any existing model:===== \n", "Document 2 =====\n", "# Pushing all things after training\n", "trainer.push_to_hub()\n", "```\n", "\n", "There is much more you can do, so we suggest to review the [Share a \n", "model](https://huggingface.co/docs/transformers/model_sharing) guide.\n", "\n", "## Additional resources\n", "\n", "* Transformers (https://github.com/huggingface/transformers).\n", "* Transformers (https://huggingface.co/docs/transformers/index).\n", "* Share a model (https://huggingface.co/docs/transformers/model_sharing).===== Document 3 =====\n", "Finally, if you want, you can push your model up to the hub. Here, we'll push it up if you specified \n", "`push_to_hub=True` in the training configuration. Note that in order to push to hub, you'll have to have git-lfs \n", "installed and be logged into your Hugging Face account (which can be done via `huggingface-cli login`).\n", "\n", "```python\n", "kwargs = {\n", " \"finetuned_from\": model.config._name_or_path,\n", " \"tasks\": \"image-classification\",\n", " \"dataset\": 'beans',\n", " \"tags\": ['image-classification'],\n", "}===== Document 4 =====\n", "You've seen most of those before: we set some hyperparameters (like the learning rate, the number of epochs to \n", "train for, and the weight decay), and we specify `push_to_hub=True` to indicate that we want to save the model and \n", "evaluate it at the end of every epoch, and that we want to upload our results to the Model Hub. Note that you can \n", "specify the name of the repository you want to push to with the `hub_model_id` argument (in particular, you will \n", "have to use this argument to push to an organization). For instance, when we pushed the model to the \n", "[`huggingface-course` organization](https://huggingface===== Document 5 =====\n", "# Step 7. Push everything to the Hub\n", " api.upload_folder(\n", " repo_id=repo_id,\n", " folder_path=repo_local_path,\n", " path_in_repo=\".\",\n", " )\n", "\n", " print(\"Your model is pushed to the Hub. You can view your model here: \", repo_url)\n", "```\n", "\n", "### .\n", "\n", "By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the \n", "Hub**.===== Document 6 =====\n", "Once you have a trained topic model, you can push it to the Hugging Face Hub in one line. Pushing your model to the\n", "Hub will automatically create an initial model card for your model, including an overview of the topics created. \n", "Below you can see an example of the topics resulting from a (https://huggingface.co/MaartenGr/BERTopic_ArXiv).\n", "</pre>\n" ], "text/plain": [ "Observations: Retrieved documents:\n", "===== Document \u001b[1;36m0\u001b[0m =====\n", ". The second way to upload a model, though, is to call \u001b[1;35mmodel.push_to_hub\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m. So this is more of a once-off method - \n", "it's not called regularly during training. You can just call this manually whenever you want to upload a model to \n", "the hub. So we recommend running this after the end of training, just to make sure that you have a commit message \n", "just to guarantee that this was the final version of the model at the end of training. And it just makes sure that \n", "you're working with the definitive end-of-training model and not accidentally using a model that's from a \n", "checkpoint somewhere along the \u001b[33mway\u001b[0m===== Document \u001b[1;36m1\u001b[0m =====\n", "\u001b[1;35mmodel.fit\u001b[0m\u001b[1m(\u001b[0mmy_data, my_labels\u001b[1m)\u001b[0m\n", "\n", "\u001b[1;35mmodel.push_to_hub\u001b[0m\u001b[1m(\u001b[0m\u001b[32m\"my-new-model\"\u001b[0m\u001b[1m)\u001b[0m\n", "```\n", "\n", "You can also use the \n", "\u001b[1m[\u001b[0mPushToHubCallback\u001b[1m]\u001b[0m\u001b[1m(\u001b[0m\u001b[4;94mhttps://huggingface.co/docs/transformers/main_classes/keras_callbacks#transformers.PushToHubCal\u001b[0m\n", "\u001b[4;94mlback\u001b[0m\u001b[4;94m)\u001b[0m to upload checkpoints regularly during a longer training run! Either way, you’ll get a model page and an \n", "autogenerated model card, and most importantly of all, anyone else can use your model to get predictions, or as a \n", "starting point for further training, using exactly the same API as they use to load any existing model:===== \n", "Document \u001b[1;36m2\u001b[0m =====\n", "# Pushing all things after training\n", "\u001b[1;35mtrainer.push_to_hub\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m\n", "```\n", "\n", "There is much more you can do, so we suggest to review the \u001b[1m[\u001b[0mShare a \n", "model\u001b[1m]\u001b[0m\u001b[1m(\u001b[0m\u001b[4;94mhttps://huggingface.co/docs/transformers/model_sharing\u001b[0m\u001b[4;94m)\u001b[0m guide.\n", "\n", "## Additional resources\n", "\n", "* Transformers \u001b[1m(\u001b[0m\u001b[4;94mhttps://github.com/huggingface/transformers\u001b[0m\u001b[4;94m)\u001b[0m\u001b[4;94m.\u001b[0m\n", "* Transformers \u001b[1m(\u001b[0m\u001b[4;94mhttps://huggingface.co/docs/transformers/index\u001b[0m\u001b[4;94m)\u001b[0m\u001b[4;94m.\u001b[0m\n", "* Share a model \u001b[1m(\u001b[0m\u001b[4;94mhttps://huggingface.co/docs/transformers/model_sharing\u001b[0m\u001b[4;94m)\u001b[0m\u001b[4;94m.=====\u001b[0m Document \u001b[1;36m3\u001b[0m =====\n", "Finally, if you want, you can push your model up to the hub. Here, we'll push it up if you specified \n", "`\u001b[33mpush_to_hub\u001b[0m=\u001b[3;92mTrue\u001b[0m` in the training configuration. Note that in order to push to hub, you'll have to have git-lfs \n", "installed and be logged into your Hugging Face account \u001b[1m(\u001b[0mwhich can be done via `huggingface-cli login`\u001b[1m)\u001b[0m.\n", "\n", "```python\n", "kwargs = \u001b[1m{\u001b[0m\n", " \u001b[32m\"finetuned_from\"\u001b[0m: model.config._name_or_path,\n", " \u001b[32m\"tasks\"\u001b[0m: \u001b[32m\"image-classification\"\u001b[0m,\n", " \u001b[32m\"dataset\"\u001b[0m: \u001b[32m'beans'\u001b[0m,\n", " \u001b[32m\"tags\"\u001b[0m: \u001b[1m[\u001b[0m\u001b[32m'image-classification'\u001b[0m\u001b[1m]\u001b[0m,\n", "\u001b[1m}\u001b[0m===== Document \u001b[1;36m4\u001b[0m =====\n", "You've seen most of those before: we set some hyperparameters \u001b[1m(\u001b[0mlike the learning rate, the number of epochs to \n", "train for, and the weight decay\u001b[1m)\u001b[0m, and we specify `\u001b[33mpush_to_hub\u001b[0m=\u001b[3;92mTrue\u001b[0m` to indicate that we want to save the model and \n", "evaluate it at the end of every epoch, and that we want to upload our results to the Model Hub. Note that you can \n", "specify the name of the repository you want to push to with the `hub_model_id` argument \u001b[1m(\u001b[0min particular, you will \n", "have to use this argument to push to an organization\u001b[1m)\u001b[0m. For instance, when we pushed the model to the \n", "\u001b[1m[\u001b[0m`huggingface-course` organization\u001b[1m]\u001b[0m\u001b[1m(\u001b[0m\u001b[4;94mhttps://\u001b[0m\u001b[4;94mhuggingface\u001b[0m\u001b[4;94m=====\u001b[0m Document \u001b[1;36m5\u001b[0m =====\n", "# Step \u001b[1;36m7\u001b[0m. Push everything to the Hub\n", " \u001b[1;35mapi.upload_folder\u001b[0m\u001b[1m(\u001b[0m\n", " \u001b[33mrepo_id\u001b[0m=\u001b[35mrepo_id\u001b[0m,\n", " \u001b[33mfolder_path\u001b[0m=\u001b[35mrepo_local_path\u001b[0m,\n", " \u001b[33mpath_in_repo\u001b[0m=\u001b[32m\".\"\u001b[0m,\n", " \u001b[1m)\u001b[0m\n", "\n", " \u001b[1;35mprint\u001b[0m\u001b[1m(\u001b[0m\u001b[32m\"Your model is pushed to the Hub. You can view your model here: \"\u001b[0m, repo_url\u001b[1m)\u001b[0m\n", "```\n", "\n", "### .\n", "\n", "By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the \n", "Hub**.===== Document \u001b[1;36m6\u001b[0m =====\n", "Once you have a trained topic model, you can push it to the Hugging Face Hub in one line. Pushing your model to the\n", "Hub will automatically create an initial model card for your model, including an overview of the topics created. \n", "Below you can see an example of the topics resulting from a \u001b[1m(\u001b[0m\u001b[4;94mhttps://huggingface.co/MaartenGr/BERTopic_ArXiv\u001b[0m\u001b[4;94m)\u001b[0m\u001b[4;94m.\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">[Step 0: Duration 8.79 seconds| Input tokens: 1,388 | Output tokens: 20]\n", "</pre>\n" ], "text/plain": [ "\u001b[2m[Step 0: Duration 8.79 seconds| Input tokens: 1,388 | Output tokens: 20]\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n", "</pre>\n" ], "text/plain": [ "\u001b[38;2;212;183;2m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ \u001b[0m\u001b[1mStep \u001b[0m\u001b[1;36m1\u001b[0m\u001b[38;2;212;183;2m ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n", "│ Calling tool: 'final_answer' with arguments: {'answer': 'To push a model to the Hub, you can use the │\n", "│ push_to_hub() method after training. You can also use the PushToHubCallback to upload checkpoints regularly │\n", "│ during a longer training run. Additionally, you can push the model up to the hub using the api.upload_folder() │\n", "│ method.'} │\n", "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n", "</pre>\n" ], "text/plain": [ "╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮\n", "│ Calling tool: 'final_answer' with arguments: {'answer': 'To push a model to the Hub, you can use the │\n", "│ push_to_hub() method after training. You can also use the PushToHubCallback to upload checkpoints regularly │\n", "│ during a longer training run. Additionally, you can push the model up to the hub using the api.upload_folder() │\n", "│ method.'} │\n", "╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Final answer: To push a model to the Hub, you can use the push_to_hub() method after training. You can also use the\n", "PushToHubCallback to upload checkpoints regularly during a longer training run. Additionally, you can push the \n", "model up to the hub using the api.upload_folder() method.\n", "</pre>\n" ], "text/plain": [ "\u001b[1;38;2;212;183;2mFinal answer: To push a model to the Hub, you can use the push_to_hub() method after training. You can also use the\u001b[0m\n", "\u001b[1;38;2;212;183;2mPushToHubCallback to upload checkpoints regularly during a longer training run. Additionally, you can push the \u001b[0m\n", "\u001b[1;38;2;212;183;2mmodel up to the hub using the api.upload_folder() method.\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">[Step 1: Duration 7.79 seconds| Input tokens: 3,668 | Output tokens: 94]\n", "</pre>\n" ], "text/plain": [ "\u001b[2m[Step 1: Duration 7.79 seconds| Input tokens: 3,668 | Output tokens: 94]\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Final output:\n", "To push a model to the Hub, you can use the push_to_hub() method after training. You can also use the PushToHubCallback to upload checkpoints regularly during a longer training run. Additionally, you can push the model up to the hub using the api.upload_folder() method.\n" ] } ], "source": [ "agent_output = agent.run(\"How can I push a model to the Hub?\")\n", "\n", "print(\"Final output:\")\n", "print(agent_output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Agentic RAG vs. standard RAG\n", "\n", "Does the agent setup make a better RAG system? Well, let's compare it to a standard RAG system using LLM Judge!\n", "\n", "We will use [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) for evaluation since it's one of the strongest OS models we tested for LLM judge use cases." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eval_dataset = datasets.load_dataset(\"m-ric/huggingface_doc_qa_eval\", split=\"train\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before running the test let's make the agent less verbose." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "agent.logger.setLevel(logging.WARNING) # Let's reduce the agent's verbosity level\n", "\n", "eval_dataset = datasets.load_dataset(\"m-ric/huggingface_doc_qa_eval\", split=\"train\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "outputs_agentic_rag = []\n", "\n", "for example in tqdm(eval_dataset):\n", " question = example[\"question\"]\n", "\n", " enhanced_question = f\"\"\"Using the information contained in your knowledge base, which you can access with the 'retriever' tool,\n", "give a comprehensive answer to the question below.\n", "Respond only to the question asked, response should be concise and relevant to the question.\n", "If you cannot find information, do not give up and try calling your retriever again with different arguments!\n", "Make sure to have covered the question completely by calling the retriever tool several times with semantically different queries.\n", "Your queries should not be questions but affirmative form sentences: e.g. rather than \"How do I load a model from the Hub in bf16?\", query should be \"load a model from the Hub bf16 weights\".\n", "\n", "Question:\n", "{question}\"\"\"\n", " answer = agent.run(enhanced_question)\n", " print(\"=======================================================\")\n", " print(f\"Question: {question}\")\n", " print(f\"Answer: {answer}\")\n", " print(f'True answer: {example[\"answer\"]}')\n", "\n", " results_agentic = {\n", " \"question\": question,\n", " \"true_answer\": example[\"answer\"],\n", " \"source_doc\": example[\"source_doc\"],\n", " \"generated_answer\": answer,\n", " }\n", " outputs_agentic_rag.append(results_agentic)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from huggingface_hub import InferenceClient\n", "\n", "reader_llm = InferenceClient(\"Qwen/Qwen2.5-72B-Instruct\")\n", "\n", "outputs_standard_rag = []\n", "\n", "for example in tqdm(eval_dataset):\n", " question = example[\"question\"]\n", " context = retriever_tool(question)\n", "\n", " prompt = f\"\"\"Given the question and supporting documents below, give a comprehensive answer to the question.\n", "Respond only to the question asked, response should be concise and relevant to the question.\n", "Provide the number of the source document when relevant.\n", "If you cannot find information, do not give up and try calling your retriever again with different arguments!\n", "\n", "Question:\n", "{question}\n", "\n", "{context}\n", "\"\"\"\n", " messages = [{\"role\": \"user\", \"content\": prompt}]\n", " answer = reader_llm.chat_completion(messages).choices[0].message.content\n", "\n", " print(\"=======================================================\")\n", " print(f\"Question: {question}\")\n", " print(f\"Answer: {answer}\")\n", " print(f'True answer: {example[\"answer\"]}')\n", "\n", " results_agentic = {\n", " \"question\": question,\n", " \"true_answer\": example[\"answer\"],\n", " \"source_doc\": example[\"source_doc\"],\n", " \"generated_answer\": answer,\n", " }\n", " outputs_standard_rag.append(results_agentic)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The evaluation prompt follows some of the best principles shown in [our llm_judge cookbook](llm_judge): it follows a small integer Likert scale, has clear criteria, and a description for each score." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "EVALUATION_PROMPT = \"\"\"You are a fair evaluator language model.\n", "\n", "You will be given an instruction, a response to evaluate, a reference answer that gets a score of 3, and a score rubric representing a evaluation criteria are given.\n", "1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.\n", "2. After writing a feedback, write a score that is an integer between 1 and 3. You should refer to the score rubric.\n", "3. The output format should look as follows: \\\"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 3}}\\\"\n", "4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.\n", "5. Do not score conciseness: a correct answer that covers the question should receive max score, even if it contains additional useless information.\n", "\n", "The instruction to evaluate:\n", "{instruction}\n", "\n", "Response to evaluate:\n", "{response}\n", "\n", "Reference Answer (Score 3):\n", "{reference_answer}\n", "\n", "Score Rubrics:\n", "[Is the response complete, accurate, and factual based on the reference answer?]\n", "Score 1: The response is completely incomplete, inaccurate, and/or not factual.\n", "Score 2: The response is somewhat complete, accurate, and/or factual.\n", "Score 3: The response is completely complete, accurate, and/or factual.\n", "\n", "Feedback:\"\"\"" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from huggingface_hub import InferenceClient\n", "\n", "evaluation_client = InferenceClient(\"meta-llama/Llama-3.1-70B-Instruct\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "results = {}\n", "for system_type, outputs in [\n", " (\"agentic\", outputs_agentic_rag),\n", " (\"standard\", outputs_standard_rag),\n", "]:\n", " for experiment in tqdm(outputs):\n", " eval_prompt = EVALUATION_PROMPT.format(\n", " instruction=experiment[\"question\"],\n", " response=experiment[\"generated_answer\"],\n", " reference_answer=experiment[\"true_answer\"],\n", " )\n", " messages = [\n", " {\"role\": \"system\", \"content\": \"You are a fair evaluator language model.\"},\n", " {\"role\": \"user\", \"content\": eval_prompt},\n", " ]\n", "\n", " eval_result = evaluation_client.text_generation(\n", " eval_prompt, max_new_tokens=1000\n", " )\n", " try:\n", " feedback, score = [item.strip() for item in eval_result.split(\"[RESULT]\")]\n", " experiment[\"eval_score_LLM_judge\"] = score\n", " experiment[\"eval_feedback_LLM_judge\"] = feedback\n", " except:\n", " print(f\"Parsing failed - output was: {eval_result}\")\n", "\n", " results[system_type] = pd.DataFrame.from_dict(outputs)\n", " results[system_type] = results[system_type].loc[~results[system_type][\"generated_answer\"].str.contains(\"Error\")]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average score for agentic RAG: 86.9%\n", "Average score for standard RAG: 73.1%\n" ] } ], "source": [ "DEFAULT_SCORE = 2 # Give average score whenever scoring fails\n", "def fill_score(x):\n", " try:\n", " return int(x)\n", " except:\n", " return DEFAULT_SCORE\n", "\n", "for system_type, outputs in [\n", " (\"agentic\", outputs_agentic_rag),\n", " (\"standard\", outputs_standard_rag),\n", "]:\n", "\n", " results[system_type][\"eval_score_LLM_judge_int\"] = (\n", " results[system_type][\"eval_score_LLM_judge\"].fillna(DEFAULT_SCORE).apply(fill_score)\n", " )\n", " results[system_type][\"eval_score_LLM_judge_int\"] = (results[system_type][\"eval_score_LLM_judge_int\"] - 1) / 2\n", "\n", " print(\n", " f\"Average score for {system_type} RAG: {results[system_type]['eval_score_LLM_judge_int'].mean()*100:.1f}%\"\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Let us recap: the Agent setup improves scores by 14% compared to a standard RAG!** (from 73.1% to 86.9%)\n", "\n", "This is a great improvement, with a very simple setup 🚀\n", "\n", "(For a baseline, using Llama-3-70B without the knowledge base got 36%)" ] } ], "metadata": { "kernelspec": { "display_name": "test2", "language": "python", "name": "test2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.0" } }, "nbformat": 4, "nbformat_minor": 2 }

notebooks/en/agent_rag.ipynb (823 lines of code) (raw):