supporting-blog-content/optimizing-retrieval-with-deepeval/optimizing-retrieval-with-deepeval.ipynb (614 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "s49gpkvZ7q53" }, "source": [ "# Optimizing Retrieval with DeepEval\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/evaluation/optimizing-retrieval-with-deepeval.ipynb)\n", "\n", "In this tutorial, we'll use [DeepEval](https://docs.confident-ai.com/) to evaluate a RAG pipeline's retriever built with Elasticsearch in order to select the best hyperparameters—such as top-K, embedding model, and chunk size—to optimize retrieval performance.\n", "\n", "More specifically, we will:\n", "\n", "\n", "1. Define **DeepEval [metrics](https://docs.confident-ai.com/docs/metrics-contextual-precision)** to measure retrieval quality\n", "2. Build a simple RAG pipeline with Elasticsearch \n", "3. Run evaluations on the Elastic retriever using DeepEval metrics\n", "4. Optimize the hyperparameters based on evaluation results \n", "\n", "DeepEval metrics work out of the box without any additional configuration. This example demonstrates the basics of using DeepEval. For more details on advanced usage, please visit the [docs](https://docs.confident-ai.com/).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "h7fdas5wmIqC" }, "source": [ "# 1. Install packages and dependencies" ] }, { "cell_type": "markdown", "metadata": { "id": "ySlCIWZvm4XU" }, "source": [ "Begin by installing the necessary libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "T-eMnl3EoMer" }, "outputs": [], "source": [ "!pip install -qU deepeval elasticsearch sentence-transformers==2.7.0" ] }, { "cell_type": "markdown", "metadata": { "id": "bqPhCyNFqmYE" }, "source": [ "# 2. Define Retrieval Metrics" ] }, { "cell_type": "markdown", "metadata": { "id": "V21CBHMpuEPZ" }, "source": [ "To optimize your Elasticsearch retriever, we'll need a way to assess retrieval quality. In this tutorial, we introduce **3 key metrics** from DeepEval:\n", "\n", "* [**Contextual Precision**](https://docs.confident-ai.com/metrics/contextual-precision): Ensures the most relevant information are ranked higher than the irrelevant ones.\n", "* [**Contextual Recall**](https://docs.confident-ai.com/metrics/contextual-recall): Measures how well the retrieved information aligns with the expected LLM output\n", "* [**Contextual Relevancy**](https://docs.confident-ai.com/metrics/contextual-relevancy): Checks how well the retrieved context aligns with the query." ] }, { "cell_type": "markdown", "metadata": { "id": "ep3XPEVgrZBg" }, "source": [ "DeepEval metrics are powered by LLMs (LLM judge metrics). You can use any custom LLMs for evaluation, but for this tutorial we'll be using `gpt-4o`. Begin by setting your `OPENAI_API_KEY`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_E9HWwi5s7bA" }, "outputs": [], "source": [ "# Export the API key to an environment variable\n", "openai_api_key = \"Your OpenAI API key\"\n", "\n", "import os\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = openai_api_key" ] }, { "cell_type": "markdown", "metadata": { "id": "7NIX7AOJuH--" }, "source": [ "After setting your `OPENAI_API_KEY`, DeepEval will automatically use `gpt-4o` as the default model for running these metrics. Now, let's define the metrics." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AlxcdlhVql8Q", "outputId": "1042b410-d728-49cf-8d39-20bc48ff5e20" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DeepEval metrics initialized successfully! 🚀\n" ] } ], "source": [ "from deepeval.metrics import (\n", " ContextualPrecisionMetric,\n", " ContextualRecallMetric,\n", " ContextualRelevancyMetric,\n", ")\n", "\n", "# Initialize the metrics\n", "contextual_precision = ContextualPrecisionMetric()\n", "contextual_recall = ContextualRecallMetric()\n", "contextual_relevancy = ContextualRelevancyMetric()\n", "\n", "print(\"DeepEval metrics initialized successfully! 🚀\")" ] }, { "cell_type": "markdown", "metadata": { "id": "gaTFHLJC-Mgi" }, "source": [ "# 3. Defining Elastic Retriever\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "47fprfgM1SJs" }, "source": [ "With the metrics defined, we can start building our RAG pipeline. In this tutorial, we'll construct and evaluate a QA RAG system designed to answer questions about Elasticsearch. First, let's create our Elastic retriever by setting up an index and populating it with knowledge about Elastic.\n", "\n", "We'll use `all-MiniLM-L6-v2` from the `sentence_transformers` library to embed our text chunks. You can learn more about this model on [Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "K9Q1p2C9-wce" }, "outputs": [], "source": [ "from sentence_transformers import SentenceTransformer\n", "\n", "model = SentenceTransformer(\"all-MiniLM-L6-v2\")" ] }, { "cell_type": "markdown", "metadata": { "id": "gEzq2Z1wBs3M" }, "source": [ "### Initializing the Elasticsearch retriever\n", "\n", "Instantiate the [Elasticsearch python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html), providing the cloud id and password in your deployment." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "uP_GTVRi-d96" }, "outputs": [], "source": [ "from elasticsearch import Elasticsearch\n", "from getpass import getpass\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n", "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n", "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n", "\n", "# Create the client instance\n", "client = Elasticsearch(\n", " # For local development\n", " # hosts=[\"http://localhost:9200\"]\n", " cloud_id=ELASTIC_CLOUD_ID,\n", " api_key=ELASTIC_API_KEY,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "bRHbecNeEDL3" }, "source": [ "Before you continue, confirm that the client has connected with this test." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rdiUKqZbEKfF" }, "outputs": [], "source": [ "print(client.info())" ] }, { "cell_type": "markdown", "metadata": { "id": "LC4ZO7Oe0Rr1" }, "source": [ "To store and retrieve embeddings efficiently, we need to create an index with the correct mappings. This index will store both the text data and its corresponding dense vector embeddings for semantic search." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "O7_rj18jz_fg" }, "outputs": [], "source": [ "if not client.indices.exists(index=\"knowledge_base\"):\n", " client.indices.create(\n", " index=\"knowledge_base\",\n", " mappings={\n", " \"properties\": {\n", " \"text\": {\"type\": \"text\"},\n", " \"embedding\": {\n", " \"type\": \"dense_vector\",\n", " \"dims\": 384,\n", " \"index\": \"true\",\n", " \"similarity\": \"cosine\",\n", " },\n", " }\n", " },\n", " )" ] }, { "cell_type": "markdown", "metadata": { "id": "FsY_unxz0I4H" }, "source": [ "Finally, use the following command to upload the knowledge base information about Elastic. The `model.encode` function encodes each text into a vector using the model we initialized earlier.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6tdNiVE3yhf3" }, "outputs": [], "source": [ "# Example document chunks\n", "document_chunks = [\n", " \"Elasticsearch is a distributed search engine.\",\n", " \"RAG improves AI-generated responses with retrieved context.\",\n", " \"Vector search enables high-precision semantic retrieval.\",\n", " \"Elasticsearch uses dense vector and sparse vector similarity for semantic search.\",\n", " \"Scalable architecture allows Elasticsearch to handle massive volumes of data.\",\n", " \"Document chunking can help improve retrieval performance.\",\n", " \"Elasticsearch supports a wide range of search features.\",\n", " # Add more document chunks as needed...\n", "]\n", "operations = []\n", "for i, chunk in enumerate(document_chunks):\n", " operations.append({\"index\": {\"_index\": \"knowledge_base\", \"_id\": i}})\n", " # Convert the document chunk to an embedding vector\n", " operations.append({\"text\": chunk, \"embedding\": model.encode(chunk).tolist()})\n", "\n", "client.bulk(index=\"knowledge_base\", operations=operations, refresh=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "CAQVcvqAxoiu" }, "source": [ "# 4. Define the RAG Pipeline" ] }, { "cell_type": "markdown", "metadata": { "id": "35MuQpj166of" }, "source": [ "\n", "With the Elasticsearch database already initialized and populated, we can **build our RAG pipeline**.\n", "\n", "Let's first define the `search` function, which serves as the Elastic retriever in our RAG pipeline. The search function:\n", "- Encodes the input query using `all-MiniLM-L6-v2`\n", "- Performs a kNN search on the Elasticsearch index to find semantically similar documents\n", "- Returns the most relevant knowledge from the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "umGje7VGyINl" }, "outputs": [], "source": [ "def search(input, top_k=3):\n", " # Encode the query using the model\n", " input_embedding = model.encode(input).tolist()\n", "\n", " # Search the Elasticsearch index using kNN on the \"embedding\" field\n", " res = client.search(\n", " index=\"knowledge_base\",\n", " body={\n", " \"knn\": {\n", " \"field\": \"embedding\",\n", " \"query_vector\": input_embedding,\n", " \"k\": top_k, # Retrieve the top k matches\n", " \"num_candidates\": 10, # Controls search speed vs accuracy\n", " }\n", " },\n", " )\n", "\n", " # Return a list of texts from the hits if available, otherwise an empty list\n", " return (\n", " [hit[\"_source\"][\"text\"] for hit in res[\"hits\"][\"hits\"]]\n", " if res[\"hits\"][\"hits\"]\n", " else []\n", " )" ] }, { "cell_type": "markdown", "metadata": { "id": "y_b-coeU9kww" }, "source": [ "Next, let's incorporate the `search` function into our overall RAG function. This RAG function:\n", "- Calls the `search` function to retrieve the most relevant context from the Elasticsearch database\n", "- Passes this context to the prompt for generating an answer with an LLM" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7Rx3254lxtHu" }, "outputs": [], "source": [ "import os\n", "from openai import OpenAI\n", "\n", "# Instantiate the OpenAI client\n", "openai_client = OpenAI()\n", "\n", "\n", "def RAG_generate(input, top_k=3):\n", " retrieval_context = search(input, top_k)\n", " messages = [\n", " {\n", " \"role\": \"system\",\n", " \"content\": \"Answer the user question ONLY based on the supporting context.\",\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": f\"User Question:\\n{input}\\n\\nSupporting Context:\\n{retrieval_context}\",\n", " },\n", " ]\n", " chat_completion = openai_client.chat.completions.create(\n", " model=\"gpt-3.5-turbo-0125\",\n", " messages=messages,\n", " temperature=0.7,\n", " max_tokens=150,\n", " )\n", " return chat_completion.choices[0].message.content.strip(), retrieval_context" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "j5plUily2882" }, "outputs": [], "source": [ "# Example usage\n", "input = \"How does Elasticsearch work?\"\n", "answer, _ = RAG_generate(input)\n", "print(answer)" ] }, { "cell_type": "markdown", "metadata": { "id": "VhtaO3qC_Pw6" }, "source": [ "#. 5. Evaluating the Retriever" ] }, { "cell_type": "markdown", "metadata": { "id": "GVS4rgHY_UaT" }, "source": [ "\n", "With the RAG pipeline, we can begin evaluating the retriever. Evaluation consists of two main steps:\n", "\n", "1. **Test Case Preparation:** \n", " Prepare an input query along with the expected LLM response. Then, use the input to generate a response from your RAG pipeline, creating an `LLMTestCase` that contains:\n", " - `input`\n", " - `actual_output`\n", " - `expected_output`\n", " - `retrieval_context`\n", "\n", "2. **Test Case Evaluation:** \n", " Evaluate the test case using the selection of retrieval metrics we previously defined." ] }, { "cell_type": "markdown", "metadata": { "id": "etObIacvBS3x" }, "source": [ "### Test Case preparation\n", "\n", "Let's begin by revisiting the `input` we had earlier and preparing an `expected_output` for it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ShpEd9-y_Ph0" }, "outputs": [], "source": [ "input = \"How does Elasticsearch work?\"\n", "expected_output = (\n", " \"Elasticsearch uses dense vector and sparse vector similarity for semantic search.\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "SuJUr5FZAhmI" }, "source": [ "Next, retrieve the `actual_output` and `retrieval_context` for this input and create an `LLMTestCase` from it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0-xp0vUhAkFy" }, "outputs": [], "source": [ "from deepeval.test_case import LLMTestCase\n", "\n", "# Example usage\n", "answer, retrieval_context = RAG_generate(input, top_k=3)\n", "\n", "test_case = LLMTestCase(\n", " input=input,\n", " actual_output=answer,\n", " expected_output=expected_output,\n", " retrieval_context=retrieval_context,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "2n4MhhENCeXl" }, "source": [ "### Run Evaluations" ] }, { "cell_type": "markdown", "metadata": { "id": "F9RdzKcsIiq2" }, "source": [ "To run evaluations, simply pass the test case and metrics into DeepEval's `evaluate` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vWHI3aLDCeJQ" }, "outputs": [], "source": [ "from deepeval import evaluate\n", "\n", "evaluate([test_case], [contextual_precision, contextual_recall, contextual_relevancy])" ] }, { "cell_type": "markdown", "metadata": { "id": "RvHHaFpMDnju" }, "source": [ "# 6. Optimizing the Retriever" ] }, { "cell_type": "markdown", "metadata": { "id": "_20m8zPoJmY8" }, "source": [ "Finally, even though we defined several hyperparameters like the embedding model and the number of candidates, let's iterate over top-K to find the best-performing value across these metrics. This is as simple as a `for` loop in DeepEval.\n", "\n", "To optimize all hyperparameters, you'll want to iterate over all of them along with the metrics to find the best hyperparameter combination for your use case!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "84OKKFkHDye-" }, "outputs": [], "source": [ "# Example usage\n", "for top_k in [1, 3, 5, 7]:\n", " answer, retrieval_context = RAG_generate(input, top_k)\n", "\n", " test_case = LLMTestCase(\n", " input=input,\n", " actual_output=answer,\n", " expected_output=expected_output,\n", " retrieval_context=retrieval_context,\n", " )\n", "\n", " evaluate(\n", " [test_case], [contextual_precision, contextual_recall, contextual_relevancy]\n", " )" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" }, "vscode": { "interpreter": { "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" } } }, "nbformat": 4, "nbformat_minor": 0 }