# Optimizing Retrieval with DeepEval

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/evaluation/optimizing-retrieval-with-deepeval.ipynb)

In this tutorial, we'll use [DeepEval](https://docs.confident-ai.com/) to evaluate a RAG pipeline's retriever built with Elasticsearch in order to select the best hyperparametersâ€”such as top-K, embedding model, and chunk sizeâ€”to optimize retrieval performance.

More specifically, we will:


1. Define **DeepEval [metrics](https://docs.confident-ai.com/docs/metrics-contextual-precision)** to measure retrieval quality
2. Build a simple RAG pipeline with Elasticsearch  
3. Run evaluations on the Elastic retriever using DeepEval metrics
4. Optimize the hyperparameters based on evaluation results  

DeepEval metrics work out of the box without any additional configuration. This example demonstrates the basics of using DeepEval. For more details on advanced usage, please visit the [docs](https://docs.confident-ai.com/).


# 1. Install packages and dependencies

Begin by installing the necessary libraries.

In [None]:
!pip install -qU deepeval elasticsearch sentence-transformers==2.7.0

# 2. Define Retrieval Metrics

To optimize your Elasticsearch retriever, we'll need a way to assess retrieval quality. In this tutorial, we introduce **3 key metrics** from DeepEval:

* [**Contextual Precision**](https://docs.confident-ai.com/metrics/contextual-precision): Ensures the most relevant information are ranked higher than the irrelevant ones.
* [**Contextual Recall**](https://docs.confident-ai.com/metrics/contextual-recall): Measures how well the retrieved information aligns with the expected LLM output
* [**Contextual Relevancy**](https://docs.confident-ai.com/metrics/contextual-relevancy): Checks how well the retrieved context aligns with the query.

DeepEval metrics are powered by LLMs (LLM judge metrics). You can use any custom LLMs for evaluation, but for this tutorial we'll be using `gpt-4o`. Begin by setting your `OPENAI_API_KEY`:

In [None]:
# Export the API key to an environment variable
openai_api_key = "Your OpenAI API key"

import os

os.environ["OPENAI_API_KEY"] = openai_api_key

After setting your `OPENAI_API_KEY`, DeepEval will automatically use `gpt-4o` as the default model for running these metrics. Now, let's define the metrics.

In [None]:
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
)

# Initialize the metrics
contextual_precision = ContextualPrecisionMetric()
contextual_recall = ContextualRecallMetric()
contextual_relevancy = ContextualRelevancyMetric()

print("DeepEval metrics initialized successfully! ðŸš€")

DeepEval metrics initialized successfully! ðŸš€


# 3. Defining Elastic Retriever



With the metrics defined, we can start building our RAG pipeline. In this tutorial, we'll construct and evaluate a QA RAG system designed to answer questions about Elasticsearch. First, let's create our Elastic retriever by setting up an index and populating it with knowledge about Elastic.

We'll use `all-MiniLM-L6-v2` from the `sentence_transformers` library to embed our text chunks. You can learn more about this model on [Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).


In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

### Initializing the Elasticsearch retriever

Instantiate the [Elasticsearch python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html), providing the cloud id and password in your deployment.

In [None]:
from elasticsearch import Elasticsearch
from getpass import getpass

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# Create the client instance
client = Elasticsearch(
    # For local development
    # hosts=["http://localhost:9200"]
    cloud_id=ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY,
)

Before you continue, confirm that the client has connected with this test.

In [None]:
print(client.info())

To store and retrieve embeddings efficiently, we need to create an index with the correct mappings. This index will store both the text data and its corresponding dense vector embeddings for semantic search.

In [None]:
if not client.indices.exists(index="knowledge_base"):
    client.indices.create(
        index="knowledge_base",
        mappings={
            "properties": {
                "text": {"type": "text"},
                "embedding": {
                    "type": "dense_vector",
                    "dims": 384,
                    "index": "true",
                    "similarity": "cosine",
                },
            }
        },
    )

Finally, use the following command to upload the knowledge base information about Elastic. The `model.encode` function encodes each text into a vector using the model we initialized earlier.


In [None]:
# Example document chunks
document_chunks = [
    "Elasticsearch is a distributed search engine.",
    "RAG improves AI-generated responses with retrieved context.",
    "Vector search enables high-precision semantic retrieval.",
    "Elasticsearch uses dense vector and sparse vector similarity for semantic search.",
    "Scalable architecture allows Elasticsearch to handle massive volumes of data.",
    "Document chunking can help improve retrieval performance.",
    "Elasticsearch supports a wide range of search features.",
    # Add more document chunks as needed...
]
operations = []
for i, chunk in enumerate(document_chunks):
    operations.append({"index": {"_index": "knowledge_base", "_id": i}})
    # Convert the document chunk to an embedding vector
    operations.append({"text": chunk, "embedding": model.encode(chunk).tolist()})

client.bulk(index="knowledge_base", operations=operations, refresh=True)

# 4. Define the RAG Pipeline


With the Elasticsearch database already initialized and populated, we can **build our RAG pipeline**.

Let's first define the `search` function, which serves as the Elastic retriever in our RAG pipeline. The search function:
- Encodes the input query using `all-MiniLM-L6-v2`
- Performs a kNN search on the Elasticsearch index to find semantically similar documents
- Returns the most relevant knowledge from the data

In [None]:
def search(input, top_k=3):
    # Encode the query using the model
    input_embedding = model.encode(input).tolist()

    # Search the Elasticsearch index using kNN on the "embedding" field
    res = client.search(
        index="knowledge_base",
        body={
            "knn": {
                "field": "embedding",
                "query_vector": input_embedding,
                "k": top_k,  # Retrieve the top k matches
                "num_candidates": 10,  # Controls search speed vs accuracy
            }
        },
    )

    # Return a list of texts from the hits if available, otherwise an empty list
    return (
        [hit["_source"]["text"] for hit in res["hits"]["hits"]]
        if res["hits"]["hits"]
        else []
    )

Next, let's incorporate the `search` function into our overall RAG function. This RAG function:
- Calls the `search` function to retrieve the most relevant context from the Elasticsearch database
- Passes this context to the prompt for generating an answer with an LLM

In [None]:
import os
from openai import OpenAI

# Instantiate the OpenAI client
openai_client = OpenAI()


def RAG_generate(input, top_k=3):
    retrieval_context = search(input, top_k)
    messages = [
        {
            "role": "system",
            "content": "Answer the user question ONLY based on the supporting context.",
        },
        {
            "role": "user",
            "content": f"User Question:\n{input}\n\nSupporting Context:\n{retrieval_context}",
        },
    ]
    chat_completion = openai_client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=messages,
        temperature=0.7,
        max_tokens=150,
    )
    return chat_completion.choices[0].message.content.strip(), retrieval_context

In [None]:
# Example usage
input = "How does Elasticsearch work?"
answer, _ = RAG_generate(input)
print(answer)

#. 5. Evaluating the Retriever


With the RAG pipeline, we can begin evaluating the retriever. Evaluation consists of two main steps:

1. **Test Case Preparation:**  
   Prepare an input query along with the expected LLM response. Then, use the input to generate a response from your RAG pipeline, creating an `LLMTestCase` that contains:
   - `input`
   - `actual_output`
   - `expected_output`
   - `retrieval_context`

2. **Test Case Evaluation:**  
   Evaluate the test case using the selection of retrieval metrics we previously defined.

### Test Case preparation

Let's begin by revisiting the `input` we had earlier and preparing an `expected_output` for it.

In [None]:
input = "How does Elasticsearch work?"
expected_output = (
    "Elasticsearch uses dense vector and sparse vector similarity for semantic search."
)

Next, retrieve the `actual_output` and `retrieval_context` for this input and create an `LLMTestCase` from it.

In [None]:
from deepeval.test_case import LLMTestCase

# Example usage
answer, retrieval_context = RAG_generate(input, top_k=3)

test_case = LLMTestCase(
    input=input,
    actual_output=answer,
    expected_output=expected_output,
    retrieval_context=retrieval_context,
)

### Run Evaluations

To run evaluations, simply pass the test case and metrics into DeepEval's `evaluate` function.

In [None]:
from deepeval import evaluate

evaluate([test_case], [contextual_precision, contextual_recall, contextual_relevancy])

# 6. Optimizing the Retriever

Finally, even though we defined several hyperparameters like the embedding model and the number of candidates, let's iterate over top-K to find the best-performing value across these metrics. This is as simple as a `for` loop in DeepEval.

To optimize all hyperparameters, you'll want to iterate over all of them along with the metrics to find the best hyperparameter combination for your use case!

In [None]:
# Example usage
for top_k in [1, 3, 5, 7]:
    answer, retrieval_context = RAG_generate(input, top_k)

    test_case = LLMTestCase(
        input=input,
        actual_output=answer,
        expected_output=expected_output,
        retrieval_context=retrieval_context,
    )

    evaluate(
        [test_case], [contextual_precision, contextual_recall, contextual_relevancy]
    )