In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Advanced RAG Techniques - Vertex RAG Engine Retrieval Quality Evaluation and Hyperparameters Tuning

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/rag-engine/rag_engine_evaluation.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Frag-engine%2Frag_engine_evaluation.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/rag-engine/rag_engine_evaluation.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/rag-engine/rag_engine_evaluation.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/rag-engine/rag_engine_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/rag-engine/rag_engine_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/rag-engine/rag_engine_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/rag-engine/rag_engine_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/rag-engine/rag_engine_evaluation.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            

|           |                                         |
|-----------|---------------------------------------- |
| Author(s) | [Ed Tsoi](https://github.com/edtsoi430) |

## Overview

Retrieval Quality is arguably the most important component of a Retrieval Augmented Generation (RAG) application. Not only does it directly impact the quality of the generated response, in some cases poor retrieval could also lead to irrelevant, incomplete or hallucinated output.

This notebook aims to provide guidelines on:
> **You'll learn how to:**
> * Evaluate retrieval quality using the [BEIR-fiqa 2018 dataset](https://arxiv.org/abs/2104.08663) (or your own!).
> * Understand the impact of key parameters on retrieval performance. (e.g. embedding model, chunk size)
> * Tune hyperparameters to improve accuracy of the RAG system.

**Note:** This notebook assumes that you already have an understanding on how to implement a RAG system with Vertex AI RAG Engine. For more general instructions on how to use Vertex AI RAG Engine, please refer to the [RAG Engine API Documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/rag-api).

We'll explore how these hyperparameters influence retrieval:

| Parameter                 | Description                                                                         |
|---------------------------|-------------------------------------------------------------------------------------|
| Chunk Size                | Size of each chunk (in tokens). Affects granularity of retrieval.                   |
| Chunk Overlap             | Overlap between chunks. Helps capture relevant information across chunk boundaries. |
| Top K                     | Maximum number of retrieved contexts.  Balance recall and precision.                |
| Vector Distance threshold | Filters contexts based on similarity.  A stricter threshold prioritizes precision.  |
| Embedding model           | Model used to convert text to embeddings. Significantly impacts retrieval accuracy. |

### How exactly could we use this notebook to improve the RAG system?

* **Hyperparameters Tuning:** There are a couple of hyperparameters that could impact retrieval quality:

| Parameter | Description |
|------------|----------------------|
| Chunk Size | When documents are ingested into an index, they are split into chunks. The `chunk_size` parameter (in tokens) specifies the size of each chunk. |
| Chunk Overlap |  By default, documents are split into chunks with a certain amount of overlap to improve relevance and retrieval quality. |
| Top K | Controls the maximum number of contexts that are retrieved. |
| Vector Distance threshold | Only contexts with a distance smaller than the threshold are considered. |
| Embedding model | The embedding model used to convert input text into embeddings for retrieval.|

You may use this notebook to evaluate your retrieval quality, and see how changing certain parameters (top k, chunk size) impact or improve your retrieval quality (`recall@k`, `precision@k`, `ndcg@k`).

* **Response Quality Evaluation:** Once you have optimized the retrieval metrics, you can understand how it impacts response quality using the [Evaluation Service API Notebook](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_rag_gen_ai_evaluation_service_sdk.ipynb)


## Get started

### Install Vertex AI SDK and other required packages


In [None]:
%pip install --upgrade --user --quiet google-cloud-aiplatform beir

### Restart runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

The restart might take a minute or longer. After it's restarted, continue to the next step.

In [2]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [3]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and initialize the Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [1]:
# Use the environment variable if the user doesn't provide Project ID.
import os

import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

vertexai.init(project=PROJECT_ID, location=LOCATION)

In [None]:
!gcloud auth application-default login
!gcloud auth application-default set-quota-project {PROJECT_ID}
!gcloud config set project {PROJECT_ID}

### Import libraries

In [43]:
from collections.abc import MutableSequence
import math
import os
import re
import time

from beir import util
from beir.datasets.data_loader import GenericDataLoader
from google.cloud import storage
from google.cloud.aiplatform_v1beta1.types import Context, RetrieveContextsResponse
import numpy as np
from tqdm import tqdm
from vertexai.preview import rag

### Define helper function for processing dataset.

In [49]:
def convert_beir_to_rag_corpus(
    corpus: dict[str, dict[str, str]], output_dir: str
) -> None:
    """
    Convert a BEIR corpus to Vertex RAG corpus format with a maximum of 10,000
    files per subdirectory.

    For each document in the BEIR corpus, we will create a new txt where:
      * doc_id will be the file name
      * doc_content will be the document text prepended by title (if any).

    Args:
      corpus: BEIR corpus
      output_dir (str): Directory where the converted corpus will be saved

    Returns:
      None (will write output to disk)
    """
    # Create the output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    file_count, subdir_count = 0, 0
    current_subdir = os.path.join(output_dir, f"{subdir_count}")
    os.makedirs(current_subdir, exist_ok=True)

    # Convert each file in the corpus
    for doc_id, doc_content in corpus.items():
        # Combine title and text (if title exists)
        full_text = doc_content.get("title", "")
        if full_text:
            full_text += "\n\n"
        full_text += doc_content["text"]

        # Create a new file for each file.
        file_path = os.path.join(current_subdir, f"{doc_id}.txt")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(full_text)

        file_count += 1

        # Create a new subdirectory if the current one has reached the limit
        if file_count >= 10000:
            subdir_count += 1
            current_subdir = os.path.join(output_dir, f"{subdir_count}")
            os.makedirs(current_subdir, exist_ok=True)
            file_count = 0

    print(f"Conversion complete. {len(corpus)} files saved in {output_dir}")


def count_files_in_gcs_bucket(gcs_path: str) -> int:
    """
    Counts the number of files in a Google Cloud Storage path,
    excluding directories and hidden files.

    Args:
      gcs_path: The full GCS path, including the bucket name and any prefix.
       * Example: 'gs://my-bucket/my-folder'

    Returns:
      The number of files in the GCS path.
    """

    # Split the GCS path into bucket name and prefix
    bucket_name, prefix = gcs_path.replace("gs://", "").split("/", 1)

    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)

    count = 0
    blobs = bucket.list_blobs(prefix=prefix)
    for blob in blobs:
        if not blob.name.endswith("/") and not any(
            part.startswith(".") for part in blob.name.split("/")
        ):  # Exclude directories and hidden files
            count += 1

    return count


def count_directories_after_split(gcs_path: str) -> int:
    """
    Counts the number of directories in a Google Cloud Storage path.

    Args:
      gcs_path: The full GCS path, including the bucket name and any prefix.

    Returns:
      The number of directories in the GCS path.
    """
    num_files_in_path = count_files_in_gcs_bucket(gcs_path)
    num_directories = math.ceil(num_files_in_path / 10000)
    return num_directories


def import_rag_files_from_gcs(
    paths: list[str], chunk_size: int, chunk_overlap: int, corpus_name: str
) -> None:
    """Imports files from Google Cloud Storage to a RAG corpus.

    Args:
      paths: A list of GCS paths to import files from.
      chunk_size: The size of each chunk to import.
      chunk_overlap: The overlap between consecutive chunks.
      corpus_name: The name of the RAG corpus to import files into.

    Returns:
      None
    """
    total_imported, total_num_of_files = 0, 0

    for path in paths:
        num_files_to_be_imported = count_files_in_gcs_bucket(path)
        total_num_of_files += num_files_to_be_imported
        max_retries, attempt, imported = 10, 0, 0
        while attempt < max_retries and imported < num_files_to_be_imported:
            response = rag.import_files(
                corpus_name,
                [path],
                chunk_size=chunk_size,
                chunk_overlap=chunk_overlap,
                timeout=20000,
                max_embedding_requests_per_min=1400,
            )
            imported += response.imported_rag_files_count or 0
            attempt += 1
        total_imported += imported

    print(f"{total_imported} files out of {total_num_of_files} imported!")

# For step 1, please choose only one of the following options:
- **1.1 (Option A, Recommended):** Create RagCorpus and perform data ingestion using the provided public GCS bucket (BEIR-fiqa dataset only).

- **1.2 (Option B):** Create RAG Corpus, choose a custom beir dataset and upload/ingest data into the RagCorpus on your own.

- **1.3 (Option C):** Bring your own existing `RagCorpus` (insert `RAG_CORPUS_ID` here).

**Do not run all these cells together.**

# 1.1 - Option A (Recommended): Create RagCorpus and perform data ingestion using the provided public GCS bucket (BEIR-fiqa dataset only).
* This option is recommended to save you time from having to upload evaluation dataset to GCS before we import them into the `RagCorpus`.
* However, if you would like more flexibility on which BEIR dataset to use, you could go with option B below to upload data to your desired GCS location.
* If you would like to bring your own rag corpus, simply skip to Option C and specify the rag corpus id.

### Create a `RagCorpus` with the specified configuration (for evaluation)

In [None]:
# See the list of current supported embedding models here: https://cloud.google.com/vertex-ai/generative-ai/docs/rag-overview#supported-embedding-models
# Select embedding model as desired.
embedding_model_config = rag.EmbeddingModelConfig(
    publisher_model="publishers/google/models/text-embedding-005"  # @param {type:"string", isTemplate: true},
)

rag_corpus = rag.create_corpus(
    display_name="test-corpus",
    description="A test corpus where we import the BEIR-FiQA-2018 dataset",
    embedding_model_config=embedding_model_config,
)

print(rag_corpus)

### Copy beir-fiqa dataset from the public path to a storage bucket in your project.

In [None]:
CURRENT_BUCKET_PATH = "gs://<INSERT_GCS_PATH_HERE>"  # @param {type:"string"},

PUBLIC_BEIR_FIQA_GCS_PATH = (
    "gs://github-repo/generative-ai/gemini/rag-engine/rag_engine_evaluation/beir-fiqa"
)

!gsutil -m rsync -r -d $PUBLIC_BEIR_FIQA_GCS_PATH $CURRENT_BUCKET_PATH

### Import evaluation dataset files into `RagCorpus` (configure chunk size, chunk overlap etc as desired)

In [40]:
num_subdirectories = count_directories_after_split(CURRENT_BUCKET_PATH)
paths = [CURRENT_BUCKET_PATH + f"/{i}/" for i in range(num_subdirectories)]

chunk_size = 512  # @param {type:"integer"}
chunk_overlap = 102  # @param {type:"integer"}

import_rag_files_from_gcs(
    paths=paths,
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    corpus_name=rag_corpus.name,
)

57638 files out of 57638 imported!


# 1.2 - Option B: Create RAG Corpus, choose a custom beir dataset and upload/ingest data into the RagCorpus on your own.

* Choose this option if you would like to have more flexibility on which dataset to use. The public, uploaded data in option 1.1 is for `BEIR-fiqa` only.
* If you would like to bring your own existing `RagCorpus` (with imported files), skip to Option C below.

### Create a `RagCorpus` with the specified configuration.

In [None]:
# See the list of current supported embedding models here: https://cloud.google.com/vertex-ai/generative-ai/docs/rag-overview#supported-embedding-models
# You may adjust the embedding model here if you would like.
embedding_model_config = rag.EmbeddingModelConfig(
    publisher_model="publishers/google/models/text-embedding-005"  # @param {type:"string", isTemplate: true},
)

rag_corpus = rag.create_corpus(
    display_name="test-corpus",
    description="A test corpus where we import the BEIR-FiQA-2018 dataset",
    embedding_model_config=embedding_model_config,
)

print(rag_corpus)

### Load BEIR Fiqa dataset (test split).
- Configure dataset of choice.

In [None]:
# Download and load a BEIR dataset
dataset = "fiqa"  # @param ["arguana", "climate-fever", "cqadupstack", "dbpedia-entity", "fever", "fiqa", "germanquad", "hotpotqa", "mmarco", "mrtydi", "msmarco-v2", "msmarco", "nfcorpus", "nq-train", "nq", "quora", "scidocs", "scifact", "trec-covid-beir", "trec-covid-v2", "trec-covid", "vihealthqa", "webis-touche2020"]
url = (
    f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
)
out_dir = "datasets"
data_path = util.download_and_unzip(url, out_dir)

# Load the dataset
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
print(
    f"Successfully loaded the {dataset} dataset with {len(corpus)} files and {len(queries)} queries!"
)

datasets/fiqa.zip:   0%|          | 0.00/17.1M [00:00<?, ?iB/s]

  0%|          | 0/57638 [00:00<?, ?it/s]

Successfully loaded the fiqa dataset with 57638 files and 648 queries!


### Convert BEIR corpus to `RagCorpus` format and upload to GCS bucket.

In [None]:
CONVERTED_DATASET_PATH = f"/converted_dataset_{dataset}"
# Convert BEIR corpus to RAG format.
convert_beir_to_rag_corpus(corpus, CONVERTED_DATASET_PATH)

#### Create a test bucket for uploading BEIR evaluation dataset to (or use an existing bucket of your choice).

In [None]:
# Optionally rename bucket name here.
BUCKET_NAME = "beir-test-bucket"  # @param {type: "string"}
!gsutil mb gs://{BUCKET_NAME}

#### Upload to specified GCS bucket (Modify the GCS bucket path to desired location)

In [None]:
GCS_BUCKET_PATH = "gs://{BUCKET_NAME}/beir-fiqa"  # @param {type: "string"}

!echo "Uploading files from ${CONVERTED_DATASET_PATH} to ${GCS_BUCKET_PATH}"
# Upload RAG format dataset to GCS bucket.
!gsutil -m rsync -r -d $CONVERTED_DATASET_PATH $GCS_BUCKET_PATH

### Import evaluation dataset files into `RagCorpus`.

In [None]:
num_subdirectories = count_directories_after_split(GCS_BUCKET_PATH)
paths = [GCS_BUCKET_PATH + f"/{i}/" for i in range(num_subdirectories)]

chunk_size = 512  # @param {type:"integer"}
chunk_overlap = 102  # @param {type:"integer"}

import_rag_files_from_gcs(
    paths=paths,
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    corpus_name=rag_corpus.name,
)

# 1.3 - Option C: Bring your own existing `RagCorpus` (insert `RAG_CORPUS_ID` here).

In [None]:
# Specify your rag corpus ID here that you want to use.
RAG_CORPUS_ID = ""  # @param {type: "string"}

rag_corpus = rag.get_corpus(
    name=f"projects/{PROJECT_ID}/locations/{LOCATION}/ragCorpora/{RAG_CORPUS_ID}"
)

print(rag_corpus)

# 2. Run Retrieval Quality Evaluation

For Retrieval Quality Evaluation, we focus on the following metrics:

- **Recall@k:**
  - Measures how many of the relevant documents/chunks are successfully retrieved within the top k results
  - Helps evaluate the retrieval component's ability to find ALL relevant information
- **Precision@k:**
  - Measures the proportion of retrieved documents that are actually relevant within the top k results
  - Helps evaluate how "focused" your retrieval is
- **nDCG@K:**
  - Measures both relevance AND ranking quality
  - Takes into account the position of relevant documents

Follow the Notebook to get these metrics numbers for your configurations, and to optimize your settings.

### Define evaluation helper function.

In [None]:
def extract_doc_id(file_path: str) -> str | None:
    """Extracts the document ID (filename without extension) from a file path.

    Handles various potential file name formats and extensions
    like .txt, .pdf, .html, etc.

    Args:
      file_path: The path to the file.

    Returns:
      The document ID (filename without extension) extracted from the file path.
    """
    try:
        # Split the path by directory separators
        parts = file_path.split("/")
        # Get the filename
        filename = parts[-1]
        # Remove the extension (if any)
        filename = re.sub(r"\.\w+$", "", filename)  # Removes .txt, .pdf, .html, etc.
        return filename
    except:
        pass  # Handle any unexpected errors during extraction
    return None


# RAG Engine helper function to extract doc_id, snippet, and score.


def extract_retrieval_details(
    response: RetrieveContextsResponse,
) -> tuple[str, str, float]:
    """Extracts the document ID, snippet, and score from a retrieval response.

    Args:
      response: The retrieval response object.

    Returns:
      A tuple containing the document ID, retrieved snippet, and distance score.
    """
    doc_id = extract_doc_id(response.source_uri)
    retrieved_snippet = response.text
    distance = response.distance
    return (doc_id, retrieved_snippet, distance)


# RAG Engine helper function for retrieval.


def rag_api_retrieve(
    query: str, corpus_name: str, top_k: int
) -> MutableSequence[Context]:
    """Retrieves relevant contexts from a RAG corpus using the RAG API.

    Args:
      query: The query text.
      corpus_name: The name of the RAG corpus, in the format of "projects/{PROJECT_ID}/locations/{LOCATION}/ragCorpora/{CORPUS_ID}".
      top_k: The number of top results to retrieve.

    Returns:
      A list of retrieved contexts.
    """
    return rag.retrieval_query(
        rag_resources=[rag.RagResource(rag_corpus=corpus_name)],
        text=query,
        similarity_top_k=top_k,
        vector_distance_threshold=0.5,
    ).contexts.contexts


def calculate_document_level_recall_precision(
    retrieved_response: MutableSequence[Context], cur_qrel: dict[str, int]
) -> tuple[float, float]:
    """Calculates the recall and precision for a list of retrieved contexts.

    Args:
      retrieved_response: A list of retrieved contexts.
      cur_qrel: A dictionary of ground truth relevant documents for the current query.

    Returns:
      A tuple containing the recall and precision scores.
    """
    if not retrieved_response:
        return (0, 0)

    relevant_retrieved_unique = set()
    num_relevant_retrieved_snippet = 0
    for res in retrieved_response:
        doc_id, text, score = extract_retrieval_details(res)
        if doc_id in cur_qrel:
            relevant_retrieved_unique.add(doc_id)
            num_relevant_retrieved_snippet += 1
    recall = (
        len(relevant_retrieved_unique) / len(cur_qrel.keys())
        if len(cur_qrel.keys()) > 0
        else 0
    )
    precision = (
        num_relevant_retrieved_snippet / len(retrieved_response)
        if len(retrieved_response) > 0
        else 0
    )
    return (recall, precision)


def calculate_document_level_metrics(
    queries: dict[str, str],
    qrels: dict[str, dict[str, int]],
    k_values: list[int],
    corpus_name: str,
) -> None:
    """Calculates and prints the average recall, precision, and NDCG for a set of queries at different top_k values.

    Args:
      queries: A dictionary of queries with query IDs as keys and query text as values.
      qrels: A dictionary of ground truth relevant documents for each query.
      k_values: A list of top_k values to evaluate.
      corpus_name: The name of the RAG corpus, in the format of "projects/{PROJECT_ID}/locations/{LOCATION}/ragCorpora/{CORPUS_ID}".

    Returns:
      None
    """

    for top_k in k_values:
        start_time = time.time()
        total_recall, total_precision, total_ndcg = 0, 0, 0
        print(f"Computing metrics for top_k value: {top_k}")
        print(f"Total number of queries: {len(queries)}")
        for query_id, query in tqdm(
            queries.items(),
            total=len(queries),
            desc=f"Processing Queries (top_k={top_k})",
        ):
            response = rag_api_retrieve(query, corpus_name, top_k)

            recall, precision = calculate_document_level_recall_precision(
                response, qrels[query_id]
            )
            ndcg = ndcg_at_k(response, qrels[query_id], top_k)

            total_recall += recall
            total_precision += precision
            total_ndcg += ndcg

        end_time = time.time()
        execution_time = end_time - start_time
        num_queries = len(queries)
        average_recall, average_precision, average_ndcg = (
            total_recall / num_queries,
            total_precision / num_queries,
            total_ndcg / num_queries,
        )
        print(f"\nAverage Recall@{top_k}: {average_recall:.4f}")
        print(f"Average Precision@{top_k}: {average_precision:.4f}")
        print(f"Average nDCG@{top_k}: {average_ndcg:.4f}")
        print(f"Execution time: {execution_time} seconds.")
        print("=============================================")


def dcg_at_k_with_zero_padding_if_needed(r: list[int], k: int) -> float:
    """Calculates the Discounted Cumulative Gain (DCG) at a given rank k.

    Args:
      r: A list of relevance scores.
      k: The rank at which to calculate DCG.

    Returns:
      The DCG at rank k.
    """
    r = np.asarray(r)[:k]
    if r.size:
        # Pad with zeros if r is shorter than k
        if r.size < k:
            r = np.pad(r, (0, k - r.size))
        return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, k + 2)))
    return 0.0


def ndcg_at_k(
    retriever_results: MutableSequence[Context],
    ground_truth_relevances: dict[str, int],
    k: int,
) -> float:
    """Calculates the Normalized Discounted Cumulative Gain (NDCG) at a given rank k.

    Args:
      retriever_results: A list of retrieved results.
      ground_truth_relevances: A dictionary of ground truth relevance scores for each document.
      k: The rank at which to calculate NDCG.

    Returns:
      The NDCG at rank k.
    """
    if not retriever_results:
        return 0

    # Prepare retriever results
    retrieved_relevances = []
    for res in retriever_results[:k]:
        doc_id, text, score = extract_retrieval_details(res)
        if doc_id in ground_truth_relevances:
            retrieved_relevances.append(ground_truth_relevances[doc_id])
        else:
            retrieved_relevances.append(0)  # Assume irrelevant if not in ground truth

    # Calculate DCG
    dcg = dcg_at_k_with_zero_padding_if_needed(retrieved_relevances, k)
    # Calculate IDCG
    ideal_relevances = sorted(ground_truth_relevances.values(), reverse=True)
    idcg = dcg_at_k_with_zero_padding_if_needed(ideal_relevances, k)

    return dcg / idcg if idcg > 0 else 0.0

### Run Retrieval Quality Evaluation.

In [None]:
calculate_document_level_metrics(
    queries, qrels, [5, 10, 100], corpus_name=rag_corpus.name
)

Computing metrics for top_k value: 5
Total number of queries: 648


Processing Queries (top_k=5): 100%|██████████| 648/648 [44:47<00:00,  4.15s/it]



Average Recall@5: 0.5608
Average Precision@5: 0.2713
Average nDCG@5: 0.4450
Execution time: 2687.608230829239 seconds.
Computing metrics for top_k value: 10
Total number of queries: 648


Processing Queries (top_k=10): 100%|██████████| 648/648 [37:31<00:00,  3.48s/it]



Average Recall@10: 0.6571
Average Precision@10: 0.1679
Average nDCG@10: 0.4039
Execution time: 2251.886693954468 seconds.
Computing metrics for top_k value: 100
Total number of queries: 648


Processing Queries (top_k=100): 100%|██████████| 648/648 [38:48<00:00,  3.59s/it]


Average Recall@100: 0.8801
Average Precision@100: 0.0253
Average nDCG@100: 0.2592
Execution time: 2328.4095141887665 seconds.





# 3. Next steps
* Once we're done with evaluation, we should carefully examine the metrics number are tune the hypeparameters. Below are some suggestions on how to optimize the hyperparameters to get the best retrieval quality.

### How to optimize Recall:
* If your recall metrics number is too low, consider the following steps:
  * **Reducing chunk size:** Sometimes important information might be buried within large chunks, making it more difficult to retrieve relevant context. Try reducing the chunk size.
  * **Increasing chunk overlap:** If the chunk overlap is too small, some relevant information at the edge might be lost. Consider increasing the chunk overlap (chunk overlap of 20% of chunk size is generally a good start.)
  * **Increasing top-K:** If your top k is too small, the retriever might miss some relevant information due to a too restrictive context.

### How to optimize Precision:
* If your precision number is low, consider:
  * **Reducing top-K:** Your top k might be too large, adding a lot of unwanted noise to the retrieved contexts.
  * **Reducing chunk overlap:** Sometimes, too large of a chunk overlap could result in duplicate information.
  * **Increasing chunk size:** If your chunk size is too small, it might lack sufficient context resulting in a low precision score.

### How to optimize nDCG:
* If your nDCG number is low, consider:
  * **Changing your embedding model:** your embedding model might not capturing relevance well. Consider using a different embedding model (e.g. if your documents are multilingual, consider using a mulilingual embedding model). For more information on the currently supported embedding models, see documentation [here](https://cloud.google.com/vertex-ai/generative-ai/docs/rag-overview#supported-embedding-models).

### Evaluate Response Quality
* If you want to evaluate response quality (generated answers) on top of retrieval quality, please refer to the [Gen AI Evaluation Service - RAG Evaluation Notebook](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_rag_gen_ai_evaluation_service_sdk.ipynb)


# 4. Cleaning up (Delete `RagCorpus`)

Once we are done with evaluation, we should clean up the `RagCorpus` to free up resources since we don't need it anymore.

In [None]:
rag.delete_corpus(rag_corpus.name)