# Search tuning in Vertex AI Search

<table align="center">
  <td style="text-align: center" width="25%">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/search/tuning/vertexai-search-tuning.ipynb">
      <img width="32" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center" width="25%">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fsearch%2Ftuning%2Fvertexai-search-tuning.ipynb">
      <img width="32" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center" width="25%">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/search/tuning/vertexai-search-tuning.ipynb">
      <img width="32" src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center" width="25%">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/search/tuning/vertexai-search-tuning.ipynb">
      <img width="32" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>
 
<b>Share to:</b>
<table align="left">
  <td style="text-align: center" width="10%">
    <a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/tuning/vertexai-search-tuning.ipynb" target="_blank">
        <img width="20" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
    </a>
   </td>
  <td style="text-align: center" width="10%">
    <a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/tuning/vertexai-search-tuning.ipynb" target="_blank">
        <img width="20" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
    </a>
  </td>
  <td style="text-align: center" width="10%">
    <a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/tuning/vertexai-search-tuning.ipynb" target="_blank">
        <img width="20" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
    </a>
  </td>
  <td style="text-align: center" width="10%">
    <a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/tuning/vertexai-search-tuning.ipynb" target="_blank">
        <img width="20" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
    </a>
  </td>
  <td style="text-align: center" width="10%">
    <a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/tuning/vertexai-search-tuning.ipynb" target="_blank">
        <img width="20" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
    </a>
  </td>
</table>            

| Author |
| --- |
| [Jincheol Kim](https://github.com/JincheolKim) |

When users try to provide a search service over their archived documents and data, search performance may not meet the performance expectation all the time. The performance of Vertex AI Search can be measured in two aspects: the accuracy and the relevance of the search results, and the correctness of the summarized responses from the search results with correct annotations and references to the source document. Among the two aspects of the search performances, the accuracy and the relevance of the search results should be enhanced by generating embedding vectors which are more relevant semantically with document chunking and other document processing methods. The correctness of the summarized responses generated from the backend LLM (Gemini) behind the Vertex AI Search endpoint can be enhanced by tuning the backend LLM with some additional relevant data. The process of tuning the backend LLM with some domain-specific data is what Vertex AI Search Tuning is for.
  
Before we tune the backend LLM behind Vertex AI Search, we should the prepare the raw text data in a specific JSONL format with a question-answer mapping file in the tab-separated table format. We will use some FAQ documents from an open source project (Kubernetes) to tune the backend LLM to enhance answers on the questions on Kubernetes. After we learn how we prepare the tuning data in JSONL and TSV format, we will learn how we can configure a search tuning job and submit it to Vertex AI.
     
To learn more about the search tuning process, please refer to the following documents in the Google Cloud Documentation.
     
- [Improve search results with search tuning](https://cloud.google.com/generative-ai-app-builder/docs/tune-search)
- [Create a search data store](https://cloud.google.com/generative-ai-app-builder/docs/create-data-store-es)
- [Create a search app](https://cloud.google.com/generative-ai-app-builder/docs/create-engine-es)

## Overview

![Key user journey of the search tuning in Vertex AI Search](https://storage.googleapis.com/github-repo/generative-ai/search/tuning/images/key_user_journey_search_tuning.png)

* Prepare your data for tuning
    - The datasets should be prepared in JSONL format with identifier-text pairs.
    - The mapping between query and answer texts should be described in tab-separated values (TSV) formats.
* Update the datastore with the additional documents
    - Before we update the datastore attached to the search app, the additional documents and data for tuning should be uploaded to the bucket in Cloud Storage.
    - After uploading the new documents and data onto the bucket in Cloud Storage, the datastore is refreshed just by creating the datastore with the same configuration used in the previous creation. In the refresh process, we can see that only the files just added are used to generate new search indexes at the console interface.
* Rebuild the search app with the updated datastore
    - After the refresh of the datastore is completed, the search app must be rebuilt to be connected to the updated datastore.

In order to obtain the correct results with the additional documents and data, users must rebuild the search app after they rebuilt the datastore.

## Get started

### Install Vertex AI SDK and other required packages

We will install some dependencies to run the cells in this notebook. 


In [None]:
%pip install --upgrade --user --quiet google-cloud-aiplatform google-cloud-discoveryengine langchain_google_community langchain langchain-google-vertexai langchain-google-community[vertexaisearch] shortuuid

## Restart runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

The restart might take a minute or longer. After it has restarted, continue to the next step.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [None]:
import os
import sys

if "google.colab" in sys.modules:

    from google.colab import auth

    auth.authenticate_user()

## Set Google Cloud project information and initialize Vertex AI SDK for Python

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [None]:
import json
import logging

# Imports common packages
import os
import platform
import re
import sys

from google.api_core.client_options import ClientOptions
from google.api_core.operation import Operation
from google.cloud import discoveryengine
import shortuuid
import vertexai

In [None]:
!gcloud auth login

In [None]:
# Use the environment variable if the user doesn't provide Project ID.
PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

PROJECT_ID = "genai-customersupport"
LOCATION = "global"
STORAGE_LOCATION = "us"

vertexai.init(project=PROJECT_ID, location=LOCATION)

## Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [None]:
PUBLIC_DATA_SOURCE_URI = f"gs://github-repo/generative-ai/search/tuning"
BASE_DATA_SOURCE_URI = f"gs://github-repo/generative-ai/search/tuning/awesome_rlhf"
BUCKET_URI = f"gs://sample-search-tuning-{PROJECT_ID}"  # @param {type:"string"}
TUNING_DATA_PATH_SOURCE = "gs://github-repo/generative-ai/search/tuning/tuning_data"
TUNING_DATA_PATH_LOCAL = f"./tuning_data"
TUNING_DATA_PATH_REMOTE = f"{BUCKET_URI}/tuning_data"
SEARCH_DATASTORE_PATH_REMOTE = f"{BUCKET_URI}/rlhf-datastore"
SEARCH_DATASTORE_ID = f"search-datastore-{PROJECT_ID}-{shortuuid.uuid().lower()}"
SEARCH_DATASTORE_NAME = "RLHF-ARTICLE-DATASTORE"

"**If your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gcloud storage buckets create --location={STORAGE_LOCATION} --project={PROJECT_ID} --enable-hierarchical-namespace --uniform-bucket-level-access -b {BUCKET_URI}
! mkdir $TUNING_DATA_PATH_LOCAL
! gcloud storage cp $TUNING_DATA_PATH_SOURCE/* $TUNING_DATA_PATH_LOCAL

## Prepare the data

We will use the following datasets for this notebook. 

(1) FAQ data from the open source projects Kubernetes and Kubernetes Client. This data is a short list of questions and answers which can be useful to test the working of this notebook in a short period of time.

(2) BEIR ([Benchmarking IR datasets](https://github.com/beir-cellar/beir)): BEIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your NLP-based retrieval models within the benchmark. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset.
    - For an overview, checkout our new wiki page: https://github.com/beir-cellar/beir/wiki.
    - For models and datasets, checkout out Hugging Face (HF) page: https://huggingface.co/BeIR.
      
(3) SciFact ([SciFact](https://huggingface.co/datasets/allenai/scifact)): SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts, and annotated with labels and rationales.
                                                                                     
For BEIR and SciFact, the datasets are already prepared in JSONL and TSV formats. You can use them for testing the search tuning feature without any data preprocessing chore. However, the amount of the data of the BEIR and SciFact is large which make the tuning job run too long. Given that, we will try to generate a small amount of the data first to check if the search tuning feature is working correctly with the FAQ data from the Kubernetes project.

This ```generate_source_dataset``` function is a function to read the raw FAQ data from the FAQ and the README documents of the Kubernetes project and to generate the ```corpus_file.jsonl``` and ```query_file.jsonl``` for the tuning job.

In [None]:
# Helper function to generate JSONL-format datasets for search tuning.
#
# Args:
#      source_file (string): Path to the source files from which the JSONL
#                            files will be generated
#      corpus_filepath (string): Path to the corpus file to which the JSONL
#                                formated corpus dataset file will be stored.
#      query_filepath (string):  Path to the query file to which the JSONL
#                                formated query dataset file will be stored.
#      cleanup_at_start (bool, default=True): Clears the corpus files and
#                                the query files generated in the previous time
#                                before generating new datasets when the value
#                                is True
#
# Raises:
#
# Returns:
#         No return values. Two JSONL files. (Corpus File, Query File)


def generate_source_dataset(
    source_file: str,
    corpus_filepath: str,
    query_filepath: str,
    cleanup_at_start: bool = True,
):
    questions = []
    answers = []

    # If cleanup_at_start is True, this section deletes the corpus files and
    # the query files before generating new corpus and query files in JSONL
    if cleanup_at_start:
        if os.path.isfile(corpus_filepath):
            print(f"Removing previous file: %s")
            os.remove(corpus_filepath)
        if os.path.isfile(query_filepath):
            print(f"Removing previous file: %s")
            os.remove(query_filepath)

    # This section generates a corpus file dataset in JSONL format
    # from the source files.
    logging.info(f"{generate_source_dataset.__name__}: {1}")
    with open(source_file) as f:
        line_str = f.readline()
        answer = ""
        answer_flag = False
        while line_str:
            if re.match(r"^(#{3})\s+(.+)$", line_str):
                question = re.split(r"^(#{3})\s+(.+)$", line_str)
                question_str = ""
                len_question = len(question) - 1
                reidx = 0
                while not (question[len_question - reidx] == "###"):
                    question_str += question[len_question - reidx]
                    reidx += 1
                questions.append(question_str)
                # print("Question: %s" % question_str)
                answer_flag = True
                answers.append(str.strip(answer, ""))
                # print("Answer: %s" % answer)
                answer = ""
            elif answer_flag == True:
                answer += line_str
            line_str = f.readline()

    logging.info(f"{generate_source_dataset.__name__}: {2}")
    corpus_idx_start = 0
    try:
        with open(corpus_filepath) as cf:
            corpus_idx_start = len(list(enumerate(cf)))
    except:
        corpus_idx_start = 0

    with open(corpus_filepath, "a") as cf:
        jsonfile = ""
        idx = corpus_idx_start
        print(f"start idx:%d" % idx)
        for answer in answers:
            idx += 1
            answer = answer.replace("\\[", "\\\\[")
            answer = answer.replace("\\]", "\\\\]")
            answer = answer.replace('"', '\\"')
            json_line = '{{"_id": "ans{:04d}", "text": "{}" }}\n'.format(
                idx, str.strip(answer).replace("\n", " ")
            )
            jsonfile += json_line
        cf.writelines(jsonfile)

    # This section generates a query file dataset in JSONL format
    # from the source files.
    logging.info(f"{generate_source_dataset.__name__}: {3}")
    query_idx_start = 0
    try:
        with open(query_filepath) as qf:
            query_idx_start = len(list(enumerate(qf)))
    except:
        query_idx_start = 0

    with open(query_filepath, "a") as qf:
        jsonfile = ""
        idx = query_idx_start
        print(f"start idx:%d" % idx)
        for question in questions:
            idx += 1
            question = question.replace("\\[", "\\\\[")
            question = question.replace("\\]", "\\\\]")
            question = question.replace('"', '\\"')
            json_line = '{{ "_id": "que{:04d}", "text": "{}" }}\n'.format(
                idx, str.strip(question).replace("\n", " ")
            )
            jsonfile += json_line
        qf.writelines(jsonfile)

This ```generate_training_test_dataset``` generates the query-answer mapping in a tab-separated value format to help the tuning job to map the queries and the texts for the answers to the queries from the FAQ.

In [None]:
# Helper function to generate TSV(tab separated values)-format datasets
# to map the corpus file and the query file for search tuning.
#
# Args:
#    corpus_filepath (string): Path to the corpus file to which the JSONL
#                              formated corpus dataset file will be stored.
#    query_filepath (string):  Path to the query file to which the JSONL
#                              formated query dataset file will be stored.
#    training_filepath (string): Path to the mapping file between the entries
#                              in the corpus file and the query file to define
#                              a training dataset in TSV-format
#    test_filepath (string): Path to the mapping file between the entries
#                              in the corpus file and the query file to define
#                              a test dataset in TSV-format
#    cleanup_at_start (bool, default=True): Clears the training dataset files and
#                              the test dataset files generated in the previous time
#                              before generating new datasets when the value is True
#
# Raises:
#
# Returns:
#    No return values. Two TSV files. (Training Dataset File, Test Dataset File)


def generate_training_test_dataset(
    corpus_filepath: str,
    query_filepath: str,
    training_filepath: str,
    test_filepath: str,
    cleanup_at_start: bool = True,
):
    questions = []
    answers = []

    # If cleanup_at_start is True, this section deletes the training dataset files
    # and the test dataset files before generating new TSV dataset files

    if cleanup_at_start:
        if os.path.isfile(training_filepath):
            print(f"Removing previous file: %s")
            os.remove(training_filepath)
        if os.path.isfile(test_filepath):
            print(f"Removing previous file: %s")
            os.remove(test_filepath)

    # Opens the corpus dataset file to generate the mapping between the corpus entries
    # and the query entries

    with open(corpus_filepath) as corpus_file:
        line_str = corpus_file.readline()
        while line_str:
            jsonl = json.loads(line_str, strict=False)
            questions.append(jsonl["text"])
            line_str = corpus_file.readline()

    logging.info(f"{generate_training_test_dataset.__name__}: {1}")

    # Opens the query dataset file to generate the mapping between the corpus entries
    # and the query entries

    with open(query_filepath) as query_file:
        line_str = query_file.readline()
        while line_str:
            jsonl = json.loads(line_str, strict=False)
            answers.append(jsonl["text"])
            line_str = query_file.readline()

    logging.info(f"{generate_training_test_dataset.__name__}: {2}")

    # Opens the training dataset file to generate the mapping between the corpus entries
    # and the query entries

    with open(training_filepath, "a") as trf:
        jsonfile = ""
        json_line = "query-id\tcorpus-id\tscore\n"
        idx = 1
        jsonfile += json_line
        len_questions = len(questions)
        for question in questions:
            json_line = f"que{idx:04d}\tans{idx:04d}\t1\n"
            jsonfile += json_line
            idx = idx + 1
            if idx > 0.85 * len_questions:
                break
        trf.write(jsonfile)

    logging.info(f"{generate_training_test_dataset.__name__}: {3}")

    # Opens the test dataset file to generate the mapping between the corpus entries
    # and the query entries

    with open(test_filepath, "a") as tef:
        jsonfile = ""
        json_line = "query-id\tcorpus-id\tscore\n"
        idx = 1
        len_questions = len(questions)
        jsonfile += json_line
        for question in questions:
            if idx <= 0.85 * len_questions:
                idx = idx + 1
            elif idx > 0.85 * len_questions and idx <= len_questions:
                json_line = f"que{idx:04d}\tans{idx:04d}\t1\n"
                jsonfile += json_line
                idx = idx + 1
        tef.write(jsonfile)

    logging.info(f"{generate_training_test_dataset.__name__}: {4}")

In [None]:
# Collects the generated JSONL and TSV files with PDF documents
# to update the search index after the search tuning


if __name__ == "__main__":
    datasets = [
        "./tuning_data/FAQ.md",
        "./tuning_data/FAQ-Kubernetes-Client.md",
        "./tuning_data/README.md",
    ]
    if os.path.isfile("./tuning_data/corpus_file.jsonl"):
        os.remove("./tuning_data/corpus_file.jsonl")
    if os.path.isfile("./tuning_data/query_file.jsonl"):
        os.remove("./tuning_data/query_file.jsonl")

    for file in datasets:
        print(file)
        generate_source_dataset(
            file,
            "./tuning_data/corpus_file.jsonl",
            "./tuning_data/query_file.jsonl",
            cleanup_at_start=False,
        )
    generate_training_test_dataset(
        "./tuning_data/corpus_file.jsonl",
        "./tuning_data/query_file.jsonl",
        "./tuning_data/training_data.tsv",
        "./tuning_data/test_data.tsv",
    )

We create pdf files for the FAQ documents which are importable to the datastore of Vertex AI Search.

In [None]:
# Check if the system is macOS
if platform.system() == "Darwin":
    # Install using Homebrew
    !brew install xelatex             # xelatex is used for pdf document creation in macOS.
    !pandoc --pdf-engine=xelatex ./tuning_data/FAQ-Kubernetes-Client.md -o ./tuning_data/FAQ-Kubernetes-Client.pdf
    !pandoc --pdf-engine=xelatex ./tuning_data/FAQ.md -o ./tuning_data/FAQ.pdf
    !pandoc --pdf-engine=xelatex ./tuning_data/README.md -o ./tuning_data/README.pdf
elif platform.system() == "Linux":
    # Install using apt-get for Ubuntu Linux
    !sudo apt-get install pdflatex    # pdflatex is used for pdf document creation in macOS.
    !pandoc --pdf-engine=pdflatex ./tuning_data/FAQ-Kubernetes-Client.md -o ./tuning_data/FAQ-Kubernetes-Client.pdf
    !pandoc --pdf-engine=pdflatex ./tuning_data/FAQ.md -o ./tuning_data/FAQ.pdf
    !pandoc --pdf-engine=pdflatex ./tuning_data/README.md -o ./tuning_data/README.pdf

After generating the test tuning datasets, we will upload the datasets to the bucket in Cloud Storage which will be used as a data store for the search tuning.

In [None]:
# Uploading the preprocessed data with the PDF files for reindexing to the search app data store
!echo "Preprocessed tuning data: {TUNING_DATA_PATH_LOCAL}"
!echo "Destination path: {TUNING_DATA_PATH_REMOTE}"
!gcloud storage folders create "{TUNING_DATA_PATH_REMOTE}"
!gcloud storage cp $TUNING_DATA_PATH_LOCAL/* $TUNING_DATA_PATH_REMOTE

## Uploading data for a search app datastore (papers on RLHF)

To create a Vertex AI search app, we will upload some pdf files on Reinforcement Learning on Human Feedback from [Awesome RLHF](https://github.com/opendilab/awesome-RLHF.git) github repository to a bucket in Cloud Storage which will be used as a search datastore. The pdf files are available at [Awesome RLHF - PDF Files](https://gitlab.com/jincheolkim/awesome-rlhf).

In [None]:
!echo {SEARCH_DATASTORE_PATH_REMOTE}
!gcloud storage folders create "{SEARCH_DATASTORE_PATH_REMOTE}"
!gcloud storage cp --recursive $BASE_DATA_SOURCE_URI/* $SEARCH_DATASTORE_PATH_REMOTE

## Helper functions to facilitate the following steps

The following functions are helper functions to help you clear the steps to perform the search tuning without distraction on other details.

* ```create_data_store```: function creates a datastore for an agent app with the identifier of a datastore with the ```data_store_id``` and the ```data_store_name```
* ```import_documents```: function imports documents from Cloud Storage to generate indices
* ```create_search_engine```: function creates a search agent app
* ```search```: function to perform a query with the query given through the argument
* ```train_custom_model```: function to tune the backend LLM for the search agent app
* ```delete_engine```: function to delete the search agent app
* ```purge_documents```: function to delete the index and the documents indexed for the search agent app
* ```delete_data_store```: function to delete the data store


In [None]:
# For more information, refer to:
# https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
search_client_options = (
    ClientOptions(api_endpoint=f"{LOCATION}-discoveryengine.googleapis.com")
    if LOCATION != "global"
    else None
)

In [None]:
def create_data_store(
    project_id: str,
    location: str,
    data_store_id: str,
    data_store_name: str,
    client_options: ClientOptions = search_client_options,
) -> str:
    client = discoveryengine.DataStoreServiceClient(client_options=client_options)

    # The full resource name of the collection
    # e.g. projects/{project}/locations/{location}/collections/default_collection
    parent = client.collection_path(
        project=project_id,
        location=location,
        collection="default_collection",
    )

    data_store = discoveryengine.DataStore(
        display_name=data_store_name,
        # Options: GENERIC, MEDIA, HEALTHCARE_FHIR
        industry_vertical=discoveryengine.IndustryVertical.GENERIC,
        # Options: SOLUTION_TYPE_RECOMMENDATION, SOLUTION_TYPE_SEARCH, SOLUTION_TYPE_CHAT, SOLUTION_TYPE_GENERATIVE_CHAT
        solution_types=[discoveryengine.SolutionType.SOLUTION_TYPE_SEARCH],
        # Options: NO_CONTENT, CONTENT_REQUIRED, PUBLIC_WEBSITE
        content_config=discoveryengine.DataStore.ContentConfig.CONTENT_REQUIRED,
    )

    request = discoveryengine.CreateDataStoreRequest(
        parent=parent,
        data_store_id=data_store_id,
        data_store=data_store,
        # Optional: For Advanced Site Search Only
        # create_advanced_site_search=True,
    )

    # Make the request
    operation = client.create_data_store(request=request)

    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()

    # After the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.CreateDataStoreMetadata(operation.metadata)

    # Handle the response
    print(response)
    print(metadata)

    return operation.operation.name

In [None]:
def import_documents(
    project_id: str,
    location: str,
    data_store_id: str,
    client_options: ClientOptions = search_client_options,
) -> discoveryengine.PurgeDocumentsMetadata:
    client = discoveryengine.DocumentServiceClient(client_options=client_options)

    # The full resource name of the search engine branch.
    # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
    parent = client.branch_path(
        project=project_id,
        location=location,
        data_store=data_store_id,
        branch="default_branch",
    )

    # With the new datastore, we will import and make an index over the documents in the datastore.
    # The ```ImportDocumentsRequests``` generates a REST API request message in the JSON format
    # and the ```import_documents``` method of the DocumentServiceClient class lets you import
    # the documents and make an index over the document set with the information
    # in the ```ImportDocumentRequest``` request.
    document_import_request = discoveryengine.ImportDocumentsRequest(
        parent=parent,
        gcs_source=discoveryengine.GcsSource(
            # Multiple URIs are supported
            input_uris=[f"{SEARCH_DATASTORE_PATH_REMOTE}/*"],
            # Options:
            # - `content` - Unstructured documents (PDF, HTML, DOC, TXT, PPTX)
            # - `custom` - Unstructured documents with custom JSONL metadata
            # - `document` - Structured documents in the discoveryengine.Document format.
            # - `csv` - Unstructured documents with CSV metadata
            data_schema="content",
        ),
        # Options: `FULL`, `INCREMENTAL`
        reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
    )

    # Make the request
    operation = client.import_documents(request=document_import_request)

    print(f"Waiting for operation to complete: {operation.operation.name}")

    # After the operation is complete,
    # get information from operation metadata
    response = operation.result()
    metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

    # Handle the response
    print(response)
    print(metadata)

    return metadata

In [None]:
def create_search_engine(
    project_id: str,
    location: str,
    engine_name: str,
    engine_id: str,
    data_store_ids: list[str],
    client_options: ClientOptions = search_client_options,
) -> str:
    client = discoveryengine.EngineServiceClient(client_options=client_options)

    # The full resource name of the collection
    # e.g. projects/{project}/locations/{location}/collections/default_collection
    parent = client.collection_path(
        project=project_id,
        location=location,
        collection="default_collection",
    )

    engine = discoveryengine.Engine(
        display_name=engine_name,
        # Options: GENERIC, MEDIA, HEALTHCARE_FHIR
        industry_vertical=discoveryengine.IndustryVertical.GENERIC,
        # Options: SOLUTION_TYPE_RECOMMENDATION, SOLUTION_TYPE_SEARCH, SOLUTION_TYPE_CHAT, SOLUTION_TYPE_GENERATIVE_CHAT
        solution_type=discoveryengine.SolutionType.SOLUTION_TYPE_SEARCH,
        # For search apps only
        search_engine_config=discoveryengine.Engine.SearchEngineConfig(
            # Options: SEARCH_TIER_STANDARD, SEARCH_TIER_ENTERPRISE
            search_tier=discoveryengine.SearchTier.SEARCH_TIER_ENTERPRISE,
            # Options: SEARCH_ADD_ON_LLM, SEARCH_ADD_ON_UNSPECIFIED
            search_add_ons=[discoveryengine.SearchAddOn.SEARCH_ADD_ON_LLM],
        ),
        # For generic recommendation apps only
        # similar_documents_config=discoveryengine.Engine.SimilarDocumentsEngineConfig,
        data_store_ids=data_store_ids,
    )

    request = discoveryengine.CreateEngineRequest(
        parent=parent,
        engine=engine,
        engine_id=engine_id,
    )

    # Make the request
    operation = client.create_engine(request=request)

    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()

    # After the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.CreateEngineMetadata(operation.metadata)

    # Handle the response
    print(response)
    print(metadata)

    return operation.operation.name

In [None]:
def search(
    project_id: str,
    location: str,
    engine_id: str,
    search_query: str,
    client_options: ClientOptions = search_client_options,
) -> list[discoveryengine.SearchResponse]:
    client = discoveryengine.SearchServiceClient(client_options=client_options)

    # The full resource name of the search app serving config
    serving_config = f"projects/{project_id}/locations/{location}/collections/default_collection/engines/{engine_id}/servingConfigs/default_config"

    # Optional - only supported for unstructured data: Configuration options for search.
    # Refer to the `ContentSearchSpec` reference for all supported fields:
    # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest.ContentSearchSpec
    content_search_spec = discoveryengine.SearchRequest.ContentSearchSpec(
        # For information about snippets, refer to:
        # https://cloud.google.com/generative-ai-app-builder/docs/snippets
        snippet_spec=discoveryengine.SearchRequest.ContentSearchSpec.SnippetSpec(
            return_snippet=True
        ),
        # For information about search summaries, refer to:
        # https://cloud.google.com/generative-ai-app-builder/docs/get-search-summaries
        summary_spec=discoveryengine.SearchRequest.ContentSearchSpec.SummarySpec(
            summary_result_count=5,
            include_citations=True,
            ignore_adversarial_query=True,
            ignore_non_summary_seeking_query=True,
            model_prompt_spec=discoveryengine.SearchRequest.ContentSearchSpec.SummarySpec.ModelPromptSpec(
                preamble="YOUR_CUSTOM_PROMPT"
            ),
            model_spec=discoveryengine.SearchRequest.ContentSearchSpec.SummarySpec.ModelSpec(
                version="stable",
            ),
        ),
    )

    # Refer to the `SearchRequest` reference for all supported fields:
    # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest
    request = discoveryengine.SearchRequest(
        serving_config=serving_config,
        query=search_query,
        page_size=10,
        content_search_spec=content_search_spec,
        query_expansion_spec=discoveryengine.SearchRequest.QueryExpansionSpec(
            condition=discoveryengine.SearchRequest.QueryExpansionSpec.Condition.AUTO,
        ),
        spell_correction_spec=discoveryengine.SearchRequest.SpellCorrectionSpec(
            mode=discoveryengine.SearchRequest.SpellCorrectionSpec.Mode.AUTO
        ),
    )

    response = client.search(request)
    print(response)

    return response

In [None]:
def train_custom_model(
    project_id: str,
    location: str,
    data_store_id: str,
    corpus_data_path: str,
    query_data_path: str,
    train_data_path: str,
    test_data_path: str,
    client_options: ClientOptions = search_client_options,
) -> Operation:
    client = discoveryengine.SearchTuningServiceClient(client_options=client_options)

    # The full resource name of the data store
    data_store = f"projects/{project_id}/locations/{location}/collections/default_collection/dataStores/{data_store_id}"

    # Make the request
    operation = client.train_custom_model(
        request=discoveryengine.TrainCustomModelRequest(
            gcs_training_input=discoveryengine.TrainCustomModelRequest.GcsTrainingInput(
                corpus_data_path=corpus_data_path,
                query_data_path=query_data_path,
                train_data_path=train_data_path,
                test_data_path=test_data_path,
            ),
            data_store=data_store,
            model_type="search-tuning",
        )
    )

    # Optional: Wait for training to complete
    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()

    # After the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.TrainCustomModelMetadata(operation.metadata)

    # Handle the response
    print(response)
    print(metadata)
    print(operation)

    return operation

In [None]:
def delete_engine(
    project_id: str,
    location: str,
    engine_id: str,
    client_options: ClientOptions = search_client_options,
) -> str:
    client = discoveryengine.EngineServiceClient(client_options=client_options)

    # The full resource name of the engine
    # e.g. projects/{project}/locations/{location}/collections/default_collection/engines/{engine_id}
    name = client.engine_path(
        project=project_id,
        location=location,
        collection="default_collection",
        engine=engine_id,
    )

    # Make the request
    operation = client.delete_engine(name=name)

    print(f"Operation: {operation.operation.name}")

    return operation.operation.name

In [None]:
def purge_documents(
    project_id: str,
    location: str,
    data_store_id: str,
    client_options: ClientOptions = search_client_options,
) -> discoveryengine.PurgeDocumentsMetadata:
    client = discoveryengine.DocumentServiceClient(client_options=client_options)

    operation = client.purge_documents(
        request=discoveryengine.PurgeDocumentsRequest(
            # The full resource name of the search engine branch.
            # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
            parent=client.branch_path(
                project=project_id,
                location=location,
                data_store=data_store_id,
                branch="default_branch",
            ),
            filter="*",
            # If force is set to `False`, return the expected purge count without deleting any documents.
            force=True,
        )
    )

    print(f"Waiting for operation to complete: {operation.operation.name}")
    response = operation.result()

    # After the operation is complete,
    # get information from operation metadata
    metadata = discoveryengine.PurgeDocumentsMetadata(operation.metadata)

    # Handle the response
    print(response)
    print(metadata)

    return metadata

In [None]:
def delete_data_store(
    project_id: str,
    location: str,
    data_store_id: str,
    client_options: ClientOptions = search_client_options,
) -> str:
    client = discoveryengine.DataStoreServiceClient(client_options=client_options)

    request = discoveryengine.DeleteDataStoreRequest(
        # The full resource name of the data store
        name=client.data_store_path(project_id, location, data_store_id)
    )

    # Make the request
    operation = client.delete_data_store(request=request)

    print(f"Operation: {operation.operation.name}")

    return operation.operation.name

## Creating a data store for a search app with the cloud storage bucket with PDF documents

We create a datastore with the datastore bucket in Cloud Storage with the PDF files on RLHF and import them to generate indices for search.

In [None]:
!echo "Datastore ID: {SEARCH_DATASTORE_ID}"
!echo "Datastore Name: {SEARCH_DATASTORE_NAME}"
create_datastore_op_name = create_data_store(
    PROJECT_ID, LOCATION, SEARCH_DATASTORE_ID, SEARCH_DATASTORE_NAME
)

In [None]:
metadata = import_documents(PROJECT_ID, LOCATION, SEARCH_DATASTORE_ID)

## Creating a search app using the Vertex AI Search SDK

As we just created a datastore and made an index over the documents in it in the above, we will create a search app with the datastore. 


In [None]:
SEARCH_DATASTORE_REF_ID = f"projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{SEARCH_DATASTORE_ID}"
SEARCH_APP_ID = f"search-app-{PROJECT_ID}-{shortuuid.uuid().lower()}"
SEARCH_APP_NAME = "RLHF_SEARCH_APP"

In [None]:
create_search_app_op_name = create_search_engine(
    PROJECT_ID, LOCATION, SEARCH_APP_NAME, SEARCH_APP_ID, [SEARCH_DATASTORE_ID]
)

### Test the search app with a test prompt

We will test the search app we just created with information about a paper regarding a world model for autonomous driving which is described in a paper among the documents in the datastore.

In [None]:
QUERY_PROMPT = """
    What is the name of the world model for autonomous driving developed recently?
"""

We can see that the search app returns a list of relevant documents with references to the related documents in the datastore. Please keep the 
result in your mind to compare it with the results after the search tuning is performed.

In [None]:
search_response = search(PROJECT_ID, LOCATION, SEARCH_APP_ID, QUERY_PROMPT)

## Configuring and submitting a search tuning job

With the search app ready, we will perform a search tuning with a test tuning data on Kubernetes.

First, we will upload the documents of FAQs about Kubernetes and Kubernetes Client API. The original documents were in the Markdown format but we transform them to PDF format files as the Vertex AI Search cannot accept Markdown files but only HTML, PDF and PDF with embedded text, TXT, JSON, XHTML, and XML format. PPTX, DOCX and XLSX formats are available in Preview. The PDF files are uploaded to the buckets for the datastore of the search app.

#### Uploading the additional PDF files for tuning to the bucket of the datastore

In [None]:
!gcloud storage cp {TUNING_DATA_PATH_LOCAL}/*.jsonl "{TUNING_DATA_PATH_REMOTE}"
!gcloud storage cp {TUNING_DATA_PATH_LOCAL}/*.tsv "{TUNING_DATA_PATH_REMOTE}"
!gcloud storage ls "{TUNING_DATA_PATH_REMOTE}"

#### Uploading the datasets for the search tuning and perform the tuning

These are the information on the tuning dataset files to be used to tune the backend LLM behind the search app. Please refer to the [Prepare data for ingesting](https://cloud.google.com/generative-ai-app-builder/docs/prepare-data#website) in the Google Cloud Documentation.

In [None]:
data_store_id = f"{SEARCH_DATASTORE_ID}"
corpus_data_path = f"{TUNING_DATA_PATH_REMOTE}/corpus_file.jsonl"
query_data_path = f"{TUNING_DATA_PATH_REMOTE}/query_file.jsonl"
train_data_path = f"{TUNING_DATA_PATH_REMOTE}/training_data.tsv"
test_data_path = f"{TUNING_DATA_PATH_REMOTE}/test_data.tsv"

This ```train_custom_model``` function is to submit a search tuning job with the datasets we just prepared.

In [None]:
tuning_op = train_custom_model(
    PROJECT_ID,
    LOCATION,
    data_store_id,
    corpus_data_path,
    query_data_path,
    train_data_path,
    test_data_path,
)

We can see that three additional documents related to the tuning task was uploaded to the datastore bucket, ```FAQ-Kubernetes-Client.pdf, FAQ.pdf, README.pdf.``` With these new documents, we should perform the indexing again by calling the ```import_documents``` method of the client again.

In [None]:
!gcloud storage cp "{TUNING_DATA_PATH_LOCAL}/*.pdf" "{SEARCH_DATASTORE_PATH_REMOTE}"
!gcloud storage ls "{SEARCH_DATASTORE_PATH_REMOTE}"

In [None]:
metadata = import_documents(PROJECT_ID, LOCATION, SEARCH_DATASTORE_ID)

#### Testing the tuned search app endpoint with a question on Kubernetes

The tuning job will take about 30 to 60 minutes. After the tuning job completed, we test the search app with a query prompt regarding Kubernetes which is the information in the documents indexed additionally with the tuning.

In [None]:
QUERY_PROMPT = """
    How do I determine the status of a deployment of Kubernetes?
"""

We can see that the information on the deployment of Kubernetes which was described in the FAQ documents are correctly returned with the new documents indexed in the tuning.

In [None]:
search_response = search(PROJECT_ID, LOCATION, SEARCH_APP_ID, QUERY_PROMPT)

## Clean up

We should clean up the deployed resources and data not to create unnecessary costs.

#### Deleting the search app

In [None]:
delete_search_app_op_name = delete_engine(PROJECT_ID, LOCATION, SEARCH_APP_ID)

#### Deleting the documents in the datastore

In [None]:
purge_document_metadata = purge_documents(PROJECT_ID, LOCATION, SEARCH_DATASTORE_ID)

#### Deleting the datastore

In [None]:
delete_datastore_op_name = delete_data_store(PROJECT_ID, LOCATION, SEARCH_DATASTORE_ID)

In [None]:
!gcloud storage rm -r "{BUCKET_URI}"