search/create_datastore_and_search.ipynb (533 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "829cc6c6-f8c3-4fc6-942b-f698be5fa1a2" }, "source": [ "# Create a Vertex AI Datastore and Search Engine\n", "\n", "<table align=\"left\">\n", "\n", " <td>\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n", " View on GitHub\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/search/create_datastore_and_search.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\">\n", " Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", "</table>\n", "\n", "<div style=\"clear: both;\"></div>\n", "\n", "<b>Share to:</b>\n", "\n", "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n", "</a>\n", "\n", "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n", "</a>\n", "\n", "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n", "</a>\n", "\n", "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n", "</a>\n", "\n", "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n", "</a> " ] }, { "cell_type": "markdown", "metadata": { "id": "0460bdbe-05db-42f8-822c-82195de8329a" }, "source": [ "---\n", "\n", "* Author(s): [Kara Greenfield](https://github.com/kgreenfield2)\n", "* Created: 22 Nov 2023\n", "* Updated: 31 Oct 2024\n", "\n", "---\n", "\n", "## Objective\n", "\n", "This notebook shows how to create and populate a Vertex AI Search Datastore, how to create a search app connected to that datastore, and how to submit queries through the search engine.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "0b4b1050-7113-487e-aeaf-55690c831a1d" }, "source": [ "Services used in the notebook:\n", "\n", "- ✅ Vertex AI Search for document search and retrieval" ] }, { "cell_type": "markdown", "metadata": { "id": "43828625-a130-449c-ba5f-6a948220f559" }, "source": [ "## Install pre-requisites\n", "\n", "If running in Colab install the pre-requisites into the runtime. Otherwise it is assumed that the notebook is running in Vertex AI Workbench." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "8ea5db8a-dccc-4442-b5d7-7088d5ffb5ac" }, "outputs": [], "source": [ "%pip install --upgrade --user -q google-cloud-discoveryengine" ] }, { "cell_type": "markdown", "metadata": { "id": "10f9e321" }, "source": [ "### Restart current runtime\n", "\n", "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "da755736" }, "outputs": [], "source": [ "# Restart kernel after installs so that your environment can access the new packages\n", "\n", "import IPython\n", "\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "9c31dbe0" }, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>\n", "</div>\n" ] }, { "cell_type": "markdown", "metadata": { "id": "5f6a5de5-156e-4f14-99e6-3ab33e076c81" }, "source": [ "## Authenticate\n", "\n", "If running in Colab authenticate with `google.colab.google.auth` otherwise assume that running on Vertex AI Workbench." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "df3b7309-c51c-44a0-a466-b0b2733e0c28" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "ae7a4925-145a-40f3-9fa1-3b69a42d488d" }, "source": [ "## Configure notebook environment" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "389b9e51-1ce4-4bdf-80b5-fcdc1882f853" }, "outputs": [], "source": [ "from google.api_core.client_options import ClientOptions\n", "from google.cloud import discoveryengine\n", "\n", "PROJECT_ID = \"YOUR_PROJECT_ID\" # @param {type:\"string\"}\n", "LOCATION = \"global\"" ] }, { "cell_type": "markdown", "metadata": { "id": "HjoerqoksBdx" }, "source": [ "Set [Application Default Credentials](https://cloud.google.com/docs/authentication/application-default-credentials)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "s-jH2yG1rtxn" }, "outputs": [], "source": [ "!gcloud auth application-default login --project {PROJECT_ID}" ] }, { "cell_type": "markdown", "metadata": { "id": "07f1aecd-4633-4451-b5a7-1f26e4cb2631" }, "source": [ "## Create and Populate a Datastore" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "9a1d0021-e299-49d4-a657-2e101ae49eb6" }, "outputs": [], "source": [ "def create_data_store(\n", " project_id: str, location: str, data_store_name: str, data_store_id: str\n", "):\n", " # Create a client\n", " client_options = (\n", " ClientOptions(api_endpoint=f\"{location}-discoveryengine.googleapis.com\")\n", " if location != \"global\"\n", " else None\n", " )\n", " client = discoveryengine.DataStoreServiceClient(client_options=client_options)\n", "\n", " # Initialize request argument(s)\n", " data_store = discoveryengine.DataStore(\n", " display_name=data_store_name,\n", " industry_vertical=discoveryengine.IndustryVertical.GENERIC,\n", " content_config=discoveryengine.DataStore.ContentConfig.CONTENT_REQUIRED,\n", " )\n", "\n", " operation = client.create_data_store(\n", " request=discoveryengine.CreateDataStoreRequest(\n", " parent=client.collection_path(project_id, location, \"default_collection\"),\n", " data_store=data_store,\n", " data_store_id=data_store_id,\n", " )\n", " )\n", "\n", " # Make the request\n", " # The try block is necessary to prevent execution from halting due to an error being thrown when the datastore takes a while to instantiate\n", " try:\n", " response = operation.result(timeout=90)\n", " except:\n", " print(\"long-running operation error.\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "8a4db726-8da1-4c76-8934-944aaf5f9b53" }, "outputs": [], "source": [ "# The datastore name can only contain lowercase letters, numbers, and hyphens\n", "DATASTORE_NAME = \"alphabet-contracts\"\n", "DATASTORE_ID = f\"{DATASTORE_NAME}-id\"\n", "\n", "create_data_store(PROJECT_ID, LOCATION, DATASTORE_NAME, DATASTORE_ID)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "03121270-5d2f-403b-81ea-c1d241357bd1" }, "outputs": [], "source": [ "def import_documents(\n", " project_id: str,\n", " location: str,\n", " data_store_id: str,\n", " gcs_uri: str,\n", "):\n", " # Create a client\n", " client_options = (\n", " ClientOptions(api_endpoint=f\"{location}-discoveryengine.googleapis.com\")\n", " if location != \"global\"\n", " else None\n", " )\n", " client = discoveryengine.DocumentServiceClient(client_options=client_options)\n", "\n", " # The full resource name of the search engine branch.\n", " # e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}\n", " parent = client.branch_path(\n", " project=project_id,\n", " location=location,\n", " data_store=data_store_id,\n", " branch=\"default_branch\",\n", " )\n", "\n", " source_documents = [f\"{gcs_uri}/*\"]\n", "\n", " request = discoveryengine.ImportDocumentsRequest(\n", " parent=parent,\n", " gcs_source=discoveryengine.GcsSource(\n", " input_uris=source_documents, data_schema=\"content\"\n", " ),\n", " # Options: `FULL`, `INCREMENTAL`\n", " reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,\n", " )\n", "\n", " # Make the request\n", " operation = client.import_documents(request=request)\n", "\n", " response = operation.result()\n", "\n", " # Once the operation is complete,\n", " # get information from operation metadata\n", " metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)\n", "\n", " # Handle the response\n", " return operation.operation.name" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "1ddfe66a-acb4-4fdb-b9ed-76a332bb0f0c" }, "outputs": [], "source": [ "source_documents_gs_uri = (\n", " \"gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs\"\n", ")\n", "\n", "import_documents(PROJECT_ID, LOCATION, DATASTORE_ID, source_documents_gs_uri)" ] }, { "cell_type": "markdown", "metadata": { "id": "7a957202-b67e-47ca-84c3-b8a62cfbe405" }, "source": [ "## Create a Search Engine\n", "\n", "This is used to set the `search_tier` to enterprise and to enable advanced LLM features.\n", "\n", "Enterprise tier is required to get extractive answers from a search query and advanced LLM features are required to summarize search results." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "0e39e9bb-f381-44f6-a45e-3322669a171f" }, "outputs": [], "source": [ "def create_engine(\n", " project_id: str, location: str, engine_name: str, engine_id: str, data_store_id: str\n", "):\n", " # Create a client\n", " client_options = (\n", " ClientOptions(api_endpoint=f\"{location}-discoveryengine.googleapis.com\")\n", " if location != \"global\"\n", " else None\n", " )\n", " client = discoveryengine.EngineServiceClient(client_options=client_options)\n", "\n", " # Initialize request argument(s)\n", " engine = discoveryengine.Engine(\n", " display_name=engine_name,\n", " solution_type=discoveryengine.SolutionType.SOLUTION_TYPE_SEARCH,\n", " industry_vertical=discoveryengine.IndustryVertical.GENERIC,\n", " data_store_ids=[data_store_id],\n", " search_engine_config=discoveryengine.Engine.SearchEngineConfig(\n", " search_tier=discoveryengine.SearchTier.SEARCH_TIER_ENTERPRISE,\n", " search_add_ons=[discoveryengine.SearchAddOn.SEARCH_ADD_ON_LLM],\n", " ),\n", " )\n", "\n", " request = discoveryengine.CreateEngineRequest(\n", " parent=client.collection_path(project_id, location, \"default_collection\"),\n", " engine=engine,\n", " engine_id=engine.display_name,\n", " )\n", "\n", " # Make the request\n", " operation = client.create_engine(request=request)\n", " response = operation.result(timeout=90)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "4a853982-8c2e-402a-a808-5364bf932619" }, "outputs": [], "source": [ "ENGINE_NAME = DATASTORE_NAME\n", "ENGINE_ID = DATASTORE_ID\n", "create_engine(PROJECT_ID, LOCATION, ENGINE_NAME, ENGINE_ID, DATASTORE_ID)" ] }, { "cell_type": "markdown", "metadata": { "id": "e9f4d978-9164-4de3-b01a-179051706313" }, "source": [ "## Query your Search Engine\n", "\n", "Note: The Engine will take some time to be ready to query.\n", "\n", "If you recently created an engine and you receive an error similar to:\n", "\n", "`404 Engine {ENGINE_NAME} is not found`\n", "\n", "Then wait a few minutes and try your query again." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "1c4dfb62-7846-43d0-9cba-fd8886ce5546" }, "outputs": [], "source": [ "def search_sample(\n", " project_id: str,\n", " location: str,\n", " engine_id: str,\n", " search_query: str,\n", ") -> list[discoveryengine.SearchResponse]:\n", " # For more information, refer to:\n", " # https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store\n", " client_options = (\n", " ClientOptions(api_endpoint=f\"{location}-discoveryengine.googleapis.com\")\n", " if LOCATION != \"global\"\n", " else None\n", " )\n", "\n", " # Create a client\n", " client = discoveryengine.SearchServiceClient(client_options=client_options)\n", "\n", " # The full resource name of the search engine serving config\n", " # e.g. projects/{project_id}/locations/{location}/dataStores/{data_store_id}/servingConfigs/{serving_config_id}\n", " serving_config = f\"projects/{project_id}/locations/{location}/collections/default_collection/engines/{engine_id}/servingConfigs/default_search\"\n", "\n", " # Optional: Configuration options for search\n", " # Refer to the `ContentSearchSpec` reference for all supported fields:\n", " # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest.ContentSearchSpec\n", " content_search_spec = discoveryengine.SearchRequest.ContentSearchSpec(\n", " # For information about snippets, refer to:\n", " # https://cloud.google.com/generative-ai-app-builder/docs/snippets\n", " snippet_spec=discoveryengine.SearchRequest.ContentSearchSpec.SnippetSpec(\n", " return_snippet=True\n", " ),\n", " # For information about search summaries, refer to:\n", " # https://cloud.google.com/generative-ai-app-builder/docs/get-search-summaries\n", " summary_spec=discoveryengine.SearchRequest.ContentSearchSpec.SummarySpec(\n", " summary_result_count=5,\n", " include_citations=True,\n", " ignore_adversarial_query=True,\n", " ignore_non_summary_seeking_query=True,\n", " ),\n", " )\n", "\n", " # Refer to the `SearchRequest` reference for all supported fields:\n", " # https://cloud.google.com/python/docs/reference/discoveryengine/latest/google.cloud.discoveryengine_v1.types.SearchRequest\n", " request = discoveryengine.SearchRequest(\n", " serving_config=serving_config,\n", " query=search_query,\n", " page_size=10,\n", " content_search_spec=content_search_spec,\n", " query_expansion_spec=discoveryengine.SearchRequest.QueryExpansionSpec(\n", " condition=discoveryengine.SearchRequest.QueryExpansionSpec.Condition.AUTO,\n", " ),\n", " spell_correction_spec=discoveryengine.SearchRequest.SpellCorrectionSpec(\n", " mode=discoveryengine.SearchRequest.SpellCorrectionSpec.Mode.AUTO\n", " ),\n", " )\n", "\n", " response = client.search(request)\n", " return response" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "ad41e18b-38d2-4f4c-98ae-df14eda900ae" }, "outputs": [], "source": [ "query = \"Who is the CEO of Google?\"\n", "\n", "response = search_sample(PROJECT_ID, LOCATION, ENGINE_ID, query)\n", "print(response.summary.summary_text)" ] } ], "metadata": { "colab": { "name": "create_datastore_and_search.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }