search/search_data_blending_with_gemini_summarization.ipynb (534 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2eec5cc39a59"
},
"outputs": [],
"source": [
"# Copyright 2024 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8629a3a859ee"
},
"source": [
"# Vertex AI Search - Querying Blended Data Apps and Summarization with Gemini\n",
"\n",
"<table align=\"left\">\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/search/search_data_blending_with_gemini_summarization.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Run in Colab\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fsearch%2Fsearch_data_blending_with_gemini_summarization.ipynb\">\n",
" <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Run in Colab Enterprise\n",
" </a>\n",
" </td> \n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/search/search_data_blending_with_gemini_summarization.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/search/search_data_blending_with_gemini_summarization.ipynb\">\n",
" <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
" </a>\n",
" </td>\n",
"</table>\n",
"\n",
"<div style=\"clear: both;\"></div>\n",
"\n",
"<b>Share to:</b>\n",
"\n",
"<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/search_data_blending_with_gemini_summarization.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/search_data_blending_with_gemini_summarization.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/search_data_blending_with_gemini_summarization.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/search_data_blending_with_gemini_summarization.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/search_data_blending_with_gemini_summarization.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
"</a> \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "868a314524de"
},
"source": [
"| | |\n",
"|-|-|\n",
"|Author(s) | [Shantam Gupta](https://github.com/ShantamGupta)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "80a01ba339cd"
},
"source": [
"## Overview\n",
"\n",
"### Search\n",
"Vertex AI Search brings together the power of deep information retrieval, state-of-the-art natural language processing, and the latest in large language processing to understand user intent and return the most relevant results for the user.\n",
"\n",
"With Vertex AI Search, you can create apps for searching and for making recommendations. Vertex AI Search also has special capabilities for some industries, such as media, healthcare, and retail.\n",
"\n",
"\n",
"\n",
"### Gemini\n",
"\n",
"Gemini is a family of generative AI models developed by Google DeepMind that is designed for multimodal use cases. The Gemini API gives you access to the Gemini models.\n",
"\n",
"### Gemini API in Vertex AI\n",
"\n",
"The Gemini API in Vertex AI provides a unified interface for interacting with Gemini models.\n",
"\n",
"You can interact with the Gemini API using the following methods:\n",
"\n",
"- Use [Vertex AI Studio](https://cloud.google.com/generative-ai-studio) for quick testing and command generation\n",
"- Use cURL commands\n",
"- Use the Vertex AI SDK\n",
"\n",
"For more information, see the [Generative AI on Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview) documentation.\n",
"\n",
"This tutorial explains how to call a search app with mixed datastore, get search snippets and summarize the response using Gemini. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "979f94dd918d"
},
"source": [
"### Create a Search App with Mixed Datastores"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0e9cb170ef48"
},
"source": [
"1. Follow the steps listed here to create a Search App\n",
" - https://cloud.google.com/generative-ai-app-builder/docs/create-engine-es \n",
"2. Create the relevant data stores (GCS, BQ, Website)\n",
" - https://cloud.google.com/generative-ai-app-builder/docs/create-data-store-es\n",
"3. Link the data stores to the Search App\n",
" - https://cloud.google.com/generative-ai-app-builder/docs/create-data-store-es#multi-data-stores\n",
"\n",
"The example query and results are based on the data used in this tutorial:\n",
" - https://cloud.google.com/generative-ai-app-builder/docs/try-enterprise-search"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "00915a917846"
},
"source": [
"### Install the Relevant packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "33fd35fef7df"
},
"outputs": [],
"source": [
"%pip install --upgrade --user -q google-cloud-aiplatform google-cloud-discoveryengine"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "17c740aa336a"
},
"source": [
"### Restart current runtime\n",
"\n",
"To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9b74e6a74dcf"
},
"outputs": [],
"source": [
"# Restart kernel after installs so that your environment can access the new packages\n",
"import IPython\n",
"\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ea71b80ced17"
},
"source": [
"### Authenticate your notebook environment (Colab only)\n",
"\n",
"If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9a9d9cb1eed5"
},
"outputs": [],
"source": [
"import sys\n",
"\n",
"# Additional authentication is required for Google Colab\n",
"if \"google.colab\" in sys.modules:\n",
" # Authenticate user to Google Cloud\n",
" from google.colab import auth\n",
"\n",
" auth.authenticate_user()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8dab929b650e"
},
"source": [
"### Define Google Cloud project information"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "813f4938f474"
},
"outputs": [],
"source": [
"PROJECT_ID = \"PROJECT_ID\" # @param {type:\"string\"}\n",
"SEARCH_APP_LOCATION = \"global\" # @param {type:\"string\"}\n",
"SEARCH_ENGINE_ID = \"VERTEX_SEARCH_ENGINE_ID\" # @param {type:\"string\"}\n",
"LOCATION_GEMINI_MODEL = \"us-central1\" # @param {type:\"string\"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "76740b1f1948"
},
"source": [
"### Initialize the Vertex AI SDK"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "3aa18dc28939"
},
"outputs": [],
"source": [
"import vertexai\n",
"\n",
"vertexai.init(project=PROJECT_ID, location=LOCATION_GEMINI_MODEL)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b3c01cac62b6"
},
"source": [
"### Import the Relevant packages"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"id": "ca3c04d1fe5c"
},
"outputs": [],
"source": [
"import re\n",
"\n",
"from google.api_core.client_options import ClientOptions\n",
"from google.cloud import discoveryengine_v1alpha as discoveryengine\n",
"from vertexai.generative_models import (\n",
" GenerationConfig,\n",
" GenerativeModel,\n",
" HarmBlockThreshold,\n",
" HarmCategory,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a108d569054e"
},
"source": [
"### Send a Request to Vertex AI Search App with Data Blending (Mixed Datastore) \n",
"\n",
"- https://cloud.google.com/generative-ai-app-builder/docs/create-data-store-es#multi-data-stores"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"id": "3eaf984cdea5"
},
"outputs": [],
"source": [
"search_query = \"What was Google's revenue in Q4 2020?\""
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"id": "90306aad568a"
},
"outputs": [],
"source": [
"# Create a client using a regional endpoint\n",
"client = discoveryengine.SearchServiceClient(\n",
" client_options=(\n",
" ClientOptions(\n",
" api_endpoint=f\"{SEARCH_APP_LOCATION}-discoveryengine.googleapis.com\"\n",
" )\n",
" if SEARCH_APP_LOCATION != \"global\"\n",
" else None\n",
" )\n",
")\n",
"\n",
"# The full resource name of the search app serving config\n",
"serving_config = f\"projects/{PROJECT_ID}/locations/{SEARCH_APP_LOCATION}/collections/default_collection/engines/{SEARCH_ENGINE_ID}/servingConfigs/default_config\"\n",
"\n",
"response = client.search(\n",
" discoveryengine.SearchRequest(\n",
" serving_config=serving_config,\n",
" query=search_query,\n",
" page_size=10,\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4108b8c326aa"
},
"source": [
"### Extract & clean up snippets from search results\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0907519d28ba"
},
"outputs": [],
"source": [
"retrieved_data: list[str] = []\n",
"\n",
"for result in response.results:\n",
" data = result.document.derived_struct_data\n",
" if not data:\n",
" continue\n",
"\n",
" snippets: list[str] = [\n",
" re.sub(\"<[^>]*>\", \"\", snippet_item.get(\"snippet\", \"\"))\n",
" for snippet_item in data.get(\"snippets\", [])\n",
" if snippet_item.get(\"snippet\")\n",
" ]\n",
"\n",
" extractive_answers: list[str] = [\n",
" re.sub(\"<[^>]*>\", \"\", snippet_item.get(\"content\", \"\"))\n",
" for snippet_item in data.get(\"extractive_answers\", [])\n",
" if snippet_item.get(\"content\")\n",
" ]\n",
"\n",
" if snippets:\n",
" title = data.get(\"title\", \"Unknown Title\")\n",
" retrieved_data.append(\n",
" f\"--- Snippets from Document {title} ---\\n{''.join(snippets)}\\n\"\n",
" )\n",
" elif extractive_answers:\n",
" title = data.get(\"link\", \"Unknown\")\n",
" retrieved_data.append(\n",
" f\"--- Snippets from Document {title} ---\\n{''.join(extractive_answers)}\\n\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4e573b68270b"
},
"source": [
"### Feed the Search result snippets to Gemini model and formulate a summary/response based on your original prompt"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9a86f795306e"
},
"source": [
"#### Model parameters\n",
"\n",
"Every prompt you send to the model includes parameter values that control how the model generates a response.\n",
"\n",
"The model can generate different results for different parameter values.\n",
"\n",
"You can experiment with different model parameters to see how the results change.\n"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"id": "3287617a4732"
},
"outputs": [],
"source": [
"generation_config = GenerationConfig(\n",
" temperature=0,\n",
" top_p=1.0,\n",
" max_output_tokens=2048,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {
"id": "fa4ef98d51d9"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PROMPT:\n",
"Provide an answer to the question based on the information in the Document snippets provided with citations.\n",
"Question: What was Google's revenue in Q4 2020?\n",
"--- Snippets from Document GOOG Exhibit 99.1 Q4'20 ---\n",
"Google Cloud revenues were $13.1 billion for 2020, with significant ongoing ... Quarter Q4 2019 Q4 2020 Fiscal Year 2019 2018 2020 Revenues: $ 43,198 $ 2,614 ...\n",
"--- Snippets from Document GOOG 10-K Q4 2020 ---\n",
"... Google Maps, Google Play, Search, and YouTube. Google Services generates revenues primarily from advertising; sales of apps, in-app purchases, digital ...\n",
"--- Snippets from Document GOOG 10-K Q4 2020 ---\n",
"Table of Contents Alphabet Inc. Google other revenues increased $4,697 million from 2019 to 2020. The growth was primarily driven by Google Play and YouTube ...\n",
"--- Snippets from Document GOOG 10-K Q4 2020 ---\n",
"The TAC rate on Google properties revenues and the TAC rate on Google Network revenues were both substantially consistent from 2019 to 2020. Over time, cost ...\n",
"--- Snippets from Document GOOG 10-K Q4 2020 ---\n",
"Our advertising revenue growth rate has been affected over ... Google Search & other Google Search & other revenues increased $5,947 million from 2019 to 2020.\n",
"--- Snippets from Document GOOG 10-K Q4 2020 ---\n",
"... revenues disaggregated by type (in millions). Year Ended December 31, 2018 2019 2020 Google Search & other YouTube ads Google Network Members' properties ...\n",
"--- Snippets from Document GOOG 10-K Q4 2020 ---\n",
"Sales and other similar taxes are excluded from revenues. Advertising Revenues We generate advertising revenues primarily by delivering advertising on Google ...\n",
"--- Snippets from Document GOOG 10-K Q4 2020 ---\n",
"Other Revenues Google other revenues and Other Bets revenues consist primarily of revenues from: • Google Play, which includes revenues from sale of apps ...\n",
"--- Snippets from Document 2021_Q4_Earnings_Transcript ---\n",
"Google search and other advertising revenues of $43.3 billion in the quarter ... fourth quarter of 2020. Network advertising revenues of $9.3 billion, were up 26 ...\n",
"--- Snippets from Document GOOG 10-K Q4 2020 ---\n",
"... 2020 Revenues: Google Services Google Cloud Other Bets $ 130,524 $ 151,825 $ 168,635 5,838 8,918 13,059 595 659 455 657 Hedging gains (losses) Total revenues ...\n",
"\n",
"\n",
"Gemini Response:\n",
"\n",
"Google's revenue in Q4 2020 was $168,635 million. (Document GOOG 10-K Q4 2020)"
]
}
],
"source": [
"# Prompt for Gemini model\n",
"PROMPT_GEMINI = f\"\"\"Provide an answer to the question based on the information in the Document snippets provided with citations.\n",
"Question: {search_query}\n",
"{''.join(retrieved_data)}\n",
"\"\"\"\n",
"\n",
"model = GenerativeModel(\"gemini-2.0-flash\") # specify the Gemini model version\n",
"\n",
"\n",
"def generate(PROMPT_GEMINI: str):\n",
" \"\"\"\n",
" Given the prompt\n",
" output the summarized response to user's original query\n",
" \"\"\"\n",
" responses = model.generate_content(\n",
" PROMPT_GEMINI,\n",
" generation_config=generation_config,\n",
" safety_settings={\n",
" HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,\n",
" HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,\n",
" HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,\n",
" HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,\n",
" },\n",
" stream=True,\n",
" )\n",
"\n",
" for response in responses:\n",
" print(response.text, end=\"\")\n",
"\n",
"\n",
"print(f\"PROMPT:\\n{PROMPT_GEMINI}\")\n",
"\n",
"print(\"Gemini Response:\\n\")\n",
"generate(PROMPT_GEMINI)"
]
}
],
"metadata": {
"colab": {
"name": "search_data_blending_with_gemini_summarization.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}