search/vais-building-blocks/custom_attributes_by_url_pattern.ipynb (918 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5XNYlDkDLpqU"
},
"outputs": [],
"source": [
"# Copyright 2024 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5tR528hOD4Dx"
},
"source": [
"# Defining custom attributes based on URL patterns in Vertex AI Search Website Datastores\n",
"\n",
"<table align=\"left\">\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-vais-building-blocks&utm_medium=aRT-clicks&utm_campaign=vais-building-blocks&destination=vais-building-blocks&url=https%3A%2F%2Fcolab.research.google.com%2Fgithub%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fblob%2Fmain%2Fsearch%2Fvais-building-blocks%2Fcustom_attributes_by_url_pattern.ipynb\">\n",
" <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-vais-building-blocks&utm_medium=aRT-clicks&utm_campaign=vais-building-blocks&destination=vais-building-blocks&url=https%3A%2F%2Fconsole.cloud.google.com%2Fvertex-ai%2Fcolab%2Fimport%2Fhttps%3A%252F%252Fraw.githubusercontent.com%252FGoogleCloudPlatform%252Fapplied-ai-engineering-samples%252Fmain%252Fsearch%252Fvais-building-blocks%252Fcustom_attributes_by_url_pattern.ipynb\">\n",
" <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-vais-building-blocks&utm_medium=aRT-clicks&utm_campaign=vais-building-blocks&destination=vais-building-blocks&url=https%3A%2F%2Fconsole.cloud.google.com%2Fvertex-ai%2Fworkbench%2Fdeploy-notebook%3Fdownload_url%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fmain%2Fsearch%2Fvais-building-blocks%2Fcustom_attributes_by_url_pattern.ipynb\">\n",
" <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples/blob/main/search/vais-building-blocks/custom_attributes_by_url_pattern.ipynb\">\n",
" <img width=\"32px\" src=\"https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
" </a>\n",
" </td>\n",
"</table>\n",
"\n",
"<div style=\"clear: both;\"></div>\n",
"\n",
"<b>Share to:</b>\n",
"\n",
"<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/custom_attributes_by_url_pattern.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/custom_attributes_by_url_pattern.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/custom_attributes_by_url_pattern.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg\" alt=\"X logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/custom_attributes_by_url_pattern.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/custom_attributes_by_url_pattern.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
"</a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pkd93iDpEBWx"
},
"source": [
"| | |\n",
"|----------|-------------|\n",
"| Author(s) | Hossein Mansour|\n",
"| Reviewers(s) | Ismail Najim, Rajesh Thallam|\n",
"| Last updated | 2024-08-09: The first draft |"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yAnTektvEQjb"
},
"source": [
"# Overview\n",
"\n",
"In this notebook, we demonstrate how to create [custom attributes](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1alpha/projects.locations.collections.dataStores.siteSearchEngine/getUriPatternDocumentData) based on URL patterns in [Vertex AI Search](https://cloud.google.com/generative-ai-app-builder/docs/introduction) Website datastores.\n",
"\n",
"These custom attributes will act similarly to metadata from page source and can be used for different purposes such as improving recall and precision, influencing results via boosting and filtering, and including additional context to be retrieved together with the documents.\n",
"\n",
"You can find more information about different types of metadata [here](https://cloud.google.com/generative-ai-app-builder/docs/provide-schema#about_providing_your_own_schema_as_a_json_object).\n",
"\n",
"Custom attributes based on URL patterns are particularly helpful in cases where adjusting page source to include relevant information is not feasible due to a need to keep that information private or when organizational complexities make it difficult to influence the page source content (e.g., content being managed by a third party).\n",
"\n",
"Custom attributes can be used, in lieu of page source metadata, in conjunction with page source metadata, or to override poor quality page content via post-processing (e.g., a Title_Override custom attribute to override the actual page title for certain URLs).\n",
"\n",
"Note that basic URL-based [boosting](https://cloud.google.com/generative-ai-app-builder/docs/boost-search-results) and [filtering](https://cloud.google.com/generative-ai-app-builder/docs/filter-website-search#examples-advanced-indexing) can be done directly. Custom Attributes are intended for more advanced use cases.\n",
"\n",
"If the custom attribute is made searchable, it can be used to implicitly influence retrieval and ranking of the page by providing additional information such as tags and related topics.\n",
"\n",
"We will perform the following steps:\n",
"\n",
"- [Prerequisite] Creating a Vertex AI Search Website Datastore and Search App\n",
"- Setting Schema and URL mapping for Customer Attributes\n",
"- Getting Schema and URL mapping to confirm this is what we want\n",
"- Searching the Datastore and demonstrating how custom attributes can be used for filtering\n",
"- Clean up\n",
"\n",
"\n",
"Please refer to the [official documentation](https://cloud.google.com/generative-ai-app-builder/docs/create-datastore-ingest) for the definition of Datastores and Apps and their relationships to one another\n",
"\n",
"REST API is used throughout this notebook. Please consult the [official documentation](https://cloud.google.com/generative-ai-app-builder/docs/apis) for alternative ways to achieve the same goal, namely Client libraries and RPC.\n",
"\n",
"\n",
"# Vertex AI Search\n",
"Vertex AI Search (VAIS) is a fully-managed platform, powered by large language models, that lets you build AI-enabled search and recommendation experiences for your public or private websites or mobile applications\n",
"\n",
"VAIS can handle a diverse set of data sources including structured, unstructured, and website data, as well as data from third-party applications such as Jira, Salesforce, and Confluence.\n",
"\n",
"VAIS also has built-in integration with LLMs which enables you to provide answers to complex questions, grounded in your data\n",
"\n",
"# Using this Notebook\n",
"If you're running outside of Colab, depending on your environment you may need to install pip packages that are included in the Colab environment by default but are not part of the Python Standard Library. Outside of Colab you'll also notice comments in code cells that look like #@something, these trigger special Colab functionality but don't change the behavior of the notebook.\n",
"\n",
"This tutorial uses the following Google Cloud services and resources:\n",
"\n",
"- Service Usage API\n",
"- Discovery Engine API\n",
"\n",
"This notebook has been tested in the following environment:\n",
"\n",
"- Python version = 3.10.12\n",
"- google.cloud.storage = 2.8.0\n",
"- google.auth = 2.27.0\n",
"\n",
"# Getting Started\n",
"\n",
"The following steps are necessary to run this notebook, no matter what notebook environment you're using.\n",
"\n",
"If you're entirely new to Google Cloud, [get started here](https://cloud.google.com/docs/get-started)\n",
"\n",
"## Google Cloud Project Setup\n",
"\n",
"1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs\n",
"2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project)\n",
"3. [Enable the Service Usage API](https://console.cloud.google.com/apis/library/serviceusage.googleapis.com)\n",
"4. [Enable the Cloud Storage API](https://console.cloud.google.com/flows/enableapi?apiid=storage.googleapis.com)\n",
"5. [Enable the Discovery Engine API for your project](https://console.cloud.google.com/marketplace/product/google/discoveryengine.googleapis.com)\n",
"\n",
"## Google Cloud Permissions\n",
"\n",
"Ideally you should have [Owner role](https://cloud.google.com/iam/docs/understanding-roles) for your project to run this notebook. If that is not an option, you need at least the following [roles](https://cloud.google.com/iam/docs/granting-changing-revoking-access)\n",
"- **`roles/serviceusage.serviceUsageAdmin`** to enable APIs\n",
"- **`roles/iam.serviceAccountAdmin`** to modify service agent permissions\n",
"- **`roles/discoveryengine.admin`** to modify discoveryengine assets"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "49x_J4vWOuNg"
},
"source": [
"#Setup Environment"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kMYYfGpyOl5G"
},
"source": [
"## Authentication\n",
"\n",
" If you're using Colab, run the code in the next cell. Follow the pop-ups and authenticate with an account that has access to your Google Cloud [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects).\n",
"\n",
"If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into [Application Default Credentials for your local environment](https://cloud.google.com/docs/authentication/provide-credentials-adc#local-dev) and [initializing the Google Cloud CLI](https://cloud.google.com/docs/authentication/gcloud). In many cases, running `gcloud auth application-default login` in a shell on the machine running the notebook kernel is sufficient.\n",
"\n",
"More authentication options are discussed [here](https://cloud.google.com/docs/authentication)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DZjtfEDG7Sr3"
},
"outputs": [],
"source": [
"# Colab authentication.\n",
"import sys\n",
"\n",
"if \"google.colab\" in sys.modules:\n",
" from google.colab import auth\n",
"\n",
" auth.authenticate_user()\n",
" print(\"Authenticated\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "kT3Eda7_mlTP"
},
"outputs": [],
"source": [
"from google.auth import default\n",
"from google.auth.transport.requests import AuthorizedSession\n",
"\n",
"creds, _ = default()\n",
"authed_session = AuthorizedSession(creds)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "otijhCIjOzk-"
},
"source": [
"## Import Libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DlIp4zv3cdA7"
},
"outputs": [],
"source": [
"import json\n",
"import time"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "N51y_mPgPHsj"
},
"source": [
"## Configure environment\n",
"\n",
"The Location of a Datastore is set at the time of creation and it should be called appropriately to query the Datastore. `global` is typically recommended unless you have a particular reason to use a regional Datastore.\n",
"\n",
"You can find more information regarding the `Location` of datastores and associated limitations [here](https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store).\n",
"\n",
"`VAIS_BRANCH` is the branch of VAIS to use. At the time of writing this notebook, URL mapping for Custom Attributes is only available in [v1alpha](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1alpha/projects.locations.collections.dataStores.siteSearchEngine/getUriPatternDocumentData) of Discovery Engine API.\n",
"\n",
"\n",
"`INCLUDE_URL_PATTERN` is the pattern of a website to be included in the datastore, e.g. `www.example.com/*`, `www.example.com/abc/*`.\n",
"\n",
"Note that you need to [verify the ownership of a domain](https://cloud.google.com/generative-ai-app-builder/docs/domain-verification) to be able to index it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hKLBf1GqROW7"
},
"outputs": [],
"source": [
"PROJECT_ID = \"\" # @param {type: 'string'}\n",
"DATASTORE_ID = \"\" # @param {type: 'string'}\n",
"APP_ID = \"\" # @param {type: 'string'}\n",
"LOCATION = \"global\" # @param [\"global\", \"us\", \"eu\"]\n",
"VAIS_BRANCH = \"v1alpha\" # @param {type: 'string'}\n",
"INCLUDE_URL_PATTERN = \"\" # @param {type: 'string'}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Akk3C5vK8oG6"
},
"source": [
"# Step 1. [Prerequisite] Create a Website Search Datastore and APP\n",
"In this section we will programmatically create a VAIS [Advanced Website Datastore and APP](https://cloud.google.com/generative-ai-app-builder/docs/about-advanced-features#advanced-website-indexing). You can achieve the same goal with a [few clicks](https://cloud.google.com/generative-ai-app-builder/docs/website-search-checklist?indexing=advanced) in the UI.\n",
"\n",
"If you already have an Advanced Website Datastore available, you can skip this section.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "C2hXlewDINDg"
},
"source": [
"## Helper functions to issue basic search on a Datastore or an App"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "v-XHQIOooshe"
},
"outputs": [],
"source": [
"def search_by_datastore(project_id: str, location: str, datastore_id: str, query: str):\n",
" \"\"\"Searches a datastore using the provided query.\"\"\"\n",
" response = authed_session.post(\n",
" f\"https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{project_id}/locations/{location}/collections/default_collection/dataStores/{datastore_id}/servingConfigs/default_search:search\",\n",
" headers={\n",
" \"Content-Type\": \"application/json\",\n",
" },\n",
" json={\"query\": query, \"pageSize\": 1},\n",
" )\n",
" return response\n",
"\n",
"\n",
"def search_by_app(project_id: str, location: str, app_id: str, query: str):\n",
" \"\"\"Searches an app using the provided query.\"\"\"\n",
" response = authed_session.post(\n",
" f\"https://discoveryengine.googleapis.com/v1/projects/{project_id}/locations/{location}/collections/default_collection/engines/{app_id}/servingConfigs/default_config:search\",\n",
" headers={\n",
" \"Content-Type\": \"application/json\",\n",
" },\n",
" json={\"query\": query, \"pageSize\": 1},\n",
" )\n",
" return response"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eAigF6KHkMZ2"
},
"source": [
"## Helper functions to check whether or not a Datastore or an App already exist"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "IO1AxLZckXYK"
},
"outputs": [],
"source": [
"def datastore_exists(project_id: str, location: str, datastore_id: str) -> bool:\n",
" \"\"\"Check if a datastore exists.\"\"\"\n",
" response = search_by_datastore(project_id, location, datastore_id, \"test\")\n",
" status_code = response.status_code\n",
" # A 400 response is expected as the URL pattern needs to be set first\n",
" if status_code == 200 or status_code == 400:\n",
" return True\n",
" if status_code == 404:\n",
" return False\n",
" raise Exception(f\"Error: {status_code}\")\n",
"\n",
"\n",
"def app_exists(project_id: str, location: str, app_id: str) -> bool:\n",
" \"\"\"Check if an App exists.\"\"\"\n",
" response = search_by_app(project_id, location, app_id, \"test\")\n",
" status_code = response.status_code\n",
" if status_code == 200:\n",
" return True\n",
" if status_code == 404:\n",
" return False\n",
" raise Exception(f\"Error: {status_code}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DYArgsiAiVfs"
},
"source": [
"## Helper functions to create a Datastore or an App"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_0uAQTKD78_k"
},
"outputs": [],
"source": [
"def create_website_datastore(\n",
" vais_branch: str, project_id: str, location: str, datastore_id: str\n",
") -> int:\n",
" \"\"\"Create a website datastore\"\"\"\n",
" payload = {\n",
" \"displayName\": datastore_id,\n",
" \"industryVertical\": \"GENERIC\",\n",
" \"solutionTypes\": [\"SOLUTION_TYPE_SEARCH\"],\n",
" \"contentConfig\": \"PUBLIC_WEBSITE\",\n",
" }\n",
" header = {\"X-Goog-User-Project\": project_id, \"Content-Type\": \"application/json\"}\n",
" es_endpoint = f\"https://discoveryengine.googleapis.com/{vais_branch}/projects/{project_id}/locations/{location}/collections/default_collection/dataStores?dataStoreId={datastore_id}\"\n",
" response = authed_session.post(\n",
" es_endpoint, data=json.dumps(payload), headers=header\n",
" )\n",
" if response.status_code == 200:\n",
" print(f\"The creation of Datastore {datastore_id} is initiated.\")\n",
" print(\"It may take a few minutes for the Datastore to become available\")\n",
" else:\n",
" print(f\"Failed to create Datastore {datastore_id}\")\n",
" print(response.text())\n",
" return response.status_code\n",
"\n",
"\n",
"def create_app(\n",
" vais_branch: str, project_id: str, location: str, datastore_id: str, app_id: str\n",
") -> int:\n",
" \"\"\"Create a search app.\"\"\"\n",
" payload = {\n",
" \"displayName\": app_id,\n",
" \"dataStoreIds\": [datastore_id],\n",
" \"solutionType\": \"SOLUTION_TYPE_SEARCH\",\n",
" \"searchEngineConfig\": {\n",
" \"searchTier\": \"SEARCH_TIER_ENTERPRISE\",\n",
" \"searchAddOns\": [\"SEARCH_ADD_ON_LLM\"],\n",
" },\n",
" }\n",
" header = {\"X-Goog-User-Project\": project_id, \"Content-Type\": \"application/json\"}\n",
" es_endpoint = f\"https://discoveryengine.googleapis.com/{vais_branch}/projects/{project_id}/locations/{location}/collections/default_collection/engines?engineId={app_id}\"\n",
" response = authed_session.post(\n",
" es_endpoint, data=json.dumps(payload), headers=header\n",
" )\n",
" if response.status_code == 200:\n",
" print(f\"The creation of App {app_id} is initiated.\")\n",
" print(\"It may take a few minutes for the App to become available\")\n",
" else:\n",
" print(f\"Failed to create App {app_id}\")\n",
" print(response.json())\n",
" return response.status_code"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1hAp5cBnIYxJ"
},
"source": [
"## Create a Datastores with the provided ID if it doesn't exist\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hBUwJxxeAazj"
},
"outputs": [],
"source": [
"if datastore_exists(PROJECT_ID, LOCATION, DATASTORE_ID):\n",
" print(f\"Datastore {DATASTORE_ID} already exists.\")\n",
"else:\n",
" create_website_datastore(VAIS_BRANCH, PROJECT_ID, LOCATION, DATASTORE_ID)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "C1d-pd2WLJZI"
},
"source": [
"## [Optional] Check if the Datastore is created successfully\n",
"\n",
"\n",
"The Datastore is polled to track when it becomes available.\n",
"\n",
"This may take a few minutes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EZGzOCnTLOwf"
},
"outputs": [],
"source": [
"while not datastore_exists(PROJECT_ID, LOCATION, DATASTORE_ID):\n",
" print(f\"Datastore {DATASTORE_ID} is still being created.\")\n",
" time.sleep(30)\n",
"print(f\"Datastore {DATASTORE_ID} is created successfully.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vSzz2AzmI5kx"
},
"source": [
"## Create an App with the provided ID if it doesn't exist\n",
"The App will be connected to a Datastore with the ID provided earlier in this notebook"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4lp4kPXNm9sE"
},
"outputs": [],
"source": [
"if app_exists(PROJECT_ID, LOCATION, APP_ID):\n",
" print(f\"App {APP_ID} already exists.\")\n",
"else:\n",
" create_app(VAIS_BRANCH, PROJECT_ID, LOCATION, DATASTORE_ID, APP_ID)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fxlTn7dVK-Q2"
},
"source": [
"## [Optional] Check if the App is created successfully\n",
"\n",
"\n",
"The App is polled to track when it becomes available.\n",
"\n",
"This may take a few minutes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZuQQ2HCGK4BA"
},
"outputs": [],
"source": [
"while not app_exists(PROJECT_ID, LOCATION, APP_ID):\n",
" print(f\"App {APP_ID} is still being created.\")\n",
" time.sleep(30)\n",
"print(f\"App {APP_ID} is created successfully.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "A38IfFRD83UG"
},
"source": [
"## Upgrade an existing Website Datastore to [Advanced Website](https://cloud.google.com/generative-ai-app-builder/docs/about-advanced-features#advanced-website-indexing) DataStore"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "BYXR-yQ38vdd"
},
"outputs": [],
"source": [
"def upgrade_to_advanced(\n",
" vais_branch: str, project_id: str, location: str, datastore_id: str\n",
") -> int:\n",
" \"\"\"Upgrade the website search datastore to advanced\"\"\"\n",
" header = {\"X-Goog-User-Project\": project_id}\n",
" es_endpoint = f\"https://discoveryengine.googleapis.com/{vais_branch}/projects/{project_id}/locations/{location}/collections/default_collection/dataStores/{datastore_id}/siteSearchEngine:enableAdvancedSiteSearch\"\n",
" response = authed_session.post(es_endpoint, headers=header)\n",
" if response.status_code == 200:\n",
" print(f\"Datastore {datastore_id} upgraded to Advanced Website Search\")\n",
" else:\n",
" print(f\"Failed to upgrade Datastore {datastore_id}\")\n",
" print(response.text())\n",
" return response.status_code\n",
"\n",
"\n",
"upgrade_to_advanced(VAIS_BRANCH, PROJECT_ID, LOCATION, DATASTORE_ID)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NlUq4ADT8975"
},
"source": [
"## Set the URLs to Include/Exclude in the Index\n",
"\n",
"You can set up to 500 Include and Exclude URL patterns for Advanced website search Datastores.\n",
"\n",
"This function sets a single URL pattern to be included every time it gets executed.\n",
"\n",
"The field `type` in the payload is used to indicate if the provided Uri pattern should be included or excluded. Here we only use `INCLUDE`.\n",
"\n",
"The `INCLUDE` and `EXCLUDE` URL patters specified with this function are incremental. You also have options to [Delete](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1alpha/projects.locations.collections.dataStores.siteSearchEngine.targetSites/delete), [List](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1alpha/projects.locations.collections.dataStores.siteSearchEngine.targetSites/list), [Batch Create](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1alpha/projects.locations.collections.dataStores.siteSearchEngine.targetSites/batchCreate), etc \n",
"\n",
"For this example, we index http://cloud.google.com/generative-ai-app-builder/*\n",
"\n",
"Note that you need to [verify the ownership of a domain](https://cloud.google.com/generative-ai-app-builder/docs/domain-verification) to be able to index it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "yc2OWvFd9Tvu"
},
"outputs": [],
"source": [
"def include_url_patterns(\n",
" vais_branch: str,\n",
" project_id: str,\n",
" location: str,\n",
" datastore_id: str,\n",
" include_url_patterns,\n",
") -> int:\n",
" \"\"\"Set include and exclude URL patterns for the Datastore\"\"\"\n",
" payload = {\n",
" \"providedUriPattern\": include_url_patterns,\n",
" \"type\": \"INCLUDE\",\n",
" }\n",
" header = {\"X-Goog-User-Project\": project_id, \"Content-Type\": \"application/json\"}\n",
" es_endpoint = f\"https://discoveryengine.googleapis.com/{vais_branch}/projects/{project_id}/locations/{location}/dataStores/{datastore_id}/siteSearchEngine/targetSites\"\n",
" response = authed_session.post(\n",
" es_endpoint, data=json.dumps(payload), headers=header\n",
" )\n",
" if response.status_code == 200:\n",
" print(\"URL patterns successfully set\")\n",
" print(\n",
" \"Depending on the size of your domain, the initial indexing may take from minutes to hours\"\n",
" )\n",
" else:\n",
" print(f\"Failed to set URL patterns for the Datastore {datastore_id}\")\n",
" print(response.text())\n",
" return response.status_code\n",
"\n",
"\n",
"include_url_patterns(\n",
" VAIS_BRANCH, PROJECT_ID, LOCATION, DATASTORE_ID, INCLUDE_URL_PATTERN\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rSAbsrg8Pkc2"
},
"source": [
"# Step 2. Schema and URL mapping for Custom Attributes"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Dc-gaZ6rP6mC"
},
"source": [
"## Set the Schema and URL mapping\n",
"\n",
"In this example we use [VAIS REST API documentation](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest) as the source for the datastore. For the mapping we add \"REST\" tags to all branches of REST documentation. We also add an additional tag to identify each branch (i.e. V1, V1alpha, V1beta). The schema and URL mapping should follow [this](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1alpha/projects.locations.collections.dataStores.siteSearchEngine/setUriPatternDocumentData#request-body) formatting.\n",
"\n",
"Separately, we identify pages under Samples with a corresponding tag.\n",
"\n",
"As mentioned above, you can only index a website you own, as a result your mapping will be different from the ones used in this example.\n",
"\n",
"Note that each successful mapping request overrides the previous ones (i.e. mappings are not incremental)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "TDpAyVAUbxXM"
},
"outputs": [],
"source": [
"header = {\"X-Goog-User-Project\": PROJECT_ID}\n",
"es_endpoint = f\"https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}/siteSearchEngine:setUriPatternDocumentData\"\n",
"json_data = {\n",
" \"documentDataMap\": {\n",
" \"https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/*\": {\n",
" \"Topic\": [\"Rest\", \"V1\"]\n",
" },\n",
" \"https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1alpha/*\": {\n",
" \"Topic\": [\"Rest\", \"V1alpha\"]\n",
" },\n",
" \"https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1beta/*\": {\n",
" \"Topic\": [\"Rest\", \"V1beta\"]\n",
" },\n",
" \"https://cloud.google.com/generative-ai-app-builder/docs/samples*\": {\n",
" \"Topic\": [\"Samples\"]\n",
" },\n",
" },\n",
" \"schema\": {\n",
" \"$schema\": \"https://json-schema.org/draft/2020-12/schema\",\n",
" \"properties\": {\n",
" \"Topic\": {\n",
" \"items\": {\n",
" \"indexable\": True,\n",
" \"retrievable\": True,\n",
" \"searchable\": True,\n",
" \"type\": \"string\",\n",
" },\n",
" \"type\": \"array\",\n",
" }\n",
" },\n",
" \"type\": \"object\",\n",
" },\n",
"}\n",
"\n",
"set_schema_response = authed_session.post(es_endpoint, headers=header, json=json_data)\n",
"\n",
"print(json.dumps(set_schema_response.json(), indent=1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xRCbxXEG2XmF"
},
"source": [
"## Get the Schema and URL mapping\n",
"\n",
"Get the Schema and URL mapping to ensure it is updated according to your expectations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7TXWY_6nTH3u"
},
"outputs": [],
"source": [
"header = {\"X-Goog-User-Project\": PROJECT_ID}\n",
"es_endpoint = f\"https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}/siteSearchEngine:getUriPatternDocumentData\"\n",
"get_schema_response = authed_session.get(es_endpoint, headers=header)\n",
"\n",
"print(json.dumps(get_schema_response.json(), indent=1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vESfZZ_QDLc8"
},
"source": [
"# Step 3. Run queries w/wo Metadata filter"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dce2yLIoZz0f"
},
"source": [
"## Search Parameters\n",
"`QUERY`: Used to query VAIS.\n",
"\n",
"`PAGE_SIZE`: The maximum number of results retrieved from VAIS.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZObBBqVdZZUV"
},
"outputs": [],
"source": [
"QUERY = \"\" # @param {type: 'string'}\n",
"PAGE_SIZE = 5 # @param {type: 'integer'}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nnnxTO8CC9Mz"
},
"source": [
"## Search Without Filter\n",
"Given that the `Topic` custom attribute is made `retrievable` in the Schema, You will get it back in the response, when applicable.\n",
"\n",
"Custom attributes are included in the `structData` field of the `result`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Usr8OMTu5EUk"
},
"outputs": [],
"source": [
"search_response = authed_session.post(\n",
" f\"https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}/servingConfigs/default_search:search\",\n",
" headers={\"Content-Type\": \"application/json\"},\n",
" json={\"query\": QUERY, \"pageSize\": PAGE_SIZE},\n",
")\n",
"\n",
"print(json.dumps(search_response.json(), indent=1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uCRfkzGyC53w"
},
"source": [
"## Search with Filter\n",
"Now we apply a filter so that a search only returns results from the V1alpha branch of the REST documentation. The filter and expected results will be different based on the domain included in your website datastore. \n",
"\n",
"We could also use this indexable field for other purposes such as Boosting, if desired."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qatUukazC4oH"
},
"outputs": [],
"source": [
"search_response = authed_session.post(\n",
" f\"https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}/servingConfigs/default_search:search\",\n",
" headers={\"Content-Type\": \"application/json\"},\n",
" json={\"query\": QUERY, \"filter\": 'Topic: ANY(\"V1alpha\")', \"pageSize\": PAGE_SIZE},\n",
")\n",
"\n",
"print(json.dumps(search_response.json(), indent=1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "e1kgs_XdDlHL"
},
"source": [
"# Clean up"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tGuk4ZnJk0S7"
},
"source": [
"## Delete the Search App\n",
"\n",
"Delete the App if you no longer need it\n",
"\n",
"Alternatively you can follow [these instructions](https://console.cloud.google.com/gen-app-builder/data-stores) to delete an App from the UI\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QEfxXtzfk0rx"
},
"outputs": [],
"source": [
"response = authed_session.delete(\n",
" f\"https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/engines/{APP_ID}\",\n",
" headers={\"X-Goog-User-Project\": PROJECT_ID},\n",
")\n",
"\n",
"print(response.text)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Tgm5idL4DjjU"
},
"source": [
"## Delete the Datastores\n",
"Delete the Datastore if you no longer need it\n",
"\n",
"Alternatively you can follow [these instructions](https://console.cloud.google.com/gen-app-builder/data-stores) to delete a Datastore from the UI"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vj8BpuS62tgt"
},
"outputs": [],
"source": [
"response = authed_session.delete(\n",
" f\"https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}\",\n",
" headers={\"X-Goog-User-Project\": PROJECT_ID},\n",
")\n",
"\n",
"print(response.text)"
]
}
],
"metadata": {
"colab": {
"name": "custom_attributes_by_url_pattern.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}