search/vais-building-blocks/manual_recrawl_urls_with_trigger.ipynb (435 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5XNYlDkDLpqU"
},
"outputs": [],
"source": [
"# Copyright 2025 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5tR528hOD4Dx"
},
"source": [
"# Event-based Triggering of Manual Recrawl for Vertex AI Search Advanced Website Datastores\n",
"\n",
"<table align=\"left\">\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-vais-building-blocks&utm_medium=aRT-clicks&utm_campaign=vais-building-blocks&destination=vais-building-blocks&url=https%3A%2F%2Fcolab.research.google.com%2Fgithub%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fblob%2Fmain%2Fsearch%2Fvais-building-blocks%2Fmanual_recrawl_urls_with_trigger.ipynb\">\n",
" <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-vais-building-blocks&utm_medium=aRT-clicks&utm_campaign=vais-building-blocks&destination=vais-building-blocks&url=https%3A%2F%2Fconsole.cloud.google.com%2Fvertex-ai%2Fcolab%2Fimport%2Fhttps%3A%252F%252Fraw.githubusercontent.com%252FGoogleCloudPlatform%252Fapplied-ai-engineering-samples%252Fmain%252Fsearch%252Fvais-building-blocks%252Fmanual_recrawl_urls_with_trigger.ipynb\">\n",
" <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-vais-building-blocks&utm_medium=aRT-clicks&utm_campaign=vais-building-blocks&destination=vais-building-blocks&url=https%3A%2F%2Fconsole.cloud.google.com%2Fvertex-ai%2Fworkbench%2Fdeploy-notebook%3Fdownload_url%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fmain%2Fsearch%2Fvais-building-blocks%2Fmanual_recrawl_urls_with_trigger.ipynb\">\n",
" <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples/blob/main/search/vais-building-blocks/manual_recrawl_urls_with_trigger.ipynb\">\n",
" <img width=\"32px\" src=\"https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
" </a>\n",
" </td>\n",
"</table>\n",
"\n",
"<div style=\"clear: both;\"></div>\n",
"\n",
"<b>Share to:</b>\n",
"\n",
"<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/manual_recrawl_urls_with_trigger.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/manual_recrawl_urls_with_trigger.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/manual_recrawl_urls_with_trigger.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg\" alt=\"X logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/manual_recrawl_urls_with_trigger.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/manual_recrawl_urls_with_trigger.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
"</a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pkd93iDpEBWx"
},
"source": [
"| | |\n",
"|----------|-------------|\n",
"| Author(s) | Hossein Mansour|\n",
"| Reviewers(s) | Abhishek Bhagwat|\n",
"| Last updated | 2025-01-03: Initial commit |"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yAnTektvEQjb"
},
"source": [
"# Overview\n",
"\n",
"In this notebook, we demonstrate how to automate [manual recrawl](https://cloud.google.com/generative-ai-app-builder/docs/recrawl-websites#manual_refresh) to keep an [Advanced Website Search](https://cloud.google.com/generative-ai-app-builder/docs/about-advanced-features#advanced-website-indexing) datastores within [Vertex AI Search](https://cloud.google.com/generative-ai-app-builder/docs/introduction).\n",
"\n",
"We focus on a particular workflow where the list of URLs to-be-crawled land as a JSON file in a GCS bucket. This then triggers parsing of the URLs and passing them to the manual Recrawl API call. One can automate different workflows where, for instance, the list of new URLs is added to a BQ table, etc.\n",
"\n",
"The workflow demonstrated here is useful for automating other workflows. An example of that would be to trigger a document import as a new PDF file lands on a GCS bucket.\n",
"\n",
"You can find more information about advanced website search datastore and its differences with the basic website search datastore [here](https://cloud.google.com/generative-ai-app-builder/docs/about-advanced-features#advanced-website-indexing).\n",
"\n",
"The web pages in an advanced website search datastore are refreshed in the following ways:\n",
"\n",
"* **Automatic refresh**: Discovers added, deleted, and updated pages and reindexes those on a best-effort basis. The expected indexing latency is in the order of 2 weeks at the time of preparing this notebook.\n",
"* [**Manual refresh**](https://cloud.google.com/generative-ai-app-builder/docs/recrawl-websites): Customers can initiate manual recrawl of an explicit list of URLs (not URL patterns) within certain [limits](https://cloud.google.com/generative-ai-app-builder/docs/recrawl-websites#limits_on_recrawling). While there is no SLO around manual re-crawls, they typically happen within minutes to hours depending on the size.\n",
"* [**Sitemap-based refresh**](https://cloud.google.com/generative-ai-app-builder/docs/index-refresh-sitemap): Customers can submit and use sitemaps to index and refresh the web pages in your data store. This feature supports only XML sitemaps and sitemap indexes. The indexing latency with this approach is in the order of hours.\n",
"\n",
"The automation demonstrated in this notebook is relevant for manual recrawl as stated above.\n",
"\n",
"As opposed to other notebooks in the repository, this notebook is not self contained and requires an existing datastore. We demonstrate a step-by-step guide on how to add the trigger and subsequent automation from within the UI. We also provide a gcloud function to automate part of the process.\n",
"\n",
"We will perform the following steps:\n",
"\n",
"- Creating a GCS bucket for staging\n",
"- Creating cloud function and trigger via UI\n",
"- Deploying the function\n",
"- Testing with a sample JSON\n",
"\n",
"\n",
"# Vertex AI Search\n",
"Vertex AI Search (VAIS) is a fully-managed platform, powered by large language models, that lets you build AI-enabled search and recommendation experiences for your public or private websites or mobile applications\n",
"\n",
"VAIS can handle a diverse set of data sources including structured, unstructured, and website data, as well as data from third-party applications such as Jira, Salesforce, and Confluence.\n",
"\n",
"VAIS also has built-in integration with LLMs which enables you to provide answers to complex questions, grounded in your data\n",
"\n",
"# Using this Notebook\n",
"This notebook cannot be run as is. It's more of a step-by-step guide on how to achieve the goal via UI.\n",
"\n",
"## Google Cloud Project Setup\n",
"\n",
"1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs\n",
"2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project)\n",
"3. [Enable the Service Usage API](https://console.cloud.google.com/apis/library/serviceusage.googleapis.com)\n",
"4. [Enable the Cloud Storage API](https://console.cloud.google.com/flows/enableapi?apiid=storage.googleapis.com)\n",
"5. [Enable the Discovery Engine API for your project](https://console.cloud.google.com/marketplace/product/google/discoveryengine.googleapis.com)\n",
"\n",
"## Google Cloud Permissions\n",
"\n",
"Ideally you should have [Owner role](https://cloud.google.com/iam/docs/understanding-roles) for your project to run this notebook. If that is not an option, you need at least the following [roles](https://cloud.google.com/iam/docs/granting-changing-revoking-access)\n",
"- **`roles/serviceusage.serviceUsageAdmin`** to enable APIs\n",
"- **`roles/iam.serviceAccountAdmin`** to modify service agent permissions\n",
"- **`roles/discoveryengine.admin`** to modify discoveryengine assets"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "49x_J4vWOuNg"
},
"source": [
"# Create GCS bucket\n",
"\n",
"As a prerequisite, we need a GCS bucket to use it as a staging area. JSON files containing URLs-to-be-crawled will land there.\n",
"\n",
"We call this bucket `recrawl_test` and use `us-central1` for the location.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TZ4lJ9bDRzDi"
},
"source": [
"# Create cloud run function\n",
"\n",
"As the next step, we create a cloud run function from within the bucket\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vVpjKl0SSxCR"
},
"source": [
"# Basic settings of the cloud run function\n",
"\n",
"Here we apply the basic settings to our cloud function. Specifically, we use the same region as our bucket (`us-central1` in this example), set the event type to finalized, and give our function a meaningful name.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cWUrF50JTE4b"
},
"source": [
"# Add the function source\n",
"\n",
"In this step we add the function source. Python 3.12 is used for this example. We need an entry point that is the function used in main.py which runs upon triggering of the associated cloud function.\n",
"\n",
"While a more manual application of the trigger was possible, we use [python function framework](https://github.com/GoogleCloudPlatform/functions-framework-python) which is the recommended way at the time of preparing this notebook.\n",
"\n",
"You can see the full source, including authentication, in the code block below.\n",
"\n",
"Note that you need to update the source with your own project_id and datastore_id. It is also recommended to use an environment variable for that.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "toPqo9SDWNBl"
},
"outputs": [],
"source": [
"import json\n",
"\n",
"import functions_framework\n",
"from google.auth import default\n",
"from google.auth.transport.requests import Request as GoogleAuthRequest\n",
"from google.cloud import storage\n",
"import requests\n",
"\n",
"# Replace with your actual values\n",
"PROJECT_ID = \"your_project_id\"\n",
"DATA_STORE_ID = \"your_datastore_id\"\n",
"\n",
"\n",
"@functions_framework.cloud_event\n",
"def recrawl_uris(cloud_event):\n",
" \"\"\"\n",
" Cloud Function triggered by Cloud Storage events (file creation).\n",
"\n",
" Args:\n",
" cloud_event (functions_framework.cloud_event.CloudEvent): The CloudEvent that triggered this function.\n",
" \"\"\"\n",
"\n",
" data = cloud_event.data\n",
" bucket_name = data[\"bucket\"]\n",
" file_name = data[\"name\"]\n",
" print(f\"File {file_name} created in bucket {bucket_name}.\")\n",
"\n",
" # Process only for finalized objects. An object is in finalized state if it is written to the bucket or an existing object is overwritten\n",
"\n",
" if (\n",
" file_name.endswith(\".json\")\n",
" and cloud_event[\"type\"] == \"google.cloud.storage.object.v1.finalized\"\n",
" ):\n",
" try:\n",
" # Read the URIs from the JSON file\n",
" uris = read_uris_from_gcs(bucket_name, file_name)\n",
"\n",
" if not uris:\n",
" print(\"No URIs found in the JSON file.\")\n",
" return\n",
"\n",
" print(\"URIs to recrawl:\", uris)\n",
"\n",
" # Recrawl the URIs using the Discovery Engine API\n",
" recrawl_uris_with_api(uris)\n",
"\n",
" except Exception as e:\n",
" print(f\"Error processing file: {e}\")\n",
"\n",
"\n",
"def read_uris_from_gcs(bucket_name, file_name):\n",
" \"\"\"\n",
" Reads the URIs from a JSON file in a GCS bucket.\n",
"\n",
" Args:\n",
" bucket_name (str): Name of the GCS bucket.\n",
" file_name (str): Name of the JSON file.\n",
"\n",
" Returns:\n",
" list: List of URIs, or None if an error occurs.\n",
" \"\"\"\n",
"\n",
" storage_client = storage.Client()\n",
" bucket = storage_client.bucket(bucket_name)\n",
" blob = bucket.blob(file_name)\n",
"\n",
" try:\n",
" file_content = blob.download_as_text()\n",
" data = json.loads(file_content)\n",
" return data.get(\"uris\", [])\n",
" except Exception as e:\n",
" print(f\"Error reading URIs from {bucket_name}/{file_name}: {e}\")\n",
" return None\n",
"\n",
"\n",
"def recrawl_uris_with_api(uris):\n",
" \"\"\"\n",
" Re-crawls the specified URIs using the Discovery Engine API.\n",
"\n",
" Args:\n",
" uris (list): List of URIs to recrawl.\n",
" \"\"\"\n",
"\n",
" # fetch bearer token to make REST API call\n",
" creds, _ = default()\n",
" auth_req = GoogleAuthRequest()\n",
" creds.refresh(auth_req)\n",
" access_token = creds.token\n",
"\n",
" # recrawl API endpoint to be invoked\n",
" url = f\"https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_ID}/locations/global/collections/default_collection/dataStores/{DATA_STORE_ID}/siteSearchEngine:recrawlUris\"\n",
"\n",
" headers = {\n",
" \"Authorization\": f\"Bearer {access_token}\",\n",
" \"Content-Type\": \"application/json\",\n",
" \"X-Goog-User-Project\": PROJECT_ID,\n",
" }\n",
" data = {\"uris\": uris}\n",
"\n",
" for attempt in range(3): # Retry up to 3 times\n",
" try:\n",
" response = requests.post(\n",
" url, headers=headers, json=data, timeout=10\n",
" ) # Added timeout\n",
" response.raise_for_status() # Raise an exception for bad status codes\n",
" print(f\"Recrawl request successful. Response: {response.json()}\")\n",
" return\n",
" except requests.exceptions.RequestException as e:\n",
" print(f\"Error during recrawl request (attempt {attempt + 1}): {e}\")\n",
" if attempt < 2 and isinstance(\n",
" e, (requests.exceptions.ConnectionError, requests.exceptions.Timeout)\n",
" ): # retry only for connection and timeout errors\n",
" print(\"Retrying in 5 seconds...\")\n",
" time.sleep(5)\n",
"\n",
" print(\"Failed to recrawl URIs after multiple attempts.\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VOefNBmqXNvI"
},
"source": [
"# Include dependencies to run the function\n",
"\n",
"Finally we need to add the dependencies in the `requirements.txt` file to run the function above.\n",
"\n",
"You can copy the requirements from the following code block.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bqu-z4AGXjJY"
},
"outputs": [],
"source": [
"functions-framework==3.*\n",
"google-cloud-storage\n",
"google-auth\n",
"requests"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "A4Sf7AHpYBeP"
},
"source": [
"# Deploy the function\n",
"\n",
"Once the source and requirements are added the function can be deployed with a single click.\n",
"\n",
"You need to ensure that the service account used by your Cloud Function has the necessary permissions:\n",
"* Storage Object Viewer on your trigger bucket to read the files.\n",
"* Discovery Engine Admin to interact with the Discovery Engine API.\n",
"* Cloud Scheduler Invoker to recrawl uris\n",
"* Cloud Functions Invoker to invoke the cloud function\n",
"* Service Account User to make secure requests\n",
"* Logs Writer to post logs to GCP\n",
"\n",
"You will typically get notified to provide necessary permissions as you try to deploy the function.\n",
"\n",
"Once the function is deployed, you can find it (called `recrawl_function` in this example) in the `Cloud Run Functions` section of the cloud console. You will have the ability to edit the function, adjust the trigger, and read logs among other things."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "n7c74l2ZZTr4"
},
"source": [
"# Test the function\n",
"\n",
"Finally, you can test the function you just deployed by adding a new JSON file to your target bucket. Once the file is uploaded to the bucket, you can check the status of the function in the logs.\n",
"\n",
"An acceptable JSON file to be used with the source file provided in this example needs to follow the following format:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QxOdMGF2aOve"
},
"outputs": [],
"source": [
"{\n",
" \"uris\": [\n",
" \"https://example.com/page-1\",\n",
" \"https://example.com/page-2\",\n",
" \"https://example.com/page-3\",\n",
" ]\n",
"}"
]
}
],
"metadata": {
"colab": {
"name": "manual_recrawl_urls_with_trigger.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}