search/vais-building-blocks/inline_ingestion_of

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "KsbFABffnCMA" }, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "q3pW20ECIDpX" }, "source": [ "# Inline Ingestion of Documents into Vertex AI Search\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-vais-building-blocks&utm_medium=aRT-clicks&utm_campaign=vais-building-blocks&destination=vais-building-blocks&url=https%3A%2F%2Fcolab.research.google.com%2Fgithub%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fblob%2Fmain%2Fsearch%2Fvais-building-blocks%2Finline_ingestion_of_documents.ipynb\">\n", " <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-vais-building-blocks&utm_medium=aRT-clicks&utm_campaign=vais-building-blocks&destination=vais-building-blocks&url=https%3A%2F%2Fconsole.cloud.google.com%2Fvertex-ai%2Fcolab%2Fimport%2Fhttps%3A%252F%252Fraw.githubusercontent.com%252FGoogleCloudPlatform%252Fapplied-ai-engineering-samples%252Fmain%252Fsearch%252Fvais-building-blocks%252Finline_ingestion_of_documents.ipynb\">\n", " <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-vais-building-blocks&utm_medium=aRT-clicks&utm_campaign=vais-building-blocks&destination=vais-building-blocks&url=https%3A%2F%2Fconsole.cloud.google.com%2Fvertex-ai%2Fworkbench%2Fdeploy-notebook%3Fdownload_url%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fmain%2Fsearch%2Fvais-building-blocks%2Finline_ingestion_of_documents.ipynb\">\n", " <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples/blob/main/search/vais-building-blocks/inline_ingestion_of_documents.ipynb\">\n", " <img width=\"32px\" src=\"https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", "</table>\n", "\n", "<div style=\"clear: both;\"></div>\n", "\n", "<b>Share to:</b>\n", "\n", "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/inline_ingestion_of_documents.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n", "</a>\n", "\n", "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/inline_ingestion_of_documents.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n", "</a>\n", "\n", "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/inline_ingestion_of_documents.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg\" alt=\"X logo\">\n", "</a>\n", "\n", "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/inline_ingestion_of_documents.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n", "</a>\n", "\n", "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/vais-building-blocks/inline_ingestion_of_documents.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n", "</a>" ] }, { "cell_type": "markdown", "metadata": { "id": "YQpT9IS3K-Fn" }, "source": [ "| | |\n", "|----------|-------------|\n", "| Author(s) | Jaival Desai, Hossein Mansour|\n", "| Reviewers(s) | Lei Chen, Abhishek Bhagwat|\n", "| Last updated | 2024-09-11: The first draft |" ] }, { "cell_type": "markdown", "metadata": { "id": "zNu-9XmEDF52" }, "source": [ "# Overview\n", "\n", "In this notebook, we will demonstrate how to make an inline ingestion of documents into [Vertex AI Search](https://cloud.google.com/generative-ai-app-builder/docs/introduction) (VAIS) datastores.\n", "\n", "VAIS supports a variety of sources and data types. For [structured documents or unstructured documents, with or without metadata](https://cloud.google.com/generative-ai-app-builder/docs/prepare-data), it is advised to initially stage them on a GCS bucket or a BQ table and perform a subsequent [import](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores.branches.documents/import) by referring to those documents by their URI. This approach creates a source-of-truth which can be investigated in details and allows for the possibility of `Incremental` import or `Full` import depending on the choice of [ReconciliationMode](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/ReconciliationMode). The `Full` option is particularly useful to resolve possible conflicts and duplicates.\n", "\n", "However in some cases customers may prefer an inline ingestion of documents for its simplicity or to help them stay compliant with some restrictions defined on Org level. Note that inline ingestion comes with some limitations including more strict [limits](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores.branches.documents#content) on the file size, and lower visibility on the UI given the fact that the content needs to be encoded into rawBytes.\n", "\n", "We will perform the following steps:\n", "\n", "- Create a VAIS Datastore\n", "- Prepare sample documents\n", "- Import sample documents (and other operations)\n", "- Query the datastore\n", "- Cleanup\n", "\n", "REST API is used throughout this notebook. Please consult the [official documentation](https://cloud.google.com/generative-ai-app-builder/docs/apis) for alternative ways to achieve the same goal, namely Client libraries and RPC.\n", "\n", "\n", "# Vertex AI Search\n", "Vertex AI Search (VAIS) is a fully-managed platform, powered by large language models, that lets you build AI-enabled search and recommendation experiences for your public or private websites or mobile applications\n", "\n", "VAIS can handle a diverse set of data sources including structured, unstructured, and website data, as well as data from third-party applications such as Jira, Salesforce, and Confluence.\n", "\n", "VAIS also has built-in integration with LLMs which enables you to provide answers to complex questions, grounded in your data\n", "\n", "#Using this Notebook\n", "If you're running outside of Colab, depending on your environment you may need to install pip packages that are included in the Colab environment by default but are not part of the Python Standard Library. Outside of Colab you'll also notice comments in code cells that look like #@something, these trigger special Colab functionality but don't change the behavior of the notebook.\n", "\n", "This tutorial uses the following Google Cloud services and resources:\n", "\n", "- Service Usage API\n", "- Discovery Engine\n", "- Google Cloud Storage Client\n", "\n", "This notebook has been tested in the following environment:\n", "\n", "- Python version = 3.10.12\n", "- google.cloud.storage = 2.8.0\n", "- google.auth = 2.27.0\n", "\n", "# Getting Started\n", "\n", "The following steps are necessary to run this notebook, no matter what notebook environment you're using.\n", "\n", "If you're entirely new to Google Cloud, [get started here](https://cloud.google.com/docs/get-started)\n", "\n", "## Google Cloud Project Setup\n", "\n", "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs\n", "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project)\n", "3. [Enable the Service Usage API](https://console.cloud.google.com/apis/library/serviceusage.googleapis.com)\n", "4. [Enable the Cloud Storage API](https://console.cloud.google.com/flows/enableapi?apiid=storage.googleapis.com)\n", "5. [Enable the Discovery Engine API for your project](https://console.cloud.google.com/marketplace/product/google/discoveryengine.googleapis.com)\n", "\n", "## Google Cloud Permissions\n", "\n", "Ideally you should have [Owner role](https://cloud.google.com/iam/docs/understanding-roles) for your project to run this notebook. If that is not an option, you need at least the following [roles](https://cloud.google.com/iam/docs/granting-changing-revoking-access)\n", "- **`roles/serviceusage.serviceUsageAdmin`** to enable APIs\n", "- **`roles/iam.serviceAccountAdmin`** to modify service agent permissions\n", "- **`roles/discoveryengine.admin`** to modify discoveryengine assets\n", "- **`roles/storage.objectAdmin`** to modify and delete GCS buckets" ] }, { "cell_type": "markdown", "metadata": { "id": "slhopo_NhUrA" }, "source": [ "#Setup Environment" ] }, { "cell_type": "markdown", "metadata": { "id": "lJFp9LUmrSOf" }, "source": [ "## Authentication\n", "\n", " If you're using Colab, run the code in the next cell. Follow the pop-ups and authenticate with an account that has access to your Google Cloud [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects).\n", "\n", "If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into [Application Default Credentials for your local environment](https://cloud.google.com/docs/authentication/provide-credentials-adc#local-dev) and [initializing the Google Cloud CLI](https://cloud.google.com/docs/authentication/gcloud). In many cases, running `gcloud auth application-default login` in a shell on the machine running the notebook kernel is sufficient.\n", "\n", "More authentication options are discussed [here](https://cloud.google.com/docs/authentication)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "x_miQy2C3DmT" }, "outputs": [], "source": [ "# Colab authentication.\n", "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()\n", " print(\"Authenticated\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Fa9DrhVx3HQ0" }, "outputs": [], "source": [ "from google.auth import default\n", "from google.auth.transport.requests import AuthorizedSession\n", "\n", "creds, _ = default()\n", "authed_session = AuthorizedSession(creds)" ] }, { "cell_type": "markdown", "metadata": { "id": "kdQPp72R11pd" }, "source": [ "## Import Libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "a2fjJjzn3LdX" }, "outputs": [], "source": [ "import base64\n", "import json\n", "import os\n", "import shutil\n", "import time\n", "\n", "import requests" ] }, { "cell_type": "markdown", "metadata": { "id": "MGXEinm3q1ks" }, "source": [ "## Configure environment\n", "\n", "You can enter the ID for an existing Vertex AI Search Datastore to be used in this notebook.\n", "\n", "You can find more information regarding the `location` of datastores and associated limitations [here](https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store). `global` is preferred unless there is a certain data residency requirement you have to comply with.\n", "\n", "The location of a Datastore is set at the time of creation and it should be called appropriately to query the Datastore.\n", "\n", "`LOCAL_DIRECTORY_DOCS` is used to store the sample files locally." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RVGPkT132xno" }, "outputs": [], "source": [ "PROJECT_ID = \"\" # @param {type:\"string\"}\n", "\n", "# Vertex AI Search Parameters\n", "DATASTORE_ID = \"\" # @param {type:\"string\"}\n", "LOCATION = \"global\" # @param [\"global\", \"us\", \"eu\"]\n", "LOCAL_DIRECTORY_DOCS = \"./sample_docs\" # @param {type:\"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "hVJt8gfLhmwX" }, "source": [ "# STEP 1. Create VAIS Datastore\n", "\n", "You can skip this section if you already have a datastore set up." ] }, { "cell_type": "markdown", "metadata": { "id": "hEOMvaNomc4q" }, "source": [ "## Helper functions to [create a Datastore](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores/create)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lIBkqEZYNNMA" }, "outputs": [], "source": [ "def create_datastore(project_id: str, location: str, datastore_id: str) -> int:\n", " \"\"\"Create a datastore with doc mode and the basic digital parser\"\"\"\n", " payload = {\n", " \"displayName\": datastore_id,\n", " \"industryVertical\": \"GENERIC\",\n", " \"solutionTypes\": [\"SOLUTION_TYPE_SEARCH\"],\n", " \"contentConfig\": \"CONTENT_REQUIRED\",\n", " \"documentProcessingConfig\": {\n", " \"defaultParsingConfig\": {\"digitalParsingConfig\": {}}\n", " },\n", " }\n", " header = {\"X-Goog-User-Project\": project_id, \"Content-Type\": \"application/json\"}\n", " es_endpoint = f\"https://discoveryengine.googleapis.com/v1/projects/{project_id}/locations/{location}/collections/default_collection/dataStores?dataStoreId={datastore_id}\"\n", " response = authed_session.post(\n", " es_endpoint, data=json.dumps(payload), headers=header\n", " )\n", " if response.status_code == 200:\n", " print(f\"The creation of Datastore {datastore_id} is initiated.\")\n", " print(\"It may take a few minutes for the Datastore to become available\")\n", " else:\n", " print(f\"Failed to create Datastore {datastore_id}\")\n", " print(response.json())\n", " return response.status_code" ] }, { "cell_type": "markdown", "metadata": { "id": "w2q8UZxwRd1m" }, "source": [ "## Helper functions to issue [basic search on a Datastore](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores.servingConfigs/search)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "98bc0zgCWhqP" }, "outputs": [], "source": [ "def search_by_datastore(\n", " project_id: str, location: str, datastore_id: str, query: str\n", ") -> requests.Response:\n", " \"\"\"Searches a datastore using the provided query.\"\"\"\n", " response = authed_session.post(\n", " f\"https://discoveryengine.googleapis.com/v1/projects/{project_id}/locations/{location}/collections/default_collection/dataStores/{datastore_id}/servingConfigs/default_search:search\",\n", " headers={\n", " \"Content-Type\": \"application/json\",\n", " },\n", " json={\"query\": query, \"pageSize\": 1},\n", " )\n", " return response" ] }, { "cell_type": "markdown", "metadata": { "id": "fJ3ULZOeR-E2" }, "source": [ "## Helper functions to check whether or not a Datastore already exists" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ShppuvcxWBut" }, "outputs": [], "source": [ "def datastore_exists(project_id: str, location: str, datastore_id: str) -> bool:\n", " \"\"\"Check if a datastore exists.\"\"\"\n", " response = search_by_datastore(project_id, location, datastore_id, \"test\")\n", " status_code = response.status_code\n", " if status_code == 200:\n", " return True\n", " if status_code == 404:\n", " return False\n", " raise Exception(f\"Error: {status_code}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "NnCzxjV9SVO_" }, "source": [ "## Create a Datastore with the provided ID if it doesn't exist" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nBnAc1NIV59z" }, "outputs": [], "source": [ "# Create Chunk mode Datastore if it doesn't exist\n", "if datastore_exists(PROJECT_ID, LOCATION, DATASTORE_ID):\n", " print(f\"Datastore {DATASTORE_ID} already exists.\")\n", "else:\n", " create_datastore(PROJECT_ID, LOCATION, DATASTORE_ID)" ] }, { "cell_type": "markdown", "metadata": { "id": "MF07QdwxW_1n" }, "source": [ "## [Optional] Check if the Datastore is created successfully\n", "\n", "\n", "The Datastore is polled to track when it becomes available.\n", "\n", "This may take a few minutes after the datastore creation is initiated" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "S7xR3uX8XEnH" }, "outputs": [], "source": [ "while not datastore_exists(PROJECT_ID, LOCATION, DATASTORE_ID):\n", " print(f\"Datastore {DATASTORE_ID} is still being created.\")\n", " time.sleep(30)\n", "print(f\"Datastore {DATASTORE_ID} is created successfully.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "FSgKPm-nnB81" }, "source": [ "# STEP 2. Prepare sample documents" ] }, { "cell_type": "markdown", "metadata": { "id": "9o-wr_8_7g0D" }, "source": [ "## Create a folder to store the files locally" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RgcYQVxu7mQO" }, "outputs": [], "source": [ "# Check if the folder already exists\n", "if not os.path.exists(LOCAL_DIRECTORY_DOCS):\n", " # Create the folder\n", " os.makedirs(LOCAL_DIRECTORY_DOCS)\n", " print(f\"Folder '{LOCAL_DIRECTORY_DOCS}' created successfully!\")\n", "else:\n", " print(f\"Folder '{LOCAL_DIRECTORY_DOCS}' already exists.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "6ekzngh0nbdY" }, "source": [ "## Helper function to download pdf files and store them locally" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SpDDcZ9Bmtsl" }, "outputs": [], "source": [ "def download_pdfs(\n", " url_list: list[str], save_directory: str = LOCAL_DIRECTORY_DOCS\n", ") -> list[str]:\n", " \"\"\"Downloads PDFs from a list of URLs and saves them to a specified directory.\n", "\n", " Args:\n", " url_list: A list of URLs pointing to PDF files.\n", " save_directory: The directory where the PDFs will be saved. Defaults to LOCAL_DIRECTORY_DOCS.\n", "\n", " Returns:\n", " A list of file paths where the PDFs were saved.\n", " \"\"\"\n", "\n", " pdf_file_paths = []\n", "\n", " # Create the save directory if it doesn't exist\n", " if not os.path.exists(save_directory):\n", " os.makedirs(save_directory)\n", "\n", " for i, url in enumerate(url_list):\n", " try:\n", " response = requests.get(url)\n", " response.raise_for_status()\n", "\n", " # Construct the full file path within the save directory\n", " file_name = f\"downloaded_pdf_{i+1}.pdf\"\n", " file_path = os.path.join(save_directory, file_name)\n", "\n", " with open(file_path, \"wb\") as f:\n", " f.write(response.content)\n", "\n", " pdf_file_paths.append(file_path)\n", " print(f\"Downloaded PDF from {url} and saved to {file_path}\")\n", " except requests.exceptions.RequestException as e:\n", " print(f\"Error downloading PDF from {url}: {e}\")\n", "\n", " return pdf_file_paths" ] }, { "cell_type": "markdown", "metadata": { "id": "qO-MlhvrnkiZ" }, "source": [ "## Download sample PDF files" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wsw0wSOnm0LG" }, "outputs": [], "source": [ "file_urls = [\n", " \"https://abc.xyz/assets/91/b3/3f9213d14ce3ae27e1038e01a0e0/2024q1-alphabet-earnings-release-pdf.pdf\",\n", " \"https://abc.xyz/assets/19/e4/3dc1d4d6439c81206370167db1bd/2024q2-alphabet-earnings-release.pdf\",\n", "]\n", "\n", "pdf_variables = download_pdfs(file_urls)" ] }, { "cell_type": "markdown", "metadata": { "id": "aoVBLH8rnrnx" }, "source": [ "## Create a sample text file and store locally" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5J1h_sAym3kJ" }, "outputs": [], "source": [ "sample_text = \"\"\"\n", "MOUNTAIN VIEW, Calif. - January 30, 2024 - Alphabet Inc. (NASDAQ: GOOG, GOOGL) today announced\n", "financial results for the quarter and fiscal year ended December 31, 2023.\n", "Sundar Pichai, CEO, said: \"We are pleased with the ongoing strength in Search and the growing contribution from\n", "YouTube and Cloud. Each of these is already benefiting from our AI investments and innovation. As we enter the\n", "Gemini era, the best is yet to come.\"\n", "Ruth Porat, President and Chief Investment Officer; CFO said: \"We ended 2023 with very strong fourth quarter\n", "financial results, with Q4 consolidated revenues of $86 billion, up 13% year over year. We remain committed to our\n", "work to durably re-engineer our cost base as we invest to support our growth opportunities.\"\n", "\"\"\"\n", "\n", "\n", "def save_string_to_file(\n", " string_to_save, filename=\"doc_3.txt\", save_directory=LOCAL_DIRECTORY_DOCS\n", "):\n", " \"\"\"Saves a string to a text file within a specified directory.\n", "\n", " Args:\n", " string_to_save: The string content to be saved.\n", " filename: The desired name for the output file (default: \"doc_3.txt\").\n", " save_directory: The directory where the file will be saved (default: LOCAL_DIRECTORY_DOCS).\n", "\n", " Returns:\n", " None\n", " \"\"\"\n", "\n", " # Create the save directory if it doesn't exist\n", " if not os.path.exists(save_directory):\n", " os.makedirs(save_directory)\n", "\n", " # Construct the full file path within the save directory\n", " file_path = os.path.join(save_directory, filename)\n", "\n", " try:\n", " with open(file_path, \"w\", encoding=\"utf-8\") as file:\n", " file.write(string_to_save)\n", " print(f\"String successfully saved to {file_path}\")\n", " except OSError as e:\n", " print(f\"An error occurred while saving the file: {e}\")\n", "\n", "\n", "save_string_to_file(sample_text)" ] }, { "cell_type": "markdown", "metadata": { "id": "UlpkwFdSoCxz" }, "source": [ "## Helper function to convert the content of a file to Base64 encoding" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "R06xrOx3Mdvf" }, "outputs": [], "source": [ "def file_to_base64(file_path):\n", " \"\"\"Converts the content of a file to Base64 encoding.\n", "\n", " Args:\n", " file_path: The path to the file.\n", "\n", " Returns:\n", " The Base64 encoded string representing the file's content.\n", " \"\"\"\n", "\n", " with open(file_path, \"rb\") as file:\n", " file_data = file.read()\n", " base64_encoded_data = base64.b64encode(file_data).decode(\"utf-8\")\n", " return base64_encoded_data" ] }, { "cell_type": "markdown", "metadata": { "id": "Dxko2yr7oOWN" }, "source": [ "## Convert sample files to Base64 encoding" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wLWaNs7wnDlu" }, "outputs": [], "source": [ "content_doc_1 = file_to_base64(LOCAL_DIRECTORY_DOCS + \"/downloaded_pdf_1.pdf\")\n", "content_doc_2 = file_to_base64(LOCAL_DIRECTORY_DOCS + \"/downloaded_pdf_2.pdf\")\n", "content_doc_3 = file_to_base64(LOCAL_DIRECTORY_DOCS + \"/doc_3.txt\")" ] }, { "cell_type": "markdown", "metadata": { "id": "9AAYFQAjogMP" }, "source": [ "## Create JSON documents from sample contents\n", "\n", "Here we create [`Documents`](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores.branches.documents#Document) in VAIS terminology based on contents from sample files created earlier.\n", "\n", "Note that the field [`content`](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores.branches.documents#Document.Content) in the document references rawBytes as opposed to `uri` that is used when the file is staged elsewhere.\n", "\n", "mimeType should be consistent with the format of the files to be ingested (e.g. application/pdf). See a list of supported mimeTypes [here](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores.branches.documents#Document.Content)\n", "\n", "We add some metadata to each document as well to demonstrate this more advanced functionality. This is optional and you can ingest the content with no metadata as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cRU3hZPw_MUn" }, "outputs": [], "source": [ "my_document_1 = {\n", " \"id\": \"doc-1\",\n", " \"structData\": {\"title\": \"test_doc_1\", \"color_theme\": \"blue\"},\n", " \"content\": {\"mimeType\": \"application/pdf\", \"rawBytes\": content_doc_1},\n", "}\n", "my_document_2 = {\n", " \"id\": \"doc-2\",\n", " \"structData\": {\"title\": \"test_doc_2\", \"color_theme\": \"red\"},\n", " \"content\": {\"mimeType\": \"application/pdf\", \"rawBytes\": content_doc_2},\n", "}\n", "my_document_3 = {\n", " \"id\": \"doc-3\",\n", " \"structData\": {\"title\": \"test_doc_3\", \"color_theme\": \"green\"},\n", " \"content\": {\"mimeType\": \"text/plain\", \"rawBytes\": content_doc_3},\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "cKrqND3Trt1Q" }, "source": [ "# STEP 3. Import, List, Get, and Delete documents\n", "\n", "In this section we demonstrate some common operations on documents. You can find a more complete list [here](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores.branches.documents)" ] }, { "cell_type": "markdown", "metadata": { "id": "xw3DcHRFqJyG" }, "source": [ "## Inline [import](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores.branches.documents/import) of documents\n", "\n", "This block contains the main logic to be demonstrated in this notebook that is an inline ingestion of documents." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NHsyjWqb3065" }, "outputs": [], "source": [ "def import_documents_rawbytes(project_id: str, location: str, datastore_id: str) -> str:\n", " \"\"\"Imports unstructured documents Inline.\"\"\"\n", " payload = {\n", " \"reconciliationMode\": \"INCREMENTAL\",\n", " \"inlineSource\": {\"documents\": [my_document_1, my_document_2, my_document_3]},\n", " }\n", " header = {\"Content-Type\": \"application/json\"}\n", " es_endpoint = f\"https://discoveryengine.googleapis.com/v1/projects/{project_id}/locations/{location}/collections/default_collection/dataStores/{datastore_id}/branches/default_branch/documents:import\"\n", " response = authed_session.post(\n", " es_endpoint, data=json.dumps(payload), headers=header\n", " )\n", " print(f\"--{response.json()}\")\n", " return response.json()\n", "\n", "\n", "import_documents_rawbytes(PROJECT_ID, LOCATION, DATASTORE_ID)" ] }, { "cell_type": "markdown", "metadata": { "id": "7jXQ3hD1a5Eo" }, "source": [ "## List all documents\n", "\n", "[List](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores.branches.documents/list) all documents and their contents for a datastore. A maximum of 1000 documents are retrieved together with a page token to retrieve the next batch of documents (i.e. pagination)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TMMnEwg7bJQy" }, "outputs": [], "source": [ "def list_documents_datastore(\n", " project_id: str, location: str, data_store_id: str\n", ") -> list[dict[str, str]] | None:\n", " \"\"\"Lists documents in a specified data store using the REST API.\n", "\n", " Args:\n", " project_id: The ID of your Google Cloud project.\n", " location: The location of your data store.\n", " Values: \"global\", \"us\", \"eu\"\n", " data_store_id: The ID of the datastore.\n", "\n", " Returns:\n", " The JSON response containing the list of documents, or None if an error occurs.\n", " \"\"\"\n", "\n", " base_url = (\n", " f\"{location}-discoveryengine.googleapis.com\"\n", " if location != \"global\"\n", " else \"discoveryengine.googleapis.com\"\n", " )\n", " url = f\"https://{base_url}/v1alpha/projects/{project_id}/locations/{location}/collections/default_collection/dataStores/{data_store_id}/branches/default_branch/documents\"\n", "\n", " try:\n", " # Assuming 'authed_session' is available and properly configured for authentication\n", " response = authed_session.get(url)\n", " response.raise_for_status() # Raise an exception for bad status codes\n", " documents = response.json()\n", " print(\n", " f\"Successfully retrieved {len(documents.get('documents', []))} document(s).\\n\"\n", " )\n", " return [document for document in documents.get(\"documents\", [])]\n", "\n", " except requests.exceptions.RequestException as e:\n", " print(f\"Error listing documents: {e}\")\n", " return None\n", "\n", "\n", "list_documents_datastore(PROJECT_ID, LOCATION, DATASTORE_ID)" ] }, { "cell_type": "markdown", "metadata": { "id": "apGqssFpayOd" }, "source": [ "## Get a specific document\n", "\n", "[Get](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores.branches.documents/get) a document and some of its details (regarding indexing status) by referencing the document ID." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "06wLHBmMcHIy" }, "outputs": [], "source": [ "DOCUMENT_ID = \"doc-1\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1Y5HopVnb_xq" }, "outputs": [], "source": [ "def get_document_datastore(\n", " project_id: str, location: str, data_store_id: str, document_id: str\n", ") -> dict[str, str] | None:\n", " \"\"\"Gets a specific document from a data store using the REST API.\n", "\n", " Args:\n", " project_id: The ID of your Google Cloud project.\n", " location: The location of your data store.\n", " Values: \"global\", \"us\", \"eu\"\n", " data_store_id: The ID of the datastore.\n", " document_id: The ID of the document to retrieve.\n", "\n", " Returns:\n", " The JSON response containing the document data, or None if an error occurs.\n", " \"\"\"\n", "\n", " base_url = (\n", " f\"{location}-discoveryengine.googleapis.com\"\n", " if location != \"global\"\n", " else \"discoveryengine.googleapis.com\"\n", " )\n", " url = f\"https://{base_url}/v1alpha/projects/{project_id}/locations/{location}/collections/default_collection/dataStores/{data_store_id}/branches/default_branch/documents/{document_id}\"\n", "\n", " try:\n", " # Assuming 'authed_session' is available and properly configured for authentication\n", " response = authed_session.get(url)\n", " response.raise_for_status() # Raise an exception for bad status codes\n", " document = response.json()\n", " print(f\"Successfully retrieved document with ID: {document_id}\\n\")\n", " return document\n", "\n", " except requests.exceptions.RequestException as e:\n", " print(f\"Error getting document: {e}\")\n", " return None\n", "\n", "\n", "get_document_datastore(PROJECT_ID, LOCATION, DATASTORE_ID, DOCUMENT_ID)" ] }, { "cell_type": "markdown", "metadata": { "id": "bUqTVMaMa7X9" }, "source": [ "## Delete a document\n", "\n", "[Delete](https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/projects.locations.collections.dataStores.branches.documents/delete) a particular document from a datastore by referencing its ID.\n", "\n", "The line that actually deletes the document is commented out here as we need all documents in a subsequent section.\n", "\n", "Note that if you are leveraging GCS/BQ staging approach for importing, a Full import from the source will make the document reappear in the datastore. Same goes with a page within an advanced website datastore which may reappear by subsequent re-crawls." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9-fC8pNjdH5C" }, "outputs": [], "source": [ "def delete_document_datastore(\n", " project_id: str, location: str, data_store_id: str, document_id: str\n", ") -> bool:\n", " \"\"\"Deletes a specific document from a data store using the REST API.\n", "\n", " Args:\n", " project_id: The ID of your Google Cloud project.\n", " location: The location of your data store.\n", " Values: \"global\", \"us\", \"eu\"\n", " data_store_id: The ID of the datastore.\n", " document_id: The ID of the document to delete.\n", "\n", " Returns:\n", " True if the document was deleted successfully, False otherwise.\n", " \"\"\"\n", "\n", " base_url = (\n", " f\"{location}-discoveryengine.googleapis.com\"\n", " if location != \"global\"\n", " else \"discoveryengine.googleapis.com\"\n", " )\n", " url = f\"https://{base_url}/v1alpha/projects/{project_id}/locations/{location}/collections/default_collection/dataStores/{data_store_id}/branches/default_branch/documents/{document_id}\"\n", "\n", " try:\n", " # Assuming 'authed_session' is available and properly configured for authentication\n", " response = authed_session.delete(url)\n", " response.raise_for_status() # Raise an exception for bad status codes\n", " print(f\"Successfully deleted document with ID: {document_id}\\n\")\n", " return True\n", "\n", " except requests.exceptions.RequestException as e:\n", " print(f\"Error deleting document: {e}\")\n", " return False\n", "\n", "\n", "# delete_document_datastore(PROJECT_ID, LOCATION, DATASTORE_ID, DOCUMENT_ID)" ] }, { "cell_type": "markdown", "metadata": { "id": "C2JPuO53q42c" }, "source": [ "# STEP 4. Run queries with and without Metadata filter" ] }, { "cell_type": "markdown", "metadata": { "id": "0wAa9WFOq8jd" }, "source": [ "## Sample search without filter\n", "A basic search request issued to the Datastore\n", "\n", "We get relevant results from all three documents in the datastore" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "m9GZazgQRTzL" }, "outputs": [], "source": [ "test_query = \"Google revenue\"\n", "\n", "response = authed_session.post(\n", " f\"https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}/servingConfigs/default_search:search\",\n", " headers={\n", " \"Content-Type\": \"application/json\",\n", " },\n", " json={\n", " \"query\": test_query,\n", " },\n", ")\n", "response.json()" ] }, { "cell_type": "markdown", "metadata": { "id": "JZwPBsOHrGt2" }, "source": [ "## Sample search with filter\n", "\n", "Now let's apply a filter to showcase how metadata can be used to influence the results.\n", "\n", "We issue the same query as above, but limit the results to color_theme \"red\". A expected we only get one result back\n", "\n", "Note that this block shows a very basic way of querying a Datastore. You can find more information [here](https://cloud.google.com/generative-ai-app-builder/docs/preview-search-results)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "y9-N7M--RYuJ" }, "outputs": [], "source": [ "test_query = \"Google revenue\"\n", "\n", "response = authed_session.post(\n", " f\"https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}/servingConfigs/default_search:search\",\n", " headers={\n", " \"Content-Type\": \"application/json\",\n", " },\n", " json={\n", " \"query\": test_query,\n", " \"filter\": 'color_theme: ANY(\"red\")',\n", " },\n", ")\n", "response.json()" ] }, { "cell_type": "markdown", "metadata": { "id": "b_9s1JT6AS7o" }, "source": [ "#Cleanup\n", "Clean up resources created in this notebook.\n", "\n", "Set `DELETE_RESOURCES` flag to `True` to delete resources." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9yeinaBzeok9" }, "outputs": [], "source": [ "DELETE_RESOURCES = False" ] }, { "cell_type": "markdown", "metadata": { "id": "4cZ8lvPQ3OnY" }, "source": [ "## Delete local files" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OqoAKBai3A4i" }, "outputs": [], "source": [ "if DELETE_RESOURCES:\n", " shutil.rmtree(LOCAL_DIRECTORY_DOCS)\n", "\n", " print(\"Local files deleted successfully.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "L0aU2DdTckUo" }, "source": [ "## Delete the Datastore\n", "Delete the Datastore if you no longer need it\n", "\n", "Alternatively you can follow [these instructions](https://console.cloud.google.com/gen-app-builder/data-stores) to delete a Datastore from the UI" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pBKcL_oicjxL" }, "outputs": [], "source": [ "if DELETE_RESOURCES:\n", " response = authed_session.delete(\n", " f\"https://discoveryengine.googleapis.com/v1alpha/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}\",\n", " headers={\"X-Goog-User-Project\": PROJECT_ID},\n", " )\n", "\n", " print(response.json())" ] } ], "metadata": { "colab": { "name": "inline_ingestion_of_documents.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }

search/vais-building-blocks/inline_ingestion_of_documents.ipynb (1,124 lines of code) (raw):