colab-enterprise/BigQuery-Analytics-with-GenAI.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "mzg8CwTmWxrK" }, "source": [ "### Overview" ] }, { "cell_type": "markdown", "metadata": { "id": "S5CENdyoWxrL" }, "source": [ "Overview: This demo will show how to use the RAG (retrieval augmented generation) pattern directly within BigQuery on a set of PDFs (unstructured data).\n", "\n", "Process Flow:\n", "1. Create a storage acccount and copy seed data\n", "2. Populate BigQuery\n", "3. Create the external connections (Gemini, Vertex AI Embeddings, Vertex AI Document Processors, BigLake)\n", "4. Set the IAM permissions\n", "5. Create an object table\n", "6. Extract the text from each PDF using Vertex AI Document Processor\n", "7. Create embeddings using Vertex AI Text Embeddings\n", "8. Use Gemini by searching the embeddings (Vector Search) and injecting them into the context window\n", "9. Use Gemini with results Grounded with Google Search\n", "10. Use Gemini with “Response Schema” which will return our response in a structured format (typically JSON, but in BigQuery a table will be returned)\n", "\n", "\n", "\n", "Cost:\n", "* Approximate cost: Less than $1\n", "\n", "Author:\n", "* Adam Paternostro" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FfGRnQwGWxrL" }, "outputs": [], "source": [ "# Architecture Diagram\n", "from IPython.display import Image\n", "Image(url='https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/BigQuery-Analytics-using-GenAI.png', width=1200)" ] }, { "cell_type": "markdown", "metadata": { "id": "B2aqAu1sWxrL" }, "source": [ "### Video Walkthrough" ] }, { "cell_type": "markdown", "metadata": { "id": "zfvmLaGfWxrL" }, "source": [ "[![Video](https://storage.googleapis.com/data-analytics-golden-demo/colab-videos/BigQuery-Analytics-with-GenAI.png)](https://storage.googleapis.com/data-analytics-golden-demo/colab-videos/BigQuery-Analytics-with-GenAI.mp4)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "KCEokJy1WxrL" }, "outputs": [], "source": [ "from IPython.display import HTML\n", "\n", "HTML(\"\"\"\n", "<video width=\"800\" height=\"600\" controls>\n", " <source src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-videos/BigQuery-Analytics-with-GenAI.mp4\" type=\"video/mp4\">\n", " Your browser does not support the video tag.\n", "</video>\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": { "id": "HMsUvoF4BP7Y" }, "source": [ "### License" ] }, { "cell_type": "markdown", "metadata": { "id": "jQgQkbOvj55d" }, "source": [ "```\n", "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "```" ] }, { "cell_type": "markdown", "metadata": { "id": "UmyL-Rg4Dr_f" }, "source": [ "### Initialize" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xOYsEVSXp6IP" }, "outputs": [], "source": [ "from PIL import Image\n", "from IPython.display import HTML\n", "import IPython.display\n", "import google.auth\n", "import requests\n", "import json\n", "import uuid\n", "import base64\n", "import os\n", "import cv2\n", "import random\n", "import time\n", "import datetime\n", "import base64\n", "import random\n", "import logging" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wMlHl3bnkFPZ" }, "outputs": [], "source": [ "# Set these (run this cell to verify the output)\n", "\n", "# chagne to \"us\", \"eu\", etc.\n", "location = \"us\"\n", "\n", "# Get the current date and time\n", "now = datetime.datetime.now()\n", "\n", "# Format the date and time as desired\n", "formatted_date = now.strftime(\"%Y-%m-%d-%H-%M\")\n", "\n", "# Get some values using gcloud\n", "project_id = os.environ[\"GOOGLE_CLOUD_PROJECT\"]\n", "user = !(gcloud auth list --filter=status:ACTIVE --format=\"value(account)\")\n", "\n", "if len(user) != 1:\n", " raise RuntimeError(f\"user is not set: {user}\")\n", "user = user[0]\n", "\n", "print(f\"project_id = {project_id}\")\n", "print(f\"user = {user}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "d5Bx-skioi0s" }, "source": [ "### Prerequisite to run this notebook" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yPykWt2GxZFK" }, "outputs": [], "source": [ "!gcloud services enable compute.googleapis.com \\\n", " bigquery.googleapis.com \\\n", " aiplatform.googleapis.com \\\n", " dataform.googleapis.com \\\n", " documentai.googleapis.com \\\n", " --project \"{project_id}\"" ] }, { "cell_type": "markdown", "metadata": { "id": "sZ6m_wGrK0YG" }, "source": [ "### Helper Methods" ] }, { "cell_type": "markdown", "metadata": { "id": "JbOjdSP1kN9T" }, "source": [ "##### restAPIHelper\n", "Calls the Google Cloud REST API using the current users credentials." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "40wlwnY4kM11" }, "outputs": [], "source": [ "def restAPIHelper(url: str, http_verb: str, request_body: str) -> str:\n", " \"\"\"Calls the Google Cloud REST API passing in the current users credentials\"\"\"\n", "\n", " import requests\n", " import google.auth\n", " import json\n", "\n", " # Get an access token based upon the current user\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request()\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " headers = {\n", " \"Content-Type\" : \"application/json\",\n", " \"Authorization\" : \"Bearer \" + access_token\n", " }\n", "\n", " if http_verb == \"GET\":\n", " response = requests.get(url, headers=headers)\n", " elif http_verb == \"POST\":\n", " response = requests.post(url, json=request_body, headers=headers)\n", " elif http_verb == \"PUT\":\n", " response = requests.put(url, json=request_body, headers=headers)\n", " elif http_verb == \"PATCH\":\n", " response = requests.patch(url, json=request_body, headers=headers)\n", " elif http_verb == \"DELETE\":\n", " response = requests.delete(url, headers=headers)\n", " else:\n", " raise RuntimeError(f\"Unknown HTTP verb: {http_verb}\")\n", "\n", " if response.status_code == 200:\n", " return json.loads(response.content)\n", " #image_data = json.loads(response.content)[\"predictions\"][0][\"bytesBase64Encoded\"]\n", " else:\n", " error = f\"Error restAPIHelper -> ' Status: '{response.status_code}' Text: '{response.text}'\"\n", " raise RuntimeError(error)" ] }, { "cell_type": "markdown", "metadata": { "id": "bI-KJELZ1jgt" }, "source": [ "##### RunQuery (on BigQuery)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pmnCwYvA1kZv" }, "outputs": [], "source": [ "def RunQuery(sql):\n", " import time\n", " from google.cloud import bigquery\n", " client = bigquery.Client()\n", "\n", " if (sql.startswith(\"SELECT\") or sql.startswith(\"WITH\")):\n", " df_result = client.query(sql).to_dataframe()\n", " return df_result\n", " else:\n", " job_config = bigquery.QueryJobConfig(priority=bigquery.QueryPriority.INTERACTIVE)\n", " query_job = client.query(sql, job_config=job_config)\n", "\n", " # Check on the progress by getting the job's updated state.\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " while query_job.state != \"DONE\":\n", " time.sleep(2)\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " if query_job.error_result == None:\n", " return True\n", " else:\n", " raise Exception(query_job.error_result)" ] }, { "cell_type": "markdown", "metadata": { "id": "7ihI3oydf5OV" }, "source": [ "##### Create Vertex AI connection" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XBqHmWbMftik" }, "outputs": [], "source": [ "def createExternal_BigLake_VertexAI_RemoteFunctions_Connection(project_id, location, connection_name):\n", " \"\"\"Creates a BigLake, Vertex AI, Remote Function connection.\"\"\"\n", "\n", " # First find the connection\n", " # https://cloud.google.com/bigquery/docs/reference/bigqueryconnection/rest/v1/projects.locations.connections/list\n", " url = f\"https://bigqueryconnection.googleapis.com/v1/projects/{project_id}/locations/{location}/connections\"\n", "\n", " # Gather existing connections\n", " json_result = restAPIHelper(url, \"GET\", None)\n", " print(f\"createBigLakeConnection (GET) json_result: {json_result}\")\n", "\n", " # Test to see if connection exists, if so return\n", " if \"connections\" in json_result:\n", " for item in json_result[\"connections\"]:\n", " print(f\"BigLake Connection: {item['name']}\")\n", " # \"projects/756740881369/locations/us/connections/biglake-notebook-connection\"\n", " # NOTE: We cannot test the complete name since it contains the project number and not id\n", " if item[\"name\"].endswith(f\"/locations/{location}/connections/{connection_name}\"):\n", " print(\"Connection already exists\")\n", " serviceAccountId = item[\"cloudResource\"][\"serviceAccountId\"]\n", " return serviceAccountId\n", "\n", " # Create the connection\n", " # https://cloud.google.com/bigquery/docs/reference/bigqueryconnection/rest/v1/projects.locations.connections/create\n", " print(\"Creating Vertex AI Connection\")\n", "\n", " url = f\"https://bigqueryconnection.googleapis.com/v1/projects/{project_id}/locations/{location}/connections?connectionId={connection_name}\"\n", "\n", " request_body = {\n", " \"friendlyName\": connection_name,\n", " \"description\": \"BigLake, Vertex AI, Remote Function connection\",\n", " \"cloudResource\": {}\n", " }\n", "\n", " json_result = restAPIHelper(url, \"POST\", request_body)\n", "\n", " serviceAccountId = json_result[\"cloudResource\"][\"serviceAccountId\"]\n", " print(\"BigLake Connection created: \", serviceAccountId)\n", " return serviceAccountId\n" ] }, { "cell_type": "markdown", "metadata": { "id": "W8E-HwcahSRu" }, "source": [ "##### Create a GCS bucket" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GfUqF2ElhS7z" }, "outputs": [], "source": [ "def createGoogleCloudStorageBucket(project_id, bucket_name, location):\n", " \"\"\"Creates a Google Cloud Storage bucket.\"\"\"\n", "\n", " # First find the bucket\n", " # https://cloud.google.com/storage/docs/json_api/v1/buckets/list\n", " url = f\"https://storage.googleapis.com/storage/v1/b?project={project_id}\"\n", "\n", " # Gather existing buckets\n", " json_result = restAPIHelper(url, \"GET\", None)\n", " print(f\"createGoogleCloudStorageBucket (GET) json_result: {json_result}\")\n", "\n", " # Test to see if connection exists, if so return\n", " if \"items\" in json_result:\n", " for item in json_result[\"items\"]:\n", " print(f\"Bucket Id / Name: ({item['id']} / {item['name']}\")\n", " if item[\"id\"] == bucket_name:\n", " print(\"Bucket already exists\")\n", " return\n", "\n", " # Create the bucket\n", " # https://cloud.google.com/storage/docs/json_api/v1/buckets/insert\n", " print(\"Creating Google Cloud Bucket\")\n", "\n", " url = f\"https://storage.googleapis.com/storage/v1/b?project={project_id}&predefinedAcl=private&predefinedDefaultObjectAcl=private&projection=noAcl\"\n", "\n", " request_body = {\n", " \"name\": bucket_name,\n", " \"location\": location\n", " }\n", "\n", " json_result = restAPIHelper(url, \"POST\", request_body)\n", " print()\n", " print(f\"json_result: {json_result}\")\n", " print()\n", " print(\"BigLake Bucket created: \", bucket_name)" ] }, { "cell_type": "markdown", "metadata": { "id": "Fnw_9stBkgNP" }, "source": [ "##### Set IAM permissions on bucket" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mNx3WadTkgkA" }, "outputs": [], "source": [ "def setBucketIamPolicy(bucket_name, accountWithPrefix, role):\n", " \"\"\"Sets the bucket IAM policy.\"\"\"\n", "\n", " # Get the current bindings (if the account has access then skip)\n", " # https://cloud.google.com/storage/docs/json_api/v1/buckets/getIamPolicy\n", "\n", " url = f\"https://storage.googleapis.com/storage/v1/b/{bucket_name}/iam\"\n", " json_result = restAPIHelper(url, \"GET\", None)\n", " print(f\"setBucketIamPolicy (GET) json_result: {json_result}\")\n", "\n", " # Test to see if permissions exist\n", " if \"bindings\" in json_result:\n", " for item in json_result[\"bindings\"]:\n", " members = item[\"members\"]\n", " for member in members:\n", " if member == accountWithPrefix:\n", " print(\"Permissions exist\")\n", " return\n", "\n", " # Take the existing bindings and we need to append the new permission\n", " # Otherwise we loose the existing permissions\n", "\n", " bindings = json_result[\"bindings\"]\n", " new_permission = {\n", " \"role\": role,\n", " \"members\": [ accountWithPrefix ]\n", " }\n", "\n", " bindings.append(new_permission)\n", "\n", " # https://cloud.google.com/storage/docs/json_api/v1/buckets/setIamPolicy\n", " url = f\"https://storage.googleapis.com/storage/v1/b/{bucket_name}/iam\"\n", "\n", " request_body = { \"bindings\" : bindings }\n", "\n", " print(f\"Permission bindings: {bindings}\")\n", "\n", "\n", " json_result = restAPIHelper(url, \"PUT\", request_body)\n", " print()\n", " print(f\"json_result: {json_result}\")\n", " print()\n", " print(f\"Bucket IAM Permissions set for {accountWithPrefix} {role}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "yu2yg46hn4e0" }, "source": [ "##### Set Project Level IAM Permissions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yCFYgowGn406" }, "outputs": [], "source": [ "def setProjectLevelIamPolicy(project_id, accountWithPrefix, role):\n", " \"\"\"Sets the Project Level IAM policy.\"\"\"\n", "\n", " # Get the current bindings (if the account has access then skip)\n", " # https://cloud.google.com/resource-manager/reference/rest/v1/projects/getIamPolicy\n", " url = f\"https://cloudresourcemanager.googleapis.com/v1/projects/{project_id}:getIamPolicy\"\n", "\n", " request_body = { }\n", " json_result = restAPIHelper(url, \"POST\", request_body)\n", " print(f\"setProjectLevelIamPolicy (GET) json_result: {json_result}\")\n", "\n", " # Test to see if permissions exist\n", " if \"bindings\" in json_result:\n", " for item in json_result[\"bindings\"]:\n", " if item[\"role\"] == role:\n", " members = item[\"members\"]\n", " for member in members:\n", " if member == accountWithPrefix:\n", " print(\"Permissions exist\")\n", " return\n", "\n", " # Take the existing bindings and we need to append the new permission\n", " # Otherwise we loose the existing permissions\n", " if \"bindings\" in json_result:\n", " bindings = json_result[\"bindings\"]\n", " else:\n", " bindings = []\n", "\n", " new_permission = {\n", " \"role\": role,\n", " \"members\": [ accountWithPrefix ]\n", " }\n", "\n", " bindings.append(new_permission)\n", "\n", " # https://cloud.google.com/resource-manager/reference/rest/v1/projects/setIamPolicy\n", " url = f\"https://cloudresourcemanager.googleapis.com/v1/projects/{project_id}:setIamPolicy\"\n", "\n", " request_body = { \"policy\" : {\n", " \"bindings\" : bindings\n", " }\n", " }\n", "\n", " print(f\"Permission bindings: {bindings}\")\n", "\n", " json_result = restAPIHelper(url, \"POST\", request_body)\n", " print()\n", " print(f\"json_result: {json_result}\")\n", " print()\n", " print(f\"Project Level IAM Permissions set for {accountWithPrefix} {role}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "19PtFeA3cEk4" }, "source": [ "##### Delete Document Processor" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "10rpecegcFHd" }, "outputs": [], "source": [ "def deleteDocumentProcessor(project_id, location, processor_name, processor_type):\n", " \"\"\"Creates a Vertex AI document process if it does not exist.\"\"\"\n", "\n", " # First find the item\n", " # https://cloud.google.com/document-ai/docs/reference/rest/v1/projects.locations.processors/list\n", " url = f\"https://{location}-documentai.googleapis.com/v1/projects/{project_id}/locations/{location}/processors\"\n", "\n", " # Gather existing items\n", " json_result = restAPIHelper(url, \"GET\", None)\n", " print(f\"createDocumentProcessor (GET) json_result: {json_result}\")\n", "\n", " # Test to see if processor exists, if so return\n", " if \"processors\" in json_result:\n", " for item in json_result[\"processors\"]:\n", " print(f\"Process Name: {item['name']}\")\n", " # \"projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c\"\n", " # NOTE: We do not know the random number at the end so test the type and display name\n", " if item[\"type\"] == processor_type and item[\"displayName\"] == processor_name:\n", " print(\"Found processor\")\n", " name = item[\"name\"]\n", " url = f\"https://{location}-documentai.googleapis.com/v1/{name}\"\n", " json_result = restAPIHelper(url, \"DELETE\", None)\n", " print(\"Document Processor Deleted\")\n", "\n", " print(\"Document Processor not found to delete\")" ] }, { "cell_type": "markdown", "metadata": { "id": "wnqFezxacs8R" }, "source": [ "##### Create a Document Processor (to process the PDFs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2Ekj0R1RcwA4" }, "outputs": [], "source": [ "def createDocumentProcessor(project_id, location, processor_name, processor_type):\n", " \"\"\"Creates a Vertex AI document process if it does not exist.\"\"\"\n", "\n", " # First find the item\n", " # https://cloud.google.com/document-ai/docs/reference/rest/v1/projects.locations.processors/list\n", " url = f\"https://{location}-documentai.googleapis.com/v1/projects/{project_id}/locations/{location}/processors\"\n", "\n", " # Gather existing items\n", " json_result = restAPIHelper(url, \"GET\", None)\n", " print(f\"createDocumentProcessor (GET) json_result: {json_result}\")\n", "\n", " # Test to see if processor exists, if so return\n", " if \"processors\" in json_result:\n", " for item in json_result[\"processors\"]:\n", " print(f\"Process Name: {item['name']}\")\n", " # \"projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c\"\n", " # NOTE: We do not know the random number at the end so test the type and display name\n", " if item[\"type\"] == processor_type and item[\"displayName\"] == processor_name:\n", " print(\"Processor already exists\")\n", " defaultProcessorVersion = item[\"defaultProcessorVersion\"]\n", " return defaultProcessorVersion\n", "\n", " # Create the processor\n", " # https://cloud.google.com/document-ai/docs/reference/rest/v1/projects.locations.processors/create\n", " print(\"Creating Document Processor\")\n", "\n", " url = f\"https://{location}-documentai.googleapis.com/v1/projects/{project_id}/locations/{location}/processors\"\n", "\n", " request_body = {\n", " \"type\": processor_type,\n", " \"displayName\": processor_name\n", " }\n", "\n", " \"\"\"\n", " INVALID_ARGUMENT: Document pages in non-imageless mode exceed the limit: 15 got 58.\n", " Try using imageless mode to increase the limit to 100. [type.googleapis.com/util.MessageSetPayload='[google.rpc.error_details_ext]\n", " { message: \"Document pages in non-imageless mode exceed the limit: 15 got 58. Try using imageless mode to increase the limit to 100.\"\n", " details { [type.googleapis.com/google.rpc.ErrorInfo] { reason: \"PAGE_LIMIT_EXCEEDED\" domain: \"documentai.googleapis.com\" metadata\n", " { key: \"page_limit\" value: \"100\" } metadata { key: \"pages\" value: \"58\" } } } }']\n", " \"\"\"\n", "\n", " json_result = restAPIHelper(url, \"POST\", request_body)\n", "\n", " \"\"\"\n", " {'name': 'projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c',\n", " 'type': 'OCR_PROCESSOR',\n", " 'displayName': 'vertex_ai_ocr_processor',\n", " 'state': 'ENABLED',\n", " 'processEndpoint': 'https://us-documentai.googleapis.com/v1/projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c:process',\n", " 'createTime': '2025-01-21T19:25:07.980401Z',\n", " 'defaultProcessorVersion': 'projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c/processorVersions/pretrained-ocr-v2.0-2023-06-02',\n", " 'processorVersionAliases': [{'alias': 'projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c/processorVersions/pretrained',\n", " 'processorVersion': 'projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c/processorVersions/pretrained-ocr-v1.0-2020-09-23'},\n", " {'alias': 'projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c/processorVersions/pretrained-next',\n", " 'processorVersion': 'projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c/processorVersions/pretrained-ocr-v1.1-2022-09-12'},\n", " {'alias': 'projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c/processorVersions/rc',\n", " 'processorVersion': 'projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c/processorVersions/pretrained-ocr-v2.1-2024-08-07'},\n", " {'alias': 'projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c/processorVersions/stable',\n", " 'processorVersion': 'projects/530963301545/locations/us/processors/b7e8a9fe78cf7e9c/processorVersions/pretrained-ocr-v2.0-2023-06-02'}]}\n", " \"\"\"\n", "\n", " defaultProcessorVersion = json_result[\"defaultProcessorVersion\"]\n", " print(\"Document Processor created: \", defaultProcessorVersion)\n", " return defaultProcessorVersion\n" ] }, { "cell_type": "markdown", "metadata": { "id": "c51M89g0Ejmz" }, "source": [ "### MAIN CODE - Setup the Environment" ] }, { "cell_type": "markdown", "metadata": { "id": "9nR11clrXJKN" }, "source": [ "- Create the bucket and copy data from a shared location\n", "- Create our BigLake / Vertex AI connection in BigQuery\n", "- The connection creates a service principal so we will grant access to thie principal to our bucket and permissions to call vertex endpoints\n", "- Create our tables and load with data\n", "- Create our models in BigQuery" ] }, { "cell_type": "markdown", "metadata": { "id": "ET78odM2czZ7" }, "source": [ "##### Create our bucket and copy data for BigQuery / Storage" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NtoSNrn7iK8y" }, "outputs": [], "source": [ "# Create a bucket\n", "bucket_name = project_id\n", "createGoogleCloudStorageBucket(project_id, bucket_name, location)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VroWIGqvdREK" }, "outputs": [], "source": [ "# Copy our data (CSV files). We want the files in our local bucket with local location.\n", "source_path = \"gs://data-analytics-golden-demo/cymbal-consumer-finance/*\"\n", "dest_path = f\"gs://{bucket_name}/cymbal-consumer-finance/\"\n", "print(f\"Copying data from {source_path} to {dest_path}\")\n", "print(\"This may take a few minutes...\")\n", "!gsutil -m -q cp -r {source_path} {dest_path}\n", "print(\"Copy [data] is complete\")\n", "\n", "\n", "# Copy our data (PDFs files). We want the files in our local bucket with local location.\n", "source_path = \"gs://data-analytics-golden-demo/cymbal-consumer-finance-pdfs/*\"\n", "dest_path = f\"gs://{bucket_name}/pdfs/\"\n", "print(f\"Copying data from {source_path} to {dest_path}\")\n", "print(\"This may take a few minutes...\")\n", "!gsutil -m -q cp -r {source_path} {dest_path}\n", "print(\"Copy [pdfs] is complete\")\n", "\n", "print(f\"To view the files: https://console.cloud.google.com/storage/browser/{bucket_name}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "gOEoFFTLc3Ey" }, "source": [ "##### Create the external BigQuery connection for BigLake / Vertex" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "U4A_7n4SEPNO" }, "outputs": [], "source": [ "# Create our connection for BigLake / Vertex AI\n", "\n", "connection_name = \"biglake_vertexai_connection\"\n", "biglake_vertexai_connection_serviceAccountId = createExternal_BigLake_VertexAI_RemoteFunctions_Connection(project_id, location, connection_name)\n", "print(f\"biglake_vertexai_connection_serviceAccountId: {biglake_vertexai_connection_serviceAccountId}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "7Lpnc37wc84V" }, "source": [ "##### Grant the service account created by the external connection IAM permissions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "q7Fx6dC4nsEr" }, "outputs": [], "source": [ "# Grant the Biglake / Vertex AI External connection Service Principal permissions to call Vertex Models / Endpoints\n", "\n", "################################################################################################################################################\n", "# NOTE: You might need to wait a minute or two before running this. It can fail if you run this too quickly after creating the connection\n", "################################################################################################################################################\n", "\n", "# To call Docuemnt API\n", "# Exception: {'reason': 'invalidQuery', 'location': 'query', 'message': \"Permission denied for document processor 'projects/530963301545/locations/us/\n", "#processors/processor_id'. Please ensure that (1) The processor 'processor_id' exists in project 530963301545, region 'us', and is active. (2)\n", "# The connection's service account bqcx-530963301545-z8r3@gcp-sa-bigquery-condel.iam.gserviceaccount.com has roles/documentai.viewer role\n", "#in the project 530963301545. More details: Permission 'documentai.processors.get' denied on resource '//documentai.googleapis.com/projects/\n", "# 530963301545/locations/us/processors/processor_id' (or it may not exist).\"}\n", "setProjectLevelIamPolicy(project_id, f\"serviceAccount:{biglake_vertexai_connection_serviceAccountId}\", \"roles/documentai.viewer\")\n", "\n", "\n", "# To call Vision API\n", "setProjectLevelIamPolicy(project_id, f\"serviceAccount:{biglake_vertexai_connection_serviceAccountId}\", \"roles/serviceusage.serviceUsageConsumer\")\n", "setProjectLevelIamPolicy(project_id, f\"serviceAccount:{biglake_vertexai_connection_serviceAccountId}\", \"roles/serviceusage.serviceUsageConsumer\")\n", "\n", "# To call GENERATE TEXT\n", "setProjectLevelIamPolicy(project_id, f\"serviceAccount:{biglake_vertexai_connection_serviceAccountId}\",\"roles/aiplatform.user\")\n", "\n", "# Grant the current user and the service principal for our BigLake / Vertex AI connection access to the bucket\n", "# We want our BigLake / Vertex AI connections to have access to the files in our bucket\n", "\n", "setBucketIamPolicy(bucket_name, f\"serviceAccount:{biglake_vertexai_connection_serviceAccountId}\", \"roles/storage.objectAdmin\")\n", "setBucketIamPolicy(bucket_name, f\"user:{user}\", \"roles/storage.admin\")" ] }, { "cell_type": "markdown", "metadata": { "id": "R-UtSlnudT6f" }, "source": [ "##### Create our BigQuery Dataset and populate it with data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tiV5OHVDiZj9" }, "outputs": [], "source": [ "dataset_name = \"cymbal_consumer_finance\"\n", "\n", "sql = f\"\"\"\n", "CREATE SCHEMA IF NOT EXISTS {dataset_name} OPTIONS(location = '{location}');\n", "\"\"\"\n", "RunQuery(sql)\n", "\n", "sql = f\"\"\"LOAD DATA OVERWRITE `{dataset_name}.customers`\n", "(\n", " customer_id STRING,\n", " first_name STRING,\n", " last_name STRING,\n", " date_of_birth DATE,\n", " email STRING,\n", " phone_number STRING,\n", " creation_date DATE,\n", " life_event STRING\n", ")\n", "FROM FILES (format = 'CSV', skip_leading_rows = 1, uris = ['gs://{bucket_name}/cymbal-consumer-finance/ccf_csv_tables_customers.csv']);\n", "\"\"\"\n", "RunQuery(sql)\n", "\n", "sql = f\"\"\"\n", "LOAD DATA OVERWRITE `{dataset_name}.loan_applications`\n", "(\n", " application_id STRING,\n", " customer_id STRING,\n", " application_date DATE,\n", " product_type STRING,\n", " sub_product STRING,\n", " loan_amount FLOAT64,\n", " description STRING,\n", " application_status STRING,\n", " approval_date DATE,\n", " disbursement_date DATE,\n", " application_channel STRING,\n", " marketing_cost FLOAT64,\n", ")\n", "FROM FILES (format = 'CSV', skip_leading_rows = 1, uris = ['gs://{bucket_name}/cymbal-consumer-finance/ccf_csv_tables_loan_applications.csv']);\n", "\"\"\"\n", "RunQuery(sql)\n", "\n", "sql = f\"\"\"\n", "LOAD DATA OVERWRITE `{dataset_name}.loan_repayments`\n", "(\n", " repayment_id STRING,\n", " loan_id STRING,\n", " repayment_date DATE,\n", " amount_due FLOAT64,\n", " amount_paid FLOAT64,\n", " payment_status STRING,\n", " days_past_due INT64\n", ")\n", "FROM FILES (format = 'CSV', skip_leading_rows = 1, uris = ['gs://{bucket_name}/cymbal-consumer-finance/ccf_csv_tables_loan_repayments.csv']);\n", "\"\"\"\n", "RunQuery(sql)\n", "\n", "sql = f\"\"\"\n", "LOAD DATA OVERWRITE `{dataset_name}.marketing_costs`\n", "(\n", " cost_id STRING,\n", " channel STRING,\n", " product_type STRING,\n", " cost_per_lead FLOAT64,\n", ")\n", "FROM FILES (format = 'CSV', skip_leading_rows = 1, uris = ['gs://{bucket_name}/cymbal-consumer-finance/ccf_csv_tables_marketing_costs.csv']);\n", "\"\"\"\n", "RunQuery(sql)\n", "\n", "print(f\"You should now see a new dataset in BigQuery with several tables loaded with data.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "kGB98KjgdZL2" }, "source": [ "##### Create our OCR and Layout Processors (to parse our PDF)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EMXABUb8XYq-" }, "outputs": [], "source": [ "processor_name = \"vertex_ai_ocr_processor\"\n", "processor_type = \"OCR_PROCESSOR\"\n", "\n", "# In case you need to change options (you should delete and then recreate)\n", "# deleteDocumentProcessor(project_id, location, processor_name, processor_type)\n", "\n", "vertex_processor_name = createDocumentProcessor(project_id, location, processor_name, processor_type)\n", "print(f\"vertex_processor_name: {vertex_processor_name}\")\n", "\n", "# Layout Parser (for chunks)\n", "layout_processor_name = \"vertex_ai_layout_processor\"\n", "layout_processor_type = \"LAYOUT_PARSER_PROCESSOR\"\n", "\n", "layout_processor_name = createDocumentProcessor(project_id, location, layout_processor_name, layout_processor_type)\n", "print(f\"layout_processor_name: {layout_processor_name}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "gJiq8z2Bdmug" }, "source": [ "##### Create our BQML Models in our BigQuery Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6b0Almk0l3We" }, "outputs": [], "source": [ "# Create the remote connection to each Vertex AI service. (Vision, Gemini Pro, Embeddings, etc.)\n", "# The models use the external connection\n", "\n", "#####################################################################################################################\n", "# NOTE: You might get an error that the processor does not exist (you should re-run this cell)\n", "#####################################################################################################################\n", "\n", "sql = f\"\"\"CREATE MODEL IF NOT EXISTS `{project_id}.{dataset_name}.layout-connection`\n", "REMOTE WITH CONNECTION `{project_id}.{location}.{connection_name}`\n", "OPTIONS (\n", " remote_service_type = 'cloud_ai_document_v1',\n", " document_processor='{layout_processor_name}');\n", "\"\"\"\n", "\n", "RunQuery(sql)\n", "\n", "sql = f\"\"\"CREATE MODEL IF NOT EXISTS `{project_id}.{dataset_name}.document-connection`\n", "REMOTE WITH CONNECTION `{project_id}.{location}.{connection_name}`\n", "OPTIONS (\n", " remote_service_type = 'cloud_ai_document_v1',\n", " document_processor='{vertex_processor_name}');\n", "\"\"\"\n", "\n", "RunQuery(sql)\n", "\n", "sql = f\"\"\"CREATE MODEL IF NOT EXISTS `{project_id}.{dataset_name}.vision-connection`\n", "REMOTE WITH CONNECTION `{project_id}.{location}.{connection_name}`\n", "OPTIONS (remote_service_type = 'cloud_ai_vision_v1');\"\"\"\n", "\n", "RunQuery(sql)\n", "\n", "print(f\"Created cloud_ai_vision_v1: {sql}\")\n", "\n", "sql = f\"\"\"CREATE MODEL IF NOT EXISTS `{project_id}.{dataset_name}.gemini_model`\n", "REMOTE WITH CONNECTION `{project_id}.{location}.{connection_name}`\n", "OPTIONS (endpoint = 'gemini-2.0-flash-001');\"\"\"\n", "\n", "RunQuery(sql)\n", "\n", "print(f\"Created gemini-2.0-flash-001: {sql}\")\n", "\n", "\n", "sql = f\"\"\"CREATE MODEL IF NOT EXISTS `{project_id}.{dataset_name}.textembedding_model`\n", "REMOTE WITH CONNECTION `{project_id}.{location}.{connection_name}`\n", "OPTIONS (endpoint = 'text-embedding-005');\"\"\"\n", "\n", "RunQuery(sql)\n", "\n", "print(f\"Created text-embedding-005: {sql}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "bcQvWgrUtHKi" }, "source": [ "### MAIN CODE - Demo" ] }, { "cell_type": "markdown", "metadata": { "id": "TntZXM82tqds" }, "source": [ "1. Create an Object table over a set of PDFs\n", "2. Force a refresh of the object table\n", "3. Process the PDFs with the OCR processor which creates 1 large text extract\n", "4. Process the PDFs with the Layout processor which will chunk our PDF text\n", "5. Create embeddings on the chunked PDF text\n", "6. Search the embedddings\n", "7. Use Gemini and the RAG pattern to use our embedding search results to answer a question" ] }, { "cell_type": "markdown", "metadata": { "id": "ji_BIY8tbvRn" }, "source": [ "##### Create an Object table over a set of PDFs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wrO6C18nBYfU" }, "outputs": [], "source": [ "# Create the object table over the files\n", "\n", "object_table_name = \"object_table_pdfs\"\n", "\n", "sql = f\"\"\"\n", "CREATE OR REPLACE EXTERNAL TABLE `{project_id}.{dataset_name}.{object_table_name}`\n", "WITH CONNECTION `{project_id}.{location}.{connection_name}`\n", "OPTIONS (\n", " object_metadata=\"DIRECTORY\",\n", " uris = ['gs://{bucket_name}/pdfs/*.pdf'],\n", " max_staleness=INTERVAL 30 MINUTE,\n", " metadata_cache_mode=\"MANUAL\"\n", " );\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "markdown", "metadata": { "id": "mGOMbQ8TbmN3" }, "source": [ "##### Call the Refresh on the Object table so it picks up the files in storage" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8SrLe_b8Bz_h" }, "outputs": [], "source": [ "# Since the table is set to MANUAL refresh, refresh the table so we see the files\n", "\n", "sql = f\"CALL BQ.REFRESH_EXTERNAL_METADATA_CACHE('{project_id}.{dataset_name}.{object_table_name}')\"\n", "RunQuery(sql)\n", "\n", "# Show the data\n", "sql=f\"\"\"SELECT *\n", " FROM `{project_id}.{dataset_name}.{object_table_name}`\n", " LIMIT 20;\"\"\"\n", "\n", "result=RunQuery(sql)\n", "\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "za8XkVcyDHgr" }, "outputs": [], "source": [ "# Get a signed url so we can show in this notebook\n", "\n", "sql=f\"\"\"SELECT *\n", " FROM EXTERNAL_OBJECT_TRANSFORM(TABLE `{project_id}.{dataset_name}.{object_table_name}`,['SIGNED_URL'])\n", " WHERE uri LIKE '%loan_application_0d2e87d5-6337-4fc6-b6ed-5e6f35df596b.pdf%'\"\"\"\n", "\n", "df=RunQuery(sql)\n", "\n", "for row in df.itertuples():\n", " uri = row.uri\n", " signed_url = row.signed_url\n", "\n", "print(f\"uri: {uri}\")\n", "print(f\"signed_url: {signed_url}\")\n", "\n", "print()\n", "print()\n", "print(\"Notice that we have handwriting and different types of fonts.\")\n", "print()\n", "print()\n", "\n", "# Chrome shows a warning\n", "#iframe = IPython.display.IFrame(src=signed_url, width=800, height=600)\n", "#display(iframe)\n", "\n", "html = f\"\"\"\n", "<object data=\"{signed_url}\" type=\"application/pdf\" width=\"800\" height=\"600\">\n", " Your browser does not support inline PDFs.\n", " alt : <a href=\"{signed_url}\" target=\"_blank\">Sample Loan PDF</a>\n", "</object>\n", "\"\"\"\n", "IPython.display.HTML(html)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "JYbvXMHAbboJ" }, "source": [ "##### We want to process each PDF document in our object table (OCR Processor)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Wgu3sD2fDhpb" }, "outputs": [], "source": [ "# This is 1 big pdf text block using the OCR processor\n", "\n", "pdfs_document_ocr_processor_table_name = \"pdfs_document_ocr_processor\"\n", "\n", "sql = f\"\"\"\n", "CREATE OR REPLACE TABLE `{project_id}.{dataset_name}.{pdfs_document_ocr_processor_table_name}` AS (\n", " SELECT *\n", " FROM ML.PROCESS_DOCUMENT(MODEL `{project_id}.{dataset_name}.document-connection`,\n", " TABLE `{project_id}.{dataset_name}.{object_table_name}`)\n", ");\"\"\"\n", "\n", "RunQuery(sql)\n", "\n", "# Show the data\n", "sql=f\"\"\"SELECT *\n", " FROM `{project_id}.{dataset_name}.{pdfs_document_ocr_processor_table_name}`\n", " LIMIT 10;\"\"\"\n", "\n", "result=RunQuery(sql)\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "PuetJH1-bRQf" }, "source": [ "##### We want to process each PDF document in our object table (Layout Processor)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ab4CWW0uoiVP" }, "outputs": [], "source": [ "# This is chunks of text blocks using the layout processor\n", "# We would need to \"manually\" (or semantic) chunk this to create embeddings\n", "\n", "pdfs_document_layout_processor_table_name = \"pdfs_document_layout_processor\"\n", "process_options = '{\"layout_config\": {\"chunking_config\": {\"chunk_size\": 100}}}'\n", "\n", "sql = f\"\"\"\n", "CREATE OR REPLACE TABLE `{project_id}.{dataset_name}.{pdfs_document_layout_processor_table_name}` AS (\n", " SELECT *\n", " FROM ML.PROCESS_DOCUMENT(MODEL `{project_id}.{dataset_name}.layout-connection`,\n", " TABLE `{project_id}.{dataset_name}.{object_table_name}`,\n", " PROCESS_OPTIONS => (JSON '{process_options}')\n", " )\n", ");\"\"\"\n", "\n", "RunQuery(sql)\n", "\n", "# Show the data\n", "sql=f\"\"\"SELECT uri, ml_process_document_result\n", " FROM `{project_id}.{dataset_name}.{pdfs_document_layout_processor_table_name}`\n", " LIMIT 5;\"\"\"\n", "\n", "result=RunQuery(sql)\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "_SO10nWmbF4X" }, "source": [ "##### Parse the JSON from the PDF extraction. Use BQ JSON functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "th6irG2wpZ4r" }, "outputs": [], "source": [ "pdfs_document_layout_processor_chunks_table_name = \"pdfs_document_layout_processor_chunks\"\n", "\n", "sql = f\"\"\"CREATE OR REPLACE TABLE `{project_id}.{dataset_name}.{pdfs_document_layout_processor_chunks_table_name}` AS\n", "SELECT uri,\n", " JSON_EXTRACT_SCALAR(json , '$.chunkId') AS chunk_id,\n", " JSON_EXTRACT_SCALAR(json , '$.content') AS content,\n", " JSON_EXTRACT_SCALAR(json , '$.pageFooters[0].text') AS page_footers_text,\n", " JSON_EXTRACT_SCALAR(json , '$.pageSpan.pageStart') AS page_span_start,\n", " JSON_EXTRACT_SCALAR(json , '$.pageSpan.pageEnd') AS page_span_end\n", " FROM `{project_id}.{dataset_name}.{pdfs_document_layout_processor_table_name}`,\n", " UNNEST(JSON_EXTRACT_ARRAY(ml_process_document_result.chunkedDocument.chunks, '$')) json\n", "\"\"\"\n", "\n", "RunQuery(sql)\n", "\n", "# Show the data\n", "sql=f\"\"\"SELECT *\n", " FROM `{project_id}.{dataset_name}.{pdfs_document_layout_processor_chunks_table_name}`\n", " ORDER BY uri, chunk_id\n", " LIMIT 10\"\"\"\n", "\n", "result=RunQuery(sql)\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "uvnVPJmxadKq" }, "source": [ "##### Use ML.GENERATE_EMBEDDING to create embeddings for our extract PDF text" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7cV8KmOUsfhx" }, "outputs": [], "source": [ "# In a real life scenerio, you would not pass in TABLE `{project_id}.{dataset_name}.{pdfs_document_layout_processor_chunks_table_name}`\n", "# You would pass in a query for items that do not already EXIST in the embedding table. The TABLE parameter could also be a SQL statement.\n", "\n", "# The \"content\" field will automatically be pass to the model as the column to embed\n", "\n", "\n", "pdfs_document_layout_processor_chunks_table_name = \"pdfs_document_layout_processor_chunks\"\n", "pdfs_document_embeddings_table_name = \"pdfs_document_embeddings\"\n", "\n", "sql = f\"\"\"\n", "CREATE OR REPLACE TABLE `{project_id}.{dataset_name}.{pdfs_document_embeddings_table_name}` AS\n", "SELECT uri,\n", " chunk_id,\n", " content,\n", " ml_generate_embedding_result as vector_embedding,\n", " ml_generate_embedding_statistics,\n", " ml_generate_embedding_status,\n", " FROM ML.GENERATE_EMBEDDING(MODEL `{project_id}.{dataset_name}.textembedding_model`,\n", " TABLE `{project_id}.{dataset_name}.{pdfs_document_layout_processor_chunks_table_name}`,\n", " STRUCT(\n", " TRUE AS flatten_json_output,\n", " 'SEMANTIC_SIMILARITY' as task_type,\n", " 768 AS output_dimensionality\n", " ))\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "markdown", "metadata": { "id": "-SGNPMkFaU4i" }, "source": [ "##### Show our embedded data (one first element from embedding array)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "us8DiQ1O_fyw" }, "outputs": [], "source": [ "# Show the data\n", "\n", "sql=f\"\"\"SELECT uri, chunk_id, content, [vector_embedding[0],vector_embedding[1]] as vector_embedding, ml_generate_embedding_statistics, ml_generate_embedding_status\n", " FROM `{project_id}.{dataset_name}.{pdfs_document_embeddings_table_name}`\n", " ORDER BY uri, chunk_id\n", " LIMIT 10;\"\"\"\n", "\n", "result=RunQuery(sql)\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "dYkX5N8QZ5U7" }, "source": [ "##### Search our embeddings table for a search string" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iW9IjccE-E3J" }, "outputs": [], "source": [ "# The search results will contain all the data that is semaniticly close to our search string\n", "# The shorter the distance the more pertinent the result\n", "\n", "vector_search_string = 'always wanted to own a home'\n", "\n", "options = '{\"fraction_lists_to_search\": 0.01}'\n", "\n", "sql = f\"\"\"SELECT base.uri as uri,\n", " base.chunk_id as chunk_id,\n", " base.content as content,\n", " distance\n", " FROM VECTOR_SEARCH(TABLE `{project_id}.{dataset_name}.{pdfs_document_embeddings_table_name}`,\n", " 'vector_embedding', -- column in table to search\n", " (SELECT ml_generate_embedding_result,\n", " content AS query\n", " FROM ML.GENERATE_EMBEDDING(MODEL `{project_id}.{dataset_name}.textembedding_model`,\n", " (SELECT '{vector_search_string}' AS content),\n", " STRUCT(TRUE AS flatten_json_output,\n", " 'SEMANTIC_SIMILARITY' as task_type,\n", " 768 AS output_dimensionality) -- struct\n", " ) -- question embedding\n", " ), -- vector search\n", " top_k => 10,\n", " OPTIONS => '{options}')\n", "ORDER BY distance;\n", "\"\"\"\n", "\n", "result = RunQuery(sql)\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "juUG-XyPosrR" }, "source": [ "**Learning Item**\n", "- Change the search string for some different items" ] }, { "cell_type": "markdown", "metadata": { "id": "fNn9ZLKaZh-y" }, "source": [ "##### Search our PDFs (embeddings) and return the results as JSON" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8j6ZzTQQFDTN" }, "outputs": [], "source": [ "# Now let's return the results as a JSON string that we will inject into the context of Gemini\n", "# LLMs understand JSON so we will have each text with the source of the text\n", "\n", "vector_search_string = 'always wanted to own a home'\n", "\n", "options = '{\"fraction_lists_to_search\": 0.01}'\n", "\n", "sql = f\"\"\"SELECT TO_JSON_STRING(STRUCT(base.uri as uri,\n", " base.chunk_id as chunk_id,\n", " base.content as content,\n", " distance as vector_search_distance)) as rag_json\n", " FROM VECTOR_SEARCH(TABLE `{project_id}.{dataset_name}.{pdfs_document_embeddings_table_name}`,\n", " 'vector_embedding', -- column in table to search\n", " (SELECT ml_generate_embedding_result,\n", " content AS query\n", " FROM ML.GENERATE_EMBEDDING(MODEL `{project_id}.{dataset_name}.textembedding_model`,\n", " (SELECT '{vector_search_string}' AS content),\n", " STRUCT(TRUE AS flatten_json_output,\n", " 'SEMANTIC_SIMILARITY' as task_type,\n", " 768 AS output_dimensionality) -- struct\n", " ) -- question embedding\n", " ), -- vector search\n", " top_k => 10,\n", " OPTIONS => '{options}')\n", "ORDER BY distance;\n", "\"\"\"\n", "\n", "result = RunQuery(sql)\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "BuUaHRhoY14I" }, "source": [ "##### Call Gemini directly in BigQuery using ML.GENERATE_TEXT" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eoJFefAPY0ZJ" }, "outputs": [], "source": [ "llm_prompt = 'What type of LLM are you?'\n", "\n", "\n", "sql = f\"\"\"SELECT *\n", " FROM ML.GENERATE_TEXT(MODEL`{project_id}.{dataset_name}.gemini_model`,\n", " (SELECT '{llm_prompt}' AS prompt),\n", " STRUCT(\n", " 0.8 AS temperature,\n", " 1024 AS max_output_tokens,\n", " 0.95 AS top_p,\n", " 40 AS top_k)\n", " )\n", "\"\"\"\n", "\n", "result = RunQuery(sql)\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "6ak_Hi0PmyTU" }, "source": [ "**Learning Item**\n", "- Parse the returned JSON from Gemini\n", "- Tip: Create a UDF to parse the JSON, that way if Gemini changes the response JSON you only have 1 place to update your code." ] }, { "cell_type": "markdown", "metadata": { "id": "APfwcaFMZVfv" }, "source": [ "##### Use Gemini (BQML) to process the data we retrieve from our Embedding Search to create a summary" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_Gona7f0CtuN" }, "outputs": [], "source": [ "# RAG Pattern\n", "\n", "vector_search_string = 'own a home'\n", "\n", "llm_prompt = \"\"\"Which people have wanted a house during the loan process?\n", "Site the sources by explaining your results using the uri and the chunk_id.\n", "Quote the orginal text from the content.\n", "<context>\n", "REPLACE-ME-WITH-EMBEDDING-SEARCH-RESULTS\n", "</context>\n", "\"\"\"\n", "\n", "options = '{\"fraction_lists_to_search\": 0.01}'\n", "\n", "sql = f\"\"\"WITH embeddings_data AS\n", "(\n", "SELECT TO_JSON_STRING(STRUCT(base.uri as uri,\n", " base.chunk_id as chunk_id,\n", " base.content as content,\n", " distance as vector_search_distance)) as embeddings_json\n", " FROM VECTOR_SEARCH(TABLE `{project_id}.{dataset_name}.{pdfs_document_embeddings_table_name}`,\n", " 'vector_embedding', -- column in table to search\n", " (SELECT ml_generate_embedding_result,\n", " content AS query\n", " FROM ML.GENERATE_EMBEDDING(MODEL `{project_id}.{dataset_name}.textembedding_model`,\n", " (SELECT '{vector_search_string}' AS content),\n", " STRUCT(TRUE AS flatten_json_output,\n", " 'SEMANTIC_SIMILARITY' as task_type,\n", " 768 AS output_dimensionality) -- struct\n", " ) -- question embedding\n", " ), -- vector search\n", " top_k => 10,\n", " OPTIONS => '{options}')\n", "ORDER BY distance\n", ")\n", ", embeddings_array AS\n", "(\n", "SELECT ARRAY_AGG(embeddings_json) AS embeddings_json_array\n", " FROM embeddings_data\n", ")\n", "SELECT ml_generate_text_result.candidates[0].content.parts[0].text as llm_result\n", " FROM ML.GENERATE_TEXT(MODEL`{project_id}.{dataset_name}.gemini_model`,\n", " (SELECT REPLACE(\\\"\\\"\\\"{llm_prompt}\\\"\\\"\\\",\n", " 'REPLACE-ME-WITH-EMBEDDING-SEARCH-RESULTS',\n", " ARRAY_TO_STRING(embeddings_json_array, '\\\\n')) AS prompt\n", " FROM embeddings_array),\n", " STRUCT(\n", " 0.8 AS temperature,\n", " 2048 AS max_output_tokens,\n", " 0.95 AS top_p,\n", " 40 AS top_k)\n", " )\n", "\"\"\"\n", "\n", "#print(sql)\n", "\n", "result = RunQuery(sql)\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "RVZ3NdwXnFrW" }, "source": [ "**Learning Item**\n", "- Change the search string and prompt\n", "- Change the temperature and other parameters" ] }, { "cell_type": "markdown", "metadata": { "id": "uqwZol_Qj7kM" }, "source": [ "#### Now let's ground our result using Google Search" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "R4Cyi3Vej77h" }, "outputs": [], "source": [ "# Without Grounding\n", "\n", "llm_prompt = 'What is the weather today on Los Angeles California and what is the current date?'\n", "\n", "sql = f\"\"\"SELECT *\n", " FROM ML.GENERATE_TEXT(MODEL`{project_id}.{dataset_name}.gemini_model`,\n", " (SELECT '{llm_prompt}' AS prompt),\n", " STRUCT(\n", " 0.8 AS temperature,\n", " 1024 AS max_output_tokens,\n", " 0.95 AS top_p,\n", " 40 AS top_k)\n", " )\n", "\"\"\"\n", "\n", "result = RunQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "3nC0OdKSnf-H" }, "source": [ "**Learning Item**\n", "- Change the prompt so you get something false back." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CcxMoVQckK2P" }, "outputs": [], "source": [ "# Grounded\n", "\n", "llm_prompt = 'What is the weather today on Los Angeles California and what is the current date?'\n", "\n", "# NOTE: A breaking change has occurred and GENERATE_TEXT currently does not support grounding.\n", "\n", "sql = f\"\"\"SELECT *\n", " FROM ML.GENERATE_TEXT(MODEL`{project_id}.{dataset_name}.gemini_model`,\n", " (SELECT '{llm_prompt}' AS prompt),\n", " STRUCT(\n", " 0.8 AS temperature,\n", " 1024 AS max_output_tokens,\n", " 0.95 AS top_p,\n", " 40 AS top_k,\n", " TRUE AS ground_with_google_search)\n", " )\n", "\"\"\"\n", "\n", "result = RunQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "jyDNOlCinmEv" }, "source": [ "**Learning Item**\n", "- Change the prompt so you get something that is only accurate when grounded" ] }, { "cell_type": "markdown", "metadata": { "id": "lpUk4VvVgo8C" }, "source": [ "#### Now let's return our response in our own Json Schema" ] }, { "cell_type": "markdown", "metadata": { "id": "R9tkVRF-jdzP" }, "source": [ "##### Now let's out the result as formatted JSON" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zmRsUIdejhMb" }, "outputs": [], "source": [ "# Grounded and return the results as formatted JSON\n", "\n", "# Previously you would tell the LLM in the prompt how to output JSON\n", "# Now we can pass in the schema for which was want our output\n", "# This means we can then place into a table or pass back to an application with a structured format.\n", "\n", "llm_prompt = 'What is the weather today on Los Angeles California and what is the current date?'\n", "llm_prompt = 'Who created this large language model. Tell me the company name, the model name (e.g. gemini-flash-?) and the model version.'\n", "\n", "response_schema = \"city STRING, state STRING, weather_result STRUCT< weather STRING, temperature FLOAT64>\"\n", "response_schema = \"company_name STRING, version_information STRUCT< version_number STRING, model_name STRING>\"\n", "\n", "# NOTE: This has changed sinced the recording \"SELECT ml_generate_text_result.candidates[0].content.parts[0] as json_response_schema\"\n", "# has become SELECT *\n", "\n", "# https://cloud.google.com/bigquery/docs/generate-table\n", "\n", "# NOTE: A breaking change has occurred and GENERATE_TABLE currently does not support grounding.\n", "\n", "sql = f\"\"\"SELECT *\n", " FROM AI.GENERATE_TABLE(MODEL`{project_id}.{dataset_name}.gemini_model`,\n", " (SELECT '{llm_prompt}' AS prompt),\n", " STRUCT(\n", " 0.8 AS temperature,\n", " 1024 AS max_output_tokens,\n", " 0.95 AS top_p,\n", " '{response_schema}' AS output_schema))\n", "\"\"\"\n", "\n", "result = RunQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "b3VXWg5TZj5-" }, "source": [ "**Learning Item**\n", "- Copy the SQL from the cell \"# RAG Pattern\" and have it output the data using ML.GENERATE_TABLE so we get the person's name, explaination, uri and chuck_id as seperate fields." ] }, { "cell_type": "markdown", "metadata": { "id": "7TpCGFYBbSfa" }, "source": [ "**Final Thoughts**\n", "- The PDFs used in this demo are basically a \"form\" and you could use the Forms parser to parse the items. You can update the above code to use it instead of the layout parser." ] } ], "metadata": { "colab": { "collapsed_sections": [ "mzg8CwTmWxrK", "B2aqAu1sWxrL", "HMsUvoF4BP7Y", "d5Bx-skioi0s", "UmyL-Rg4Dr_f", "sZ6m_wGrK0YG", "JbOjdSP1kN9T", "bI-KJELZ1jgt", "W8E-HwcahSRu", "Fnw_9stBkgNP", "yu2yg46hn4e0", "ET78odM2czZ7", "gOEoFFTLc3Ey", "7Lpnc37wc84V", "R-UtSlnudT6f", "kGB98KjgdZL2", "gJiq8z2Bdmug", "ji_BIY8tbvRn", "mGOMbQ8TbmN3", "JYbvXMHAbboJ", "PuetJH1-bRQf", "_SO10nWmbF4X", "uvnVPJmxadKq", "-SGNPMkFaU4i", "dYkX5N8QZ5U7", "fNn9ZLKaZh-y", "BuUaHRhoY14I", "APfwcaFMZVfv", "R9tkVRF-jdzP" ], "name": "BQ Analytics with GenAI", "private_outputs": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }

colab-enterprise/BigQuery-Analytics-with-GenAI.ipynb (1,823 lines of code) (raw):