colab-enterprise/Campaign-Performance-Spanner-Graph.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "2eJaZU5YAQUf" }, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spanner’s new Graph capabilities enable us to create a graph representation of our customer reviews and order history, as well as a social graph of our social media followers, empowering us to derive new analytical insights from complex relationships via efficient graph node traversal. This notebook also demonstrates simple data movement from BigQuery to Spanner via Reverse ETL, joining real-time ordering data with historical data via External Datasets, and performing vector similarity search with Gemini explanations.\n", "\n", "Process Flow: \n", "1. Data Preparation:\n", " - Spanner Setup: Creates a Spanner instance and database, defining node tables (e.g., customers, orders) and edge tables (e.g., customer_reviews_menu_item) based on schemas from BigQuery. A property graph encompassing these tables is defined.\n", " - Data Loading: Data from BigQuery is transferred to Spanner using Reverse ETL, and social graph data is generated. \n", "2. Recommendation and Analysis:\n", " - Collaborative Filtering: Leverages the graph to find personalized product recommendations, considering preferences of similar customers, even when they dislike the same items.\n", " - Brand Partner Discovery: Identifies influential customers followed by \"at-risk\" customers for potential marketing partnerships.\n", "3. Social Network Visualization: Visually represents the social graph to understand customer connections.\n", "4. Spanner & BigQuery Integration:\n", " - External Dataset: Allows querying Spanner data directly from BigQuery using an external schema. Demonstrates analyzing order trends across both real-time data in Spanner and historical data in BigQuery.\n", "5. AI Enhancements:\n", " - Vertex AI integration: Creates remote AI models in Spanner for generating text embeddings and using a large language model (LLM).\n", " - Vector Search: Demonstrates finding customers similar to a search phrase using both exact nearest neighbor (ENN) and approximate nearest neighbor (ANN) searches.\n", "6. Gemini Explanations: Utilizes the LLM to provide insights into the relevance of vector search results.\n", "\n", "Cost:\n", "* The Spanner instance configured with 100 processing units costs ~$3.00 a day. \n", "* If experimenting with ANN vector search, Spanner requires 1000 processing units (1 node) minimum. Be sure to use the scale up and scale down functions provided in the notebook to optimize cost. \n", "\n", "Author:\n", "* Paul Ramsey" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Video Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Video](https://img.youtube.com/vi/SepF_1T133k/hq720.jpg)](https://www.youtube.com/watch?v=SepF_1T133k)" ] }, { "cell_type": "markdown", "metadata": { "id": "HMsUvoF4BP7Y" }, "source": [ "## License" ] }, { "cell_type": "markdown", "metadata": { "id": "jQgQkbOvj55d" }, "source": [ "```\n", "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "```" ] }, { "cell_type": "markdown", "metadata": { "id": "UmyL-Rg4Dr_f" }, "source": [ "## Initialize" ] }, { "cell_type": "markdown", "metadata": { "id": "1bwZxbB-l8p7" }, "source": [ "### Pip installs" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Qbr1rlyxl3P8" }, "outputs": [], "source": [ "! pip -q install pyvis" ] }, { "cell_type": "markdown", "metadata": { "id": "Rj0rVQrQn-fk" }, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xOYsEVSXp6IP" }, "outputs": [], "source": [ "from IPython.display import HTML\n", "import google.auth\n", "import random\n", "import time\n", "from datetime import datetime, timedelta, timezone\n", "import pandas as pd\n", "import re\n", "from pyvis.network import Network\n", "import networkx as nx\n", "\n", "from tenacity import retry, wait_exponential, stop_after_attempt, before_sleep_log, retry_if_exception\n", "from google.cloud import bigquery\n", "client = bigquery.Client()" ] }, { "cell_type": "markdown", "metadata": { "id": "1NH22rCkqXqD" }, "source": [ "### Variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wMlHl3bnkFPZ" }, "outputs": [], "source": [ "# Update these variables to match your environment\n", "location=\"us-central1\" # Your region\n", "bigquery_location = \"${bigquery_location}\" # Must be \"us\" or \"eu\"\n", "\n", "### Do not change the values in this cell below this line ###\n", "project_id = !(gcloud config get-value project)\n", "user = !(gcloud auth list --filter=status:ACTIVE --format=\"value(account)\")\n", "\n", "if len(project_id) != 1:\n", " raise RuntimeError(f\"project_id is not set: {project_id}\")\n", "project_id = project_id[0]\n", "\n", "if len(user) == 0:\n", " raise RuntimeError(f\"user is not set: {user}\")\n", "user = user[0]\n", "\n", "project_number = !gcloud projects describe {project_id} --format=\"value(projectNumber)\"\n", "project_number = str(project_number[0])\n", "bucket_name = f\"bucket-{project_id}\"\n", "\n", "# Define Spanner and BQ variables\n", "instance_id = f\"chocolate-ai-{project_id}\"\n", "database_id = \"chocolate-ai\"\n", "bq_dataset = \"${project_id}.${bigquery_chocolate_ai_dataset}\"\n", "session = None\n", "spanner_external_schema_name = 'chocolate_ai_spanner_external_schema'\n", "\n", "print(f\"project_id = {project_id}\")\n", "print(f\"user = {user}\")\n", "print(f\"location = {location}\")\n", "print(f\"bigquery_location = {bigquery_location}\")\n", "print(f\"bucket_name = {bucket_name}\")\n", "print(f\"project_number = {project_number}\")\n", "print(f\"instance_id = {instance_id}\")\n", "print(f\"database_id = {database_id}\")\n", "print(f\"bq_dataset = {bq_dataset}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "sZ6m_wGrK0YG" }, "source": [ "### Define Helper Methods" ] }, { "cell_type": "markdown", "metadata": { "id": "yHzsHPzRwuI1" }, "source": [ "#### RetryCondition(error)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NX_kSNO5xH3e" }, "outputs": [], "source": [ "def RetryCondition(error):\n", " error_string = str(error)\n", " print(error_string)\n", "\n", " retry_errors = [\n", " \"RESOURCE_EXHAUSTED\",\n", " \"No content in candidate\",\n", " # Add more error messages here as needed\n", " ]\n", "\n", " for retry_error in retry_errors:\n", " if retry_error in error_string:\n", " print(\"Retrying...\")\n", " return True\n", "\n", " return False" ] }, { "cell_type": "markdown", "metadata": { "id": "JbOjdSP1kN9T" }, "source": [ "#### restAPIHelper\n", "Calls the Google Cloud REST API using the current users credentials." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "40wlwnY4kM11" }, "outputs": [], "source": [ "def restAPIHelper(url: str, http_verb: str, request_body: str = {}, params: dict = None) -> str:\n", " \"\"\"Calls the Google Cloud REST API passing in the current users credentials\"\"\"\n", "\n", " import requests\n", " import google.auth\n", " import json\n", "\n", " # Get an access token based upon the current user\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request()\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " # Get user information from credentials\n", " user_info = getattr(creds, 'id_token', None)\n", " if user_info:\n", " #print(f\"Request made by user: {user_info['email']}\")\n", " pass\n", " else:\n", " # If no user info, it's likely a service account\n", " #print(f\"Request made by service account: {creds.service_account_email}\")\n", " pass\n", "\n", "\n", " headers = {\n", " \"Authorization\" : \"Bearer \" + access_token,\n", " \"Content-Type\": \"application/json\",\n", " \"x-goog-user-project\": project_id # Required to workaround quota project bug\n", " }\n", "\n", " if http_verb == \"GET\":\n", " response = requests.get(url, headers=headers, params=params)\n", " elif http_verb == \"POST\":\n", " response = requests.post(url, headers=headers, data=json.dumps(request_body), params=params)\n", " elif http_verb == \"PUT\":\n", " response = requests.put(url, headers=headers, data=json.dumps(request_body), params=params)\n", " elif http_verb == \"PATCH\":\n", " response = requests.patch(url, headers=headers, data=json.dumps(request_body), params=params)\n", " elif http_verb == \"DELETE\":\n", " response = requests.delete(url, headers=headers, params=params)\n", " else:\n", " raise RuntimeError(f\"Unknown HTTP verb: {http_verb}\")\n", "\n", " if response.status_code == 200:\n", " return json.loads(response.content)\n", " else:\n", " print(\"Request URL:\", response.request.url)\n", " print(\"Request Headers:\", response.request.headers)\n", " print(\"Request Body:\", response.request.body)\n", " error = f\"Error restAPIHelper -> ' Status: '{response.status_code}' Text: '{response.text}'\"\n", " raise RuntimeError(error)" ] }, { "cell_type": "markdown", "metadata": { "id": "Wp6zIq-3M86P" }, "source": [ "#### RunBQQuery()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BbDTEwYWM4Vv" }, "outputs": [], "source": [ "def RunBQQuery(sql, job_config = None):\n", " import time\n", "\n", " if (sql.startswith(\"SELECT\") or sql.startswith(\"WITH\")):\n", " df_result = client.query(sql).to_dataframe()\n", " return df_result\n", " else:\n", " if job_config == None:\n", " job_config = bigquery.QueryJobConfig(priority=bigquery.QueryPriority.INTERACTIVE)\n", " query_job = client.query(sql, job_config=job_config)\n", "\n", " # Check on the progress by getting the job's updated state.\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " while query_job.state != \"DONE\":\n", " time.sleep(2)\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " if query_job.error_result == None:\n", " return True\n", " else:\n", " return False" ] }, { "cell_type": "markdown", "metadata": { "id": "v7oUYvIYuc8s" }, "source": [ "#### GetSpannerSessions()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6z3Ny5fHuaqQ" }, "outputs": [], "source": [ "def GetSpannerSessions(project_id = project_id, instance_id = instance_id, database_id = \"chocolate-ai\"):\n", "\n", " # Get session\n", " uri = f\"https://spanner.googleapis.com/v1/projects/{project_id}/instances/{instance_id}/databases/{database_id}/sessions\"\n", " response = restAPIHelper(uri, \"GET\")\n", " return response" ] }, { "cell_type": "markdown", "metadata": { "id": "P10KeztXxVrk" }, "source": [ "#### CreateSpannerSession()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qa4xYix9xVj0" }, "outputs": [], "source": [ "# https://cloud.google.com/spanner/docs/reference/rest/v1/projects.instances.databases.sessions/create\n", "def CreateSpannerSession(project_id = project_id, instance_id = instance_id, database_id = \"chocolate-ai\"):\n", " print(\"No Spanner session found. Creating a new session.\")\n", "\n", " # Create a new session\n", " uri = f\"https://spanner.googleapis.com/v1/projects/{project_id}/instances/{instance_id}/databases/{database_id}/sessions\"\n", " params = {\n", " \"database\": f\"projects/{project_id}/instances/{instance_id}/databases/{database_id}\"\n", " }\n", " response = restAPIHelper(uri, \"POST\", {}, params)\n", " return response['name']" ] }, { "cell_type": "markdown", "metadata": { "id": "a6sPgWROvb7x" }, "source": [ "#### CloseSpannerSession()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "P4ldSF14vb0I" }, "outputs": [], "source": [ "# https://cloud.google.com/spanner/docs/reference/rest/v1/projects.instances.databases.sessions/delete\n", "def CloseSpannerSession(session, project_id = project_id, instance_id = instance_id, database_id = \"chocolate-ai\"):\n", " \"\"\"\n", " Example:\n", " response = GetSpannerSessions()\n", " for session in response['sessions']:\n", " CloseSpannerSession(session['name'])\n", " \"\"\"\n", "\n", " uri = f\"https://spanner.googleapis.com/v1/{session}\"\n", " response = restAPIHelper(uri, \"DELETE\", {}, {\"name\": f\"{session}\"})\n", " return response" ] }, { "cell_type": "markdown", "metadata": { "id": "7SvEE1s2aiZv" }, "source": [ "#### RunSpannerQuery()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lsB-p5KNaiST" }, "outputs": [], "source": [ "def RunSpannerQuery(sql, database_id = \"chocolate-ai\", query_options=None, create_new_session=False):\n", " \"\"\"\n", " Runs a Spanner query and returns the result.\n", "\n", " Args:\n", " sql: The SQL query to execute.\n", " query_options: (Optional) A dictionary of advanced query options.\n", " See https://cloud.google.com/spanner/docs/reference/rest/v1/projects.instances.databases.sessions/executeSql#queryoptions\n", " for available options.\n", "\n", " Returns:\n", " A dictionary containing the query results.\n", "\n", " Ref:\n", " https://cloud.google.com/spanner/docs/reference/rest/v1/projects.instances.databases.sessions/executeSql\n", " \"\"\"\n", " # Ensure a session exists\n", " # Create session\n", " global session\n", " if not session or create_new_session == True:\n", " session = CreateSpannerSession()\n", "\n", " # Initialize response vars\n", " commit_response = \"\"\n", " response = \"\"\n", "\n", " # Construct the request URL\n", " uri = f\"https://spanner.googleapis.com/v1/{session}:executeSql\"\n", "\n", " # Set transaction type (readOnly/readWrite) and transaction object with commit type (begin/singleUse)\n", " transaction_type = \"readWrite\" if any(x in sql.lower() for x in [\"insert\", \"update\", \"delete\"]) else \"readOnly\"\n", " transaction = {\"begin\": {\"readWrite\": {}}} if transaction_type == \"readWrite\" else {\"singleUse\": {\"readOnly\": {}}}\n", "\n", " request_body = {\n", " \"sql\": sql,\n", " \"transaction\": transaction\n", " }\n", " params = {\n", " \"session\": session\n", " }\n", "\n", " if query_options:\n", " request_body[\"queryOptions\"] = query_options\n", "\n", " try:\n", " # Make the request\n", " response = restAPIHelper(uri, \"POST\", request_body=request_body, params = params)\n", "\n", " except RuntimeError as e:\n", " if \"Session not found\" in str(e):\n", " print(f\"Session not found. Creating a new session and retrying the query...\")\n", " return RunSpannerQuery(sql, database_id, query_options, create_new_session=True) # Retry with a new session\n", " else:\n", " raise # Re-raise the exception if it's not a \"Session not found\" error\n", "\n", " # Commit transaction if read/write\n", " if transaction_type == \"readWrite\":\n", " uri = f\"https://spanner.googleapis.com/v1/{session}:commit\"\n", " params = {\n", " \"session\": session\n", " }\n", " commit_response = restAPIHelper(uri, \"POST\", {\"transactionId\": response['metadata']['transaction']['id']}, params)\n", " print(f\"commit_response: {commit_response}\")\n", "\n", " # Return a DataFrame if\n", " if (sql.lower().startswith((\"select\", \"with\", \"graph\"))):\n", " columns = [field.get('name', 'unnamed_column') for field in response['metadata']['rowType']['fields']]\n", "\n", " # Create DataFrame from rows\n", " if 'rows' in response:\n", " df = pd.DataFrame(response['rows'], columns=columns)\n", " return df\n", " else:\n", " return response\n", "\n", " else:\n", " # Return the query results\n", " return response" ] }, { "cell_type": "markdown", "metadata": { "id": "JAQutwtLVER7" }, "source": [ "#### RunSpannerDDL()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Z48QpdiMVEHx" }, "outputs": [], "source": [ "def RunSpannerDDL(ddl_array, project_id = project_id, instance_id = instance_id, database_id = database_id):\n", " # Create tables in Spanner\n", " # https://cloud.google.com/spanner/docs/reference/rest/v1/projects.instances.databases.tables/create#try-it\n", "\n", " uri = f\"https://spanner.googleapis.com/v1/projects/{project_id}/instances/{instance_id}/databases/{database_id}/ddl\"\n", " http_verb = \"PATCH\"\n", " request_body = {\n", " \"statements\": ddl_array\n", " }\n", "\n", " response = restAPIHelper(uri, http_verb, request_body)\n", "\n", " operation_name = response['name']\n", " uri = f\"https://spanner.googleapis.com/v1/{operation_name}\"\n", "\n", " while True:\n", " response = restAPIHelper(uri, \"GET\", {})\n", " if response.get(\"done\", False):\n", " if response.get(\"error\"):\n", " print(response.get(\"error\"))\n", " else:\n", " print(\"Operation completed successfully.\")\n", " break\n", " else:\n", " print(\"Operation not completed yet.\")\n", " time.sleep(2)" ] }, { "cell_type": "markdown", "metadata": { "id": "E00vdLyeCIz6" }, "source": [ "#### ScaleSpannerInstance()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "s2d72s0ICIZC" }, "outputs": [], "source": [ "def ScaleSpannerInstance(processing_units, instance_id = instance_id, project_id = project_id):\n", " uri = f\"https://spanner.googleapis.com/v1/projects/{project_id}/instances/{instance_id}\"\n", " http_verb = \"PATCH\"\n", " request_body = {\n", " \"instance\": {\n", " \"config\": f\"projects/{project_id}/instanceConfigs/regional-us-central1\",\n", " \"processingUnits\": processing_units\n", " },\n", " \"fieldMask\": \"processingUnits\"\n", " }\n", "\n", " response = restAPIHelper(uri, http_verb, request_body)\n", " print(response)\n", "\n", " operation_name = response['name']\n", " uri = f\"https://spanner.googleapis.com/v1/{operation_name}\"\n", "\n", " while True:\n", " response = restAPIHelper(uri, \"GET\", {})\n", " if response.get(\"done\", False):\n", " if response.get(\"error\"):\n", " print(response.get(\"error\"))\n", " else:\n", " print(\"Operation completed successfully.\")\n", " break\n", " else:\n", " print(\"Operation not completed yet.\")\n", " time.sleep(2)" ] }, { "cell_type": "markdown", "metadata": { "id": "6u-YtwaEWsue" }, "source": [ "#### RunReverseETL()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "41PmgH0BWsk0" }, "outputs": [], "source": [ "def RunReverseETL(source_table, project_id = project_id, instance_id = instance_id, database_id = database_id):\n", " print(f\"Running reverse ETL for table {source_table}...\")\n", "\n", " spanner_options = f\"{{'table': 'cai_{source_table}'}}\"\n", " spanner_options = spanner_options.replace(\"'\", '\"')\n", "\n", " export_statement = f\"\"\"EXPORT DATA OPTIONS (\n", " uri='https://spanner.googleapis.com/projects/{project_id}/instances/{instance_id}/databases/{database_id}',\n", " format='CLOUD_SPANNER',\n", " spanner_options='''{spanner_options}'''\n", " )\n", " AS SELECT * FROM `{bq_dataset}.{source_table}`;\"\"\"\n", "\n", " result = RunBQQuery(export_statement)\n", " return result" ] }, { "cell_type": "markdown", "metadata": { "id": "KejAZTVed06m" }, "source": [ "#### GenerateFollowerData()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0PJ4MX3pd0t-" }, "outputs": [], "source": [ "def GenerateFollowerData(user_ids, follower_ids, influencer_ids,\n", " min_followers, max_followers,\n", " min_influencer_followers, max_influencer_followers,\n", " mutation_limit):\n", " \"\"\"\n", " Generates and inserts social media follower data into a database.\n", "\n", " Args:\n", " user_ids: List of user IDs to generate follower data for.\n", " follower_ids: List of potential follower IDs.\n", " influencer_ids: List of potential influencer IDs to follow.\n", " min_followers: Minimum number of regular followers for each user.\n", " max_followers: Maximum number of regular followers for each user.\n", " min_influencer_followers: Minimum number of influencer followers for each user.\n", " max_influencer_followers: Maximum number of influencer followers for each user.\n", " mutation_limit: Maximum number of mutations allowed per transaction.\n", " \"\"\"\n", " def get_random_date(start_date, end_date):\n", " time_between_dates = end_date - start_date\n", " days_between_dates = time_between_dates.days\n", "\n", " random_number_of_days = random.randrange(days_between_dates)\n", " random_date = start_date + timedelta(days=random_number_of_days)\n", "\n", " formatted_date = random_date.strftime('%Y-%m-%d') # Convert to string\n", " return formatted_date\n", "\n", " for i in range(len(user_ids)):\n", " print(f\"Processing user: ({i + 1} of {len(user_ids)})\")\n", "\n", " # Get number of followers and influencer followers\n", " num_followers = random.randint(min_followers, max_followers)\n", " num_influencer_followers = random.randint(min_influencer_followers, max_influencer_followers)\n", "\n", " # Get array of random non-influencers who follow users\n", " random_follower_ids = random.sample(follower_ids, min(num_followers, len(follower_ids)))\n", "\n", " # Get array of random influencers who follow users\n", " random_follower_ids.extend(random.sample(influencer_ids, min(num_influencer_followers, len(influencer_ids))))\n", "\n", " # Define random date range\n", " date_range_start = datetime(2020, 1, 1)\n", " date_range_end = datetime(2024, 10, 4)\n", "\n", " # Track mutations\n", " mutation_count = 0\n", " mutations_per_insert = 3\n", "\n", " # Build sql\n", " sql = f\"\"\"INSERT OR UPDATE INTO cai_edge_customer_follows_customer (customer_id, followed_customer_id, follow_date) VALUES \"\"\"\n", " for j in range(len(random_follower_ids)):\n", " random_date = get_random_date(date_range_start, date_range_end)\n", " sql += f\"\"\"({user_ids[i]}, {random_follower_ids[j]}, '{random_date}'),\"\"\"\n", " mutation_count += mutations_per_insert\n", "\n", " # Run the insert if mutation count approaches max mutations per transaction\n", " if j == len(random_follower_ids) - 1 or mutation_count >= mutation_limit - mutations_per_insert:\n", " print(f\"Mutation count: {mutation_count}\")\n", " result = RunSpannerQuery(sql[:-1], session)\n", " print(result)\n", " sql = f\"\"\"INSERT OR UPDATE INTO cai_edge_customer_follows_customer (customer_id, followed_customer_id, follow_date) VALUES \"\"\"\n", " mutation_count = 0\n", "\n", " print(\"Finished generating follower data.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "cSWQmpaWm5pw" }, "source": [ "## Setup Spanner Graph" ] }, { "cell_type": "markdown", "metadata": { "id": "uVFcNTLIoCwV" }, "source": [ "### Create Spanner Instance and Database" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CbaQ0vRFEJga" }, "outputs": [], "source": [ "# Enable the Spanner API\n", "uri = f\"https://serviceusage.googleapis.com/v1/projects/{project_number}/services/spanner.googleapis.com:enable\"\n", "\n", "response = restAPIHelper(uri, \"POST\", {})\n", "\n", "try:\n", " print(response['response']['service']['state'])\n", "except:\n", " print(response)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HCwfIu1-9S0N" }, "outputs": [], "source": [ "# Create the Spanner instance\n", "# This notebook creates a paid instance. 90-day Free trial available once per project lifecycle\n", "# https://cloud.google.com/spanner/docs/reference/rest/v1/projects.instances/create\n", "uri = f\"https://spanner.googleapis.com/v1/projects/{project_id}/instances\"\n", "http_verb = \"POST\"\n", "request_body = {\n", " \"instance\": {\n", " \"config\": f\"projects/{project_id}/instanceConfigs/regional-us-central1\",\n", " \"displayName\": \"Chocolate AI Spanner\",\n", " \"edition\": \"ENTERPRISE\",\n", " \"processingUnits\": 100,\n", "\n", " # OPTIONAL: Define nodeCount instead of processingUnits or autoscalingConfig.\n", " #\"nodeCount\": 1,\n", "\n", " # OPTIONAL: Define autoscalingConfig instead of nodeCount or processingUnits.\n", " #\"autoscalingConfig\": {\n", " # \"autoscalingLimits\": {\n", " # \"minProcessingUnits\": 1000,\n", " # \"maxProcessingUnits\": 2000\n", " # },\n", " # \"autoscalingTargets\": {\n", " # \"highPriorityCpuUtilizationPercent\": 80,\n", " # \"storageUtilizationPercent\": 80\n", " # }\n", " #}\n", " },\n", " \"instanceId\": f\"{instance_id}\"\n", "}\n", "\n", "response = restAPIHelper(uri, http_verb, request_body)\n", "response" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "E2qL60lgY_Ma" }, "outputs": [], "source": [ "# Create a database\n", "# https://cloud.google.com/spanner/docs/reference/rest/v1/projects.instances.databases/create#try-it\n", "\n", "uri = f\"https://spanner.googleapis.com/v1/projects/{project_id}/instances/{instance_id}/databases\"\n", "http_verb = \"POST\"\n", "request_body = {\n", " \"createStatement\": f\"CREATE DATABASE `{database_id}`\"\n", "}\n", "\n", "response = restAPIHelper(uri, http_verb, request_body)\n", "response" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create AI Models in Spanner" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create and Embeddings Model and LLM Model\n", "# Ref: https://codelabs.developers.google.com/codelabs/spanner-getting-started-vector-search#3\n", "# https://cloud.google.com/spanner/docs/ml-tutorial-embeddings\n", "ddl_array = []\n", "ddl_array.append(f\"\"\"CREATE MODEL IF NOT EXISTS EmbeddingsModel INPUT(\n", " content STRING(MAX),\n", " ) OUTPUT(\n", " embeddings STRUCT<statistics STRUCT<truncated BOOL, token_count FLOAT64>, values ARRAY<FLOAT64>>,\n", " ) REMOTE OPTIONS (\n", " endpoint = '//aiplatform.googleapis.com/projects/{project_id}/locations/us-central1/publishers/google/models/text-embedding-005'\n", " )\n", "\"\"\")\n", "\n", "ddl_array.append(f\"\"\"CREATE MODEL IF NOT EXISTS LLMModel INPUT(\n", "prompt STRING(MAX),\n", ") OUTPUT(\n", "content STRING(MAX),\n", ") REMOTE OPTIONS (\n", "endpoint = '//aiplatform.googleapis.com/projects/{project_id}/locations/us-central1/publishers/google/models/gemini-2.0-flash',\n", "default_batch_size = 1\n", ")\"\"\")\n", "\n", "result = RunSpannerDDL(ddl_array)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "UJdsLcg9oHU2" }, "source": [ "### Create Node Tables" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UDQb4Hj8c1LD" }, "outputs": [], "source": [ "# Get source table definitions from BQ and translate to Spanner syntax\n", "\n", "## Get DDL from BQ\n", "sql = f\"\"\"SELECT ddl\n", " FROM `{bq_dataset}.INFORMATION_SCHEMA.TABLES`\n", " WHERE table_type = 'BASE TABLE'\n", " AND table_name IN (\n", " 'customer',\n", " 'customer_marketing_profile',\n", " 'customer_review',\n", " 'menu',\n", " 'order',\n", " 'order_item',\n", " 'store'\n", " );\"\"\"\n", "result = RunBQQuery(sql)\n", "\n", "# Transform BQ DDL into Spanner dialect.\n", "ddl_array = []\n", "for row in result.itertuples():\n", " ddl = row.ddl\n", " if \"CLUSTER BY\" in ddl:\n", " ddl = ddl.replace(\"CLUSTER BY \", \"PRIMARY KEY(\")\n", " ddl = ddl.replace(\";\", \");\")\n", "\n", " ddl = ddl.replace(f\"{bq_dataset}.\",f\"cai_\")\n", " ddl = re.sub(r\"[\\r\\n]+\", \"\", ddl)\n", " ddl = ddl.replace(\" \", \" \")\n", " ddl = ddl.replace(\"( \", \"(\")\n", " ddl = ddl.replace(\") \", \")\")\n", " ddl = ddl.replace(\"STRING\", \"STRING(MAX)\")\n", " ddl = ddl.replace(\";\", \"\")\n", " ddl = ddl.replace(\"DEFAULT GENERATE_UUID()\", \"DEFAULT (GENERATE_UUID())\")\n", " ddl = ddl.replace(\"DEFAULT CURRENT_TIMESTAMP()\", \"DEFAULT (CURRENT_TIMESTAMP())\")\n", " ddl = ddl.replace(\"DATETIME\", \"TIMESTAMP\")\n", "\n", " #ddl = ddl.replace(\"ARRAY<FLOAT64>\",\"ARRAY<FLOAT64>(vector_length=>768)\")\n", "\n", " # Replace those with escaped quotes, newlines, and nested parentheses\n", " ddl = re.sub(r\"OPTIONS\\s*\$(?:[^()]|\\([^()]*\$)*\\)\", \"\", ddl)\n", "\n", " if \"PRIMARY KEY\" not in ddl:\n", " # Extract the first column name using a regular expression\n", " first_column_match = re.search(r\"CREATE TABLE `.*`\\((\\S+)\", ddl)\n", " if first_column_match:\n", " first_column = first_column_match.group(1)\n", " print(f\"Adding primary key: {first_column}\")\n", " # Add PRIMARY KEY constraint to the first column\n", " ddl = ddl + f\" PRIMARY KEY({first_column})\" # Remove the last ')' and add PK\n", "\n", " ddl = ddl.replace(\"CREATE TABLE\",\"CREATE TABLE IF NOT EXISTS\")\n", " ddl_array.append(ddl)\n", "\n", " # Print the transformed DDL\n", " print(ddl)\n", "\n", "# Define the customer_360 table\n", "# This is a special case, because customer_360 is a view in BQ with\n", "# incompatible data types, so it's easier to create this as a one off.\n", "ddl_array.append(f\"\"\"CREATE TABLE IF NOT EXISTS cai_customer_360 (\n", " customer_id INT64 NOT NULL,\n", " customer_marketing_insights STRING(MAX),\n", " benefits_sought STRING(MAX),\n", " browsing_behavior STRING(MAX),\n", " loyalty_status STRING(MAX),\n", " occasion_timing STRING(MAX),\n", " purchase_history STRING(MAX),\n", " spending_habits STRING(MAX),\n", " usage_frequency STRING(MAX),\n", " user_status STRING(MAX),\n", " at_risk_customers STRING(MAX),\n", " first_time_customers STRING(MAX),\n", " former_customers STRING(MAX),\n", " inactive_customers STRING(MAX),\n", " loyal_advocates STRING(MAX),\n", " new_leads STRING(MAX),\n", " potential_customers STRING(MAX),\n", " repeat_customers STRING(MAX),\n", " age STRING(MAX),\n", " education STRING(MAX),\n", " ethnicity STRING(MAX),\n", " family_size STRING(MAX),\n", " gender STRING(MAX),\n", " generation STRING(MAX),\n", " income STRING(MAX),\n", " language STRING(MAX),\n", " marital_status STRING(MAX),\n", " occupation STRING(MAX),\n", " city STRING(MAX),\n", " climate STRING(MAX),\n", " country STRING(MAX),\n", " population_density STRING(MAX),\n", " region STRING(MAX),\n", " time_zone STRING(MAX),\n", " urban_rural STRING(MAX),\n", " challenges STRING(MAX),\n", " goals STRING(MAX),\n", " pain_points STRING(MAX),\n", " priorities STRING(MAX),\n", " specific_needs STRING(MAX),\n", " attitudes STRING(MAX),\n", " hobbies STRING(MAX),\n", " interests STRING(MAX),\n", " lifestyle STRING(MAX),\n", " motivations STRING(MAX),\n", " personality STRING(MAX),\n", " social_class STRING(MAX),\n", " customer_values STRING(MAX),\n", " adoption_rate STRING(MAX),\n", " browsers STRING(MAX),\n", " devices STRING(MAX),\n", " internet_connectivity STRING(MAX),\n", " operating_systems STRING(MAX),\n", " social_media_platforms STRING(MAX),\n", " software STRING(MAX),\n", " tech_savviness STRING(MAX),\n", " cost_benefit_analysis STRING(MAX),\n", " perceived_value STRING(MAX),\n", " price_sensitivity STRING(MAX),\n", " willingness_to_pay STRING(MAX),\n", " children STRING(MAX),\n", " chocolate_preferences STRING(MAX),\n", " content_interaction STRING(MAX),\n", " customer_age INT64,\n", " facebook_bio STRING(MAX),\n", " facebook_engagement STRING(MAX),\n", " facebook_handle STRING(MAX),\n", " instagram_bio STRING(MAX),\n", " instagram_engagement STRING(MAX),\n", " instagram_handle STRING(MAX),\n", " linkedin_bio STRING(MAX),\n", " linkedin_engagement STRING(MAX),\n", " linkedin_handle STRING(MAX),\n", " martial_status STRING(MAX),\n", " solicated_buying_habits STRING(MAX),\n", " sports STRING(MAX),\n", " tiktok_bio STRING(MAX),\n", " tiktok_handle STRING(MAX),\n", " twitter_bio STRING(MAX),\n", " twitter_engagement STRING(MAX),\n", " twitter_handle STRING(MAX),\n", " youtube_bio STRING(MAX),\n", " youtube_handle STRING(MAX),\n", " customer_service_interactions STRING(MAX),\n", " average_amount_spent_per_order NUMERIC,\n", " last_order_date TIMESTAMP,\n", " latest_review_sentiment STRING(MAX),\n", " most_frequent_purchase_location INT64,\n", " negative_review_percentage NUMERIC,\n", " neutral_review_percentage NUMERIC,\n", " positive_review_percentage NUMERIC,\n", " purchase_locations STRING(MAX),\n", " top_3_favorite_menu_items STRING(MAX),\n", " total_amount_spent NUMERIC,\n", " total_orders INT64,\n", " total_reviews INT64\n", ") PRIMARY KEY(customer_id)\n", "\"\"\")\n", "\n", "# Create the tables in Spanner\n", "RunSpannerDDL(ddl_array)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "LbIqVjBWqqVA" }, "source": [ "### Create Edge Tables" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ud3BN30qw5KM" }, "outputs": [], "source": [ "# Add edge table DDL into an array\n", "ddl_array = []\n", "\n", "ddl_array.append(f\"\"\"CREATE TABLE IF NOT EXISTS cai_edge_customer_places_order\n", "( customer_id INT64,\n", " order_id INT64,\n", " store_id INT64,\n", " order_datetime TIMESTAMP \n", ") PRIMARY KEY(customer_id, order_id, store_id),\n", " INTERLEAVE IN PARENT cai_customer ON DELETE CASCADE\n", "\"\"\")\n", "\n", "ddl_array.append(f\"\"\"CREATE TABLE IF NOT EXISTS cai_edge_customer_writes_review\n", "( customer_id INT64,\n", " customer_review_id INT64,\n", " review_datetime TIMESTAMP \n", ") PRIMARY KEY(customer_id, customer_review_id),\n", " INTERLEAVE IN PARENT cai_customer ON DELETE CASCADE\n", "\"\"\")\n", "\n", "ddl_array.append(f\"\"\"CREATE TABLE IF NOT EXISTS cai_edge_customer_follows_customer\n", "( customer_id INT64,\n", " followed_customer_id INT64,\n", " follow_date DATE \n", ") PRIMARY KEY(customer_id, followed_customer_id),\n", " INTERLEAVE IN PARENT cai_customer ON DELETE CASCADE\n", "\"\"\")\n", "\n", "ddl_array.append(f\"\"\"CREATE TABLE IF NOT EXISTS cai_edge_customer_rates_menu_item\n", "( customer_id INT64,\n", " menu_id INT64,\n", " rating INT64,\n", " rating_datetime TIMESTAMP \n", ") PRIMARY KEY(customer_id, menu_id),\n", " INTERLEAVE IN PARENT cai_customer ON DELETE CASCADE\n", "\"\"\")\n", "\n", "ddl_array.append(f\"\"\"CREATE TABLE IF NOT EXISTS cai_edge_order_contains_order_item\n", "( order_id INT64,\n", " store_id INT64,\n", " order_item_id INT64 \n", ") PRIMARY KEY(order_id, store_id, order_item_id),\n", " INTERLEAVE IN PARENT cai_order ON DELETE CASCADE\n", "\"\"\")\n", "\n", "ddl_array.append(f\"\"\"CREATE TABLE IF NOT EXISTS cai_edge_order_placed_at_store\n", "( order_id INT64,\n", " store_id INT64,\n", " order_datetime TIMESTAMP \n", ") PRIMARY KEY(order_id, store_id),\n", " INTERLEAVE IN PARENT cai_order ON DELETE CASCADE\n", "\"\"\")\n", "\n", "ddl_array.append(f\"\"\"CREATE TABLE cai_edge_customer_has_marketing_profile (\n", " customer_id INT64,\n", " marketing_profile_id INT64,\n", ") PRIMARY KEY(customer_id, marketing_profile_id),\n", " INTERLEAVE IN PARENT cai_customer ON DELETE CASCADE\"\"\")\n", "\n", "RunSpannerDDL(ddl_array)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add Full-text Search Index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ddl_array = []\n", "\n", "# Add FTS token column\n", "ddl_array.append(\"\"\"\n", "ALTER TABLE cai_menu\n", "ADD COLUMN menu_name_token TOKENLIST\n", "AS (TOKENIZE_NGRAMS(menu_name))\n", "STORED HIDDEN\n", "\"\"\")\n", "\n", "# Create search index on tokenized column\n", "ddl_array.append(\"\"\"\n", "CREATE SEARCH INDEX menu_name_fts_idx\n", "ON cai_menu(menu_name_token)\n", "OPTIONS (sort_order_sharding = true)\n", "\"\"\")\n", "\n", "result = RunSpannerDDL(ddl_array)" ] }, { "cell_type": "markdown", "metadata": { "id": "GKrTUV2mTw2Y" }, "source": [ "### Create Spanner Graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "L3K_4XpCDdft" }, "outputs": [], "source": [ "# Create graph\n", "ddl_array = []\n", "\n", "ddl_array.append(f\"\"\"CREATE OR REPLACE PROPERTY GRAPH chocolate_ai_graph\n", " NODE TABLES (\n", " cai_customer,\n", " cai_customer_marketing_profile,\n", " cai_customer_review,\n", " cai_menu,\n", " cai_order,\n", " cai_order_item,\n", " cai_store,\n", " cai_customer_360\n", " )\n", " EDGE TABLES (\n", " cai_edge_customer_places_order\n", " SOURCE KEY(customer_id) REFERENCES cai_customer\n", " DESTINATION KEY(order_id, store_id) REFERENCES cai_order\n", " LABEL Places,\n", " cai_edge_customer_writes_review\n", " SOURCE KEY(customer_id) REFERENCES cai_customer\n", " DESTINATION KEY(customer_review_id) REFERENCES cai_customer_review\n", " LABEL Writes,\n", " cai_edge_customer_follows_customer\n", " SOURCE KEY(customer_id) REFERENCES cai_customer\n", " DESTINATION KEY(followed_customer_id) REFERENCES cai_customer\n", " LABEL Follows,\n", " cai_edge_customer_rates_menu_item\n", " SOURCE KEY(customer_id) REFERENCES cai_customer\n", " DESTINATION KEY(menu_id) REFERENCES cai_menu\n", " LABEL Rates,\n", " cai_edge_order_contains_order_item\n", " SOURCE KEY(order_id, store_id) REFERENCES cai_order\n", " DESTINATION KEY(order_item_id, order_id) REFERENCES cai_order_item\n", " LABEL IsIn,\n", " cai_edge_order_placed_at_store\n", " SOURCE KEY(order_id, store_id) REFERENCES cai_order\n", " DESTINATION KEY(store_id) REFERENCES cai_store\n", " LABEL PlacedAt,\n", " cai_edge_customer_has_marketing_profile\n", " KEY(customer_id, marketing_profile_id)\n", " SOURCE KEY(customer_id) REFERENCES cai_customer(customer_id)\n", " DESTINATION KEY(marketing_profile_id) REFERENCES cai_customer_360(customer_id)\n", " LABEL HasMarketingProfile PROPERTIES(customer_id, marketing_profile_id)\n", " )\n", "\"\"\")\n", "\n", "RunSpannerDDL(ddl_array)" ] }, { "cell_type": "markdown", "metadata": { "id": "BH5daDK8oLpS" }, "source": [ "### Create JSON Views" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "707bymC-oTO8" }, "outputs": [], "source": [ "# Define views to simplify querying JSON data\n", "ddl_array = []\n", "ddl_array.append(f\"\"\"CREATE OR REPLACE VIEW `cai_customer_marketing_profile_segments` SQL SECURITY INVOKER\n", "AS SELECT\n", " cai_customer_marketing_profile.customer_id,\n", " REPLACE(JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.behavioral_segmentation.`Benefits Sought`), \" \", \"\")AS benefits_sought,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.behavioral_segmentation.`Browsing Behavior`)AS browsing_behavior,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.behavioral_segmentation.`Loyalty Status`)AS loyalty_status,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.behavioral_segmentation.`Occasion/Timing`)AS occasion_timing,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.behavioral_segmentation.`Purchase History`)AS purchase_history,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.behavioral_segmentation.`Spending Habits`)AS spending_habits,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.behavioral_segmentation.`Usage Frequency`)AS usage_frequency,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.behavioral_segmentation.`User Status`)AS user_status,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`At-Risk Customers`), \",\")AS at_risk_customers,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`First-Time Customers`), \",\")AS first_time_customers,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`Former Customers`), \",\")AS former_customers,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`Inactive Customers`), \",\")AS inactive_customers,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`Loyal Advocates`), \",\")AS loyal_advocates,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`New Leads`), \",\")AS new_leads,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`Potential Customers`), \",\")AS potential_customers,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`Repeat Customers`), \",\")AS repeat_customers,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.demographic_segmentation.`Age`)AS age,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.demographic_segmentation.`Education`)AS education,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.demographic_segmentation.`Ethnicity`)AS ethnicity,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.demographic_segmentation.`Family Size`)AS family_size,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.demographic_segmentation.`Gender`)AS gender,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.demographic_segmentation.`Generation`)AS generation,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.demographic_segmentation.`Income`)AS income,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.demographic_segmentation.`Language`)AS language,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.demographic_segmentation.`Marital Status`)AS marital_status,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.demographic_segmentation.`Occupation`)AS occupation,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.geographic_segmentation.`City`)AS city,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.geographic_segmentation.`Climate`)AS climate,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.geographic_segmentation.`Country`)AS country,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.geographic_segmentation.`Population Density`)AS population_density,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.geographic_segmentation.`Region`)AS region,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.geographic_segmentation.`Time Zone`)AS time_zone,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.geographic_segmentation.`Urban/Rural`)AS urban_rural,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.needs_based_segmentation.`Challenges`)AS challenges,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.needs_based_segmentation.`Goals`)AS goals,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.needs_based_segmentation.`Pain Points`)AS pain_points,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.needs_based_segmentation.`Priorities`)AS priorities,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.needs_based_segmentation.`Specific Needs`)AS specific_needs,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.psychographic_segmentation.`Attitudes`)AS attitudes,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.psychographic_segmentation.`Hobbies`), \",\")AS hobbies,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.psychographic_segmentation.`Interests`)AS interests,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.psychographic_segmentation.`Lifestyle`)AS lifestyle,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.psychographic_segmentation.`Motivations`)AS motivations,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.psychographic_segmentation.`Personality`)AS personality,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.psychographic_segmentation.`Social Class`)AS social_class,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.psychographic_segmentation.`Values`)AS customer_values,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.technographic_segmentation.`Adoption Rate`)AS adoption_rate,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.technographic_segmentation.`Browsers`)AS browsers,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.technographic_segmentation.`Devices`)AS devices,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.technographic_segmentation.`Internet Connectivity`)AS internet_connectivity,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.technographic_segmentation.`Operating Systems`)AS operating_systems,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.technographic_segmentation.`Social Media Platforms`)AS social_media_platforms,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.technographic_segmentation.`Software`)AS software,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.technographic_segmentation.`Tech Savviness`)AS tech_savviness,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.value_based_segmentation.`Cost-Benefit Analysis`)AS cost_benefit_analysis,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.value_based_segmentation.`Perceived Value`)AS perceived_value,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.value_based_segmentation.`Price Sensitivity`)AS price_sensitivity,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_segmentation_data.customer_segments.value_based_segmentation.`Willingness to Pay`)AS willingness_to_pay\n", "FROM `cai_customer_marketing_profile`\n", "\"\"\")\n", "\n", "ddl_array.append(f\"\"\"CREATE OR REPLACE VIEW `cai_customer_marketing_profile_data` SQL SECURITY INVOKER\n", "AS SELECT\n", " cai_customer_marketing_profile.customer_id,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.children)AS children,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_profile_data.chocolate_preferences), \",\")AS chocolate_preferences,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_profile_data.content_interaction), \",\")AS content_interaction,\n", " CAST(JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.customer_age)AS INT64)AS customer_age,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.education)AS education,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.facebook_bio)AS facebook_bio,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.facebook_engagement)AS facebook_engagement,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.facebook_handle)AS facebook_handle,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.instagram_bio)AS instagram_bio,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.instagram_engagement)AS instagram_engagement,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.instagram_handle)AS instagram_handle,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_profile_data.interests), \",\")AS interests,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_profile_data.lifestyle), \",\")AS lifestyle,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.linkedin_bio)AS linkedin_bio,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.linkedin_engagement)AS linkedin_engagement,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.linkedin_handle)AS linkedin_handle,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.martial_status)AS martial_status,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.occupation)AS occupation,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_profile_data.solicated_buying_habits), \",\")AS solicated_buying_habits,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_profile_data.sports), \",\")AS sports,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.tiktok_bio)AS tiktok_bio,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.tiktok_handle)AS tiktok_handle,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.twitter_bio)AS twitter_bio,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.twitter_engagement)AS twitter_engagement,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.twitter_handle)AS twitter_handle,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.youtube_bio)AS youtube_bio,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_profile_data.youtube_handle)AS youtube_handle,\n", " (\n", " SELECT STRING_AGG(CONCAT(\n", " 'contact_reason:', JSON_VALUE(interaction, '$.contact_reason'), ' ',\n", " 'resolution_time:', JSON_VALUE(interaction, '$.resolution_time'), ' ',\n", " 'sentiment_analysis:', JSON_VALUE(interaction, '$.sentiment_analysis')\n", " ), ' | ')\n", " FROM UNNEST(JSON_QUERY_ARRAY(cai_customer_marketing_profile.customer_profile_data, '$.customer_service_interactions'))AS interaction\n", " )AS customer_service_interactions\n", " FROM\n", " `cai_customer_marketing_profile`\n", "\"\"\")\n", "\n", "ddl_array.append(f\"\"\"CREATE OR REPLACE VIEW `cai_customer_marketing_profile_loyalty` SQL SECURITY INVOKER\n", "AS SELECT\n", " cai_customer_marketing_profile.customer_id,\n", " CAST(JSON_VALUE(cai_customer_marketing_profile.customer_loyalty_data.average_amount_spent_per_order)AS NUMERIC)AS average_amount_spent_per_order,\n", " CAST(JSON_VALUE(cai_customer_marketing_profile.customer_loyalty_data.last_order_date)AS TIMESTAMP)AS last_order_date,\n", " JSON_VALUE(cai_customer_marketing_profile.customer_loyalty_data.latest_review_sentiment)AS latest_review_sentiment,\n", " CAST(JSON_VALUE(cai_customer_marketing_profile.customer_loyalty_data.most_frequent_purchase_location)AS INT64)AS most_frequent_purchase_location,\n", " CAST(JSON_VALUE(cai_customer_marketing_profile.customer_loyalty_data.negative_review_percentage)AS NUMERIC)AS negative_review_percentage,\n", " CAST(JSON_VALUE(cai_customer_marketing_profile.customer_loyalty_data.neutral_review_percentage)AS NUMERIC)AS neutral_review_percentage,\n", " CAST(JSON_VALUE(cai_customer_marketing_profile.customer_loyalty_data.positive_review_percentage)AS NUMERIC)AS positive_review_percentage,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_loyalty_data.purchase_locations), \",\")AS purchase_locations,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(cai_customer_marketing_profile.customer_loyalty_data.top_3_favorite_menu_items), \",\")AS top_3_favorite_menu_items,\n", " CAST(JSON_VALUE(cai_customer_marketing_profile.customer_loyalty_data.total_amount_spent)AS NUMERIC)AS total_amount_spent,\n", " CAST(JSON_VALUE(cai_customer_marketing_profile.customer_loyalty_data.total_orders)AS INT64)AS total_orders,\n", " CAST(JSON_VALUE(cai_customer_marketing_profile.customer_loyalty_data.total_reviews)AS INT64)AS total_reviews\n", " FROM\n", " `cai_customer_marketing_profile`\n", "\"\"\")\n", "\n", "RunSpannerDDL(ddl_array)" ] }, { "cell_type": "markdown", "metadata": { "id": "hXsA185xw3Ym" }, "source": [ "### Copy Source Data from BigQuery" ] }, { "cell_type": "markdown", "metadata": { "id": "Es_dXSC5aGUq" }, "source": [ "#### Create Temporary BQ Views" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VJzhxokiywG6" }, "outputs": [], "source": [ "# Create temporary edge table views in BigQuery to be used for ETL into Spanner because\n", "# Spanner limits us to 80,000 mutations per read/write query, making this approach\n", "# faster and more straightforward.\n", "\n", "bq_temp_view_name_array = [\n", " \"edge_customer_places_order\",\n", " \"edge_customer_writes_review\",\n", " \"edge_order_contains_order_item\",\n", " \"edge_order_placed_at_store\",\n", " \"edge_customer_rates_menu_item\",\n", " \"edge_customer_has_marketing_profile\"\n", "]\n", "\n", "bq_temp_view_ddl_array = []\n", "\n", "bq_temp_view_ddl_array.append(f\"\"\"CREATE OR REPLACE VIEW `{bq_dataset}.edge_customer_places_order` (customer_id, order_id, store_id, order_datetime) AS\n", "SELECT c.customer_id, o.order_id, o.store_id, o.order_datetime\n", "FROM `{bq_dataset}.customer` c\n", "JOIN `{bq_dataset}.order` o ON c.customer_id = o.customer_id;\"\"\")\n", "\n", "bq_temp_view_ddl_array.append(f\"\"\"CREATE OR REPLACE VIEW `{bq_dataset}.edge_customer_writes_review` (customer_id, customer_review_id, review_datetime) AS\n", "SELECT c.customer_id, cr.customer_review_id, cr.review_datetime\n", "FROM `{bq_dataset}.customer` c\n", "JOIN `{bq_dataset}.customer_review` cr ON c.customer_id = cr.customer_id;\"\"\")\n", "\n", "bq_temp_view_ddl_array.append(f\"\"\"CREATE OR REPLACE VIEW `{bq_dataset}.edge_order_contains_order_item` (order_id, store_id, order_item_id) AS\n", "SELECT o.order_id, o.store_id, oi.order_item_id\n", "FROM `{bq_dataset}.order` o\n", "JOIN `{bq_dataset}.order_item` oi ON o.order_id = oi.order_id;\"\"\")\n", "\n", "bq_temp_view_ddl_array.append(f\"\"\"CREATE OR REPLACE VIEW `{bq_dataset}.edge_order_placed_at_store` (order_id, store_id, order_datetime) AS\n", "SELECT o.order_id, o.store_id, o.order_datetime\n", "FROM `{bq_dataset}.order` o;\"\"\")\n", "\n", "bq_temp_view_ddl_array.append(f\"\"\"CREATE OR REPLACE VIEW `{bq_dataset}.edge_customer_has_marketing_profile` (customer_id, marketing_profile_id) AS\n", "SELECT customer_id, customer_id\n", "FROM `{bq_dataset}.customer_marketing_profile`;\"\"\")\n", "\n", "# Complex case statement below was generated with the following query:\n", "'''\n", " sql = f\"\"\"SELECT CONCAT(' WHEN review_text LIKE \\\\'%', menu_name, '%\\\\' THEN ', menu_id, ',') AS case_statement,\n", " menu_name,\n", " menu_id\n", " FROM {bq_dataset}.menu\n", " ORDER BY menu_id;\n", " \"\"\"\n", "'''\n", "\n", "bq_temp_view_ddl_array.append(f\"\"\"CREATE OR REPLACE VIEW `{bq_dataset}.edge_customer_rates_menu_item` AS\n", " SELECT customer_id,\n", " CASE\n", " WHEN review_text LIKE '%Chocolate Decadence Coffee%' THEN 1\n", " WHEN review_text LIKE '%The Golden Trio%' THEN 2\n", " WHEN review_text LIKE '%Spice Trade Chocolate Flight%' THEN 3\n", " WHEN review_text LIKE '%Enchanted Forest Dream%' THEN 4\n", " WHEN review_text LIKE '%Spice Odyssey%' THEN 5\n", " WHEN review_text LIKE '%Cocoa Cream Delight%' THEN 6\n", " WHEN review_text LIKE '%Triple Chocolate Hazelnut Delice%' THEN 7\n", " WHEN review_text LIKE '%Chocolate Swan Cake%' THEN 8\n", " WHEN review_text LIKE '%Lavender Bloom Truffle%' THEN 9\n", " WHEN review_text LIKE '%Chocolate & Cheese Flight%' THEN 10\n", " WHEN review_text LIKE '%Spiced Chocolate Trio%' THEN 11\n", " WHEN review_text LIKE '%Saffron-Infused Chocolate Sphere%' THEN 12\n", " WHEN review_text LIKE '%Lavender White Chocolate Cream Puffs%' THEN 13\n", " WHEN review_text LIKE '%Earl Grey & Bergamot Chocolate Symphony%' THEN 14\n", " WHEN review_text LIKE '%Lavender Sea Salt Chocolate%' THEN 15\n", " WHEN review_text LIKE '%Red Velvet Surprise%' THEN 16\n", " WHEN review_text LIKE '%Chocolate Discovery%' THEN 17\n", " WHEN review_text LIKE '%Triple Chocolate Cake%' THEN 18\n", " WHEN review_text LIKE '%Earl Grey & Gold Truffles%' THEN 19\n", " WHEN review_text LIKE '%Lavender Honey Truffle%' THEN 20\n", " WHEN review_text LIKE '%Molten Caramel Surprise%' THEN 21\n", " WHEN review_text LIKE '%Lavender Mocha%' THEN 22\n", " WHEN review_text LIKE '%Coffee Bean Dream%' THEN 23\n", " WHEN review_text LIKE '%Aztec Spice%' THEN 24\n", " WHEN review_text LIKE '%Midnight in Paris Truffle%' THEN 25\n", " WHEN review_text LIKE '%Spicy Chocolate Six Pack%' THEN 26\n", " WHEN review_text LIKE '%Triple Layer Bar%' THEN 27\n", " WHEN review_text LIKE '%Midnight In Paris Cake%' THEN 28\n", " WHEN review_text LIKE '%Triple Chocolate Symphony Cake%' THEN 29\n", " WHEN review_text LIKE '%Salted Caramel Coffee Dreams%' THEN 30\n", " WHEN review_text LIKE '%Edible Gems%' THEN 31\n", " WHEN review_text LIKE '%Churro Tower%' THEN 32\n", " WHEN review_text LIKE '%Blooming Caramel Truffle%' THEN 33\n", " WHEN review_text LIKE '%Triple Chocolate Mousse Delight%' THEN 34\n", " WHEN review_text LIKE '%Dark & Zesty%' THEN 35\n", " WHEN review_text LIKE '%Spice Route Chocolate Flight%' THEN 36\n", " WHEN review_text LIKE '%Cosmic Cacao%' THEN 37\n", " WHEN review_text LIKE '%Eiffel Tower Profiterole Tower%' THEN 38\n", " WHEN review_text LIKE '%Parisian Ganache Trio%' THEN 39\n", " WHEN review_text LIKE '%Edible Masterpieces%' THEN 40\n", " WHEN review_text LIKE '%Lavender White Chocolate Lava Cake%' THEN 41\n", " WHEN review_text LIKE '%Midnight in Paris Mousse%' THEN 42\n", " WHEN review_text LIKE '%Cosmic Chocolate Sphere%' THEN 43\n", " WHEN review_text LIKE '%Golden Orb Surprise%' THEN 44\n", " WHEN review_text LIKE '%Parisian Spice%' THEN 45\n", " WHEN review_text LIKE '%Spiced Chocolate Faberge Egg%' THEN 46\n", " WHEN review_text LIKE '%Trio of Chocolate Bars%' THEN 47\n", " WHEN review_text LIKE '%Parisian Truffle Collection%' THEN 48\n", " WHEN review_text LIKE '%Espresso Chocolate Dream Cake%' THEN 49\n", " WHEN review_text LIKE '%Spice Bazaar Chocolate Sticks%' THEN 50\n", " WHEN review_text LIKE '%Eiffel Tower Delight%' THEN 51\n", " WHEN review_text LIKE '%Lavender Honey Dusk%' THEN 52\n", " WHEN review_text LIKE '%Toasted Marshmallow Chocolate Mousse Peak%' THEN 53\n", " WHEN review_text LIKE '%Chocolate Balloon Ride for Two%' THEN 54\n", " WHEN review_text LIKE '%Parisian Cafe Mocha%' THEN 55\n", " WHEN review_text LIKE '%Chocolate Dream Trio%' THEN 56\n", " WHEN review_text LIKE '%Parisian Nights%' THEN 57\n", " WHEN review_text LIKE '%Parisian Hot Chocolate%' THEN 58\n", " WHEN review_text LIKE '%Triple Chocolate Ganache%' THEN 59\n", " WHEN review_text LIKE '%Parisian Dreamscape Collection%' THEN 60\n", " WHEN review_text LIKE '%Chocolate Affogato%' THEN 61\n", " WHEN review_text LIKE '%Melting Chocolate Sphere%' THEN 62\n", " WHEN review_text LIKE '%Spice Route Selection%' THEN 63\n", " WHEN review_text LIKE '%Pour-Over Coffee%' THEN 64\n", " WHEN review_text LIKE '%Avant-Garde Bonbons%' THEN 65\n", " WHEN review_text LIKE '%Persian Dreams%' THEN 66\n", " WHEN review_text LIKE '%Midnight in Paris%' THEN 67\n", " WHEN review_text LIKE '%Triple Chocolate Hazelnut Dream%' THEN 68\n", " WHEN review_text LIKE '%Eiffel Tower%' THEN 69\n", " WHEN review_text LIKE '%Parisian Chocolate Tower Espresso%' THEN 70\n", " WHEN review_text LIKE '%Melting Chocolate Surprise%' THEN 71\n", " WHEN review_text LIKE '%Dark Chocolate Almond Bark%' THEN 72\n", " WHEN review_text LIKE '%Melting Chocolate Sphere Surprise%' THEN 73\n", " WHEN review_text LIKE '%Spiced Chocolate Symphony%' THEN 74\n", " WHEN review_text LIKE '%Choco-Espresso%' THEN 75\n", " WHEN review_text LIKE '%Parisian Cafe Au Lait%' THEN 76\n", " WHEN review_text LIKE '%The Artist\\\\'s Palette%' THEN 77\n", " WHEN review_text LIKE '%Chocolate Rose%' THEN 78\n", " WHEN review_text LIKE '%Chocolate Covered Ice Cream Bon Bon%' THEN 79\n", " WHEN review_text LIKE '%Chocolate Espresso Cup%' THEN 80\n", " WHEN review_text LIKE '%Black Pepper Chocolate Dream%' THEN 81\n", " WHEN review_text LIKE '%Parisian Bonbon Collection%' THEN 82\n", " WHEN review_text LIKE '%Signature Chocolate Cloud Cappuccino%' THEN 83\n", " WHEN review_text LIKE '%Golden Orb Surprise%' THEN 84\n", " WHEN review_text LIKE '%Golden Egg Surprise%' THEN 85\n", " WHEN review_text LIKE '%Lavender Latte%' THEN 86\n", " WHEN review_text LIKE '%Parisian Spice Rhapsody%' THEN 87\n", " WHEN review_text LIKE '%Ethereal Hearts Desire%' THEN 88\n", " WHEN review_text LIKE '%Triple Chocolate Decadence%' THEN 89\n", " WHEN review_text LIKE '%Decadent Chocolate Dream%' THEN 90\n", " WHEN review_text LIKE '%Spice Route Tasting Set%' THEN 91\n", " WHEN review_text LIKE '%Parisian Chocolate Caramel Dream%' THEN 92\n", " WHEN review_text LIKE '%Chocolate Sphere Surprise%' THEN 93\n", " WHEN review_text LIKE '%Molten Espresso Heart%' THEN 94\n", " WHEN review_text LIKE '%Golden Macaron Tower%' THEN 95\n", " WHEN review_text LIKE '%Chocolate Chip Cookie Mousse Duo%' THEN 96\n", " WHEN review_text LIKE '%Churro Trio with Dark Chocolate Dip%' THEN 97\n", " WHEN review_text LIKE '%Ethereal Chocolate Sphere%' THEN 98\n", " WHEN review_text LIKE '%Salted Caramel Pretzel Bark%' THEN 99\n", " WHEN review_text LIKE '%Parisian Noir Orange Dream%' THEN 100\n", " WHEN review_text LIKE '%Coco Chanel%' THEN 101\n", " WHEN review_text LIKE '%Chocolate World Tour%' THEN 102\n", " WHEN review_text LIKE '%Spiced Ganache Treasures%' THEN 103\n", " WHEN review_text LIKE '%Parisian Vanilla Bean Dream%' THEN 104\n", " WHEN review_text LIKE '%Chocolate Tasting Flight%' THEN 105\n", " WHEN review_text LIKE '%Orange Zest Coffee%' THEN 106\n", " WHEN review_text LIKE '%Deconstructed Chocolate Mousse%' THEN 107\n", " WHEN review_text LIKE '%Chocolate Hot Air Balloon Ride%' THEN 108\n", " WHEN review_text LIKE '%Parisian Coffee Pairing%' THEN 109\n", " WHEN review_text LIKE '%Molten Chocolate Lava Cake%' THEN 110\n", " WHEN review_text LIKE '%Spiced Chocolate Sphere Trio%' THEN 111\n", " WHEN review_text LIKE '%Chocolate Marshmallow Croissant%' THEN 112\n", " WHEN review_text LIKE '%Ebony & Ivory%' THEN 113\n", " WHEN review_text LIKE '%Black Pepper & Gold Sphere%' THEN 114\n", " WHEN review_text LIKE '%Colorful Confections%' THEN 115\n", " WHEN review_text LIKE '%Triple Chocolate Mousse Tower%' THEN 116\n", " WHEN review_text LIKE '%Melting Chocolate Globe Surprise%' THEN 117\n", " WHEN review_text LIKE '%Eiffel Cloud%' THEN 118\n", " WHEN review_text LIKE '%Sweet Ascent%' THEN 119\n", " WHEN review_text LIKE '%Eiffel Tower Mousse%' THEN 120\n", " WHEN review_text LIKE '%Coffee Ganache Dome%' THEN 121\n", " WHEN review_text LIKE '%Chocolate Dipped Coffee Mousse Cones%' THEN 122\n", " WHEN review_text LIKE '%Donut Brioche Skewers%' THEN 123\n", " WHEN review_text LIKE '%Parisian Chocolate Easter Egg%' THEN 124\n", " WHEN review_text LIKE '%Lavender Bloom%' THEN 125\n", " WHEN review_text LIKE '%Aztec\\\\'s Spicy Chocolate Tower%' THEN 126\n", " WHEN review_text LIKE '%Smoked Tea Truffle Quintet%' THEN 127\n", " WHEN review_text LIKE '%Parisian Latte%' THEN 128\n", " WHEN review_text LIKE '%Triple Chocolate Symphony%' THEN 129\n", " WHEN review_text LIKE '%Parisian Dream Bonbons%' THEN 130\n", " WHEN review_text LIKE '%Salted Caramel Nitro Cold Brew%' THEN 131\n", " WHEN review_text LIKE '%Five Spice Chocolate Celebration Cake%' THEN 132\n", " WHEN review_text LIKE '%Butterfly Garden%' THEN 133\n", " WHEN review_text LIKE '%Classic Espresso%' THEN 134\n", " WHEN review_text LIKE '%Chocolate Dipped Waffle Cone Bites%' THEN 135\n", " WHEN review_text LIKE '%Golden Night%' THEN 136\n", " WHEN review_text LIKE '%Parisian Cloud Espresso%' THEN 137\n", " WHEN review_text LIKE '%Trio of Textures%' THEN 138\n", " WHEN review_text LIKE '%Parisian Surprise%' THEN 139\n", " WHEN review_text LIKE '%Doughnut Delight%' THEN 140\n", " WHEN review_text LIKE '%Rich Indulgence%' THEN 141\n", " WHEN review_text LIKE '%Midnight in Provence%' THEN 142\n", " WHEN review_text LIKE '%Chocolate Infused Coffee%' THEN 143\n", " WHEN review_text LIKE '%Parisian Coffee Truffle Trio%' THEN 144\n", " WHEN review_text LIKE '%Chocolate Caramel Orb%' THEN 145\n", " WHEN review_text LIKE '%Classic Affogato%' THEN 146\n", " WHEN review_text LIKE '%Golden Parisian S\\\\'mores%' THEN 147\n", " WHEN review_text LIKE '%Pistachio Cardamom Jewels%' THEN 148\n", " WHEN review_text LIKE '%Coffee Break Brownie%' THEN 149\n", " WHEN review_text LIKE '%Spice Tower Mousse Trio%' THEN 150\n", " WHEN review_text LIKE '%Cardamom Coffee Indulgence%' THEN 151\n", " WHEN review_text LIKE '%Triple Chocolate Mousse Parfait%' THEN 152\n", " WHEN review_text LIKE '%Cosmic Cups%' THEN 153\n", " WHEN review_text LIKE '%Parisian Chocolate Dream%' THEN 154\n", " WHEN review_text LIKE '%Chocolate Fudge Tower%' THEN 155\n", " WHEN review_text LIKE '%Aromatic Adventures%' THEN 156\n", " WHEN review_text LIKE '% Parisian Midnight%' THEN 157\n", " WHEN review_text LIKE '%Chocolate Chip Cookie Ice Cream Tower%' THEN 158\n", " WHEN review_text LIKE '%Chocolate Macaron Delights%' THEN 159\n", " WHEN review_text LIKE '%Parisian Coffee%' THEN 160\n", " WHEN review_text LIKE '%Eiffel Tower of Chocolate Delights%' THEN 161\n", " WHEN review_text LIKE '%Emerald of the Seine%' THEN 162\n", " WHEN review_text LIKE '%Parisian Chocolate Ganache Coffee%' THEN 163\n", " WHEN review_text LIKE '%Molten White Chocolate Lava Cake%' THEN 164\n", " WHEN review_text LIKE '%Chocolate Balloon Ride%' THEN 165\n", " WHEN review_text LIKE '%White Chocolate Dream Latte%' THEN 166\n", " WHEN review_text LIKE '%Chocolate Concerto%' THEN 167\n", " WHEN review_text LIKE '%Spice Trader\\\\'s Collection%' THEN 168\n", " WHEN review_text LIKE '%Molten Chocolate Espresso%' THEN 169\n", " WHEN review_text LIKE '%Parisian Chocolate Symphony%' THEN 170\n", " WHEN review_text LIKE '%Chocolate Waffle Tower%' THEN 171\n", " WHEN review_text LIKE '%Lavender Midnight Mocha%' THEN 172\n", " WHEN review_text LIKE '%Decadent Chocolate Hazelnut Symphony%' THEN 173\n", " WHEN review_text LIKE '%Parisian Hemisphere Trio%' THEN 174\n", " WHEN review_text LIKE '%Salted Caramel Affogato%' THEN 175\n", " WHEN review_text LIKE '%Churro Tower of Dreams%' THEN 176\n", " WHEN review_text LIKE '%Hot Air Balloon Adventure%' THEN 177\n", " WHEN review_text LIKE '%Lavender Vanilla Bean Latte%' THEN 178\n", " WHEN review_text LIKE '%Parisian Chocolate Sphere Trio%' THEN 179\n", " WHEN review_text LIKE '%Chocolate Hot Air Balloon Flight%' THEN 180\n", " WHEN review_text LIKE '%Midnight Velvet Espresso%' THEN 181\n", " WHEN review_text LIKE '%Triple Chocolate Spice Deception%' THEN 182\n", " WHEN review_text LIKE '%Chocolate Hazelnut Symphony%' THEN 183\n", " WHEN review_text LIKE '%Lavender Espresso Dream%' THEN 184\n", " WHEN review_text LIKE '%Chocolate Trio%' THEN 185\n", " WHEN review_text LIKE '%Parisian Twilight Delight%' THEN 186\n", " WHEN review_text LIKE '%Parisian Mocha %' THEN 187\n", " WHEN review_text LIKE '%Chocolate Decadence Tower%' THEN 188\n", " WHEN review_text LIKE '%Parisian Spice Journey%' THEN 189\n", " WHEN review_text LIKE '%Parisian Chocolate Espresso%' THEN 190\n", " WHEN review_text LIKE '%Parisian Twilight Mousse%' THEN 191\n", " WHEN review_text LIKE '%Spice Trader\\\\'s Chocolate Journey%' THEN 192\n", " WHEN review_text LIKE '%Lavender Dream%' THEN 193\n", " WHEN review_text LIKE '%Chocolate Dipped Pretzel Cones with Salted Caramel%' THEN 194\n", " WHEN review_text LIKE '% Parisian Twilight%' THEN 195\n", " WHEN review_text LIKE '%Parisian Iced Coffee Dream%' THEN 196\n", " WHEN review_text LIKE '%Lavender Dreamscape%' THEN 197\n", " WHEN review_text LIKE '%Landmark Lockets%' THEN 198\n", " WHEN review_text LIKE '%Midnight in Paris Mocha%' THEN 199\n", " WHEN review_text LIKE '%Coffee Crunch%' THEN 200\n", " WHEN review_text LIKE '%Spiced Chocolate Napoleon%' THEN 201\n", " WHEN review_text LIKE '%Volcano Coffee%' THEN 202\n", " WHEN review_text LIKE '%Earl Grey Chocolate Tartlet%' THEN 203\n", " WHEN review_text LIKE '%Edible Chocolate Flames%' THEN 204\n", " WHEN review_text LIKE '%Cardamom Chocolate Espresso%' THEN 205\n", " WHEN review_text LIKE '%Chocolate Parisian Hats%' THEN 206\n", " WHEN review_text LIKE '%Parisian Nights Chocolate Trio%' THEN 207\n", " WHEN review_text LIKE '%Salted Caramel Hot Chocolate%' THEN 208\n", " WHEN review_text LIKE '%Triple Chocolate Coffee Dream%' THEN 209\n", " WHEN review_text LIKE '%Chocolate Hot Air Balloons%' THEN 210\n", " WHEN review_text LIKE '%Lavender Dreams Hot Chocolate%' THEN 211\n", " WHEN review_text LIKE '%Chocolate Covered Pretzel Choux au Craquelin%' THEN 212\n", " WHEN review_text LIKE '%Parisian Promenade%' THEN 213\n", " WHEN review_text LIKE '%Parisian Hot Chocolate Extravaganza%' THEN 214\n", " WHEN review_text LIKE '%Layered Chocolate Mousse%' THEN 215\n", " WHEN review_text LIKE '%Parisian Chocolate Map%' THEN 216\n", " WHEN review_text LIKE '%Parisian Delight%' THEN 217\n", " WHEN review_text LIKE '%Chocolate Paris Tower%' THEN 218\n", " WHEN review_text LIKE '%Parisian Truffle Flight%' THEN 219\n", " WHEN review_text LIKE '%Parisian Hot Cocoa%' THEN 220\n", " WHEN review_text LIKE '%Parisian Pâtisserie Perfection%' THEN 221\n", " WHEN review_text LIKE '%Edible Parisian Garden%' THEN 222\n", " WHEN review_text LIKE '%Midnight Chocolate Latte%' THEN 223\n", " WHEN review_text LIKE '%Melting Palette%' THEN 224\n", " WHEN review_text LIKE '%Parisian Gold Truffles%' THEN 225\n", " WHEN review_text LIKE '%Lavender White Mocha%' THEN 226\n", " WHEN review_text LIKE '%Mousse au Chocolat Couvert%' THEN 227\n", " WHEN review_text LIKE '%Moroccan Spice Journey%' THEN 228\n", " WHEN review_text LIKE '%Parisian Mocha%' THEN 229\n", " WHEN review_text LIKE '%Sweet Surrender%' THEN 230\n", " WHEN review_text LIKE '%Parisian Landmarks Collection%' THEN 231\n", " WHEN review_text LIKE '%Dark Chocolate Decadence%' THEN 232\n", " WHEN review_text LIKE '%Trio of Chocolate Lava Cakes%' THEN 233\n", " WHEN review_text LIKE '%Eastern Indulgence%' THEN 234\n", " WHEN review_text LIKE '%Black Pepper Ganache Sphere%' THEN 235\n", " WHEN review_text LIKE '%Dark Chocolate Sea Salt Meringue Pie%' THEN 236\n", " WHEN review_text LIKE '%Cosmic Trio%' THEN 237\n", " WHEN review_text LIKE '%Parisian Twilight Latte%' THEN 238\n", " WHEN review_text LIKE '%Chocolate Churro Dreams%' THEN 239\n", " WHEN review_text LIKE '%Parisian Dreamscape%' THEN 240\n", " WHEN review_text LIKE '%Lavender Espresso%' THEN 241\n", " WHEN review_text LIKE '%Chocolate Decadence%' THEN 242\n", " WHEN review_text LIKE '%Chai Chocolate Zoo%' THEN 243\n", " WHEN review_text LIKE '%Parisian Roast%' THEN 244\n", " WHEN review_text LIKE '%Chocolate Garden of Dreams%' THEN 245\n", " WHEN review_text LIKE '%Parisian Dreamscape Bonbons%' THEN 246\n", " WHEN review_text LIKE '%Dark Chocolate Dream Espresso%' THEN 247\n", " WHEN review_text LIKE '%Parisian Midnight%' THEN 248\n", " WHEN review_text LIKE '%Celestial Trio%' THEN 249\n", " WHEN review_text LIKE '%Parisian Chocolate Layered Latte%' THEN 250\n", " ELSE 999\n", " END AS menu_id,\n", " CASE review_sentiment\n", " WHEN 'Positive' THEN CAST(ROUND(4 + rand() * (5 - 4)) AS INT64)\n", " WHEN 'Negative' THEN CAST(ROUND(1 + rand() * (2 - 1)) AS INT64)\n", " WHEN 'Neutral' THEN 3\n", " ELSE 3\n", " END AS rating,\n", " review_datetime AS rating_datetime\n", " FROM `{bq_dataset}.customer_review`;\n", "\"\"\")\n", "\n", "for view in bq_temp_view_ddl_array:\n", " result = RunBQQuery(view)\n", " print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Enable BigQuery Reservation API\n", "\n", "Enterprise edition or higher is required to run Reverse ETL." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! gcloud services enable bigqueryreservation.googleapis.com" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create a Capacity Reservation\n", "\n", "The capacity reservation includes a baseline of 0 and a max of 50 to reduce cost. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE RESERVATION\n", " `{project_id}.region-us.reverse-etl-reservation`\n", "OPTIONS (\n", " slot_capacity = 0,\n", " edition = 'ENTERPRISE',\n", " autoscale_max_slots = 50);\n", "\"\"\"\n", "\n", "result = RunBQQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Assignment\n", "\n", "> NOTE: You may need to wait a minute for the capacity reservation to be available before creating an Assignment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE ASSIGNMENT\n", " `{project_id}.region-us.reverse-etl-reservation.reverse-etl-assignment`\n", "OPTIONS (\n", " assignee = 'projects/{project_id}',\n", " job_type = 'QUERY');\n", "\"\"\"\n", "\n", "result = RunBQQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "5jKiB7XPboUn" }, "source": [ "#### Run Reverse ETL\n", "\n", "> NOTE: You may need to wait a few minutes for the capacity reservation to take effect before running Reverse ETL. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "E1F2Euhsvtde" }, "outputs": [], "source": [ "# Setup reverse ETL\n", "# https://cloud.google.com/bigquery/docs/export-to-spanner#:~:text=You%20can%20do%20this%20by,and%20high%20throughput%20in%20Spanner.\n", "\n", "source_tables = [\n", " \"customer\",\n", " \"customer_marketing_profile\",\n", " \"customer_review\",\n", " \"menu\",\n", " \"order\",\n", " \"order_item\",\n", " \"store\"\n", "]\n", "\n", "# Add tables from bq_temp_view_name_array to the source_tables array\n", "source_tables.extend(bq_temp_view_name_array)\n", "\n", "for table in source_tables:\n", " RunReverseETL(table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Validate Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UZhEYX2sDEL6" }, "outputs": [], "source": [ "# Validate data copied successfully\n", "\n", "for table in source_tables:\n", " print(f\"\\n\\n\\nValidating table cai_{table}...\\n\")\n", " sql = f\"\"\"SELECT COUNT(*) AS row_count FROM cai_{table}\"\"\"\n", " result = RunSpannerQuery(sql)\n", " print(f\"Row count: {result.iloc[0,0]}\\nSample data:\")\n", "\n", " sql = f\"\"\"SELECT * FROM cai_{table} LIMIT 5\"\"\"\n", " result = RunSpannerQuery(sql)\n", " print(result)" ] }, { "cell_type": "markdown", "metadata": { "id": "TvAvkPw0cTTb" }, "source": [ "#### Drop Temporary BQ Views" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SvGzr4RE2bIU" }, "outputs": [], "source": [ "# Once the data is copied from the BQ views, we can drop the views because we\n", "# won't be using them in BQ.\n", "\n", "for view in bq_temp_view_name_array:\n", " if view == 'customer_360':\n", " print(f\"Skipping view {view}\")\n", " continue\n", " print(f\"Dropping view {view}...\")\n", " sql = f\"\"\"DROP VIEW IF EXISTS `{bq_dataset}.{view}`\"\"\"\n", " result = RunBQQuery(sql)\n", " print(result)" ] }, { "cell_type": "markdown", "metadata": { "id": "ppsX90dx4nXu" }, "source": [ "#### Load Customer 360 Table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AcdXgMlSlcNU" }, "outputs": [], "source": [ "# There are too many rows in the customer_360 table to insert in one batch due\n", "# to Spanner mutation limits (80,000) per transaction. We loop through batches\n", "# of inserts here as a workaround.\n", "\n", "# Define batch size\n", "batch_size = 800\n", "\n", "# Get the total number of rows to insert.\n", "sql = f\"\"\"SELECT count(*) AS row_count\n", " FROM `cai_customer_marketing_profile_segments` AS mp\n", " INNER JOIN `cai_customer_marketing_profile_data` AS cp ON mp.customer_id = cp.customer_id\n", " INNER JOIN `cai_customer_marketing_profile_loyalty` AS cl ON mp.customer_id = cl.customer_id\n", " INNER JOIN `cai_customer_marketing_profile` AS cmp ON mp.customer_id = cmp.customer_id\n", "\"\"\"\n", "result = RunSpannerQuery(sql)\n", "total_rows = int(result.iloc[0]['row_count'])\n", "\n", "# Calculate the number of batches.\n", "num_batches = (total_rows + batch_size - 1) // batch_size\n", "\n", "# Insert the data in batches.\n", "for i in range(num_batches):\n", " print(f\"Inserting batch {i + 1} of {num_batches}\")\n", "\n", " start_index = i * batch_size\n", " end_index = (i + 1) * batch_size\n", "\n", " sql = f\"\"\"INSERT INTO cai_customer_360 (\n", " customer_id,\n", " customer_marketing_insights,\n", " benefits_sought,\n", " browsing_behavior,\n", " loyalty_status,\n", " occasion_timing,\n", " purchase_history,\n", " spending_habits,\n", " usage_frequency,\n", " user_status,\n", " at_risk_customers,\n", " first_time_customers,\n", " former_customers,\n", " inactive_customers,\n", " loyal_advocates,\n", " new_leads,\n", " potential_customers,\n", " repeat_customers,\n", " age,\n", " education,\n", " ethnicity,\n", " family_size,\n", " gender,\n", " generation,\n", " income,\n", " language,\n", " marital_status,\n", " occupation,\n", " city,\n", " climate,\n", " country,\n", " population_density,\n", " region,\n", " time_zone,\n", " urban_rural,\n", " challenges,\n", " goals,\n", " pain_points,\n", " priorities,\n", " specific_needs,\n", " attitudes,\n", " hobbies,\n", " interests,\n", " lifestyle,\n", " motivations,\n", " personality,\n", " social_class,\n", " customer_values,\n", " adoption_rate,\n", " browsers,\n", " devices,\n", " internet_connectivity,\n", " operating_systems,\n", " social_media_platforms,\n", " software,\n", " tech_savviness,\n", " cost_benefit_analysis,\n", " perceived_value,\n", " price_sensitivity,\n", " willingness_to_pay,\n", " children,\n", " chocolate_preferences,\n", " content_interaction,\n", " customer_age,\n", " facebook_bio,\n", " facebook_engagement,\n", " facebook_handle,\n", " instagram_bio,\n", " instagram_engagement,\n", " instagram_handle,\n", " linkedin_bio,\n", " linkedin_engagement,\n", " linkedin_handle,\n", " martial_status,\n", " solicated_buying_habits,\n", " sports,\n", " tiktok_bio,\n", " tiktok_handle,\n", " twitter_bio,\n", " twitter_engagement,\n", " twitter_handle,\n", " youtube_bio,\n", " youtube_handle,\n", " customer_service_interactions,\n", " average_amount_spent_per_order,\n", " last_order_date,\n", " latest_review_sentiment,\n", " most_frequent_purchase_location,\n", " negative_review_percentage,\n", " neutral_review_percentage,\n", " positive_review_percentage,\n", " purchase_locations,\n", " top_3_favorite_menu_items,\n", " total_amount_spent,\n", " total_orders,\n", " total_reviews\n", " )\n", " SELECT\n", " mp.customer_id,\n", " cmp.customer_marketing_insights,\n", " mp.benefits_sought,\n", " mp.browsing_behavior,\n", " mp.loyalty_status,\n", " mp.occasion_timing,\n", " mp.purchase_history,\n", " mp.spending_habits,\n", " mp.usage_frequency,\n", " mp.user_status,\n", " mp.at_risk_customers,\n", " mp.first_time_customers,\n", " mp.former_customers,\n", " mp.inactive_customers,\n", " mp.loyal_advocates,\n", " mp.new_leads,\n", " mp.potential_customers,\n", " mp.repeat_customers,\n", " mp.age,\n", " mp.education,\n", " mp.ethnicity,\n", " mp.family_size,\n", " mp.gender,\n", " mp.generation,\n", " mp.income,\n", " mp.language,\n", " mp.marital_status,\n", " mp.occupation,\n", " mp.city,\n", " mp.climate,\n", " mp.country,\n", " mp.population_density,\n", " mp.region,\n", " mp.time_zone,\n", " mp.urban_rural,\n", " mp.challenges,\n", " mp.goals,\n", " mp.pain_points,\n", " mp.priorities,\n", " mp.specific_needs,\n", " mp.attitudes,\n", " mp.hobbies,\n", " mp.interests,\n", " mp.lifestyle,\n", " mp.motivations,\n", " mp.personality,\n", " mp.social_class,\n", " mp.customer_values,\n", " mp.adoption_rate,\n", " mp.browsers,\n", " mp.devices,\n", " mp.internet_connectivity,\n", " mp.operating_systems,\n", " mp.social_media_platforms,\n", " mp.software,\n", " mp.tech_savviness,\n", " mp.cost_benefit_analysis,\n", " mp.perceived_value,\n", " mp.price_sensitivity,\n", " mp.willingness_to_pay,\n", " cp.children,\n", " cp.chocolate_preferences,\n", " cp.content_interaction,\n", " cp.customer_age,\n", " cp.facebook_bio,\n", " cp.facebook_engagement,\n", " cp.facebook_handle,\n", " cp.instagram_bio,\n", " cp.instagram_engagement,\n", " cp.instagram_handle,\n", " cp.linkedin_bio,\n", " cp.linkedin_engagement,\n", " cp.linkedin_handle,\n", " cp.martial_status,\n", " cp.solicated_buying_habits,\n", " cp.sports,\n", " cp.tiktok_bio,\n", " cp.tiktok_handle,\n", " cp.twitter_bio,\n", " cp.twitter_engagement,\n", " cp.twitter_handle,\n", " cp.youtube_bio,\n", " cp.youtube_handle,\n", " cp.customer_service_interactions,\n", " cl.average_amount_spent_per_order,\n", " cl.last_order_date,\n", " cl.latest_review_sentiment,\n", " cl.most_frequent_purchase_location,\n", " cl.negative_review_percentage,\n", " cl.neutral_review_percentage,\n", " cl.positive_review_percentage,\n", " cl.purchase_locations,\n", " cl.top_3_favorite_menu_items,\n", " cl.total_amount_spent,\n", " cl.total_orders,\n", " cl.total_reviews\n", " FROM\n", " `cai_customer_marketing_profile_segments` AS mp\n", " INNER JOIN `cai_customer_marketing_profile_data` AS cp ON mp.customer_id = cp.customer_id\n", " INNER JOIN `cai_customer_marketing_profile_loyalty` AS cl ON mp.customer_id = cl.customer_id\n", " INNER JOIN `cai_customer_marketing_profile` AS cmp ON mp.customer_id = cmp.customer_id\n", " LIMIT {batch_size} OFFSET {start_index}\n", " \"\"\"\n", " result = RunSpannerQuery(sql)" ] }, { "cell_type": "markdown", "metadata": { "id": "VSXa8OjDceAY" }, "source": [ "### Load or Generate Social Data" ] }, { "cell_type": "markdown", "metadata": { "id": "h0hb79BWTRl7" }, "source": [ "The social graph data is generated by the code below. You can re-generate it yourself using this code (which takes a few hours) by setting `regenerate_social_data = True`, or you can simply load a pre-generated dataset from BQ by keeping the default value of `regenerate_social_data = False`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XXexOBX8T5As" }, "outputs": [], "source": [ "# Choose whether to regenerate social data\n", "regenerate_social_data = False\n", "\n", "# Load data from BQ if regenerate_social_data = False\n", "if regenerate_social_data == False:\n", " spanner_options = f\"{{'table': 'cai_edge_customer_follows_customer'}}\"\n", " spanner_options = spanner_options.replace(\"'\", '\"')\n", "\n", " export_statement = f\"\"\"EXPORT DATA OPTIONS (\n", " uri='https://spanner.googleapis.com/projects/{project_id}/instances/{instance_id}/databases/{database_id}',\n", " format='CLOUD_SPANNER',\n", " spanner_options='''{spanner_options}'''\n", " )\n", " AS SELECT * FROM `{bq_dataset}.spanner_social_data`;\"\"\"\n", "\n", " result = RunBQQuery(export_statement)\n", " result\n", "\n", "# Generate data if regenerate_social_data = True\n", "else:\n", " # Get customer_ids of influencers\n", " sql = f\"\"\"SELECT customer_id FROM `{bq_dataset}.customer_360`\n", " WHERE customer_marketing_insights LIKE '%an influencer%'\"\"\"\n", " influencers = RunBQQuery(sql)\n", " influencers = list(influencers['customer_id'].values)\n", "\n", " # Get customer_ids of non-influencers\n", " sql = f\"\"\"SELECT customer_id FROM `{bq_dataset}.customer_360`\n", " WHERE customer_marketing_insights NOT LIKE '%an influencer%'\"\"\"\n", " followers = RunBQQuery(sql)\n", " followers = list(followers['customer_id'].values)\n", "\n", " # Generate follower data for non-influencers\n", " GenerateFollowerData(\n", " user_ids=followers, # Users we're generating social data for\n", " follower_ids=followers, # Regular users who are not influencers\n", " influencer_ids=influencers, # Users who are influencers\n", " min_followers=5, # Min count of regular users for any given user in this batch of user_ids\n", " max_followers=200, # Max count of regular users for any given user in this batch of user_ids\n", " min_influencer_followers=0, # Min count of followers who are influencers for any given user in this batch of user_ids\n", " max_influencer_followers=10, # Max count of followers who are influencers for any given user in this batch of user_ids\n", " mutation_limit=1000, # Max mutations sent to Spanner as a single transaction\n", " )\n", "\n", " # Generate follower data for influencers\n", " GenerateFollowerData(\n", " user_ids=influencers,\n", " follower_ids=followers,\n", " influencer_ids=influencers,\n", " min_followers=1000,\n", " max_followers=int(len(followers) * .8),\n", " min_influencer_followers=int(len(influencers) * .25),\n", " max_influencer_followers=int(len(influencers) * .8),\n", " mutation_limit=1000\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NOniR1BaSrZL" }, "outputs": [], "source": [ "# Explore the social data\n", "\n", "# Count of rows in the cai_edge_customer_follows_customer table\n", "sql = f\"\"\"SELECT count(*) AS row_count\n", "FROM cai_edge_customer_follows_customer;\"\"\"\n", "result = RunSpannerQuery(sql)\n", "print(result)\n", "\n", "# Followers per user\n", "sql = f\"\"\"SELECT customer_id, count(customer_id) AS follower_count\n", "FROM cai_edge_customer_follows_customer\n", "GROUP BY customer_id\n", "ORDER BY 2 DESC;\"\"\"\n", "result = RunSpannerQuery(sql)\n", "print(result)\n", "\n", "# Ensure customers all have social data\n", "sql = f\"\"\"SELECT COUNT(*) as row_count FROM cai_customer c\n", "WHERE customer_id NOT IN (\n", " SELECT DISTINCT customer_id FROM\n", " cai_edge_customer_follows_customer\n", " )\n", "\"\"\"\n", "result = RunSpannerQuery(sql)\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": { "id": "MvGD1dopWQ9_" }, "source": [ "## Query Spanner Graph" ] }, { "cell_type": "markdown", "metadata": { "id": "wc7nwDrebEq3" }, "source": [ "### Get Personalized Menu Recommendations via Collaborative Filtering\n", "\n", "> NOTE: If the query below does not return any results, try with a different `customer_id`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pu51du_bwNgc" }, "outputs": [], "source": [ "# Collaborative filtering\n", "# Find recommended menu items for a customer (with customer_id = 10) based on the\n", "# preferences of similar customers who have also rated the same item poorly.\n", "\n", "sql = \"\"\"GRAPH chocolate_ai_graph\n", "MATCH (c1:cai_customer)-[r1:Rates {rating: 1}]->(m1:cai_menu)<-[r2:Rates {rating: 1}]-(c2:cai_customer)-[r3:Rates {rating: 5}]->(m2:cai_menu)\n", "WHERE c1.customer_id <> c2.customer_id\n", "AND c1.customer_id = 120 -- Check recommendations for different customers by changing this customer_id\n", "AND m1.menu_name = 'Lavender Sea Salt Chocolate'\n", "AND m2.menu_name <> 'Lavender Sea Salt Chocolate'\n", "RETURN c1.customer_id AS current_shopper, r1.rating as disliked_item_rating, m1.menu_name, m2.menu_name AS recommended_menu_item, c2.customer_id AS similar_shopper, r3.rating as recommended_item_rating;\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "46kAgV4vcuI6" }, "outputs": [], "source": [ "# We get lots of recommendations, so let's combine our graph query with traditional SQL\n", "# to count the number of recommendations for each menu_item and recommend the top 3\n", "\n", "sql = \"\"\"SELECT recommended_menu_item, count(*) recommended_count FROM GRAPH_TABLE (\n", " chocolate_ai_graph\n", " MATCH (c1:cai_customer)-[r1:Rates {rating: 1}]->(m1:cai_menu)<-[r2:Rates {rating: 1}]-(c2:cai_customer)-[r3:Rates {rating: 5}]->(m2:cai_menu)\n", " WHERE c1.customer_id <> c2.customer_id\n", " AND c1.customer_id = 120\n", " AND m1.menu_name = 'Lavender Sea Salt Chocolate'\n", " AND m2.menu_name <> 'Lavender Sea Salt Chocolate'\n", " RETURN c1.customer_id AS current_shopper, r1.rating as disliked_item_rating, m1.menu_name, m2.menu_name AS recommended_menu_item, c2.customer_id AS similar_shopper, r3.rating as recommended_item_rating)\n", "as gt\n", "GROUP BY recommended_menu_item\n", "ORDER BY 2 DESC\n", "LIMIT 3;\n", "\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get Recommendations for Ambiguous Reviews with Full-text Search" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Show menu item with hard-to-spell name\n", "sql = \"\"\"SELECT * FROM cai_menu\n", "WHERE menu_name = 'Lavender Sea Salt Chocolate'\n", "\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# There are matches in the graph for \"Lavender Sea Salt Chocolate\" and customer_id 120\n", "\n", "sql = \"\"\"GRAPH chocolate_ai_graph\n", "MATCH (c1:cai_customer)-[r1:Rates {rating: 1}]->(m1:cai_menu)<-[r2:Rates {rating: 1}]-(c2:cai_customer)-[r3:Rates {rating: 5}]->(m2:cai_menu)\n", "WHERE c1.customer_id <> c2.customer_id\n", "AND c1.customer_id = 120 -- Check recommendations for different customers by changing this customer_id\n", "AND m1.menu_name = 'Lavender Sea Salt Chocolate'\n", "AND m2.menu_name <> 'Lavender Sea Salt Chocolate'\n", "RETURN c1.customer_id AS current_shopper, r1.rating as disliked_item_rating, m1.menu_name, m2.menu_name AS recommended_menu_item, c2.customer_id AS similar_shopper, r3.rating as recommended_item_rating;\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# BUT, there are no matches in the graph when the menu item is misspelled (\"Lavinder\" instead of \"Lavender\")\n", "\n", "sql = \"\"\"GRAPH chocolate_ai_graph\n", "MATCH (c1:cai_customer)-[r1:Rates {rating: 1}]->(m1:cai_menu)<-[r2:Rates {rating: 1}]-(c2:cai_customer)-[r3:Rates {rating: 5}]->(m2:cai_menu)\n", "WHERE c1.customer_id <> c2.customer_id\n", "AND c1.customer_id = 120 -- Check recommendations for different customers by changing this customer_id\n", "AND m1.menu_name = 'Lavinder Sea Salt Chocolate'\n", "AND m2.menu_name <> 'Lavinder Sea Salt Chocolate'\n", "RETURN c1.customer_id AS current_shopper, r1.rating as disliked_item_rating, m1.menu_name, m2.menu_name AS recommended_menu_item, c2.customer_id AS similar_shopper, r3.rating as recommended_item_rating;\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Full-text search matches both misspelled and incomplete product names ('lavinder salt' matches 'Lavender Sea Salt Chocolate')\n", "\n", "sql = \"\"\"GRAPH chocolate_ai_graph\n", "MATCH (c1:cai_customer)-[r1:Rates {rating: 1}]->(m1:cai_menu)<-[r2:Rates {rating: 1}]-(c2:cai_customer)-[r3:Rates {rating: 5}]->(m2:cai_menu)\n", "WHERE c1.customer_id <> c2.customer_id\n", "AND c1.customer_id = 120 -- Check recommendations for different customers by changing this customer_id\n", "AND SEARCH_NGRAMS(m1.menu_name_token, 'lavinder salt')\n", "AND NOT SEARCH_NGRAMS(m2.menu_name_token, 'lavinder salt')\n", "RETURN c1.customer_id AS current_shopper, r1.rating as disliked_item_rating, m1.menu_name, m2.menu_name AS recommended_menu_item, c2.customer_id AS similar_shopper, r3.rating as recommended_item_rating;\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get Recommendations When No Product is Mentioned with Vector Search\n", "\n", "> NOTE: It may take a minute or two for permissions to be setup after creating the models in the cells above. This should happen automatically, so wait for a couple of minutes and retry if you get permission errors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get recommendations based on preferences from other users with similar profiles\n", "\n", "reviewer_id = 120\n", "\n", "sql = f\"\"\"GRAPH chocolate_ai_graph\n", "MATCH (c:cai_customer)-[r:Rates {{rating: 5}}]->(m:cai_menu)\n", "WHERE c.customer_id IN (\n", " SELECT customer_id\n", " FROM cai_customer_marketing_profile\n", " WHERE customer_id <> {reviewer_id}\n", " ORDER BY COSINE_DISTANCE(\n", " customer_segmentation_data_embedding,\n", " (SELECT customer_segmentation_data_embedding FROM cai_customer_marketing_profile WHERE customer_id = {reviewer_id})\n", " )\n", " LIMIT 10\n", " )\n", "RETURN m.menu_name AS recommended_menu_item, c.customer_id AS similar_shopper, r.rating as recommended_item_rating;\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "A_c3kFnKa5EJ" }, "source": [ "### Find Potential Brand Partners via Social Graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "exKeuxEsaBQn" }, "outputs": [], "source": [ "# Find influencers who are followed by at-risk customers so that we can consider\n", "# a brand partnership with the influencers and retain the at-risk customers.\n", "\n", "sql = \"\"\"GRAPH chocolate_ai_graph\n", "MATCH (c360_1:cai_customer_360)<-[:HasMarketingProfile]-(c1:cai_customer)<-[:Follows]-(c2:cai_customer)-[:HasMarketingProfile]->(c360_2:cai_customer_360)\n", "WHERE c360_1.loyalty_status LIKE 'Loyal%'\n", " AND c360_2.loyalty_status = 'At-Risk'\n", "RETURN c1.customer_name, c1.customer_id, count(c2) AS num_at_risk_followers\n", "GROUP BY c1.customer_id, c1.customer_name\n", "ORDER BY num_at_risk_followers DESC\n", "LIMIT 10;\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9BC6L13CZR2b" }, "outputs": [], "source": [ "# Now let's use traditional SQL to enrich the results to determine which\n", "# influencers would make the best partners\n", "\n", "sql = \"\"\"WITH gt AS (\n", " SELECT * FROM GRAPH_TABLE (chocolate_ai_graph\n", " MATCH (c360_1:cai_customer_360)<-[:HasMarketingProfile]-(c1:cai_customer)<-[:Follows]-(c2:cai_customer)-[:HasMarketingProfile]->(c360_2:cai_customer_360)\n", " WHERE c360_1.loyalty_status LIKE 'Loyal%'\n", " AND c360_2.loyalty_status = 'At-Risk'\n", " RETURN c1.customer_name, c1.customer_id, count(c2) AS num_at_risk_followers\n", " GROUP BY c1.customer_id, c1.customer_name\n", " ORDER BY num_at_risk_followers DESC\n", " LIMIT 10)\n", "), follower_count AS (\n", " SELECT followed_customer_id, COUNT(followed_customer_id) AS total_follower_count\n", " FROM cai_edge_customer_follows_customer\n", " WHERE followed_customer_id IN (\n", " SELECT customer_id FROM gt\n", " )\n", " GROUP BY followed_customer_id\n", " )\n", " SELECT gt.customer_name AS influencer\n", " , gt.customer_id AS influencer_id\n", " , gt.num_at_risk_followers\n", " , tfc.total_follower_count\n", " , c360.loyalty_status AS influencer_loyalty_status\n", " , c360.age\n", " , c360.education\n", " , c360.gender\n", " , c360.marital_status\n", " , c360.occupation\n", " , c360.children\n", " , c360.facebook_bio\n", " , c360.facebook_engagement\n", " , c360.facebook_handle\n", " , c360.instagram_bio\n", " , c360.instagram_engagement\n", " , c360.instagram_handle\n", " , c360.linkedin_bio\n", " , c360.linkedin_engagement\n", " , c360.linkedin_handle\n", " , c360.tiktok_bio\n", " , c360.tiktok_handle\n", " , c360.twitter_bio\n", " , c360.twitter_engagement\n", " , c360.twitter_handle\n", " , c360.youtube_bio\n", " , c360.youtube_handle\n", " FROM gt\n", " JOIN cai_customer_360 AS c360 ON gt.customer_id = c360.customer_id\n", " JOIN follower_count AS tfc ON gt.customer_id = tfc.followed_customer_id\n", " ORDER BY gt.num_at_risk_followers DESC;\n", "\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "KJg4nL3Wayfs" }, "source": [ "### Visualize the Social Graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mt23Mjl271In" }, "outputs": [], "source": [ "# Helper function for visualization with pyviz\n", "def visualize_graph(query_result):\n", " \"\"\"\n", " Visualizes the results of a Spanner graph query with influencers and followers using pyvis.\n", "\n", " Args:\n", " query_result: A pandas DataFrame containing the query results.\n", " \"\"\"\n", "\n", " net = Network(height='750px', width='100%', bgcolor='#222222', font_color='white', notebook=True, filter_menu=True, cdn_resources='in_line')\n", "\n", " # Convert DataFrame to a list of tuples for easier handling\n", " graph_data = list(query_result[['influencer_id', 'influencer_name', 'follower_id', 'follower_name']].itertuples(index=False, name=None))\n", "\n", " # Add nodes and edges\n", " for influencer_id, influencer_name, follower_id, follower_name in graph_data:\n", " # Add nodes with labels and titles\n", " net.add_node(follower_id, label=f\"Follower: {follower_name} ({follower_id})\", title=f\"Follower: {follower_name} ({follower_id})\", color='red')\n", " net.add_node(influencer_id, label=f\"Influencer: {influencer_name} ({influencer_id})\", title=f\"Influencer: {influencer_name} ({influencer_id})\", color='green')\n", "\n", " # Add edge\n", " net.add_edge(influencer_id, follower_id, title=\"Follows\")\n", "\n", " # Customize visualization options (optional)\n", " net.force_atlas_2based(overlap=1)\n", " net.show_buttons(filter_=['physics']) # Useful for tuning settings before finalizing graph\n", " #net.toggle_stabilization(True)\n", " net.toggle_physics(True)\n", "\n", " return net" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Vw68FzXFmx64" }, "outputs": [], "source": [ "# Find influencers who are followed by at-risk customers so that we can consider\n", "# a brand partnership with the influencers and retain the at-risk customers.\n", "sql = \"\"\"GRAPH chocolate_ai_graph\n", "MATCH (c360_1:cai_customer_360)<-[:HasMarketingProfile]-(c1:cai_customer)<-[:Follows]-(c2:cai_customer)-[:HasMarketingProfile]->(c360_2:cai_customer_360)\n", "WHERE c360_1.loyalty_status LIKE 'Loyal%'\n", " AND c360_2.loyalty_status = 'At-Risk'\n", "RETURN c1.customer_name, c1.customer_id, count(c2) AS num_at_risk_followers\n", "GROUP BY c1.customer_id, c1.customer_name\n", "ORDER BY num_at_risk_followers DESC\n", "LIMIT 10;\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "influencer_ids = list(result['customer_id'])\n", "in_string = str(set(influencer_ids)).replace(\"'\", \"\").replace(\"{\", \"(\").replace(\"}\", \")\")\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "aVVGC7YjwJFN" }, "outputs": [], "source": [ "# Get detailed graph nodes and edges for visualzation\n", "sql = f\"\"\"SELECT\n", " ce.followed_customer_id AS influencer_id,\n", " ic.customer_name AS influencer_name,\n", " ce.customer_id AS follower_id,\n", " fc.customer_name AS follower_name,\n", "FROM cai_edge_customer_follows_customer ce\n", "JOIN cai_customer_360 c360 ON ce.customer_id = c360.customer_id\n", "JOIN cai_customer ic ON ce.followed_customer_id = ic.customer_id\n", "JOIN cai_customer fc ON ce.customer_id = fc.customer_id\n", "WHERE c360.loyalty_status = 'At-Risk'\n", "AND followed_customer_id IN {in_string}\n", "\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result\n", "\n", "# Visualize the results\n", "# https://pyvis.readthedocs.io/en/latest/tutorial.html#using-pyvis-within-jupyter-notebook\n", "net = visualize_graph(result)\n", "net.show(\"customer_graph.html\")\n", "display(HTML('customer_graph.html'))" ] }, { "cell_type": "markdown", "metadata": { "id": "QMzqzqjzuaoN" }, "source": [ "## Query Spanner as an External Dataset in BigQuery" ] }, { "cell_type": "markdown", "metadata": { "id": "CzuIce5gujhA" }, "source": [ "An external dataset is a connection between BigQuery and an external data source at the dataset level. It lets you query transactional data in Spanner databases with GoogleSQL without moving data from Spanner to BigQuery storage.\n", "\n", "The tables in an external dataset are automatically populated from the tables in the corresponding external data source. You can query these tables directly in BigQuery, but you cannot make modifications, additions, or deletions. However, any updates that you make in the external data source are automatically reflected in BigQuery.\n", "\n", "When using external datasets, Spanner's Data Boost feature is always used and you don't have to enable it manually. Data Boost is a fully managed, serverless feature that provides independent compute resources for supported Spanner workloads. Data Boost lets you execute analytics queries and data exports with near-zero impact to existing workloads on the provisioned Spanner instance. Data Boost lets you run federated queries with independent compute capacity separate from your provisioned instances to avoid impacting existing workloads on Spanner. Data Boost is most impactful when you run complex ad hoc queries, or when you want to process large amounts of data without impacting the existing Spanner workload. Running federated queries with Data Boost can lead to significantly lower CPU consumption, and in some cases, lower query latency.\n", "\n", "You can read more about external datasets in BigQuery [here](https://cloud.google.com/bigquery/docs/spanner-external-datasets) and Spanner Data Boost [here](https://cloud.google.com/spanner/docs/databoost/databoost-overview)." ] }, { "cell_type": "markdown", "metadata": { "id": "xCWT6kvZyxPz" }, "source": [ "### Create External Schema" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NSde3v2nuaJE" }, "outputs": [], "source": [ "# Create the external schema\n", "spanner_external_schema_name = 'chocolate_ai_spanner_external_schema'\n", "\n", "# This uses predefined roles for Spanner. To use a customer role, see the doc below.\n", "# https://cloud.google.com/bigquery/docs/connect-to-spanner#bq:~:text=%22-,database_role,-%22%3A%20(Optional)%20If\n", "sql = f\"\"\"CREATE EXTERNAL SCHEMA {spanner_external_schema_name}\n", " OPTIONS (\n", " external_source = 'google-cloudspanner:///projects/{project_id}/instances/{instance_id}/databases/{database_id}',\n", " location = '{bigquery_location}');\"\"\"\n", "\n", "result = RunBQQuery(sql)\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "T0Rl9b_QzYTm" }, "outputs": [], "source": [ "# Confirm the external schema is created\n", "sql = f\"\"\"SELECT *\n", "FROM `{project_id}.region-{bigquery_location}.INFORMATION_SCHEMA.SCHEMATA`\n", "WHERE schema_name = '{spanner_external_schema_name}'\"\"\"\n", "\n", "result = RunBQQuery(sql)\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "aZ2lvj8X7a0v" }, "outputs": [], "source": [ "# Test querying Spanner data through BigQuery\n", "sql = f\"\"\"SELECT *\n", "FROM `{project_id}.{spanner_external_schema_name}.cai_customer` LIMIT 5\n", "\"\"\"\n", "\n", "result = RunBQQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "xar6hMmb3FX_" }, "source": [ "### Query Real-time Orders and Historical Orders Separately" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SIoe8-063fZe" }, "outputs": [], "source": [ "# Set timestamps to separate current order in Spanner from historic orders in BigQuery\n", "current_datetime = datetime.now(timezone.utc)\n", "\n", "bq_from_datetime = current_datetime - timedelta(days=365)\n", "bq_from_datetime = bq_from_datetime.strftime(\"%Y-%m-%d %H:%M:%S%z\")\n", "\n", "bq_to_datetime = current_datetime - timedelta(days=30)\n", "bq_to_datetime = bq_to_datetime.strftime(\"%Y-%m-%d %H:%M:%S%z\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bw1DdDvq6gkq" }, "outputs": [], "source": [ "# Query last 30 days of order data directly from Spanner\n", "sql = f\"\"\"SELECT DATE_TRUNC(CAST(order_completion_datetime AS DATE), WEEK) AS week,\n", "COUNT(*)\n", "FROM cai_order\n", "WHERE order_completion_datetime BETWEEN '{bq_to_datetime}' AND '{current_datetime}'\n", "GROUP BY week\n", "ORDER BY 1\n", "\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EFXEnjNA0e6L" }, "outputs": [], "source": [ "# Query preceding 11 months of data from BigQuery\n", "sql = f\"\"\"SELECT DATE_TRUNC(order_completion_datetime, WEEK) AS week,\n", " COUNT(*)\n", "FROM `{bq_dataset}.order`\n", "WHERE order_completion_datetime BETWEEN '{bq_from_datetime}' AND '{bq_to_datetime}'\n", "GROUP BY WEEK;\n", "\"\"\"\n", "result = RunBQQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "wso13WRFfMgP" }, "source": [ "### Query Real-time Orders and Historical Orders Together" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Evcafl6h_9S5" }, "outputs": [], "source": [ "# Query Spanner and BigQuery Together with External Dataset\n", "sql = f\"\"\"WITH union_result AS (\n", " SELECT DATE_TRUNC(CAST(order_completion_datetime AS DATE), WEEK) AS week,\n", " COUNT(*) AS order_count\n", " FROM `{project_id}.{spanner_external_schema_name}.cai_order` span_o\n", " WHERE span_o.order_completion_datetime BETWEEN '{bq_to_datetime}' AND '{current_datetime}'\n", " GROUP BY week\n", " UNION ALL\n", " SELECT DATE_TRUNC(CAST(order_completion_datetime AS DATE), WEEK) AS week,\n", " COUNT(*) AS order_count\n", " FROM `{bq_dataset}.order`\n", " WHERE order_completion_datetime BETWEEN '{bq_from_datetime}' AND '{bq_to_datetime}'\n", " GROUP BY week\n", " ORDER BY week\n", " )\n", " SELECT week, SUM(order_count)\n", " FROM union_result\n", " GROUP BY week\n", " ORDER BY week;\n", "\"\"\"\n", "\n", "result = RunBQQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "QEfR2lXHCZRP" }, "source": [ "## Run Vector Search in Spanner" ] }, { "cell_type": "markdown", "metadata": { "id": "3yL-DJBsEun8" }, "source": [ "### Test the Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test the models created earlier in the notebook. \n", "\n", "> NOTE: It may take a minute or two for permissions to be setup after creating the models. This should happen automatically, so wait for a couple of minutes and retry if you get permission errors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hp37uctzDEsJ" }, "outputs": [], "source": [ "# Test the embedding model\n", "sql = \"\"\"SELECT embeddings.values from ML.PREDICT(MODEL EmbeddingsModel,\n", " (SELECT customer_name as content FROM cai_customer where customer_id = 2555)\n", ")\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3mNimy60DlZo" }, "outputs": [], "source": [ "# Test the LLM model\n", "sql = \"\"\"SELECT *\n", "FROM ML.PREDICT(\n", "MODEL LLMModel,\n", "( SELECT\n", "'What is so great about chocolate?' AS prompt),\n", "STRUCT(256 AS maxOutputTokens))\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "2YJiq19kGeae" }, "source": [ "### Run an Exact Nearest Neighbor (ENN) Vector Search Query in Spanner" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "G-AZvP4BEmx7" }, "outputs": [], "source": [ "# Search the cai_customer_marketing_profile for 'Young professional'\n", "# This allows you to perform flexible semantic search on customer\n", "# marketing insight data to better target specific customers with\n", "# relevant marketing materials.\n", "\n", "# Define your search_phrase\n", "search_phrase = \"Young professional\"\n", "\n", "# Get an embedding for your search_phrase\n", "sql = f\"\"\"SELECT embeddings.values\n", " FROM ML.PREDICT(\n", " MODEL EmbeddingsModel,\n", " (SELECT '{search_phrase}' as content))\n", "\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "embedding = result.iloc[0,0]\n", "\n", "# Run the ENN search\n", "sql = f\"\"\"SELECT customer_id,\n", " COSINE_DISTANCE(\n", " customer_marketing_insights_embedding,\n", " {embedding}\n", " ) as dist,\n", " customer_marketing_insights\n", "FROM cai_customer_marketing_profile\n", "ORDER BY dist\n", "LIMIT 10;\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "e7EKtiEKeG-M" }, "source": [ "### OPTIONAL: Add an ANN Vector Index" ] }, { "cell_type": "markdown", "metadata": { "id": "rZSWcFHOeN_3" }, "source": [ "The `COSINE_DISTANCE()` function used above does an Exact Nearest Neighbor search, where we compute the distance between the query vector and every vector in the table. This performs fine on our small dataset, but for larger datasets with Millions-Billions of vectors, this approach does not scale.\n", "\n", "We can significantly improve the performance of our vector query using approximate nearest neighbor (ANN) search in Spanner with the `APPROX_COSINE_DISTANCE()` function, delivering:\n", "* Scale and speed: fast, high-recall search scaling to more than 10B vectors\n", "* Operational simplicity: no need to copy your data to a specialized vector DB\n", "* Consistency: the results are always fresh, reflecting the latest updates \n", "\n", "However, we need to make a few changes to our instance and table before we can create the index:\n", "* Scale up the instance (ANN indexing is only supported on instances with at least 1 node)\n", "* Create and load a new embedding column, specifying `vector_length=>768` in the column definition.\n", "* Re-write our query to use the `APPROX_COSINE_DISTANCE()` function." ] }, { "cell_type": "markdown", "metadata": { "id": "ndXmvHGkDc8f" }, "source": [ "#### Scale Up Spanner Instance" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HB81ENKhDh92" }, "outputs": [], "source": [ "# Note: ANN search index feature requires at least 1 full node while in preview,\n", "# otherwise you'll get the following error:\n", "# {'code': 12, 'message': 'VECTOR INDEX is not available in granular instances in preview'}\n", "# Ref: https://cloud.google.com/spanner/docs/find-approximate-nearest-neighbors#:~:text=Spanner%20accelerates%20ANN%20vector%20searches,data%20and%20facilitate%20faster%20searches.\n", "\n", "# Scale up the Spanner instance\n", "result = ScaleSpannerInstance(processing_units = 1000)\n", "result\n" ] }, { "cell_type": "markdown", "metadata": { "id": "L5AHrkXLfKSZ" }, "source": [ "#### Create and Load New Embedding Column" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eWtkt_K4JKZG" }, "outputs": [], "source": [ "ddl_array = []\n", "\n", "# Create and load a new embedding column, specifying vector_length=>768 in the column definition.\n", "ddl_array.append(\"\"\"ALTER TABLE cai_customer_marketing_profile ADD COLUMN IF NOT EXISTS customer_marketing_insights_embedding_span ARRAY<FLOAT64>(vector_length=>768)\n", "\"\"\")\n", "\n", "result = RunSpannerDDL(ddl_array)\n", "result\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FQTVsIAnb1w6" }, "outputs": [], "source": [ "# There are too many rows in the cai_customer_marketing_profile table to update\n", "# in one batch due to Spanner mutation limits (80,000) per transaction. We loop\n", "# through batches of updates here as a workaround.\n", "\n", "# Define batch size\n", "batch_size = 1000\n", "\n", "# Get the total number of rows to insert.\n", "sql = f\"\"\"SELECT count(*) AS row_count\n", " FROM `cai_customer_marketing_profile`\n", "\"\"\"\n", "result = RunSpannerQuery(sql)\n", "total_rows = int(result.iloc[0]['row_count'])\n", "\n", "# Calculate the number of batches.\n", "num_batches = (total_rows + batch_size - 1) // batch_size\n", "\n", "# Insert the data in batches.\n", "for i in range(num_batches):\n", " print(f\"Updating batch {i + 1} of {num_batches}\")\n", "\n", " start_index = i * batch_size\n", " end_index = (i + 1) * batch_size\n", "\n", " sql = f\"\"\"\n", " UPDATE cai_customer_marketing_profile\n", " SET customer_marketing_insights_embedding_span = customer_marketing_insights_embedding\n", " WHERE customer_id IN (\n", " SELECT customer_id\n", " FROM cai_customer_marketing_profile\n", " ORDER BY customer_id -- Or any other suitable column for ordering\n", " LIMIT {batch_size} OFFSET {start_index}\n", " )\n", " \"\"\"\n", "\n", " result = RunSpannerQuery(sql)" ] }, { "cell_type": "markdown", "metadata": { "id": "psxLUy--fPEq" }, "source": [ "#### Create ANN Index\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "59SpbJvDfRFz" }, "outputs": [], "source": [ "ddl_array = []\n", "\n", "ddl_array.append(f\"\"\"CREATE VECTOR INDEX customer_marketing_insights_embedding_idx\n", " ON cai_customer_marketing_profile(customer_marketing_insights_embedding_span)\n", " WHERE customer_marketing_insights_embedding_span IS NOT NULL\n", " OPTIONS (distance_type = 'COSINE', tree_depth = 2, num_leaves = 1000)\n", "\"\"\")\n", "\n", "result = RunSpannerDDL(ddl_array)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "Ips7VrqY6-DZ" }, "source": [ "#### Run an Approximate Nearest Neighbor (ANN) Search" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LFu12L1fJaFR" }, "outputs": [], "source": [ "# Define your search_phrase\n", "search_phrase = \"Young professional\"\n", "\n", "# Get an embedding for your search_phrase\n", "sql = f\"\"\"SELECT embeddings.values\n", " FROM ML.PREDICT(\n", " MODEL EmbeddingsModel,\n", " (SELECT '{search_phrase}' as content))\n", "\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "embedding = result.iloc[0,0]\n", "\n", "# Search the cai_customer_marketing_profile for search_phrase\n", "sql = f\"\"\"SELECT customer_id,\n", " APPROX_COSINE_DISTANCE(\n", " {embedding},\n", " customer_marketing_insights_embedding_span,\n", " options=>JSON'{{\"num_leaves_to_search\": 1000}}'\n", " ) AS distance,\n", " customer_marketing_insights,\n", " customer_marketing_insights_embedding_span\n", "FROM cai_customer_marketing_profile@{{FORCE_INDEX=customer_marketing_insights_embedding_idx}}\n", "WHERE customer_marketing_insights_embedding_span IS NOT NULL\n", "ORDER BY distance\n", "LIMIT 10;\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "POcl0geG9oYH" }, "source": [ "#### Drop ANN Index" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xtn_z-UeJgn_" }, "outputs": [], "source": [ "# Drop ANN index\n", "ddl_array = []\n", "\n", "ddl_array.append(\"\"\"DROP INDEX IF EXISTS customer_marketing_insights_embedding_idx\"\"\")\n", "\n", "result = RunSpannerDDL(ddl_array)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "US89bHEH9vVM" }, "source": [ "#### Scale Spanner Back Down" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xwxQF-xSJi4V" }, "outputs": [], "source": [ "# Scale Spanner instance back down to save on cost\n", "result = ScaleSpannerInstance(processing_units = 100)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "5Lm_nuAhD99V" }, "source": [ "### Have Gemini Explain Query Results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-TJq-bc8D9yr" }, "outputs": [], "source": [ "# Search the cai_customer_marketing_profile for 'Young professional'\n", "# This is an exact nearest neighbor search, so this query can take ~60-120 seconds\n", "\n", "# Define your search_phrase\n", "search_phrase = \"Young professional\"\n", "\n", "# Get an embedding for your search_phrase\n", "sql = f\"\"\"SELECT embeddings.values\n", " FROM ML.PREDICT(\n", " MODEL EmbeddingsModel,\n", " (SELECT '{search_phrase}' as content))\n", "\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "embedding = result.iloc[0,0]\n", "\n", "# Run the ENN search with Gemini explanation\n", "sql = f\"\"\"WITH vector_query AS (\n", " SELECT customer_id,\n", " COSINE_DISTANCE(\n", " customer_marketing_insights_embedding,\n", " {embedding}\n", " ) as dist,\n", " customer_marketing_insights,\n", " GENERATE_UUID() AS temp_row_id -- Use UUID for unique identifier\n", " FROM cai_customer_marketing_profile\n", " ORDER BY dist\n", " LIMIT 5\n", ")\n", "SELECT\n", " content as explanation,\n", " vq.customer_id,\n", " vq.dist,\n", " vq.customer_marketing_insights\n", "FROM ML.PREDICT(\n", " MODEL LLMModel,\n", " (SELECT\n", " 'Explain how the following vector query result is relevant to this search phrase: \"{search_phrase}\" Vector query result: ' || customer_marketing_insights AS prompt,\n", " temp_row_id\n", " FROM vector_query),\n", " STRUCT(256 AS maxOutputTokens)) as ml\n", "JOIN vector_query AS vq ON ml.temp_row_id = vq.temp_row_id;\n", "\"\"\"\n", "\n", "result = RunSpannerQuery(sql)\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "42IxhtRRrvR-" }, "source": [ "## Clean Up" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vW_Q2QIC5FCC" }, "outputs": [], "source": [ "# Delete external dataset\n", "sql = f\"\"\"DROP SCHEMA IF EXISTS `chocolate_ai_spanner_external_schema`;\"\"\"\n", "\n", "result = RunBQQuery(sql)\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gviBL9huW_T9" }, "outputs": [], "source": [ "# Delete instance\n", "uri = f\"https://spanner.googleapis.com/v1/projects/{project_id}/instances/chocolate-ai-{project_id}\"\n", "http_verb = \"DELETE\"\n", "\n", "response = restAPIHelper(uri, http_verb, {})\n", "response" ] }, { "cell_type": "markdown", "metadata": { "id": "ASQ2BPisXDA0" }, "source": [ "## Reference Links\n" ] }, { "cell_type": "markdown", "metadata": { "id": "MNr3FTda-Hgl" }, "source": [ "- [Spanner Graph Overview](https://cloud.google.com/products/spanner/graph?e=48754805&hl=en)\n", "- [Codelab: Getting started with Spanner Graph](https://codelabs.developers.google.com/codelabs/spanner-graph-getting-started#5)\n", "- [Querying Spanner Graph](https://cloud.google.com/spanner/docs/graph/queries-overview)\n", "- [Spanner Full-text Search](https://cloud.google.com/spanner/docs/full-text-search)\n", "- [Introducing BigQuery External Datasets for Spanner](https://cloud.google.com/blog/products/data-analytics/introducing-bigquery-external-datasets-for-spanner?e=48754805)\n", "- [Spanner Data Boost](https://cloud.google.com/spanner/docs/databoost/databoost-overview)\n", "- [Spanner vector search](https://cloud.google.com/blog/products/databases/how-spanner-vector-search-supports-generative-ai-apps?e=48754805)\n", "- [Spanner ANN vector search](https://cloud.google.com/blog/products/databases/spanner-now-supports-approximate-nearest-neighbor-search?e=48754805)\n", "- [Codelab: Getting started with Spanner Vector Search](https://codelabs.developers.google.com/codelabs/spanner-getting-started-vector-search#4)\n", "- [Invoke Gemini from Spanner with ML.PREDICT()](https://cloud.google.com/spanner/docs/ml-tutorial)\n", "- [Pyvis for graph visualization](https://pyvis.readthedocs.io/en/latest/)\n", "- [Force Graph for graph visualization](https://www.npmjs.com/package/force-graph)\n", "- [Spanner REST API Quickstart](https://cloud.google.com/spanner/docs/getting-started/rest)\n", "- [Spanner JSON functions](https://cloud.google.com/spanner/docs/reference/standard-sql/json_functions)\n", "- [Intro to BigQuery Views](https://cloud.google.com/bigquery/docs/views-intro)" ] } ], "metadata": { "colab": { "name": "Create-Spanner-Graph-Collaborative-Filtering.ipynb", "private_outputs": true, "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }

colab-enterprise/Campaign-Performance-Spanner-Graph.ipynb (3,399 lines of code) (raw):