colab-enterprise/Create-Campaign-Customer-Segmentation.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "SauhRkiN_Rq1" }, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": { "id": "FTcqjwUs-pGC" }, "source": [ "This notebook focuses on customer segmentation. It creates JSON views to simplify querying existing segmentation data, and it generates, explores, and visualizes embeddings for the `customer_marketing_profile` table in BigQuery (as defined below), enabling advanced analytics and customer segmentation for Chocolate AI. The notebook shows how you can easily query complex segment data using JSON, and it demonstrates how to find matches for a net new segment using vector search in BigQuery.\n", "\n", "```sql\n", "CREATE TABLE IF NOT EXISTS `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`\n", "(\n", " customer_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " customer_profile_data JSON OPTIONS(description=\"The raw data we know about a customer.\"),\n", " customer_profile_data_embedding ARRAY<FLOAT64> OPTIONS(description=\"Vector embedding of the customer_profile_data column.\"),\n", " customer_loyalty_data JSON OPTIONS(description=\"Data about the customer's purchases and reviews in JSON format.\"),\n", " customer_loyalty_data_embedding ARRAY<FLOAT64> OPTIONS(description=\"Vector embedding of the customer_loyalty_data column.\"),\n", " customer_segmentation_data JSON OPTIONS(description=\"The generated customer segmentation data in JSON format.\"),\n", " customer_segmentation_data_embedding ARRAY<FLOAT64> OPTIONS(description=\"Vector embedding of the customer_segmentation_data column.\"),\n", " customer_marketing_insights STRING OPTIONS(description=\"Generated text summary of customer_profile_data, customer_loyalty_data, and customer_segmentation_data.\"),\n", " customer_marketing_insights_embedding ARRAY<FLOAT64> OPTIONS(description=\"Vector embedding of the customer_marketing_insights column.\")\n", ")\n", "CLUSTER BY customer_id;\n", "```\n", "\n", "Process Flow:\n", "1. Create JSON Views:\n", " - Creates three views in BigQuery to flatten nested JSON data from the customer_marketing_profile table:\n", " - customer_marketing_profile_data: Flattens customer_profile_data.\n", " - customer_marketing_profile_loyalty: Flattens customer_loyalty_data.\n", " - customer_marketing_profile_segments: Flattens customer_segmentation_data.\n", " - Creates a combined view customer_360 that joins these flattened views with the main table to provide a unified view of customer information.\n", "2. Explore JSON Segmentation Data:\n", " - Provides example queries demonstrating how to analyze customer segments using the created views.\n", " - These queries explore segments based on combinations like \"Loyalty Status\" and \"Generation,\" analyze demographics of engaged customers on Twitter, identify high-value customers based on spending, and segment customers based on loyalty and average order value.\n", "3. Generate Embeddings:\n", " - Create Embedding Model: Sets up a connection to Vertex AI and creates a BigQuery remote model (chocolate_ai.google-textembedding) for generating text embeddings.\n", " - Generate Embeddings: Uses the ML.GENERATE_EMBEDDING function to create embeddings for each customer's customer_profile_data, customer_loyalty_data, customer_segmentation_data, and customer_marketing_insights. It uses different task types for different embedding columns based on the intended use case.\n", " - Create Vector Index (Optional): Provides instructions for creating a vector index on customer_marketing_insights_embedding to improve the performance of vector search queries.\n", "4. Explore Embeddings:\n", " - Dynamic Segments with Vector Search: Demonstrates using vector search based on semantic similarity to find customers matching a given description (e.g., \"Young professional\").\n", " - Visualize Customer Segments with Clustered Embeddings:\n", " - Dimensionality Reduction: Applies t-SNE to reduce the dimensionality of the embeddings for visualization purposes.\n", " - Visualization Function: Defines a function VisualizeEmbeddings to create scatter plots of customer segments based on their reduced-dimension embeddings.\n", " - Explanation Function: Defines a function ExplainEmbeddings that uses Gemini to explain the key differences between customers in a given segment.\n", " - Visualization and Explanation: Calls the VisualizeEmbeddings and ExplainEmbeddings functions to visualize and explain segments for selected categories like 'loyalty_status'.\n", "\n", "Author: Paul Ramsey" ] }, { "cell_type": "markdown", "metadata": { "id": "BKDYfyKvOuLT" }, "source": [ "## License" ] }, { "cell_type": "markdown", "metadata": { "id": "tfKslF-JBchh" }, "source": [ "```\n", "##################################################################################\n", "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "###################################################################################\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Video Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![Video](https://i.ytimg.com/vi/ElRiWDo5Vvg/hq720.jpg)](https://www.youtube.com/watch?v=ElRiWDo5Vvg)" ] }, { "cell_type": "markdown", "metadata": { "id": "CJMI6O0PBrR0" }, "source": [ "## Initialize\n", "\n", "Update the values of the variables below to match your environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5kEbeA3APzI7" }, "outputs": [], "source": [ "# Update these variables to match your environment\n", "location=\"us-central1\" # Your region\n", "bigquery_location = \"${bigquery_location}\" # Must be \"us\" or \"eu\"\n", "\n", "# Define vertex ai connection name\n", "vertex_ai_connection_name = \"vertex-ai\" # Can be any valid name.\n", "\n", "### Do not change the values in this cell below this line ###\n", "project_id = !(gcloud config get-value project)\n", "user = !(gcloud auth list --filter=status:ACTIVE --format=\"value(account)\")\n", "\n", "if len(project_id) != 1:\n", " raise RuntimeError(f\"project_id is not set: {project_id}\")\n", "project_id = project_id[0]\n", "\n", "if len(user) != 1:\n", " raise RuntimeError(f\"user is not set: {user}\")\n", "user = user[0]\n", "\n", "print(f\"project_id = {project_id}\")\n", "print(f\"user = {user}\")\n", "print(f\"location = {location}\")\n", "print(f\"bigquery_location = {bigquery_location}\")\n", "print(f\"vertex_ai_connection_name = {vertex_ai_connection_name}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-sE7DTmMB645" }, "outputs": [], "source": [ "import google.auth\n", "import requests\n", "import json\n", "import time\n", "from tenacity import retry, wait_exponential, stop_after_attempt, before_sleep_log, retry_if_exception\n", "import logging\n", "from IPython.display import HTML\n", "import markdown\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "from sklearn.datasets import fetch_20newsgroups\n", "from sklearn.manifold import TSNE\n", "from sklearn.cluster import KMeans\n", "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", "from scipy.spatial.distance import cdist\n", "\n", "from google.cloud import bigquery\n", "client = bigquery.Client()" ] }, { "cell_type": "markdown", "metadata": { "id": "dyU1rxRHB80C" }, "source": [ "## Helper Functions\n", "\n", "Run the cells below to initialize helper functions which are utilized throughout the notebook." ] }, { "cell_type": "markdown", "metadata": { "id": "WSJPqZosHTh5" }, "source": [ "#### RetryCondition()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bocthCiLHSFJ" }, "outputs": [], "source": [ "def RetryCondition(error):\n", " error_string = str(error)\n", " print(error_string)\n", "\n", " retry_errors = [\n", " \"RESOURCE_EXHAUSTED\",\n", " \"No content in candidate\",\n", " # Add more error messages here as needed\n", " ]\n", "\n", " for retry_error in retry_errors:\n", " if retry_error in error_string:\n", " print(\"Retrying...\")\n", " return True\n", "\n", " return False" ] }, { "cell_type": "markdown", "metadata": { "id": "yghz5AxB-nRZ" }, "source": [ "#### GeminiLLM()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JNdQHp7m-lKw" }, "outputs": [], "source": [ "@retry(wait=wait_exponential(multiplier=1, min=1, max=60), stop=stop_after_attempt(10), retry=retry_if_exception(RetryCondition), before_sleep=before_sleep_log(logging.getLogger(), logging.INFO))\n", "def GeminiLLM(prompt, model = \"gemini-2.0-flash\", response_schema = None,\n", " temperature = 1, topP = 1, topK = 32):\n", "\n", " # https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference#supported_models\n", " # model = \"gemini-2.0-flash\"\n", "\n", " llm_response = None\n", " if temperature < 0:\n", " temperature = 0\n", "\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request() # required to acess access token\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " headers = {\n", " \"Content-Type\" : \"application/json\",\n", " \"Authorization\" : \"Bearer \" + access_token\n", " }\n", "\n", " # https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference\n", " url = f\"https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/publishers/google/models/{model}:generateContent\"\n", "\n", " generation_config = {\n", " \"temperature\": temperature,\n", " \"topP\": topP,\n", " \"maxOutputTokens\": 8192,\n", " \"candidateCount\": 1,\n", " \"responseMimeType\": \"application/json\",\n", " }\n", "\n", " # Add inthe response schema for when it is provided\n", " if response_schema is not None:\n", " generation_config[\"responseSchema\"] = response_schema\n", "\n", " if model == \"gemini-2.0-flash\":\n", " generation_config[\"topK\"] = topK\n", "\n", " payload = {\n", " \"contents\": {\n", " \"role\": \"user\",\n", " \"parts\": {\n", " \"text\": prompt\n", " },\n", " },\n", " \"generation_config\": {\n", " **generation_config\n", " },\n", " \"safety_settings\": {\n", " \"category\": \"HARM_CATEGORY_SEXUALLY_EXPLICIT\",\n", " \"threshold\": \"BLOCK_LOW_AND_ABOVE\"\n", " }\n", " }\n", "\n", " response = requests.post(url, json=payload, headers=headers)\n", "\n", " if response.status_code == 200:\n", " try:\n", " json_response = json.loads(response.content)\n", " except Exception as error:\n", " raise RuntimeError(f\"An error occurred parsing the JSON: {error}\")\n", "\n", " if \"candidates\" in json_response:\n", " candidates = json_response[\"candidates\"]\n", " if len(candidates) > 0:\n", " candidate = candidates[0]\n", " if \"content\" in candidate:\n", " content = candidate[\"content\"]\n", " if \"parts\" in content:\n", " parts = content[\"parts\"]\n", " if len(parts):\n", " part = parts[0]\n", " if \"text\" in part:\n", " text = part[\"text\"]\n", " llm_response = text\n", " else:\n", " raise RuntimeError(\"No text in part: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No parts in content: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No parts in content: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No content in candidate: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No candidates: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No candidates: {response.content}\")\n", "\n", " # Remove some typically response characters (if asking for a JSON reply)\n", " llm_response = llm_response.replace(\"```json\",\"\")\n", " llm_response = llm_response.replace(\"```\",\"\")\n", " llm_response = llm_response.replace(\"\\n\",\"\")\n", "\n", " return llm_response\n", "\n", " else:\n", " raise RuntimeError(f\"Error with prompt:'{prompt}' Status:'{response.status_code}' Text:'{response.text}'\")" ] }, { "cell_type": "markdown", "metadata": { "id": "7Yxtui5OF5qf" }, "source": [ "#### PrettyPrintJson()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qjfO-y8HF5h7" }, "outputs": [], "source": [ "def PrettyPrintJson(json_string):\n", " json_object = json.loads(json_string)\n", " json_formatted_str = json.dumps(json_object, indent=2)\n", " print(json_formatted_str)\n", " return json.dumps(json_object)" ] }, { "cell_type": "markdown", "metadata": { "id": "ftgjePVS2wwr" }, "source": [ "#### RunQuery()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TNqB5oN62wpH" }, "outputs": [], "source": [ "def RunQuery(sql, job_config = None):\n", " import time\n", "\n", " if (sql.startswith(\"SELECT\") or sql.startswith(\"WITH\")):\n", " df_result = client.query(sql).to_dataframe()\n", " return df_result\n", " else:\n", " if job_config == None:\n", " job_config = bigquery.QueryJobConfig(priority=bigquery.QueryPriority.INTERACTIVE)\n", " query_job = client.query(sql, job_config=job_config)\n", "\n", " # Check on the progress by getting the job's updated state.\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " while query_job.state != \"DONE\":\n", " time.sleep(2)\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " if query_job.error_result == None:\n", " return True\n", " else:\n", " return False" ] }, { "cell_type": "markdown", "metadata": { "id": "EG21sovNgCIo" }, "source": [ "#### DisplayMarkdown()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sqPr2xXUgB-j" }, "outputs": [], "source": [ "def DisplayMarkdown(text):\n", " \"\"\"\n", " Displays text in markdown/HTML format in a Colab notebook.\n", "\n", " Args:\n", " text: The text to display. Can be plain text or Markdown.\n", " \"\"\"\n", "\n", " formatted_text = markdown.markdown(text) # Convert to HTML if necessary\n", " display(HTML(formatted_text))" ] }, { "cell_type": "markdown", "metadata": { "id": "aMbCLRJlL0xO" }, "source": [ "#### createVertexAIConnection()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9Ttl6L4FvsrU" }, "outputs": [], "source": [ "# Function to create Vertex AI connection\n", "def createVertexAIConnection(params):\n", " \"\"\"Creates a Vertex AI connection.\"\"\"\n", "\n", " # First find the connection\n", " # https://cloud.google.com/bigquery/docs/reference/bigqueryconnection/rest/v1/projects.locations.connections/list\n", " project_id = params[\"project_id\"]\n", " bigquery_location = params[\"bigquery_location\"]\n", " vertex_ai_connection_name = params[\"vertex_ai_connection_name\"]\n", " url = f\"https://bigqueryconnection.googleapis.com/v1/projects/{project_id}/locations/{bigquery_location}/connections\"\n", "\n", " # Gather existing connections\n", " json_result = restAPIHelper(url, \"GET\", None)\n", " print(f\"createVertexAIConnection (GET) json_result: {json_result}\")\n", "\n", " # Test to see if connection exists, if so return\n", " if \"connections\" in json_result:\n", " for item in json_result[\"connections\"]:\n", " print(f\"BigLake Connection: {item['name']}\")\n", " # NOTE: We cannot test the complete name since it contains the project number and not id\n", " if item[\"name\"].endswith(f\"/locations/{bigquery_location}/connections/{vertex_ai_connection_name}\"):\n", " print(\"Connection already exists\")\n", " serviceAccountId = item[\"cloudResource\"][\"serviceAccountId\"]\n", " return serviceAccountId\n", "\n", " # Create the connection\n", " # https://cloud.google.com/bigquery/docs/reference/bigqueryconnection/rest/v1/projects.locations.connections/create\n", " print(\"Creating Vertex AI Connection\")\n", "\n", " url = f\"https://bigqueryconnection.googleapis.com/v1/projects/{project_id}/locations/{bigquery_location}/connections?connectionId={vertex_ai_connection_name}\"\n", "\n", " request_body = {\n", " \"friendlyName\": \"notebook_connection\",\n", " \"description\": \"Vertex AI Colab Notebooks Connection for Data Analytics Golden Demo\",\n", " \"cloudResource\": {}\n", " }\n", "\n", " json_result = restAPIHelper(url, \"POST\", request_body)\n", "\n", " serviceAccountId = json_result[\"cloudResource\"][\"serviceAccountId\"]\n", " print(\"Vertex AI Connection created: \", serviceAccountId)\n", " return serviceAccountId" ] }, { "cell_type": "markdown", "metadata": { "id": "JlzXXYNGL3M9" }, "source": [ "#### setProjectLevelIamPolicy()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RBskQRm9v-Rz" }, "outputs": [], "source": [ "# Helper function for IAM role bindings\n", "def setProjectLevelIamPolicy(params, accountWithPrefix, role):\n", " \"\"\"Sets the Project Level IAM policy.\"\"\"\n", "\n", " # Get the current bindings (if the account has access then skip)\n", " # https://cloud.google.com/resource-manager/reference/rest/v1/projects/getIamPolicy\n", " project_id = params[\"project_id\"]\n", "\n", " url = f\"https://cloudresourcemanager.googleapis.com/v1/projects/{project_id}:getIamPolicy\"\n", "\n", " request_body = { }\n", " json_result = restAPIHelper(url, \"POST\", request_body)\n", " print(f\"setProjectLevelIamPolicy (GET) json_result: {json_result}\")\n", "\n", " # Test to see if permissions exist\n", " if \"bindings\" in json_result:\n", " for item in json_result[\"bindings\"]:\n", " if item[\"role\"] == role:\n", " members = item[\"members\"]\n", " for member in members:\n", " if member == accountWithPrefix:\n", " print(\"Permissions exist\")\n", " return\n", "\n", " # Take the existing bindings and we need to append the new permission\n", " # Otherwise we loose the existing permissions\n", " if \"bindings\" in json_result:\n", " bindings = json_result[\"bindings\"]\n", " else:\n", " bindings = []\n", "\n", " new_permission = {\n", " \"role\": role,\n", " \"members\": [ accountWithPrefix ]\n", " }\n", "\n", " bindings.append(new_permission)\n", "\n", " # https://cloud.google.com/resource-manager/reference/rest/v1/projects/setIamPolicy\n", " url = f\"https://cloudresourcemanager.googleapis.com/v1/projects/{project_id}:setIamPolicy\"\n", "\n", " request_body = { \"policy\" : {\n", " \"bindings\" : bindings\n", " }\n", " }\n", "\n", " print(f\"Permission bindings: {bindings}\")\n", "\n", " json_result = restAPIHelper(url, \"POST\", request_body)\n", " print()\n", " print(f\"json_result: {json_result}\")\n", " print()\n", " print(f\"Project Level IAM Permissions set for {accountWithPrefix} {role}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "gOE4wpX1LYWV" }, "source": [ "#### restAPIHelper()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "U5WwcRAyvkN7" }, "outputs": [], "source": [ "# Rest API Helper Function\n", "def restAPIHelper(url: str, http_verb: str, request_body: str) -> str:\n", " \"\"\"Calls the Google Cloud REST API passing in the current users credentials\"\"\"\n", "\n", " import requests\n", " import google.auth\n", " import json\n", "\n", " # Get an access token based upon the current user\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request()\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " headers = {\n", " \"Content-Type\" : \"application/json\",\n", " \"Authorization\" : \"Bearer \" + access_token\n", " }\n", "\n", " if http_verb == \"GET\":\n", " response = requests.get(url, headers=headers)\n", " elif http_verb == \"POST\":\n", " response = requests.post(url, json=request_body, headers=headers)\n", " elif http_verb == \"PUT\":\n", " response = requests.put(url, json=request_body, headers=headers)\n", " elif http_verb == \"PATCH\":\n", " response = requests.patch(url, json=request_body, headers=headers)\n", " elif http_verb == \"DELETE\":\n", " response = requests.delete(url, headers=headers)\n", " else:\n", " raise RuntimeError(f\"Unknown HTTP verb: {http_verb}\")\n", "\n", " if response.status_code == 200:\n", " return json.loads(response.content)\n", " #image_data = json.loads(response.content)[\"predictions\"][0][\"bytesBase64Encoded\"]\n", " else:\n", " error = f\"Error restAPIHelper -> ' Status: '{response.status_code}' Text: '{response.text}'\"\n", " raise RuntimeError(error)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "6rjux_bRaQoO" }, "source": [] }, { "cell_type": "markdown", "metadata": { "id": "_At09a6Fgyvh" }, "source": [ "## Create JSON Views\n", "\n", "In the notebook `Synthetic-Data-Generation-Customers-Marketing-Profile`, we used functions to generate customer profile and segment data as JSON. Since we leveraged Gemini's [Controlled Generation](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output#model_behavior_and_response_schema) feature, we can rely on a consistent schema in the JSON objects, which allows us to create views so that we can more easily query the JSON data." ] }, { "cell_type": "markdown", "metadata": { "id": "mFPlglmAorZT" }, "source": [ "### Create profile data view: `customer_marketing_profile_data`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hX6Uq1N4owmD" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "CREATE OR REPLACE VIEW `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile_data` AS\n", "SELECT\n", " customer_id,\n", " JSON_VALUE(customer_profile_data.children) AS children,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_profile_data.chocolate_preferences), \",\") AS chocolate_preferences,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_profile_data.content_interaction), \",\") AS content_interaction,\n", " CAST(JSON_VALUE(customer_profile_data.customer_age) AS INT64) AS customer_age,\n", " JSON_VALUE(customer_profile_data.education) AS education,\n", " JSON_VALUE(customer_profile_data.facebook_bio) AS facebook_bio,\n", " JSON_VALUE(customer_profile_data.facebook_engagement) AS facebook_engagement,\n", " JSON_VALUE(customer_profile_data.facebook_handle) AS facebook_handle,\n", " JSON_VALUE(customer_profile_data.instagram_bio) AS instagram_bio,\n", " JSON_VALUE(customer_profile_data.instagram_engagement) AS instagram_engagement,\n", " JSON_VALUE(customer_profile_data.instagram_handle) AS instagram_handle,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_profile_data.interests), \",\") AS interests,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_profile_data.lifestyle), \",\") AS lifestyle,\n", " JSON_VALUE(customer_profile_data.linkedin_bio) AS linkedin_bio,\n", " JSON_VALUE(customer_profile_data.linkedin_engagement) AS linkedin_engagement,\n", " JSON_VALUE(customer_profile_data.linkedin_handle) AS linkedin_handle,\n", " JSON_VALUE(customer_profile_data.martial_status) AS martial_status,\n", " JSON_VALUE(customer_profile_data.occupation) AS occupation,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_profile_data.solicated_buying_habits), \",\") AS solicated_buying_habits,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_profile_data.sports), \",\") AS sports,\n", " JSON_VALUE(customer_profile_data.tiktok_bio) AS tiktok_bio,\n", " JSON_VALUE(customer_profile_data.tiktok_handle) AS tiktok_handle,\n", " JSON_VALUE(customer_profile_data.twitter_bio) AS twitter_bio,\n", " JSON_VALUE(customer_profile_data.twitter_engagement) AS twitter_engagement,\n", " JSON_VALUE(customer_profile_data.twitter_handle) AS twitter_handle,\n", " JSON_VALUE(customer_profile_data.youtube_bio) AS youtube_bio,\n", " JSON_VALUE(customer_profile_data.youtube_handle) AS youtube_handle,\n", " (\n", " SELECT STRING_AGG(CONCAT(\n", " 'contact_reason:', JSON_VALUE(interaction, '$.contact_reason'), '; ',\n", " 'resolution_time:', JSON_VALUE(interaction, '$.resolution_time'), '; ',\n", " 'sentiment_analysis:', JSON_VALUE(interaction, '$.sentiment_analysis')\n", " ), ' | ')\n", " FROM UNNEST(JSON_QUERY_ARRAY(customer_profile_data, '$.customer_service_interactions')) AS interaction\n", " ) AS customer_service_interactions\n", " FROM\n", " `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`;" ] }, { "cell_type": "markdown", "metadata": { "id": "VtayZ6YZpBdU" }, "source": [ "### Create loyalty view: `customer_marketing_profile_loyalty`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "asG2WYy9pGKg" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "CREATE OR REPLACE VIEW `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile_loyalty` AS\n", "SELECT\n", " customer_id,\n", " CAST(JSON_VALUE(customer_loyalty_data.average_amount_spent_per_order) AS BIGNUMERIC) AS average_amount_spent_per_order,\n", " CAST(JSON_VALUE(customer_loyalty_data.last_order_date) AS TIMESTAMP) AS last_order_date,\n", " JSON_VALUE(customer_loyalty_data.latest_review_sentiment) AS latest_review_sentiment,\n", " CAST(JSON_VALUE(customer_loyalty_data.most_frequent_purchase_location) AS INT64) AS most_frequent_purchase_location,\n", " CAST(JSON_VALUE(customer_loyalty_data.negative_review_percentage) AS NUMERIC) AS negative_review_percentage,\n", " CAST(JSON_VALUE(customer_loyalty_data.neutral_review_percentage) AS NUMERIC) AS neutral_review_percentage,\n", " CAST(JSON_VALUE(customer_loyalty_data.positive_review_percentage) AS NUMERIC) AS positive_review_percentage,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_loyalty_data.purchase_locations), \",\") AS purchase_locations,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_loyalty_data.top_3_favorite_menu_items), \",\") AS top_3_favorite_menu_items,\n", " CAST(JSON_VALUE(customer_loyalty_data.total_amount_spent) AS BIGNUMERIC) AS total_amount_spent,\n", " CAST(JSON_VALUE(customer_loyalty_data.total_orders) AS INT64) AS total_orders,\n", " CAST(JSON_VALUE(customer_loyalty_data.total_reviews) AS INT64) AS total_reviews\n", " FROM\n", " `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`;" ] }, { "cell_type": "markdown", "metadata": { "id": "5_mNz5MMoij2" }, "source": [ "### Create segments view: `customer_marketing_profile_segments`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ORWTyTo4gykg" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "CREATE OR REPLACE VIEW `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile_segments` AS\n", "SELECT\n", " customer_id,\n", " REPLACE(JSON_VALUE(customer_segmentation_data.customer_segments.behavioral_segmentation.`Benefits Sought`), \" \", \"\") AS benefits_sought,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.behavioral_segmentation.`Browsing Behavior`) AS browsing_behavior,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.behavioral_segmentation.`Loyalty Status`) AS loyalty_status,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.behavioral_segmentation.`Occasion/Timing`) AS occasion_timing,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.behavioral_segmentation.`Purchase History`) AS purchase_history,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.behavioral_segmentation.`Spending Habits`) AS spending_habits,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.behavioral_segmentation.`Usage Frequency`) AS usage_frequency,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.behavioral_segmentation.`User Status`) AS user_status,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`At-Risk Customers`), \",\") AS at_risk_customers,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`First-Time Customers`), \",\") AS first_time_customers,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`Former Customers`), \",\") AS former_customers,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`Inactive Customers`), \",\") AS inactive_customers,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`Loyal Advocates`), \",\") AS loyal_advocates,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`New Leads`), \",\") AS new_leads,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`Potential Customers`), \",\") AS potential_customers,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_segmentation_data.customer_segments.customer_lifecycle_segmentation.`Repeat Customers`), \",\") AS repeat_customers,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.demographic_segmentation.`Age`) AS age,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.demographic_segmentation.`Education`) AS education,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.demographic_segmentation.`Ethnicity`) AS ethnicity,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.demographic_segmentation.`Family Size`) AS family_size,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.demographic_segmentation.`Gender`) AS gender,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.demographic_segmentation.`Generation`) AS generation,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.demographic_segmentation.`Income`) AS income,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.demographic_segmentation.`Language`) AS language,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.demographic_segmentation.`Marital Status`) AS marital_status,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.demographic_segmentation.`Occupation`) AS occupation,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.geographic_segmentation.`City`) AS city,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.geographic_segmentation.`Climate`) AS climate,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.geographic_segmentation.`Country`) AS country,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.geographic_segmentation.`Population Density`) AS population_density,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.geographic_segmentation.`Region`) AS region,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.geographic_segmentation.`Time Zone`) AS time_zone,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.geographic_segmentation.`Urban/Rural`) AS urban_rural,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.needs_based_segmentation.`Challenges`) AS challenges,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.needs_based_segmentation.`Goals`) AS goals,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.needs_based_segmentation.`Pain Points`) AS pain_points,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.needs_based_segmentation.`Priorities`) AS priorities,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.needs_based_segmentation.`Specific Needs`) AS specific_needs,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.psychographic_segmentation.`Attitudes`) AS attitudes,\n", " ARRAY_TO_STRING(JSON_VALUE_ARRAY(customer_segmentation_data.customer_segments.psychographic_segmentation.`Hobbies`), \",\") AS hobbies,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.psychographic_segmentation.`Interests`) AS interests,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.psychographic_segmentation.`Lifestyle`) AS lifestyle,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.psychographic_segmentation.`Motivations`) AS motivations,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.psychographic_segmentation.`Personality`) AS personality,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.psychographic_segmentation.`Social Class`) AS social_class,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.psychographic_segmentation.`Values`) AS customer_values,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.technographic_segmentation.`Adoption Rate`) AS adoption_rate,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.technographic_segmentation.`Browsers`) AS browsers,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.technographic_segmentation.`Devices`) AS devices,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.technographic_segmentation.`Internet Connectivity`) AS internet_connectivity,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.technographic_segmentation.`Operating Systems`) AS operating_systems,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.technographic_segmentation.`Social Media Platforms`) AS social_media_platforms,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.technographic_segmentation.`Software`) AS software,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.technographic_segmentation.`Tech Savviness`) AS tech_savviness,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.value_based_segmentation.`Cost-Benefit Analysis`) AS cost_benefit_analysis,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.value_based_segmentation.`Perceived Value`) AS perceived_value,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.value_based_segmentation.`Price Sensitivity`) AS price_sensitivity,\n", " JSON_VALUE(customer_segmentation_data.customer_segments.value_based_segmentation.`Willingness to Pay`) AS willingness_to_pay\n", "FROM `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`;\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "bKCQcdq1rfYx" }, "source": [ "### Create combined view: `customer_360`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mSb7hJrGrkqC" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "CREATE OR REPLACE VIEW `${project_id}.${bigquery_chocolate_ai_dataset}.customer_360` AS\n", "SELECT\n", " mp.customer_id,\n", " -- Customer Marketing Profile Summary\n", " cmp.customer_marketing_insights,\n", " -- Customer Marketing Profile Segments\n", " mp.benefits_sought,\n", " mp.browsing_behavior,\n", " mp.loyalty_status,\n", " mp.occasion_timing,\n", " mp.purchase_history,\n", " mp.spending_habits,\n", " mp.usage_frequency,\n", " mp.user_status,\n", " mp.at_risk_customers,\n", " mp.first_time_customers,\n", " mp.former_customers,\n", " mp.inactive_customers,\n", " mp.loyal_advocates,\n", " mp.new_leads,\n", " mp.potential_customers,\n", " mp.repeat_customers,\n", " mp.age,\n", " mp.education,\n", " mp.ethnicity,\n", " mp.family_size,\n", " mp.gender,\n", " mp.generation,\n", " mp.income,\n", " mp.language,\n", " mp.marital_status,\n", " mp.occupation,\n", " mp.city,\n", " mp.climate,\n", " mp.country,\n", " mp.population_density,\n", " mp.region,\n", " mp.time_zone,\n", " mp.urban_rural,\n", " mp.challenges,\n", " mp.goals,\n", " mp.pain_points,\n", " mp.priorities,\n", " mp.specific_needs,\n", " mp.attitudes,\n", " mp.hobbies,\n", " mp.interests,\n", " mp.lifestyle,\n", " mp.motivations,\n", " mp.personality,\n", " mp.social_class,\n", " mp.customer_values,\n", " mp.adoption_rate,\n", " mp.browsers,\n", " mp.devices,\n", " mp.internet_connectivity,\n", " mp.operating_systems,\n", " mp.social_media_platforms,\n", " mp.software,\n", " mp.tech_savviness,\n", " mp.cost_benefit_analysis,\n", " mp.perceived_value,\n", " mp.price_sensitivity,\n", " mp.willingness_to_pay,\n", " -- Customer Profile\n", " cp.children,\n", " cp.chocolate_preferences,\n", " cp.content_interaction,\n", " cp.customer_age,\n", " cp.facebook_bio,\n", " cp.facebook_engagement,\n", " cp.facebook_handle,\n", " cp.instagram_bio,\n", " cp.instagram_engagement,\n", " cp.instagram_handle,\n", " cp.linkedin_bio,\n", " cp.linkedin_engagement,\n", " cp.linkedin_handle,\n", " cp.martial_status,\n", " cp.solicated_buying_habits,\n", " cp.sports,\n", " cp.tiktok_bio,\n", " cp.tiktok_handle,\n", " cp.twitter_bio,\n", " cp.twitter_engagement,\n", " cp.twitter_handle,\n", " cp.youtube_bio,\n", " cp.youtube_handle,\n", " cp.customer_service_interactions,\n", " -- Customer Loyalty\n", " cl.average_amount_spent_per_order,\n", " cl.last_order_date,\n", " cl.latest_review_sentiment,\n", " cl.most_frequent_purchase_location,\n", " cl.negative_review_percentage,\n", " cl.neutral_review_percentage,\n", " cl.positive_review_percentage,\n", " cl.purchase_locations,\n", " cl.top_3_favorite_menu_items,\n", " cl.total_amount_spent,\n", " cl.total_orders,\n", " cl.total_reviews\n", " FROM\n", " `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile_segments` AS mp\n", " INNER JOIN `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile_data` AS cp ON mp.customer_id = cp.customer_id\n", " INNER JOIN `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile_loyalty` AS cl ON mp.customer_id = cl.customer_id\n", " INNER JOIN `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile` AS cmp ON mp.customer_id = cmp.customer_id;" ] }, { "cell_type": "markdown", "metadata": { "id": "5U4r2hZea8PD" }, "source": [] }, { "cell_type": "markdown", "metadata": { "id": "yEOAspI2WubR" }, "source": [ "## Explore JSON Segmentation Data\n", "\n", "This section provides an example of how you can use the data generated and derived in this notebook to gain insights into your customers and prepare for new marketing campaigns." ] }, { "cell_type": "markdown", "metadata": { "id": "yQJpIdpJwEjh" }, "source": [ "#### Explore customer segments" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tqJiYUo-wEt5" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "--This query identifies the most common \"Loyalty Status\" and \"Generation\"\n", "--combinations among customers. This helps understand which generations we are\n", "--effectively reaching, allowing for more targeted messaging\n", "-- and product development.\n", "\n", "SELECT\n", " loyalty_status,\n", " generation,\n", " COUNT(*) AS customer_count\n", "FROM\n", " `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile_segments`\n", "GROUP BY\n", " loyalty_status,\n", " generation\n", "ORDER BY\n", " customer_count DESC;" ] }, { "cell_type": "markdown", "metadata": { "id": "iJlUGp0DwNaa" }, "source": [ "#### Explore customer profiles" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XK3K2TdYwNTD" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- This query finds the average customer age for those who actively engage with\n", "-- the brand on Twitter and have mentioned \"chocolate\" in their Twitter bio. This\n", "-- helps understand the demographics of engaged customers on Twitter and tailor\n", "-- content accordingly.\n", "\n", "SELECT\n", " AVG(customer_age) AS average_age\n", "FROM\n", " `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile_data`\n", "WHERE\n", " twitter_engagement = 'Active' AND twitter_bio LIKE '%chocolate%';" ] }, { "cell_type": "markdown", "metadata": { "id": "O2vDvsS5wyLG" }, "source": [ "#### Explore loyalty data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WCrrYI3twxsJ" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- This query calculates the average total amount spent by customers who have\n", "-- placed more than 50 orders and have left at least one review. This helps\n", "-- identify high-value customers and understand their spending patterns, which\n", "-- can inform loyalty programs and targeted promotions.\n", "\n", "\n", "SELECT\n", " AVG(total_amount_spent) AS average_spending\n", "FROM\n", " `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile_loyalty`\n", "WHERE\n", " total_orders > 50 AND total_reviews > 0;" ] }, { "cell_type": "markdown", "metadata": { "id": "mfdX56khxmtf" }, "source": [ "#### Explore customer 360 data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2cK5plAwxnSk" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- This query segments customers based on their \"Loyalty Status\" from the\n", "-- customer_marketing_profile_segments view and their \"average_amount_spent_per_order\"\n", "-- from the customer_loyalty view. This allows you to analyze the relationship\n", "-- between loyalty status and spending habits, enabling you to tailor marketing\n", "-- efforts and loyalty programs to different customer segments.\n", "\n", "SELECT\n", " loyalty_status,\n", " AVG(average_amount_spent_per_order) AS average_order_value,\n", " COUNT(*) AS customer_count\n", "FROM\n", " `${project_id}.${bigquery_chocolate_ai_dataset}.customer_360`\n", "GROUP BY\n", " loyalty_status\n", "ORDER BY\n", " customer_count DESC;" ] }, { "cell_type": "markdown", "metadata": { "id": "B2VYgNTxuBXQ" }, "source": [ "## Generate embeddings\n", "\n", "This section adds vector embeddings for each JSON and STRING column in the `customer_marketing_profile` table, which opens up new capabilities, including semantic search, retrieval augemented generation (RAG), and dynamic customer segmentation." ] }, { "cell_type": "markdown", "metadata": { "id": "eBL2lZ4YwOjl" }, "source": [ "### Create Embedding Model endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FuDL1T6mwW_H" }, "outputs": [], "source": [ "# NOTE: You can skip this step if you deployed via Terraform as the endpoint was\n", "# already created for you. This cell is provided for reference to help you\n", "# follow this pattern in your own environment.\n", "\n", "# Define params as dictionary\n", "params = { \"project_id\" : project_id,\n", " \"bigquery_location\" : bigquery_location,\n", " \"user\" : user,\n", " \"vertex_ai_connection_name\" : vertex_ai_connection_name\n", " }\n", "\n", "# Create the BigQuery External Connection that will be used to call the Vertex AI\n", "# Set the required permissions on the external connection's service principal\n", "vertexAIServiceAccountId = createVertexAIConnection(params)\n", "params[\"vertexAIServiceAccountId\"] = vertexAIServiceAccountId\n", "\n", "# To call Vision API\n", "setProjectLevelIamPolicy(params, f\"serviceAccount:{vertexAIServiceAccountId}\", \"roles/serviceusage.serviceUsageConsumer\")\n", "\n", "# To call GENERATE TEXT\n", "setProjectLevelIamPolicy(params, f\"serviceAccount:{vertexAIServiceAccountId}\",\"roles/aiplatform.user\")\n", "\n", "# Create the text embedding model\n", "# Working:\n", "sql = f\"\"\"CREATE MODEL IF NOT EXISTS `${project_id}.${bigquery_chocolate_ai_dataset}.textembedding_model`\n", "REMOTE WITH CONNECTION `{project_id}.{bigquery_location}.{vertex_ai_connection_name}`\n", "OPTIONS (endpoint = 'text-embedding-005');\"\"\"\n", "\n", "RunQuery(sql)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "zKTmxl4Z4uln" }, "source": [ "### Generate embeddings" ] }, { "cell_type": "markdown", "metadata": { "id": "-JpDtDzBlFnz" }, "source": [ "> NOTE: The demo data set already has embeddings created for you, so you can just read through this step without executing the cells if desired.\n", "\n", "Google's latest embedding models now support task types, which increase the accuracy of vector search results based on the nature of the underlying data and your use case.\n", "\n", "The new embeddings models support the following task types:\n", " \n", "|Task type|Embeddings optimization criteria|\n", "|:-|:-|\n", "|`SEMANTIC_SIMILARITY`|Semantic similarity. Use this task type when retrieving similar texts from the corpus.|\n", "|`RETRIEVAL_QUERY`|Document search and information retrieval. Use `RETRIEVAL_QUERY` for query texts, and `RETRIEVAL_DOCUMENT` for documents to be retrieved.|\n", "|`QUESTION_ANSWERING`|Questions and answers applications such as RAG. Use `QUESTION_ANSWERING` for question texts, and `RETRIEVAL_DOCUMENT` for documents to be retrieved.|\n", "|`FACT_VERIFICATION`|Document search for fact verification. Use `FACT_VERIFICATION` for the target text, and `RETRIEVAL_DOCUMENT` for documents to be retrieved.|\n", "|`CODE_RETRIEVAL_QUERY`|Code search. Use `CODE_RETRIEVAL_QUERY` for query text, and `RETRIEVAL_DOCUMENT` for code blocks to be retrieved (available on embedding model `text-embedding-preview-0815` and later)|\n", "|`CLASSIFICATION`|Text classification. Use this task type for training a small classification model with the embedding.|\n", "|`CLUSTERING`|Text clustering. Use this task type for k-means or other clustering analysis.|\n", "\n", "For example, if you are building a RAG system for a question and answering use case, you may specify task type `RETRIEVAL_DOCUMENT` for generating embeddings for building with vector search, and specify `QUESTION_ANSWERING` for embeddings for question texts. Thus you should see improved search quality compared to using `SEMANTIC_SIMILARITY` for both query and document. Likewise, you may use `RETRIEVAL_QUERY` for queries for document search, and `FACT_VERIFICATION` for queries for finding documents for fact checking.\n", "\n", "Embeddings with task type `CLASSIFICATION` are useful for classifying texts with its semantics for use cases such as customer and product segmentation.\n", "\n", "For our use case, we'll use the `SEMANTIC_SIMILARITY` task type for the `customer_marketing_insights_embedding` so that we can search for target customers using natural language, and we'll use the `CLUSTERING` task type for the other three embedding columns so that we can group customers together based on their similarity to each other." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "I3NRgPBK4LvC" }, "outputs": [], "source": [ "# Populate customer_loyalty_data_embedding column with embeddings\n", "# Checking for nulls after update and rerunning if nulls are found to work around quota constraints\n", "\n", "# Define columns to update and task type to use for embeddings\n", "# Embeddings will be stored in a column named <column name>_embedding\n", "columns_to_embed = [\n", " ['customer_profile_data','CLUSTERING'],\n", " ['customer_loyalty_data', 'CLUSTERING'],\n", " ['customer_segmentation_data','CLUSTERING'],\n", " ['customer_marketing_insights','SEMANTIC_SIMILARITY']\n", " ]\n", "\n", "for column in columns_to_embed:\n", " print(f\"Populating column {column[0]}_embedding with embeddings using task type {column[1]}\")\n", "\n", " sleep_time_seconds = 5\n", "\n", " sql = f\"\"\"SELECT COUNT(*) AS null_count FROM `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`\n", " WHERE ARRAY_LENGTH({column[0]}_embedding) = 0\"\"\"\n", "\n", " result = RunQuery(sql)\n", " null_count = result.iloc[0,0]\n", " limit = null_count\n", "\n", " while null_count > 0:\n", " last_null_count = null_count\n", "\n", " sql = f\"\"\"UPDATE `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile` AS t1\n", " SET {column[0]}_embedding = t2.ml_generate_embedding_result\n", " FROM ML.GENERATE_EMBEDDING(\n", " MODEL `${project_id}.${bigquery_chocolate_ai_dataset}.textembedding_model`,\n", " (SELECT customer_id, TO_JSON_STRING({column[0]}, false) as content\n", " FROM `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`\n", " WHERE ARRAY_LENGTH({column[0]}_embedding) = 0\n", " LIMIT {str(limit)}),\n", " STRUCT(\n", " TRUE AS flatten_json_output,\n", " '{column[1]}' as task_type,\n", " 768 AS output_dimensionality\n", " )\n", " ) AS t2\n", " WHERE t1.customer_id = t2.customer_id;\"\"\"\n", "\n", " result = RunQuery(sql)\n", "\n", " sql = f\"\"\"SELECT COUNT(*) AS null_count FROM `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`\n", " WHERE ARRAY_LENGTH({column[0]}_embedding) = 0\"\"\"\n", "\n", " result = RunQuery(sql)\n", " null_count = result.iloc[0,0]\n", "\n", " if null_count > 0:\n", " print(f\"Found null count: {null_count}. Waiting {sleep_time_seconds} seconds before retry.\")\n", " time.sleep(sleep_time_seconds)\n", "\n", " if null_count >= last_null_count:\n", " # Increase sleep time and decrease limit if we're not making progress\n", " sleep_time_seconds = sleep_time_seconds * 1.5 if sleep_time_seconds < 30 else 30\n", " limit = int(limit / 1.5) if limit > 10 else 10\n", "\n", " print(f\"Retrying with with limit set to {limit}.\")\n", "\n", " print(f\"Done populating column {column[0]}_embedding with embeddings.\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "oStrtlsnDnvY" }, "source": [ "### Optional: Create vector index\n", "\n", "If you have a table with at least 5000 rows, you can create a vector index to perform approximate nearest neighbor (ANN) search, which is much more performant on large data sets. Tables with less than 5000 rows must use exact nearest neighbor (ENN) search.\n", "\n", "This example uses the [IVF algorithm](https://cloud.google.com/bigquery/docs/vector-index#ivf-index). IVF is an inverted file index, which uses a k-means algorithm to cluster the vector data, and then partitions the vector data based on those clusters. When you use the VECTOR_SEARCH function to search the vector data, it can use these partitions to reduce the amount of data it needs to read in order to determine a result.\n", "\n", "For large batch search use cases, BigQuery also supports the [TreeAH algorithm](https://cloud.google.com/bigquery/docs/vector-index#tree-ah-index) for vector indexing. TreeAH is a type of vector index that uses Google's ScaNN algorithm. It works as follows:\n", "- The base table is divided into smaller, more manageable shards.\n", "- A clustering model is trained, with the number of clusters derived from the leaf_node_embedding_count option in tree_ah_options.\n", "- The vectors are product quantized and stored in the index tables.\n", "- During VECTOR_SEARCH, a candidate list for each query vector is efficiently computed using asymmetric hashing, which is hardware-optimized for approximate distance calculations. These candidates are then re-scored and re-ranked using exact embeddings.\n", "\n", "The TreeAH algorithm is optimized for batch queries that process hundreds or more query vectors.\n", "\n", "BigQuery limits you to one vector index per table, so we have chosen to index the `customer_marketing_insights_embedding` column. You could choose to index one of the other columns in this table instead using the syntax below if desired.\n", "\n", "```sql\n", "CREATE OR REPLACE VECTOR INDEX customer_segmentation_data_embedding_ivf\n", "ON `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`(customer_segmentation_data_embedding)\n", "OPTIONS (index_type = 'IVF', distance_type = 'COSINE');\n", "\n", "CREATE OR REPLACE VECTOR INDEX customer_loyalty_data_embedding_ivf\n", "ON `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`(customer_loyalty_data_embedding)\n", "OPTIONS (index_type = 'IVF', distance_type = 'COSINE');\n", "\n", "CREATE OR REPLACE VECTOR INDEX customer_profile_data_embedding_ivf\n", "ON `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`(customer_profile_data_embedding)\n", "OPTIONS (index_type = 'IVF', distance_type = 'COSINE');\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "a5OtJsftDsd5" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- Requires minimum of 5000 rows to create ANN index\n", "DROP VECTOR INDEX IF EXISTS customer_marketing_insights_embedding_ivf\n", "ON `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`;\n", "\n", "CREATE VECTOR INDEX customer_marketing_insights_embedding_ivf\n", "ON `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`(customer_marketing_insights_embedding)\n", "OPTIONS (index_type = 'IVF', distance_type = 'COSINE');\n" ] }, { "cell_type": "markdown", "metadata": { "id": "7lu5KKlIRI1_" }, "source": [ "## Explore Embeddings" ] }, { "cell_type": "markdown", "metadata": { "id": "FRD2qnjQ5p54" }, "source": [ "#### Get Dynamic Segments with Vector Search on Semantic Similarity Embeddings\n", "\n", "The query below uses vector search to return a list of the top 5 (top_k) matches based on semantic similarity to the search phrase \"Young professional\". This provides additional flexibility to build custom segments of customers that don't align with the pre-defined segments in the `customer_segmentation_data` column. You can use a similar query to search based on vector similarity for the other embedding columns we generated earlier as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "N1ossi5e8Pmq" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- Ref: https://cloud.google.com/bigquery/docs/vector-search\n", "SELECT\n", " distance,\n", " base.customer_id,\n", " base.customer_marketing_insights,\n", " base.customer_profile_data,\n", " base.customer_loyalty_data,\n", " base.customer_segmentation_data\n", "FROM VECTOR_SEARCH(\n", " -- base table or subquery\n", " (\n", " SELECT * FROM `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`\n", " ),\n", "\n", " -- embedding column to search in base table - must be of type ARRAY<FLOAT64>\n", " 'customer_marketing_insights_embedding',\n", "\n", " -- query table or subquery - this is where you generate the search embedding\n", " (\n", " SELECT ml_generate_embedding_result, content AS query\n", " FROM ML.GENERATE_EMBEDDING(\n", " MODEL `${project_id}.${bigquery_chocolate_ai_dataset}.textembedding_model`,\n", " (\n", " -- Search term\n", " SELECT \"Young professional\" AS content\n", " ),\n", " STRUCT(\n", " TRUE AS flatten_json_output,\n", " 'SEMANTIC_SIMILARITY' as task_type,\n", " 768 AS output_dimensionality\n", " )\n", " )\n", " ),\n", " top_k => 5,\n", " distance_type => 'COSINE'\n", ");\n" ] }, { "cell_type": "markdown", "metadata": { "id": "y47gnjTRfbES" }, "source": [ "### Visualize Customer Segments with Clustered Embeddings" ] }, { "cell_type": "markdown", "metadata": { "id": "0OeZ4IT2htTO" }, "source": [ "#### Perform Dimensionality Reduction" ] }, { "cell_type": "markdown", "metadata": { "id": "zSwUQzDogzNv" }, "source": [ "Our embeddings contain 768 dimensions, which is difficult to visualize. So, we can use the t-Distributed Stochastic Neighbor Embedding (t-SNE) approach to perform dimensionality reduction, enabling us to visualize the embeddings and spot patterns in our customer base.\n", "\n", "The t-SNE technique reduces the number of dimensions, while preserving clusters (points that are close together stay close together). For the original data, the model tries to construct a distribution over which other data points are \"neighbors\" (e.g., they share a similar meaning). It then optimizes an objective function to keep a similar distribution in the visualization." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pijcvaRr6GlH" }, "outputs": [], "source": [ "# Get embeddings and customer_360 data in a dataframe\n", "\n", "sql = \"\"\"SELECT cmp.customer_id,\n", " cmp.customer_segmentation_data_embedding AS Embeddings,\n", " c360.*\n", " FROM `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile` cmp\n", " JOIN `${project_id}.${bigquery_chocolate_ai_dataset}.customer_360` c360\n", " ON c360.customer_id = cmp.customer_id\"\"\"\n", "\n", "result = RunQuery(sql)\n", "result.head()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Gz24I7nGUjje" }, "outputs": [], "source": [ "# Show count of current dimensions in embeddings\n", "print(f\"Number of dimensions per embedding: {len(result['Embeddings'][0])}\")\n", "\n", "# Convert result['Embeddings'] Pandas series to a np.array of float32\n", "X = np.array(result['Embeddings'].to_list(), dtype=np.float32)\n", "print(f\"Shape of converted np.array: {X.shape}\")\n", "\n", "# Apply t-SNE\n", "print(\"Performing dimensionality reduction via t-SNE...\")\n", "tsne = TSNE(random_state=0, max_iter=1000)\n", "tsne_results = tsne.fit_transform(X)\n", "\n", "# **Create df_tsne with customer_id**\n", "df_tsne = pd.DataFrame({'customer_id': result['customer_id'],\n", " 'TSNE1': tsne_results[:, 0],\n", " 'TSNE2': tsne_results[:, 1]})" ] }, { "cell_type": "markdown", "metadata": { "id": "UL8RCDyNh52q" }, "source": [ "#### Define Visualization and Explanation Functinos" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "x-lATLh6RItm" }, "outputs": [], "source": [ "# Visualize clustered embeddings\n", "def VisualizeEmbeddings(df, segment):\n", " fig, ax = plt.subplots(figsize=(8,6))\n", " sns.set_style('darkgrid', {\"grid.color\": \".6\", \"grid.linestyle\": \":\"})\n", "\n", " merged_df = pd.merge(df_tsne, result[['customer_id', segment]], on='customer_id')\n", "\n", " sns.scatterplot(data=merged_df, x='TSNE1', y='TSNE2', hue=segment, palette='hls')\n", "\n", " plt.title(f'Scatter plot of \"{segment}\" segment using t-SNE');\n", " plt.xlabel('TSNE1');\n", " plt.ylabel('TSNE2');\n", " plt.axis('equal')\n", "\n", " # Calculate centroids\n", " centroids = merged_df.groupby(segment).agg({'TSNE1': 'mean', 'TSNE2': 'mean'})\n", "\n", " # Label centroids with nearest customer_id and store for legend\n", " centroid_labels = {}\n", " centroid_data = {} # Dictionary to store segment and customer_id pairs\n", " for segment_name, centroid in centroids.iterrows():\n", " distances = cdist([centroid], merged_df[merged_df[segment] == segment_name][['TSNE1', 'TSNE2']])\n", " nearest_index = distances.argmin()\n", " nearest_customer_id = merged_df[merged_df[segment] == segment_name]['customer_id'].iloc[nearest_index]\n", "\n", " centroid_labels[segment_name] = f\"{segment_name} (Cust. {nearest_customer_id})\"\n", " centroid_data[segment_name] = nearest_customer_id # Store only customer_id\n", "\n", " ax.annotate(nearest_customer_id, (centroid['TSNE1'], centroid['TSNE2']),\n", " textcoords=\"offset points\", xytext=(5,5), ha='center', fontsize=8)\n", "\n", " # Update legend with centroid labels\n", " handles, labels = ax.get_legend_handles_labels()\n", " for i, label in enumerate(labels):\n", " if label in centroid_labels:\n", " labels[i] = centroid_labels[label]\n", " ax.legend(handles, labels, title=segment, loc=\"upper left\", bbox_to_anchor=(1, 1))\n", "\n", " plt.show()\n", " plt.close()\n", "\n", " return centroid_data\n", "\n", "def ExplainEmbeddings(segment_centroid_pair):\n", " # Define in_string for query\n", " in_string = ''\n", " for k, v in segment_centroid_pair.items():\n", " in_string += f\"{v},\"\n", " in_string = f\"({in_string[:-1]})\"\n", "\n", " # Get raw segment data from BigQuery for centroids\n", " sql = f\"\"\"SELECT customer_id, customer_segmentation_data FROM `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`\n", " WHERE customer_id IN {in_string}\"\"\"\n", " segment_result = RunQuery(sql)\n", "\n", " # Define prompt\n", " prompt = f\"\"\"The following customers have been selected as representative of a unique customer segment based on a segment category called {segment}:\n", " {['Customer ' + str(y) + ' is representative of segment \"' + str(x) + '\"' for x, y in segment_centroid_pair.items()]}\n", "\n", " Here are is the segmentation data for each customer identified above:\n", " {['Segment data for customer_id ' + str(row['customer_id']) + ': ' + str(row['customer_segmentation_data']) for index, row in segment_result.iterrows()]}\n", "\n", " Explain in summary format 1/ the key differences between each of these customers that is related to segment category \"{segment}\", 2/ the key differences between each of these customers that is NOT related to segment category \"{segment}\", and 3/ the key SIMILARITIES between these customers that is NOT related to segment category \"{segment}\". Think step by step and explain your reasoning.\n", " \"\"\"\n", "\n", " response_schema = {\n", " \"type\": \"object\",\n", " \"required\": [\n", " \"explanation\"\n", " ],\n", " \"properties\": {\n", " \"explanation\": {\n", " \"type\": \"string\"\n", " }\n", " }\n", " }\n", "\n", " # Have Gemini explain the segmentation:\n", " response = GeminiLLM(prompt, response_schema = response_schema)\n", " return response\n" ] }, { "cell_type": "markdown", "metadata": { "id": "9D8FvTtZiANi" }, "source": [ "#### Visualize Embedding-based Segments" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mkimRR6-hfEc" }, "outputs": [], "source": [ "segments_to_visualize = [\n", " 'age',\n", " 'family_size',\n", " 'gender',\n", " 'generation',\n", " 'marital_status',\n", " 'children',\n", " 'martial_status',\n", " 'loyalty_status',\n", " 'spending_habits',\n", " 'usage_frequency',\n", "]\n", "\n", "for segment in segments_to_visualize:\n", " segment_centroid_pair = VisualizeEmbeddings(df_tsne, segment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Novel Segments from Relevant Centroids" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get nearest neighbors to a relevant centroid to find unexpected similar customers\n", "# Like 6673, the centroid for \"At-risk\" customers. This allows us to target both\n", "# \"At-risk\" customers, and those who are similar to them but might not be identified yet\n", "\n", "%%bigquery\n", "\n", "-- Ref: https://cloud.google.com/bigquery/docs/vector-search\n", "SELECT\n", " distance,\n", " base.customer_id,\n", " base.customer_marketing_insights,\n", " base.customer_profile_data,\n", " base.customer_loyalty_data,\n", " base.customer_segmentation_data\n", "FROM VECTOR_SEARCH(\n", " -- base table or subquery\n", " (\n", " SELECT * FROM `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`\n", " ),\n", "\n", " -- embedding column to search in base table - must be of type ARRAY<FLOAT64>\n", " 'customer_marketing_insights_embedding',\n", "\n", " -- query table or subquery - this is where you generate the search embedding\n", " (\n", "\n", " SELECT customer_marketing_insights_embedding \n", " FROM `${project_id}.${bigquery_chocolate_ai_dataset}.customer_marketing_profile`\n", " WHERE customer_id = 6673\n", " ),\n", " top_k => 250,\n", " distance_type => 'COSINE'\n", ");" ] }, { "cell_type": "markdown", "metadata": { "id": "ZQAq9RfqiCum" }, "source": [ "#### Visualize and Explain Embedding-based Segments" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CIX7xIjvek4B" }, "outputs": [], "source": [ "# It can be hard to spot the commonalities and differences between embedding-based\n", "# segments, so we can ask Gemini to inspect the centroids and explain them.\n", "\n", "segments_to_visualize_and_explain = [\n", " 'loyalty_status'\n", "]\n", "\n", "response = ''\n", "for segment in segments_to_visualize_and_explain:\n", " segment_centroid_pair = VisualizeEmbeddings(df_tsne, segment)\n", " response = json.loads(ExplainEmbeddings(segment_centroid_pair))\n", " DisplayMarkdown(response['explanation'])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reference Links" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [BigQuery JSON Functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions)\n", "- [Working with JSON Data in BigQuery](https://cloud.google.com/bigquery/docs/json-data)\n", "- [Generate Embeddings in BigQuery with ML.GENERATE_EMBEDDING](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-embedding)\n", "- [BigQuery Vector Indexes](https://cloud.google.com/bigquery/docs/vector-index)\n", "- [BigQuery VECTOR_SEARCH() Function](https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search)\n", "- [Visualizing Embeddings with t-SNE](https://ai.google.dev/gemini-api/tutorials/clustering_with_embeddings)\n", "- [BigQuery Views](https://cloud.google.com/bigquery/docs/views-intro)" ] } ], "metadata": { "colab": { "name": "Create-Campaign-Customer-Segmentation.ipynb", "private_outputs": true, "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }

colab-enterprise/Create-Campaign-Customer-Segmentation.ipynb (1,744 lines of code) (raw):