data-analytics-demos/data-engineering-agents/Data-Engineering-Agents-Demo.ipynb (916 lines of code) (raw):
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"name": "Next-25-Demo",
"collapsed_sections": [
"meXOXbrf-ckY",
"KqmvoasiwGaI",
"C0TAaNe-ubYZ",
"RI-8BnmHvTv7",
"HAK-uM7pxSyN",
"Eqiq1DH4yIYl",
"XxHBvI9ay8Lj",
"qqzYGndw1iQI",
"_DuECn4V1pso",
"jQXtpnJy1s3C",
"YzaVokWA3lnX",
"7FWkRhKi4ZOw",
"0U3o-wtCGVe-",
"3Jedm-hAGYNj",
"8R0_XE1qGbub",
"cV6VAF12rNnJ"
]
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"### <img src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/BigQuery-32-color.svg\" width=\"30\" valign=\"top\" alt=\"BigQuery\"> <img src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/CloudSQL-32-color.svg\" width=\"30\" valign=\"top\" alt=\"AlloyDB\"> <img src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/VertexAI-32-color.svg\" width=\"30\" valign=\"top\" alt=\"VertexAI\"> Architecture"
],
"metadata": {
"id": "meXOXbrf-ckY"
}
},
{
"cell_type": "code",
"source": [
"# Architecture Diagram\n",
"from IPython.display import Image\n",
"Image(url='https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/Architecture-Data-Engineering-Agents.png', width=1000)"
],
"metadata": {
"id": "_fE-da9E-c96"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### <img src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/BigQuery-32-color.svg\" width=\"30\" valign=\"top\" alt=\"BigQuery\"> View Raw Unprocessed Competitor Pricing data"
],
"metadata": {
"id": "KqmvoasiwGaI"
}
},
{
"cell_type": "code",
"source": [
"%%bigquery\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Malformed data, Hotel name is not proper case, phone has various formats\n",
"----------------------------------------------------------------------------------------------------------------\n",
"SELECT * FROM `data-connect-demo2.raw_data.competitor_pricing`;"
],
"metadata": {
"id": "8Bm6o7-ZwGBh"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### <img src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/BigQuery-32-color.svg\" width=\"30\" valign=\"top\" alt=\"BigQuery\"> Data Engineering Agents"
],
"metadata": {
"id": "C0TAaNe-ubYZ"
}
},
{
"cell_type": "code",
"source": [
"%%bigquery\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Clean up, drop existing table in case we want schema changes\n",
"----------------------------------------------------------------------------------------------------------------\n",
"DROP TABLE IF EXISTS data-connect-demo2.cleaned_data.pricing_1;"
],
"metadata": {
"id": "tTTOSwXWuf8R"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Prompts**\n",
"1. Source and Destinations:\n",
" - I have a table competitor_pricing in dataset raw_data. I want to make the hotel name, room type and bed type columns to be lowercase and the first letter of each word uppercase. I want to save the data to a table named pricing_1 in dataset cleaned_data.\n",
"\n",
"2. Cleaning phone number\n",
" - I want the phone number field to only contain numbers.\n",
"\n",
"3. Cleaning dates\n",
" - In the date field replace forward slashes with a dash.\n",
"\n",
"4. Cleaning data type\n",
" - I want the room_size_sqft and price columns to be numeric data types in the destination table. Also make the date field a DATE datatype."
],
"metadata": {
"id": "PaYOSFspuhsy"
}
},
{
"cell_type": "code",
"source": [
"%%bigquery\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Create a pricing table with a primary key\n",
"----------------------------------------------------------------------------------------------------------------\n",
"CREATE TABLE IF NOT EXISTS `data-connect-demo2.cleaned_data.pricing` AS\n",
"SELECT ROW_NUMBER() OVER (PARTITION BY 1) AS pricing_id, *\n",
" FROM `data-connect-demo2.cleaned_data.pricing_1`;"
],
"metadata": {
"id": "ZrB76B8NB3pS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%bigquery\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Show the imported / cleaned data from the Data Engineering Agent\n",
"----------------------------------------------------------------------------------------------------------------\n",
"SELECT *\n",
" FROM data-connect-demo2.cleaned_data.pricing;"
],
"metadata": {
"id": "15Xq4qR9y16q"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### <img src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/BigQuery-32-color.svg\" width=\"30\" valign=\"top\" alt=\"BigQuery\"> Gemini and Text Embeddings"
],
"metadata": {
"id": "RI-8BnmHvTv7"
}
},
{
"cell_type": "code",
"source": [
"%%bigquery\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Create our Gemini and Text Embedding models\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Create our GenAI and Vector Embeddings models\n",
"CREATE MODEL IF NOT EXISTS `data-connect-demo2.cleaned_data.gemini_2_0_flash`\n",
" REMOTE WITH CONNECTION `data-connect-demo2.us-central1.vertex-ai`\n",
" OPTIONS (endpoint = 'gemini-2.0-flash');\n",
"\n",
"CREATE MODEL IF NOT EXISTS `data-connect-demo2.cleaned_data.text_embedding_005`\n",
" REMOTE WITH CONNECTION `data-connect-demo2.us-central1.vertex-ai`\n",
" OPTIONS (endpoint = 'text-embedding-005');"
],
"metadata": {
"id": "ICnWbCN-vTeu"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We now want to create embeddings to match semantically similar items:\n",
"\n",
"- Balcony/Terrace/Veranda/Sundeck/Porch: (These are all open-air platforms attached to the room)\n",
"- Mini-fridge/Refrigerator/Fridge: (Different names for the same appliance)\n",
"- Working Desk/Writing Table/Desk: (All the same thing - a surface for working)\n",
"- Down Pillows and Duvets: (Specific type of bedding)\n"
],
"metadata": {
"id": "28XTsdxMw_y_"
}
},
{
"cell_type": "code",
"source": [
"%%bigquery\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Create embeddings using text-embedding-005 on each room feature (we split it by the pipe)\n",
"----------------------------------------------------------------------------------------------------------------\n",
"CREATE TABLE IF NOT EXISTS `data-connect-demo2.cleaned_data.pricing_embeddings` AS\n",
"WITH split_room_features AS\n",
"(\n",
" SELECT pricing_id, SPLIT(LOWER(room_features), '|') AS room_features_array\n",
" FROM `cleaned_data.pricing`\n",
"),\n",
"room_features AS\n",
"(\n",
" SELECT pricing_id, room_feature\n",
" FROM split_room_features\n",
" JOIN UNNEST(room_features_array) AS room_feature\n",
")\n",
"SELECT pricing_id,\n",
" room_feature,\n",
" ml_generate_embedding_result AS vector_embedding\n",
" FROM ML.GENERATE_EMBEDDING(MODEL `data-connect-demo2.cleaned_data.text_embedding_005`,\n",
" (SELECT pricing_id, room_feature, room_feature AS content FROM room_features),\n",
" STRUCT(TRUE AS flatten_json_output,\n",
" 'SEMANTIC_SIMILARITY' as task_type,\n",
" 768 AS output_dimensionality));"
],
"metadata": {
"id": "OFGZJ1T4wrEH"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### <img src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/BigQuery-32-color.svg\" width=\"30\" valign=\"top\" alt=\"BigQuery\"> Search Vector Embeddings"
],
"metadata": {
"id": "HAK-uM7pxSyN"
}
},
{
"cell_type": "code",
"source": [
"%%bigquery\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Search: balcony\n",
"----------------------------------------------------------------------------------------------------------------\n",
"SELECT DISTINCT query.query AS search_string,\n",
" base.room_feature,\n",
" distance\n",
" FROM VECTOR_SEARCH(TABLE `data-connect-demo2.cleaned_data.pricing_embeddings`,\n",
" 'vector_embedding', -- column in table to search\n",
" (SELECT ml_generate_embedding_result,\n",
" content AS query\n",
" FROM ML.GENERATE_EMBEDDING(MODEL `data-connect-demo2.cleaned_data.text_embedding_005`,\n",
" (SELECT 'balcony' AS content),\n",
" STRUCT(TRUE AS flatten_json_output,\n",
" 'SEMANTIC_SIMILARITY' as task_type,\n",
" 768 AS output_dimensionality) -- struct\n",
" )),\n",
" top_k => 100)\n",
" WHERE distance < .66\n",
"ORDER BY distance;"
],
"metadata": {
"id": "bFlS5L6fxTJj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"%%bigquery\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Search: hair drying\n",
"----------------------------------------------------------------------------------------------------------------\n",
"SELECT DISTINCT query.query AS search_string,\n",
" base.room_feature,\n",
" distance\n",
" FROM VECTOR_SEARCH(TABLE `data-connect-demo2.cleaned_data.pricing_embeddings`,\n",
" 'vector_embedding', -- column in table to search\n",
" (SELECT ml_generate_embedding_result,\n",
" content AS query\n",
" FROM ML.GENERATE_EMBEDDING(MODEL `data-connect-demo2.cleaned_data.text_embedding_005`,\n",
" (SELECT 'hair drying' AS content),\n",
" STRUCT(TRUE AS flatten_json_output,\n",
" 'SEMANTIC_SIMILARITY' as task_type,\n",
" 768 AS output_dimensionality) -- struct\n",
" )),\n",
" top_k => 100)\n",
" WHERE distance < .66\n",
"ORDER BY distance;"
],
"metadata": {
"id": "1kIiPJvTx7o7"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### <img src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/BigQuery-32-color.svg\" width=\"30\" valign=\"top\" alt=\"BigQuery\"> Compute Competitor Pricing"
],
"metadata": {
"id": "Eqiq1DH4yIYl"
}
},
{
"cell_type": "code",
"source": [
"%%bigquery\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Find all rooms, using vector embeddings, with and without a \"Balcony\"\n",
"----------------------------------------------------------------------------------------------------------------\n",
"SELECT 'Balcony' AS amenity,\n",
" CAST(AVG(pricing.room_size_sqft) AS INT64) AS average_sq_ft,\n",
" CAST(AVG(pricing.price) AS INT64) AS average_price\n",
" FROM VECTOR_SEARCH(TABLE `data-connect-demo2.cleaned_data.pricing_embeddings`,\n",
" 'vector_embedding', -- column in table to search\n",
" (SELECT ml_generate_embedding_result,\n",
" content AS query\n",
" FROM ML.GENERATE_EMBEDDING(MODEL `data-connect-demo2.cleaned_data.text_embedding_005`,\n",
" (SELECT 'balcony' AS content),\n",
" STRUCT(TRUE AS flatten_json_output,\n",
" 'SEMANTIC_SIMILARITY' as task_type,\n",
" 768 AS output_dimensionality) -- struct\n",
" )),\n",
" top_k => 100) AS vector_table\n",
" INNER JOIN `data-connect-demo2.cleaned_data.pricing` AS pricing\n",
" ON vector_table.base.pricing_id = pricing.pricing_id\n",
" WHERE distance < .66\n",
" UNION ALL\n",
" SELECT 'No Balcony' AS amenity,\n",
" CAST(AVG(pricing.room_size_sqft) AS INT64) AS average_sq_ft,\n",
" CAST(AVG(pricing.price) AS INT64) AS average_price\n",
" FROM VECTOR_SEARCH(TABLE `data-connect-demo2.cleaned_data.pricing_embeddings`,\n",
" 'vector_embedding', -- column in table to search\n",
" (SELECT ml_generate_embedding_result,\n",
" content AS query\n",
" FROM ML.GENERATE_EMBEDDING(MODEL `data-connect-demo2.cleaned_data.text_embedding_005`,\n",
" (SELECT 'balcony' AS content),\n",
" STRUCT(TRUE AS flatten_json_output,\n",
" 'SEMANTIC_SIMILARITY' as task_type,\n",
" 768 AS output_dimensionality) -- struct\n",
" )),\n",
" top_k => 200) AS vector_table\n",
" INNER JOIN `data-connect-demo2.cleaned_data.pricing` AS pricing\n",
" ON vector_table.base.pricing_id = pricing.pricing_id\n",
" WHERE distance >= .66;"
],
"metadata": {
"id": "k_zqTm03yKtQ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### <img src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/CloudSQL-32-color.svg\" width=\"30\" valign=\"top\" alt=\"AlloyDB\"> AlloyDB: Vector Embeddings"
],
"metadata": {
"id": "XxHBvI9ay8Lj"
}
},
{
"cell_type": "markdown",
"source": [
"##### Open AlloyDB"
],
"metadata": {
"id": "qqzYGndw1iQI"
}
},
{
"cell_type": "markdown",
"source": [
"- https://console.cloud.google.com/alloydb/locations/us-central1/clusters/alloy-db-us-central1/studio?project=data-connect-demo2\n",
"- Login\n",
" - Database: postgres\n",
" - User: postgres\n",
" - Password: next25"
],
"metadata": {
"id": "oBDtNUPL1NbL"
}
},
{
"cell_type": "markdown",
"source": [
"##### Run this in AlloyDB: Query 1"
],
"metadata": {
"id": "_DuECn4V1pso"
}
},
{
"cell_type": "markdown",
"source": [
"```\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- The competition uses \"Terrace\" and we use Balcony\n",
"----------------------------------------------------------------------------------------------------------------\n",
"SELECT DISTINCT room_feature,\n",
" vector_embedding::vector <-> embedding('text-embedding-005', 'Terrace')::vector AS distance\n",
" FROM room_amenity\n",
" ORDER BY distance\n",
" LIMIT 100;\n",
"```"
],
"metadata": {
"id": "ESSfLvqB0qKr"
}
},
{
"cell_type": "markdown",
"source": [
"##### Run this in AlloyDB: Query 2"
],
"metadata": {
"id": "jQXtpnJy1s3C"
}
},
{
"cell_type": "markdown",
"source": [
"```\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Compute our average square feet and price for rooms with and without a Balcony\n",
"----------------------------------------------------------------------------------------------------------------\n",
"SELECT 'Balcony' AS amenity,\n",
" CAST(AVG(room.square_feet) AS INT) AS average_sq_ft,\n",
" CAST(AVG(room_rate.room_rate) AS INT) AS average_price\n",
" FROM room\n",
" INNER JOIN room_amenity\n",
" ON room.room_id= room_amenity.room_id\n",
" AND room_amenity.room_feature = 'Balcony'\n",
" INNER JOIN room_rate\n",
" ON room_rate.room_id = room.room_id\n",
"UNION ALL\n",
"SELECT 'No Balcony' AS amenity,\n",
" CAST(AVG(room.square_feet) AS INT) AS average_sq_ft,\n",
" CAST(AVG(room_rate.room_rate) AS INT) AS average_price\n",
" FROM room\n",
" INNER JOIN room_rate\n",
" ON room_rate.room_id = room.room_id\n",
" AND room.room_id NOT IN (SELECT room_id FROM room_amenity WHERE room_feature = 'Balcony');\n",
"```"
],
"metadata": {
"id": "yNSmL-po0qk_"
}
},
{
"cell_type": "markdown",
"source": [
"### <img src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/BigQuery-32-color.svg\" width=\"30\" valign=\"top\" alt=\"BigQuery\"> BigQuery / AlloyDB Federated Query"
],
"metadata": {
"id": "YzaVokWA3lnX"
}
},
{
"cell_type": "code",
"source": [
"%%bigquery\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Let's compare prices between our competitor pricing (BigQuery) and our transactional data (AlloyDB)\n",
"----------------------------------------------------------------------------------------------------------------\n",
"SELECT * FROM EXTERNAL_QUERY\n",
"(\n",
"'us-central1.alloydb-transaction-data',\n",
"'''\n",
"-- Compute our average square feet and price for rooms with and without a Balcony\n",
"SELECT 'AlloyDB' AS source_system,\n",
" 'Balcony' AS amenity,\n",
" CAST(AVG(room.square_feet) AS INT) AS average_sq_ft,\n",
" CAST(AVG(room_rate.room_rate) AS INT) AS average_price\n",
" FROM room\n",
" INNER JOIN room_amenity\n",
" ON room.room_id= room_amenity.room_id\n",
" AND room_amenity.room_feature = 'Balcony'\n",
" INNER JOIN room_rate\n",
" ON room_rate.room_id = room.room_id\n",
"UNION ALL\n",
"SELECT 'AlloyDB' AS source_system,\n",
" 'No Balcony' AS amenity,\n",
" CAST(AVG(room.square_feet) AS INT) AS average_sq_ft,\n",
" CAST(AVG(room_rate.room_rate) AS INT) AS average_price\n",
" FROM room\n",
" INNER JOIN room_rate\n",
" ON room_rate.room_id = room.room_id\n",
" AND room.room_id NOT IN (SELECT room_id FROM room_amenity WHERE room_feature = 'Balcony');\n",
"'''\n",
") AS alloydb_data\n",
"UNION ALL\n",
"SELECT 'BigQuery' AS source_system,\n",
" 'Balcony' AS amenity,\n",
" CAST(AVG(pricing.room_size_sqft) AS INT64) AS average_sq_ft,\n",
" CAST(AVG(pricing.price) AS INT64) AS average_price\n",
" FROM VECTOR_SEARCH(TABLE `data-connect-demo2.cleaned_data.pricing_embeddings`,\n",
" 'vector_embedding', -- column in table to search\n",
" (SELECT ml_generate_embedding_result,\n",
" content AS query\n",
" FROM ML.GENERATE_EMBEDDING(MODEL `data-connect-demo2.cleaned_data.text_embedding_005`,\n",
" (SELECT 'balcony' AS content),\n",
" STRUCT(TRUE AS flatten_json_output,\n",
" 'SEMANTIC_SIMILARITY' as task_type,\n",
" 768 AS output_dimensionality) -- struct\n",
" )),\n",
" top_k => 100) AS vector_table\n",
" INNER JOIN `data-connect-demo2.cleaned_data.pricing` AS pricing\n",
" ON vector_table.base.pricing_id = pricing.pricing_id\n",
" WHERE distance < .66\n",
" UNION ALL\n",
" SELECT 'BigQuery' AS source_system,\n",
" 'No Balcony' AS amenity,\n",
" CAST(AVG(pricing.room_size_sqft) AS INT64) AS average_sq_ft,\n",
" CAST(AVG(pricing.price) AS INT64) AS average_price\n",
" FROM VECTOR_SEARCH(TABLE `data-connect-demo2.cleaned_data.pricing_embeddings`,\n",
" 'vector_embedding', -- column in table to search\n",
" (SELECT ml_generate_embedding_result,\n",
" content AS query\n",
" FROM ML.GENERATE_EMBEDDING(MODEL `data-connect-demo2.cleaned_data.text_embedding_005`,\n",
" (SELECT 'balcony' AS content),\n",
" STRUCT(TRUE AS flatten_json_output,\n",
" 'SEMANTIC_SIMILARITY' as task_type,\n",
" 768 AS output_dimensionality) -- struct\n",
" )),\n",
" top_k => 200) AS vector_table\n",
" INNER JOIN `data-connect-demo2.cleaned_data.pricing` AS pricing\n",
" ON vector_table.base.pricing_id = pricing.pricing_id\n",
" WHERE distance >= .66\n",
" ORDER BY source_system, amenity;"
],
"metadata": {
"id": "tn8N-le23tRB"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### <img src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/BigQuery-32-color.svg\" width=\"30\" valign=\"top\" alt=\"BigQuery\"> BigQuery / <img src=\"https://storage.googleapis.com/data-analytics-golden-demo/colab-diagrams/VertexAI-32-color.svg\" width=\"30\" valign=\"top\" alt=\"VertexAI\"> Vertex AI - Gemini"
],
"metadata": {
"id": "7FWkRhKi4ZOw"
}
},
{
"cell_type": "markdown",
"source": [
"##### PIP install (only need to run once)"
],
"metadata": {
"id": "0U3o-wtCGVe-"
}
},
{
"cell_type": "code",
"source": [
"# PIP Installs\n",
"import sys\n",
"#!{sys.executable} -m pip install plotly"
],
"metadata": {
"id": "04z3OQOn4Z0L"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"##### Visualize the data"
],
"metadata": {
"id": "3Jedm-hAGYNj"
}
},
{
"cell_type": "code",
"source": [
"# Generate a chart with our data\n",
"import pandas as pd\n",
"import plotly.graph_objects as go\n",
"from plotly.subplots import make_subplots\n",
"\n",
"# 1. Recreate the DataFrame (from the image)\n",
"data = {\n",
" 'source_system': ['AlloyDB', 'AlloyDB', 'BigQuery', 'BigQuery'],\n",
" 'amenity': ['Balcony', 'No Balcony', 'Balcony', 'No Balcony'],\n",
" 'average_sq_ft': [461, 420, 458, 412],\n",
" 'average_price': [286, 257, 308, 257]\n",
"}\n",
"df = pd.DataFrame(data)\n",
"\n",
"# 2. Separate data for AlloyDB and BigQuery\n",
"alloydb_data = df[df['source_system'] == 'AlloyDB']\n",
"bigquery_data = df[df['source_system'] == 'BigQuery']\n",
"\n",
"# 3. Create the Plotly Figure\n",
"fig = make_subplots(specs=[[{\"secondary_y\": False}]])\n",
"\n",
"# Add traces for AlloyDB (Our Hotel)\n",
"fig.add_trace(\n",
" go.Bar(\n",
" x=alloydb_data['amenity'],\n",
" y=alloydb_data['average_price'],\n",
" name='Our Hotel (AlloyDB)',\n",
" marker_color='rgb(53, 106, 228)' # Choose your colors\n",
" ),\n",
" secondary_y=False,\n",
")\n",
"\n",
"# Add traces for BigQuery (Competitor)\n",
"fig.add_trace(\n",
" go.Bar(\n",
" x=bigquery_data['amenity'],\n",
" y=bigquery_data['average_price'],\n",
" name='Competitor Pricing (BigQuery)',\n",
" marker_color='rgb(55, 83, 109)' # Choose your colors\n",
" ),\n",
" secondary_y=False,\n",
")\n",
"\n",
"# 4. Add the Annotation (Arrow and Text)\n",
"fig.add_annotation(\n",
" x='Balcony', # x-coordinate of the arrow's end (AlloyDB Balcony bar)\n",
" y=295, # y-coordinate of the arrow's end (slightly above AlloyDB bar)\n",
" ax='Balcony', # x-coordinate of the arrow's start (same as x for a vertical arrow)\n",
" ay=350, # y-coordinate of the arrow's start (higher up)\n",
" xref='x',\n",
" yref='y',\n",
" axref='x',\n",
" ayref='y',\n",
" text='We are undercharging!', # Annotation text\n",
" showarrow=True,\n",
" arrowhead=2, # Style of the arrowhead\n",
" arrowsize=1,\n",
" arrowwidth=2,\n",
" arrowcolor=\"red\",\n",
" font=dict(\n",
" family=\"Courier New, monospace\",\n",
" size=16,\n",
" color=\"red\" # Text color\n",
" )\n",
")\n",
"\n",
"# 5. Update Layout (Titles, Axis Labels, etc.)\n",
"fig.update_layout(\n",
" title_text='Average Hotel Room Price Comparison',\n",
" xaxis_title='Amenity',\n",
" yaxis_title='Average Price',\n",
" barmode='group', # Group the bars side-by-side\n",
" xaxis_tickangle=-45, # Rotate x-axis labels if needed\n",
" template='plotly_white' # Use a clean white template\n",
"\n",
")\n",
"\n",
"fig.show()"
],
"metadata": {
"id": "kuc9ZwLh5bnX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"##### Use Gemini with Retrieval Augmented Generation (RAG)"
],
"metadata": {
"id": "8R0_XE1qGbub"
}
},
{
"cell_type": "code",
"source": [
"%%bigquery\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Now do a RAG using Gemini to process our data for:\n",
"-- I work at a hotel and just imported competitor pricing into BigQuery.\n",
"-- I reviewed the data for rooms that have a balcony and matched these to the rooms we have from our AlloyDB system.\n",
"-- I need to generate some insights to report to management.\n",
"-- Can you generate 2 to 3 sentences as to what we should recommend to management.\n",
"-- Explain your reasoning in your response.\n",
"-- <context>\n",
"-- {\"source_system\":\"AlloyDB\",\"amenity\":\"Balcony\",\"average_sq_ft\":461,\"average_price\":286}\n",
"-- \t{\"source_system\":\"AlloyDB\",\"amenity\":\"No Balcony\",\"average_sq_ft\":420,\"average_price\":257}\n",
"-- \t{\"source_system\":\"BigQuery\",\"amenity\":\"Balcony\",\"average_sq_ft\":458,\"average_price\":308}\n",
"-- \t{\"source_system\":\"BigQuery\",\"amenity\":\"No Balcony\",\"average_sq_ft\":412,\"average_price\":257}\n",
"-- </context>\n",
"----------------------------------------------------------------------------------------------------------------\n",
"WITH llm_prompt AS\n",
"(\n",
" SELECT \"\"\"\n",
" I work at a hotel and just imported competitor pricing into BigQuery.\n",
" I reviewed the data for rooms that have a balcony and matched these to the rooms we have from our AlloyDB system.\n",
" I need to generate some insights to report up to management.\n",
" Can you generate 2 to 3 sentenaces as to how what we should recommend to management.\n",
" Explain your reasoning in your response.\n",
" <context>\n",
" REPLACE-ME-WITH-EMBEDDING-SEARCH-RESULTS\n",
" </context>\"\"\" AS prompt\n",
")\n",
", federated_query AS\n",
"(\n",
" SELECT source_system,amenity,average_sq_ft,average_price\n",
" FROM EXTERNAL_QUERY('us-central1.alloydb-transaction-data',\n",
" '''\n",
" -- Compute our average square feet and price for rooms with and without a Balcony\n",
" SELECT 'AlloyDB' AS source_system,\n",
" 'Balcony' AS amenity,\n",
" CAST(AVG(room.square_feet) AS INT) AS average_sq_ft,\n",
" CAST(AVG(room_rate.room_rate) AS INT) AS average_price\n",
" FROM room\n",
" INNER JOIN room_amenity\n",
" ON room.room_id= room_amenity.room_id\n",
" AND room_amenity.room_feature = 'Balcony'\n",
" INNER JOIN room_rate\n",
" ON room_rate.room_id = room.room_id\n",
" UNION ALL\n",
" SELECT 'AlloyDB' AS source_system,\n",
" 'No Balcony' AS amenity,\n",
" CAST(AVG(room.square_feet) AS INT) AS average_sq_ft,\n",
" CAST(AVG(room_rate.room_rate) AS INT) AS average_price\n",
" FROM room\n",
" INNER JOIN room_rate\n",
" ON room_rate.room_id = room.room_id\n",
" AND room.room_id NOT IN (SELECT room_id FROM room_amenity WHERE room_feature = 'Balcony');\n",
" ''') AS alloydb_data\n",
" UNION ALL\n",
" SELECT 'BigQuery' AS source_system,\n",
" 'Balcony' AS amenity,\n",
" CAST(AVG(pricing.room_size_sqft) AS INT64) AS average_sq_ft,\n",
" CAST(AVG(pricing.price) AS INT64) AS average_price\n",
" FROM VECTOR_SEARCH(TABLE `data-connect-demo2.cleaned_data.pricing_embeddings`,\n",
" 'vector_embedding', -- column in table to search\n",
" (SELECT ml_generate_embedding_result,\n",
" content AS query\n",
" FROM ML.GENERATE_EMBEDDING(MODEL `data-connect-demo2.cleaned_data.text_embedding_005`,\n",
" (SELECT 'balcony' AS content),\n",
" STRUCT(TRUE AS flatten_json_output,\n",
" 'SEMANTIC_SIMILARITY' as task_type,\n",
" 768 AS output_dimensionality) -- struct\n",
" )),\n",
" top_k => 100) AS vector_table\n",
" INNER JOIN `data-connect-demo2.cleaned_data.pricing` AS pricing\n",
" ON vector_table.base.pricing_id = pricing.pricing_id\n",
" WHERE distance < .66\n",
" UNION ALL\n",
" SELECT 'BigQuery' AS source_system,\n",
" 'No Balcony' AS amenity,\n",
" CAST(AVG(pricing.room_size_sqft) AS INT64) AS average_sq_ft,\n",
" CAST(AVG(pricing.price) AS INT64) AS average_price\n",
" FROM VECTOR_SEARCH(TABLE `data-connect-demo2.cleaned_data.pricing_embeddings`,\n",
" 'vector_embedding', -- column in table to search\n",
" (SELECT ml_generate_embedding_result,\n",
" content AS query\n",
" FROM ML.GENERATE_EMBEDDING(MODEL `data-connect-demo2.cleaned_data.text_embedding_005`,\n",
" (SELECT 'balcony' AS content),\n",
" STRUCT(TRUE AS flatten_json_output,\n",
" 'SEMANTIC_SIMILARITY' as task_type,\n",
" 768 AS output_dimensionality) -- struct\n",
" )),\n",
" top_k => 200) AS vector_table\n",
" INNER JOIN `data-connect-demo2.cleaned_data.pricing` AS pricing\n",
" ON vector_table.base.pricing_id = pricing.pricing_id\n",
" WHERE distance >= .66\n",
" ORDER BY source_system, amenity\n",
")\n",
", embeddings_data AS\n",
"(\n",
" SELECT TO_JSON_STRING(STRUCT(source_system, amenity, average_sq_ft, average_price)) AS embeddings_json\n",
" FROM federated_query\n",
")\n",
", embeddings_array AS\n",
"(\n",
"SELECT ARRAY_AGG(embeddings_json) AS embeddings_json_array\n",
" FROM embeddings_data\n",
")\n",
"SELECT ml_generate_text_result.candidates[0].content.parts[0].text as llm_result\n",
" FROM ML.GENERATE_TEXT(MODEL`cleaned_data.gemini_2_0_flash`,\n",
" (SELECT REPLACE(prompt,\n",
" 'REPLACE-ME-WITH-EMBEDDING-SEARCH-RESULTS',\n",
" ARRAY_TO_STRING(embeddings_json_array, '\\n')) AS prompt\n",
" FROM llm_prompt CROSS JOIN embeddings_array),\n",
" STRUCT(\n",
" .5 AS temperature,\n",
" 2048 AS max_output_tokens,\n",
" 0.95 AS top_p,\n",
" 40 AS top_k)\n",
" )"
],
"metadata": {
"id": "uKHLM8oK51W4"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**<font color='#4285f4'>Example Result:</font>**\n",
"\n",
"Here's a recommendation for management based on the competitor pricing data, along with the reasoning:\n",
"\n",
"**Recommendation:**\n",
"We should consider slightly increasing the price of our rooms with balconies to better align with competitor pricing, as they are charging a premium for balcony rooms. We should also consider highlighting the slightly larger average square footage of our balcony rooms as a value add when compared to competitors.\n",
"\n",
"**Reasoning:**\n",
"The data shows competitors are charging an average of \\$308 for balcony rooms, while we are charging $286. This suggests an opportunity to increase revenue by capitalizing on the value customers place on having a balcony. Additionally, our balcony rooms average slightly larger square footage (461 sq ft) compared to the competition (458 sq ft). Highlighting this difference can help justify a price increase and attract customers seeking more spacious accommodations."
],
"metadata": {
"id": "c3xx4FuS9pRp"
}
},
{
"cell_type": "markdown",
"source": [
"##### Optional - Verify Gemini's analysis by asking if the original question was accurated answered"
],
"metadata": {
"id": "cV6VAF12rNnJ"
}
},
{
"cell_type": "code",
"source": [
"%%bigquery\n",
"----------------------------------------------------------------------------------------------------------------\n",
"-- Ask Gemini to verify the RAG result\n",
"\n",
"-- I need you to respond with True or False if the following original prompt (<original-prompt>)\n",
"-- was correctly answered by a LLM (<llm-response>).\n",
"--\n",
"-- <original-prompt>\n",
"-- I work at a hotel and just imported competitor pricing into BigQuery.\n",
"-- I reviewed the data for rooms that have a balcony and matched these to the rooms we have from our AlloyDB system.\n",
"-- I need to generate some insights to report to management.\n",
"-- Can you generate 2 to 3 sentences as to what we should recommend to management.\n",
"-- Explain your reasoning in your response.\n",
"-- <context>\n",
"-- {\"source_system\":\"AlloyDB\",\"amenity\":\"Balcony\",\"average_sq_ft\":461,\"average_price\":286}\n",
"-- \t{\"source_system\":\"AlloyDB\",\"amenity\":\"No Balcony\",\"average_sq_ft\":420,\"average_price\":257}\n",
"-- \t{\"source_system\":\"BigQuery\",\"amenity\":\"Balcony\",\"average_sq_ft\":458,\"average_price\":308}\n",
"-- \t{\"source_system\":\"BigQuery\",\"amenity\":\"No Balcony\",\"average_sq_ft\":412,\"average_price\":257}\n",
"-- </context>\n",
"-- </original-prompt>\n",
"--\n",
"-- <llm-response>\n",
"-- Here's a recommendation for management, along with the reasoning:\n",
"-- Recommendation: We should consider slightly increasing the price of our balcony rooms, as\n",
"-- competitor pricing indicates they are achieving a $22 premium for similar rooms.\n",
"-- Given our comparable square footage, capturing a portion of this price difference could\n",
"-- significantly increase revenue without impacting occupancy.\n",
"-- Reasoning: The data shows that competitors are charging a higher average price ($308)\n",
"-- for balcony rooms compared to our current average price ($286) for similar rooms.\n",
"-- While the square footage is comparable, this price difference suggests there's room\n",
"-- to increase our revenue by adjusting our pricing strategy for balcony rooms to be\n",
"-- more in line with market rates.\n",
"-- </llm-response>\n",
"----------------------------------------------------------------------------------------------------------------\n",
"WITH llm_prompt AS\n",
"(\n",
" SELECT \"\"\"\n",
" I need you to respond with True or False if the following original prompt (<original-prompt>)\n",
" was correctly answered by a LLM (<llm-response>).\n",
"\n",
" <original-prompt>\n",
" I work at a hotel and just imported competitor pricing into BigQuery.\n",
" I reviewed the data for rooms that have a balcony and matched these to the rooms we have from our AlloyDB system.\n",
" I need to generate some insights to report to management.\n",
" Can you generate 2 to 3 sentences as to what we should recommend to management.\n",
" Explain your reasoning in your response.\n",
" <context>\n",
" {\"source_system\":\"AlloyDB\",\"amenity\":\"Balcony\",\"average_sq_ft\":461,\"average_price\":286}\n",
" {\"source_system\":\"AlloyDB\",\"amenity\":\"No Balcony\",\"average_sq_ft\":420,\"average_price\":257}\n",
" {\"source_system\":\"BigQuery\",\"amenity\":\"Balcony\",\"average_sq_ft\":458,\"average_price\":308}\n",
" {\"source_system\":\"BigQuery\",\"amenity\":\"No Balcony\",\"average_sq_ft\":412,\"average_price\":257}\n",
" </context>\n",
" </original-prompt>\n",
"\n",
" <llm-response>\n",
" \"Here's a recommendation for management, along with the reasoning:\\n\\n**Recommendation:** We should consider slightly increasing the price of our balcony rooms, as competitor pricing indicates they are achieving a $22 premium for similar rooms. Given our comparable square footage, capturing a portion of this price difference could significantly increase revenue without impacting occupancy.\\n\\n**Reasoning:** The data shows that competitors are charging a higher average price ($308) for balcony rooms compared to our current average price ($286) for similar rooms. While the square footage is comparable, this price difference suggests there's room to increase our revenue by adjusting our pricing strategy for balcony rooms to be more in line with market rates.\\n\"\n",
" </llm-response>\"\"\" AS prompt\n",
")\n",
"SELECT ml_generate_text_result.candidates[0].content.parts[0].text as llm_result\n",
" FROM ML.GENERATE_TEXT(MODEL`cleaned_data.gemini_2_0_flash`,\n",
" (SELECT prompt AS prompt FROM llm_prompt),\n",
" STRUCT(\n",
" .1 AS temperature,\n",
" 100 AS max_output_tokens,\n",
" 0.95 AS top_p,\n",
" 40 AS top_k)\n",
" )"
],
"metadata": {
"id": "ARzotMkfrOGk"
},
"execution_count": null,
"outputs": []
}
]
}