genai-for-marketing/notebooks/data_qa_with_sql.ipynb (629 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Copyright 2023 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Q&A with PaLM API and GoogleSQL"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"Data Q&A with PaLM API and SQL is a new way to interact with data. PaLM API is a large language model from Google AI, trained on a massive dataset of text and code. SQL is a standard language for accessing and manipulating data in databases. By combining these two technologies, you can ask questions about data in a natural language way, and PaLM API will generate SQL queries that can be used to answer your questions. \n",
"\n",
"The notebook is structured as follows:\n",
" - You will begin by querying a dataset and previewing its data\n",
" - Next, you formulate questions to ask your data\n",
" - Finally, we create a prompt, including all the metadata from the dataset and your question, and submit to the model.\n",
" The model will generate GoogleSQL code to query BigQuery.\n",
"\n",
"The prompts were inspired by the guidelines provided in this [paper](https://arxiv.org/pdf/2306.00739.pdf) - SQL-PALM: IMPROVED LARGE LANGUAGE MODEL ADAPTATION FOR TEXT-TO-SQL."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install pre-requisites"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Install python packages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install google-cloud-datacatalog\n",
"! pip install google-cloud-aiplatform\n",
"! pip install pandas\n",
"! pip install google-api-python-client\n",
"! pip install python-dateutil\n",
"! pip install google-cloud-bigquery"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Before you execute this notebook, you must setup up your dataset on BigQuery and TagTemplate on Dataplex. \n",
"Follow the [Quick Start](./1_environment_setup.ipynb) notebook if you haven't done so."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"#### ⚠️ Do not forget to RESTART THE RUNTIME before continue.\n",
"\n",
"---"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure Google Cloud environment settings\n",
"\n",
"Set the following constants to reflect your GCP environment.\n",
"- `PROJECT_ID`: Your Google Cloud Project ID.\n",
"- `LOCATION`: The region to use for Vertex AI\n",
"- `DATASET_ID`: Id of the dataset we will submit queries\n",
"- `MODEL_NAME`: Name of the model to generate GoogleSQL from questions using natural language"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Set project parameters\n",
"PROJECT_ID = '<YOUR PROJECT ID HERE>'\n",
"\n",
"# [optional] Change the following parameters\n",
"LOCATION = 'us-central1'\n",
"DATASET_ID = 'cdp_dataset'\n",
"MODEL_NAME = 'text-bison'\n",
"\n",
"# Variables to query the dataset metadata\n",
"TAG_TEMPLATE_NAME = f'projects/{PROJECT_ID}/locations/{LOCATION}/tagTemplates/llmcdptemplate'\n",
"QUERY = f'SELECT * FROM `{PROJECT_ID}.cdp_dataset.INFORMATION_SCHEMA.TABLES` WHERE table_name NOT LIKE \"%metadata%\"'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prompt = '''This is a task converting text into GoogleSQL statement. We will first given the dataset schema and then ask a question in text. You are asked to generate SQL statement.\n",
"Here is an example: Convert text to SQL:\n",
"[SCHEMA details for table `rl-llm-dev.cdp_dataset.customers`]\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: cart_total - Data Type: FLOAT64 - Primary Key: False - foreign Key: False - Description: The value of the items in the customer's shopping cart.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: channel - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The channel through which the customer was acquired.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: city - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The city where the customer lives.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: customer_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier of the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: email - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The customer's email address.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: is_media_follower - Data Type: BOOL - Primary Key: False - foreign Key: False - Description: Whether the customer is a social media follower.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_activity_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the customer's last account activity.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_purchase_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer made their last purchase.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_sign_up_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer signed up.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: loyalty_score - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: A score that measures the customer's engagement with the company.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: state - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The state where the customer lives.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_emails - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of emails opened by the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_purchases - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of purchases made by the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of all purchases made by the customer.\n",
"[SCHEMA details for table `rl-llm-dev.cdp_dataset.transactions`]\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: app_purchase_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The value of the in-app purchase.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: The code of the inventory item that was purchased.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_name - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The name of the product that was purchased.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the transaction was made.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the transaction.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of the transaction.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The quantity of items purchased in the transaction.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of transaction (e.g., purchase, refund, etc.).\n",
"[SCHEMA details for table `rl-llm-dev.cdp_dataset.events`]:\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the event.\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the event.\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of event.\n",
"[Q]: What is the city and state with the most customer transactions?\n",
"[SQL]: SELECT c.city, c.state, SUM(t.transaction_qnt) as total_transactions\n",
"FROM `rl-llm-dev.cdp_dataset.customers` AS c\n",
"JOIN `rl-llm-dev.cdp_dataset.transactions` as t\n",
"ON c.customer_id = t.customer_id\n",
"GROUP BY c.city, c.state\n",
"ORDER BY total_transactions DESC\n",
"\n",
"Here is an example: Convert text to SQL:\n",
"[SCHEMA details for table `rl-llm-dev.cdp_dataset.customers`]\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: cart_total - Data Type: FLOAT64 - Primary Key: False - foreign Key: False - Description: The value of the items in the customer's shopping cart.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: channel - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The channel through which the customer was acquired.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: city - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The city where the customer lives.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: customer_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier of the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: email - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The customer's email address.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: is_media_follower - Data Type: BOOL - Primary Key: False - foreign Key: False - Description: Whether the customer is a social media follower.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_activity_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the customer's last account activity.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_purchase_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer made their last purchase.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_sign_up_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer signed up.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: loyalty_score - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: A score that measures the customer's engagement with the company.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: state - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The state where the customer lives.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_emails - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of emails opened by the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_purchases - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of purchases made by the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of all purchases made by the customer.\n",
"[SCHEMA details for table `rl-llm-dev.cdp_dataset.transactions`]\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: app_purchase_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The value of the in-app purchase.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: The code of the inventory item that was purchased.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_name - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The name of the product that was purchased.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the transaction was made.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the transaction.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of the transaction.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The quantity of items purchased in the transaction.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of transaction (e.g., purchase, refund, etc.).\n",
"[SCHEMA details for table `rl-llm-dev.cdp_dataset.events`]:\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the event.\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the event.\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of event.\n",
"[Q]: What are the customer emails ordered by the quantity of purchases by the customer in the city of Atlanta and the state of Georgia?\n",
"[SQL]: SELECT c.email, c.total_purchases \n",
"FROM `rl-llm-dev.cdp_dataset.customers` as c\n",
"WHERE c.city = \"Atlanta\" AND c.state = \"Georgia\"\n",
"ORDER BY c.total_purchases DESC\n",
"\n",
"Here is an example: Convert text to SQL:\n",
"[SCHEMA details for table `rl-llm-dev.cdp_dataset.customers`]\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: cart_total - Data Type: FLOAT64 - Primary Key: False - foreign Key: False - Description: The value of the items in the customer's shopping cart.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: channel - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The channel through which the customer was acquired.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: city - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The city where the customer lives.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: customer_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier of the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: email - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The customer's email address.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: is_media_follower - Data Type: BOOL - Primary Key: False - foreign Key: False - Description: Whether the customer is a social media follower.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_activity_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the customer's last account activity.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_purchase_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer made their last purchase.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_sign_up_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer signed up.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: loyalty_score - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: A score that measures the customer's engagement with the company.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: state - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The state where the customer lives.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_emails - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of emails opened by the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_purchases - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of purchases made by the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of all purchases made by the customer.\n",
"[SCHEMA details for table `rl-llm-dev.cdp_dataset.transactions`]\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: app_purchase_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The value of the in-app purchase.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: The code of the inventory item that was purchased.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_name - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The name of the product that was purchased.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the transaction was made.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the transaction.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of the transaction.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The quantity of items purchased in the transaction.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of transaction (e.g., purchase, refund, etc.).\n",
"[SCHEMA details for table `rl-llm-dev.cdp_dataset.events`]:\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the event.\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the event.\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of event.\n",
"[Q]: What are the customer emails ordered by the total transactions in app by the customer in the city of Atlanta and the state of Georgia?\n",
"[SQL]: SELECT c.email, SUM(t.app_purchase_qnt) as total_app_purchase\n",
"FROM `rl-llm-dev.cdp_dataset.customers` as c\n",
"JOIN `rl-llm-dev.cdp_dataset.transactions` as t\n",
"ON c.customer_id = t.customer_id\n",
"WHERE c.city = \"Atlanta\" AND c.state = \"Georgia\"\n",
"GROUP BY c.email, c.city, c.state\n",
"ORDER BY total_app_purchase DESC\n",
"\n",
"Here is an example: Convert text to SQL:\n",
"[SCHEMA details for table `rl-llm-dev.cdp_dataset.customers`]\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: cart_total - Data Type: FLOAT64 - Primary Key: False - foreign Key: False - Description: The value of the items in the customer's shopping cart.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: channel - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The channel through which the customer was acquired.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: city - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The city where the customer lives.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: customer_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier of the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: email - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The customer's email address.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: is_media_follower - Data Type: BOOL - Primary Key: False - foreign Key: False - Description: Whether the customer is a social media follower.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_activity_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the customer's last account activity.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_purchase_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer made their last purchase.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_sign_up_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer signed up.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: loyalty_score - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: A score that measures the customer's engagement with the company.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: state - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The state where the customer lives.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_emails - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of emails opened by the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_purchases - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of purchases made by the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of all purchases made by the customer.\n",
"[SCHEMA details for table `rl-llm-dev.cdp_dataset.transactions`]\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: app_purchase_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The value of the in-app purchase.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: The code of the inventory item that was purchased.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_name - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The name of the product that was purchased.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the transaction was made.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the transaction.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of the transaction.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The quantity of items purchased in the transaction.\n",
"Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of transaction (e.g., purchase, refund, etc.).\n",
"[SCHEMA details for table `rl-llm-dev.cdp_dataset.events`]:\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the event.\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the event.\n",
"Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of event.\n",
"[Q]: Retrieve top 10 customer emails ordered by loyalty score\n",
"[SQL]: SELECT c.email\n",
"FROM `rl-llm-dev.cdp_dataset.customers` as c\n",
"ORDER BY c.loyalty_score DESC \n",
"\n",
"Here is an example: Convert text to SQL:'''"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Initialize the SDK and import some modules."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import vertexai\n",
"\n",
"from google.cloud import bigquery\n",
"from google.cloud import datacatalog_v1\n",
"\n",
"from google.cloud.exceptions import NotFound\n",
"from google.cloud.exceptions import BadRequest\n",
"from IPython.display import display\n",
"from vertexai.preview.language_models import TextGenerationModel\n",
"\n",
"vertexai.init(project=PROJECT_ID, location=LOCATION)\n",
"bq_client = bigquery.Client(project=PROJECT_ID)\n",
"client_code_model = TextGenerationModel.from_pretrained(MODEL_NAME)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preview dataset"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the next cell to query BigQuery and preview the tables from the dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query_table_1 = f'SELECT * FROM `{PROJECT_ID}.{DATASET_ID}.customers` LIMIT 3'\n",
"query_table_2 = f'SELECT * FROM `{PROJECT_ID}.{DATASET_ID}.events` LIMIT 3'\n",
"query_table_3 = f'SELECT * FROM `{PROJECT_ID}.{DATASET_ID}.transactions` LIMIT 3'\n",
"\n",
"for query, table_name in zip(\n",
" [query_table_1, query_table_2, query_table_3],\n",
" ['customers', 'events', 'transactions']\n",
" ):\n",
" print(f'==> Table: {table_name}')\n",
" query_job = bq_client.query(query)\n",
" rows = query_job.result()\n",
" display(rows.to_dataframe())"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ask questions to your dataset"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The following functions extract all the relevant metadata from the dataset to use as a context to the prompt."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_tags_from_table(\n",
" dataset_id: str, \n",
" table_id: str, \n",
" project_id: str, \n",
" tag_template_name: str\n",
"):\n",
" \"\"\"Gets the tags from a BigQuery table.\n",
"\n",
" Args:\n",
" dataset_id:\n",
" The ID of the BigQuery dataset that contains the table.\n",
" table_id: \n",
" The ID of the BigQuery table.\n",
" project_id: \n",
" The ID of the Google Cloud project.\n",
" tag_template_name: \n",
" The name of the tag template.\n",
"\n",
" Returns:\n",
" A string containing the tags for the table.\n",
" \"\"\"\n",
" # Lookup Data Catalog's Entry referring to the table.\n",
" datacatalog_client = datacatalog_v1.DataCatalogClient()\n",
" resource_name = (\n",
" f\"//bigquery.googleapis.com/projects/{project_id}\"\n",
" f\"/datasets/{dataset_id}/tables/{table_id}\"\n",
" )\n",
" table_entry = datacatalog_client.lookup_entry(\n",
" request={\"linked_resource\": resource_name}\n",
" )\n",
"\n",
" # Make the request\n",
" page_result = datacatalog_client.list_tags(parent=table_entry.name)\n",
" tags_str = ''\n",
" # Handle the response\n",
" for response in page_result:\n",
" if response.template == tag_template_name:\n",
" desc = response.fields[\"description\"].string_value\n",
" data_type = response.fields[\"data_type\"].string_value\n",
" pk = response.fields[\"is_primary_key\"].bool_value\n",
" fk = response.fields[\"is_foreign_key\"].bool_value \n",
" tags_str += (\"Full table name: {} \"\n",
" \"- Column: {} \" \n",
" \"- Data Type: {} \" \n",
" \"- Primary Key: {} \" \n",
" \"- Foreign Key: {} \" \n",
" \"- Description: {}\\n\".format(\n",
" f'`{project_id}.{dataset_id}.{table_id}`', response.column, data_type, pk, fk, desc))\n",
"\n",
" return tags_str\n",
"\n",
"def get_metadata_from_dataset(\n",
" query: str,\n",
" project_id: str,\n",
" dataset_id: str, \n",
" tag_template_name: str\n",
"):\n",
" \"\"\"Gets the metadata for all tables in a BigQuery dataset.\n",
"\n",
" Args:\n",
" query: \n",
" The BigQuery query to run to get the list of tables.\n",
" project_id: \n",
" The ID of the BigQuery project.\n",
" dataset_id: \n",
" The ID of the BigQuery dataset.\n",
" tag_template_name: \n",
" The name of the BigQuery tag template to use to get the table descriptions.\n",
" state_key: \n",
" The key to use to store the metadata in the Streamlit session state.\n",
" \"\"\"\n",
" query_job = bq_client.query(query) # API request\n",
" rows = query_job.result()\n",
" metadata = []\n",
"\n",
" for row in rows:\n",
" full_table_path = f'`{project_id}.{dataset_id}.{row.table_name}`'\n",
" table_metadata = f'[SCHEMA details for table {full_table_path}]\\n'\n",
"\n",
" table_metadata += get_tags_from_table(\n",
" dataset_id, row.table_name, project_id, tag_template_name)\n",
" metadata.append(table_metadata)\n",
"\n",
" return metadata"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The following functions create a prompt to submit to the model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def generate_prompt(\n",
" question: str,\n",
" metadata: list\n",
"):\n",
" \"\"\"Generates a prompt for a GoogleSQL query compatible with BigQuery.\n",
"\n",
" Args:\n",
" question: \n",
" The question to answer.\n",
" metadata: \n",
" A list of dictionaries, where each dictionary describes a BigQuery table. \n",
" The dictionaries should have the following keys:\n",
" - name: The name of the table.\n",
" - schema: The schema of the table.\n",
" - description: A description of the table.\n",
" state_key: \n",
" The key to use to store the prompt in the session state.\n",
"\n",
" Returns:\n",
" The prompt.\n",
" \"\"\"\n",
" context = ''\n",
" for i in metadata:\n",
" context += i\n",
"\n",
" return f\"\"\"{prompt}\n",
"{context}\n",
"[Q]: {question}\n",
"[SQL]: \n",
"\"\"\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Questions"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Define your question using natural language."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"question = 'Retrieve top 10 customer emails ordered by transaction value'\n",
"# question = 'What are the customer emails ordered by the sum of transactions value by customers in the city of Atlanta and the state of Georgia?'"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The next cell will:\n",
" - Get the metadata from this dataset\n",
" - Generate a prompt\n",
" - Submit the prompt to PaLM API (Codey)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"metadata = get_metadata_from_dataset(\n",
" query=QUERY, \n",
" project_id=PROJECT_ID, \n",
" dataset_id=DATASET_ID,\n",
" tag_template_name=TAG_TEMPLATE_NAME)\n",
" \n",
"prompt_metadata = generate_prompt(question, metadata)\n",
" \n",
"gen_code = client_code_model.predict(\n",
" prompt = prompt_metadata,\n",
" max_output_tokens = 1024,\n",
" temperature=0.2)\n",
"\n",
"# Generated GoogleSQL code\n",
"print(gen_code.text)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The next cell will use the generated GoogleSQL to query BigQuery and print the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Results\n",
"try:\n",
" result_query = bq_client.query(gen_code.text)\n",
" result_query.result()\n",
"except NotFound as nf:\n",
" print('Dataset or table not found.')\n",
" print('Review the table details and ask a different question.')\n",
"except BadRequest as br:\n",
" print('Bad query.')\n",
" print('Review the table details and ask a different question.')\n",
"else:\n",
" print('Query executed successfully. Retrieving dataset.')\n",
" final_result = result_query.to_dataframe()\n",
" print('Resulting table (limited by 50 rows)')\n",
" display(final_result.iloc[:50])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llm-dev-py311",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}