genai-for-marketing/notebooks/data_qa_with_sql.ipynb (629 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Copyright 2023 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Data Q&A with PaLM API and GoogleSQL" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "Data Q&A with PaLM API and SQL is a new way to interact with data. PaLM API is a large language model from Google AI, trained on a massive dataset of text and code. SQL is a standard language for accessing and manipulating data in databases. By combining these two technologies, you can ask questions about data in a natural language way, and PaLM API will generate SQL queries that can be used to answer your questions. \n", "\n", "The notebook is structured as follows:\n", " - You will begin by querying a dataset and previewing its data\n", " - Next, you formulate questions to ask your data\n", " - Finally, we create a prompt, including all the metadata from the dataset and your question, and submit to the model.\n", " The model will generate GoogleSQL code to query BigQuery.\n", "\n", "The prompts were inspired by the guidelines provided in this [paper](https://arxiv.org/pdf/2306.00739.pdf) - SQL-PALM: IMPROVED LARGE LANGUAGE MODEL ADAPTATION FOR TEXT-TO-SQL." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Install pre-requisites" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Install python packages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install google-cloud-datacatalog\n", "! pip install google-cloud-aiplatform\n", "! pip install pandas\n", "! pip install google-api-python-client\n", "! pip install python-dateutil\n", "! pip install google-cloud-bigquery" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Before you execute this notebook, you must setup up your dataset on BigQuery and TagTemplate on Dataplex. \n", "Follow the [Quick Start](./1_environment_setup.ipynb) notebook if you haven't done so." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#### ⚠️ Do not forget to RESTART THE RUNTIME before continue.\n", "\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Configure Google Cloud environment settings\n", "\n", "Set the following constants to reflect your GCP environment.\n", "- `PROJECT_ID`: Your Google Cloud Project ID.\n", "- `LOCATION`: The region to use for Vertex AI\n", "- `DATASET_ID`: Id of the dataset we will submit queries\n", "- `MODEL_NAME`: Name of the model to generate GoogleSQL from questions using natural language" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set project parameters\n", "PROJECT_ID = '<YOUR PROJECT ID HERE>'\n", "\n", "# [optional] Change the following parameters\n", "LOCATION = 'us-central1'\n", "DATASET_ID = 'cdp_dataset'\n", "MODEL_NAME = 'text-bison'\n", "\n", "# Variables to query the dataset metadata\n", "TAG_TEMPLATE_NAME = f'projects/{PROJECT_ID}/locations/{LOCATION}/tagTemplates/llmcdptemplate'\n", "QUERY = f'SELECT * FROM `{PROJECT_ID}.cdp_dataset.INFORMATION_SCHEMA.TABLES` WHERE table_name NOT LIKE \"%metadata%\"'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prompt = '''This is a task converting text into GoogleSQL statement. We will first given the dataset schema and then ask a question in text. You are asked to generate SQL statement.\n", "Here is an example: Convert text to SQL:\n", "[SCHEMA details for table `rl-llm-dev.cdp_dataset.customers`]\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: cart_total - Data Type: FLOAT64 - Primary Key: False - foreign Key: False - Description: The value of the items in the customer's shopping cart.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: channel - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The channel through which the customer was acquired.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: city - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The city where the customer lives.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: customer_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier of the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: email - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The customer's email address.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: is_media_follower - Data Type: BOOL - Primary Key: False - foreign Key: False - Description: Whether the customer is a social media follower.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_activity_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the customer's last account activity.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_purchase_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer made their last purchase.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_sign_up_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer signed up.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: loyalty_score - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: A score that measures the customer's engagement with the company.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: state - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The state where the customer lives.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_emails - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of emails opened by the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_purchases - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of purchases made by the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of all purchases made by the customer.\n", "[SCHEMA details for table `rl-llm-dev.cdp_dataset.transactions`]\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: app_purchase_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The value of the in-app purchase.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: The code of the inventory item that was purchased.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_name - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The name of the product that was purchased.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the transaction was made.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the transaction.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of the transaction.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The quantity of items purchased in the transaction.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of transaction (e.g., purchase, refund, etc.).\n", "[SCHEMA details for table `rl-llm-dev.cdp_dataset.events`]:\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the event.\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the event.\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of event.\n", "[Q]: What is the city and state with the most customer transactions?\n", "[SQL]: SELECT c.city, c.state, SUM(t.transaction_qnt) as total_transactions\n", "FROM `rl-llm-dev.cdp_dataset.customers` AS c\n", "JOIN `rl-llm-dev.cdp_dataset.transactions` as t\n", "ON c.customer_id = t.customer_id\n", "GROUP BY c.city, c.state\n", "ORDER BY total_transactions DESC\n", "\n", "Here is an example: Convert text to SQL:\n", "[SCHEMA details for table `rl-llm-dev.cdp_dataset.customers`]\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: cart_total - Data Type: FLOAT64 - Primary Key: False - foreign Key: False - Description: The value of the items in the customer's shopping cart.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: channel - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The channel through which the customer was acquired.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: city - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The city where the customer lives.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: customer_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier of the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: email - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The customer's email address.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: is_media_follower - Data Type: BOOL - Primary Key: False - foreign Key: False - Description: Whether the customer is a social media follower.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_activity_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the customer's last account activity.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_purchase_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer made their last purchase.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_sign_up_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer signed up.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: loyalty_score - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: A score that measures the customer's engagement with the company.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: state - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The state where the customer lives.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_emails - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of emails opened by the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_purchases - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of purchases made by the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of all purchases made by the customer.\n", "[SCHEMA details for table `rl-llm-dev.cdp_dataset.transactions`]\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: app_purchase_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The value of the in-app purchase.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: The code of the inventory item that was purchased.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_name - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The name of the product that was purchased.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the transaction was made.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the transaction.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of the transaction.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The quantity of items purchased in the transaction.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of transaction (e.g., purchase, refund, etc.).\n", "[SCHEMA details for table `rl-llm-dev.cdp_dataset.events`]:\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the event.\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the event.\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of event.\n", "[Q]: What are the customer emails ordered by the quantity of purchases by the customer in the city of Atlanta and the state of Georgia?\n", "[SQL]: SELECT c.email, c.total_purchases \n", "FROM `rl-llm-dev.cdp_dataset.customers` as c\n", "WHERE c.city = \"Atlanta\" AND c.state = \"Georgia\"\n", "ORDER BY c.total_purchases DESC\n", "\n", "Here is an example: Convert text to SQL:\n", "[SCHEMA details for table `rl-llm-dev.cdp_dataset.customers`]\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: cart_total - Data Type: FLOAT64 - Primary Key: False - foreign Key: False - Description: The value of the items in the customer's shopping cart.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: channel - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The channel through which the customer was acquired.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: city - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The city where the customer lives.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: customer_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier of the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: email - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The customer's email address.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: is_media_follower - Data Type: BOOL - Primary Key: False - foreign Key: False - Description: Whether the customer is a social media follower.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_activity_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the customer's last account activity.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_purchase_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer made their last purchase.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_sign_up_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer signed up.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: loyalty_score - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: A score that measures the customer's engagement with the company.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: state - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The state where the customer lives.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_emails - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of emails opened by the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_purchases - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of purchases made by the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of all purchases made by the customer.\n", "[SCHEMA details for table `rl-llm-dev.cdp_dataset.transactions`]\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: app_purchase_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The value of the in-app purchase.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: The code of the inventory item that was purchased.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_name - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The name of the product that was purchased.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the transaction was made.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the transaction.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of the transaction.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The quantity of items purchased in the transaction.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of transaction (e.g., purchase, refund, etc.).\n", "[SCHEMA details for table `rl-llm-dev.cdp_dataset.events`]:\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the event.\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the event.\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of event.\n", "[Q]: What are the customer emails ordered by the total transactions in app by the customer in the city of Atlanta and the state of Georgia?\n", "[SQL]: SELECT c.email, SUM(t.app_purchase_qnt) as total_app_purchase\n", "FROM `rl-llm-dev.cdp_dataset.customers` as c\n", "JOIN `rl-llm-dev.cdp_dataset.transactions` as t\n", "ON c.customer_id = t.customer_id\n", "WHERE c.city = \"Atlanta\" AND c.state = \"Georgia\"\n", "GROUP BY c.email, c.city, c.state\n", "ORDER BY total_app_purchase DESC\n", "\n", "Here is an example: Convert text to SQL:\n", "[SCHEMA details for table `rl-llm-dev.cdp_dataset.customers`]\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: cart_total - Data Type: FLOAT64 - Primary Key: False - foreign Key: False - Description: The value of the items in the customer's shopping cart.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: channel - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The channel through which the customer was acquired.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: city - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The city where the customer lives.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: customer_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier of the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: email - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The customer's email address.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: is_media_follower - Data Type: BOOL - Primary Key: False - foreign Key: False - Description: Whether the customer is a social media follower.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_activity_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the customer's last account activity.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_purchase_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer made their last purchase.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: last_sign_up_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the customer signed up.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: loyalty_score - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: A score that measures the customer's engagement with the company.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: state - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The state where the customer lives.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_emails - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of emails opened by the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_purchases - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total number of purchases made by the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.customers` - Column: total_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of all purchases made by the customer.\n", "[SCHEMA details for table `rl-llm-dev.cdp_dataset.transactions`]\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: app_purchase_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The value of the in-app purchase.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: The code of the inventory item that was purchased.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: product_name - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The name of the product that was purchased.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date the transaction was made.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the transaction.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transaction_value - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The total value of the transaction.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_qnt - Data Type: INT64 - Primary Key: False - foreign Key: False - Description: The quantity of items purchased in the transaction.\n", "Full table name: `rl-llm-dev.cdp_dataset.transactions` - Column: transation_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of transaction (e.g., purchase, refund, etc.).\n", "[SCHEMA details for table `rl-llm-dev.cdp_dataset.events`]:\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: customer_id - Data Type: INT64 - Primary Key: False - foreign Key: True - Description: A unique identifier of the customer.\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_date - Data Type: DATE - Primary Key: False - foreign Key: False - Description: The date of the event.\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_id - Data Type: INT64 - Primary Key: True - foreign Key: False - Description: A unique identifier for the event.\n", "Full table name: `rl-llm-dev.cdp_dataset.events` - Column: event_type - Data Type: STRING - Primary Key: False - foreign Key: False - Description: The type of event.\n", "[Q]: Retrieve top 10 customer emails ordered by loyalty score\n", "[SQL]: SELECT c.email\n", "FROM `rl-llm-dev.cdp_dataset.customers` as c\n", "ORDER BY c.loyalty_score DESC \n", "\n", "Here is an example: Convert text to SQL:'''" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Initialize the SDK and import some modules." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import vertexai\n", "\n", "from google.cloud import bigquery\n", "from google.cloud import datacatalog_v1\n", "\n", "from google.cloud.exceptions import NotFound\n", "from google.cloud.exceptions import BadRequest\n", "from IPython.display import display\n", "from vertexai.preview.language_models import TextGenerationModel\n", "\n", "vertexai.init(project=PROJECT_ID, location=LOCATION)\n", "bq_client = bigquery.Client(project=PROJECT_ID)\n", "client_code_model = TextGenerationModel.from_pretrained(MODEL_NAME)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Preview dataset" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Run the next cell to query BigQuery and preview the tables from the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query_table_1 = f'SELECT * FROM `{PROJECT_ID}.{DATASET_ID}.customers` LIMIT 3'\n", "query_table_2 = f'SELECT * FROM `{PROJECT_ID}.{DATASET_ID}.events` LIMIT 3'\n", "query_table_3 = f'SELECT * FROM `{PROJECT_ID}.{DATASET_ID}.transactions` LIMIT 3'\n", "\n", "for query, table_name in zip(\n", " [query_table_1, query_table_2, query_table_3],\n", " ['customers', 'events', 'transactions']\n", " ):\n", " print(f'==> Table: {table_name}')\n", " query_job = bq_client.query(query)\n", " rows = query_job.result()\n", " display(rows.to_dataframe())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Ask questions to your dataset" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The following functions extract all the relevant metadata from the dataset to use as a context to the prompt." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_tags_from_table(\n", " dataset_id: str, \n", " table_id: str, \n", " project_id: str, \n", " tag_template_name: str\n", "):\n", " \"\"\"Gets the tags from a BigQuery table.\n", "\n", " Args:\n", " dataset_id:\n", " The ID of the BigQuery dataset that contains the table.\n", " table_id: \n", " The ID of the BigQuery table.\n", " project_id: \n", " The ID of the Google Cloud project.\n", " tag_template_name: \n", " The name of the tag template.\n", "\n", " Returns:\n", " A string containing the tags for the table.\n", " \"\"\"\n", " # Lookup Data Catalog's Entry referring to the table.\n", " datacatalog_client = datacatalog_v1.DataCatalogClient()\n", " resource_name = (\n", " f\"//bigquery.googleapis.com/projects/{project_id}\"\n", " f\"/datasets/{dataset_id}/tables/{table_id}\"\n", " )\n", " table_entry = datacatalog_client.lookup_entry(\n", " request={\"linked_resource\": resource_name}\n", " )\n", "\n", " # Make the request\n", " page_result = datacatalog_client.list_tags(parent=table_entry.name)\n", " tags_str = ''\n", " # Handle the response\n", " for response in page_result:\n", " if response.template == tag_template_name:\n", " desc = response.fields[\"description\"].string_value\n", " data_type = response.fields[\"data_type\"].string_value\n", " pk = response.fields[\"is_primary_key\"].bool_value\n", " fk = response.fields[\"is_foreign_key\"].bool_value \n", " tags_str += (\"Full table name: {} \"\n", " \"- Column: {} \" \n", " \"- Data Type: {} \" \n", " \"- Primary Key: {} \" \n", " \"- Foreign Key: {} \" \n", " \"- Description: {}\\n\".format(\n", " f'`{project_id}.{dataset_id}.{table_id}`', response.column, data_type, pk, fk, desc))\n", "\n", " return tags_str\n", "\n", "def get_metadata_from_dataset(\n", " query: str,\n", " project_id: str,\n", " dataset_id: str, \n", " tag_template_name: str\n", "):\n", " \"\"\"Gets the metadata for all tables in a BigQuery dataset.\n", "\n", " Args:\n", " query: \n", " The BigQuery query to run to get the list of tables.\n", " project_id: \n", " The ID of the BigQuery project.\n", " dataset_id: \n", " The ID of the BigQuery dataset.\n", " tag_template_name: \n", " The name of the BigQuery tag template to use to get the table descriptions.\n", " state_key: \n", " The key to use to store the metadata in the Streamlit session state.\n", " \"\"\"\n", " query_job = bq_client.query(query) # API request\n", " rows = query_job.result()\n", " metadata = []\n", "\n", " for row in rows:\n", " full_table_path = f'`{project_id}.{dataset_id}.{row.table_name}`'\n", " table_metadata = f'[SCHEMA details for table {full_table_path}]\\n'\n", "\n", " table_metadata += get_tags_from_table(\n", " dataset_id, row.table_name, project_id, tag_template_name)\n", " metadata.append(table_metadata)\n", "\n", " return metadata" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The following functions create a prompt to submit to the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def generate_prompt(\n", " question: str,\n", " metadata: list\n", "):\n", " \"\"\"Generates a prompt for a GoogleSQL query compatible with BigQuery.\n", "\n", " Args:\n", " question: \n", " The question to answer.\n", " metadata: \n", " A list of dictionaries, where each dictionary describes a BigQuery table. \n", " The dictionaries should have the following keys:\n", " - name: The name of the table.\n", " - schema: The schema of the table.\n", " - description: A description of the table.\n", " state_key: \n", " The key to use to store the prompt in the session state.\n", "\n", " Returns:\n", " The prompt.\n", " \"\"\"\n", " context = ''\n", " for i in metadata:\n", " context += i\n", "\n", " return f\"\"\"{prompt}\n", "{context}\n", "[Q]: {question}\n", "[SQL]: \n", "\"\"\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Questions" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Define your question using natural language." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "question = 'Retrieve top 10 customer emails ordered by transaction value'\n", "# question = 'What are the customer emails ordered by the sum of transactions value by customers in the city of Atlanta and the state of Georgia?'" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The next cell will:\n", " - Get the metadata from this dataset\n", " - Generate a prompt\n", " - Submit the prompt to PaLM API (Codey)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metadata = get_metadata_from_dataset(\n", " query=QUERY, \n", " project_id=PROJECT_ID, \n", " dataset_id=DATASET_ID,\n", " tag_template_name=TAG_TEMPLATE_NAME)\n", " \n", "prompt_metadata = generate_prompt(question, metadata)\n", " \n", "gen_code = client_code_model.predict(\n", " prompt = prompt_metadata,\n", " max_output_tokens = 1024,\n", " temperature=0.2)\n", "\n", "# Generated GoogleSQL code\n", "print(gen_code.text)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The next cell will use the generated GoogleSQL to query BigQuery and print the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Results\n", "try:\n", " result_query = bq_client.query(gen_code.text)\n", " result_query.result()\n", "except NotFound as nf:\n", " print('Dataset or table not found.')\n", " print('Review the table details and ask a different question.')\n", "except BadRequest as br:\n", " print('Bad query.')\n", " print('Review the table details and ask a different question.')\n", "else:\n", " print('Query executed successfully. Retrieving dataset.')\n", " final_result = result_query.to_dataframe()\n", " print('Resulting table (limited by 50 rows)')\n", " display(final_result.iloc[:50])" ] } ], "metadata": { "kernelspec": { "display_name": "llm-dev-py311", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }