notebooks/1_Setup_OpenDataQnA.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<div style=\"display: flex; align-items: left;\">\n", " <a href=\"https://sites.google.com/corp/google.com/genai-solutions/home?authuser=0\">\n", " <img src=\"../utilities/imgs/aaie.png\" style=\"margin-right\">\n", " </a>\n", "</div>" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "copyright" }, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "DRyGcAepAPJ5" }, "source": [ "\n", "<h1 align=\"center\">Open Data QnA - Chat with your SQL Database</h1> \n", "\n", "---\n", "\n", "This notebook first walks through the Vector Store Setup needed for running the Open Data QnA application. \n", "\n", "Currently supported Source DBs are: \n", "- PostgreSQL on Google Cloud SQL \n", "- BigQuery\n", "\n", "Furthermore, the following vector stores are supported \n", "- pgvector on PostgreSQL \n", "- BigQuery vector\n", "\n", "\n", "The setup part covers the following steps: \n", "> 1. Configuration: Intial GCP project, IAM permissions, Environment and Databases setup including logging on Bigquery for analytics\n", "\n", "> 2. Creation of Table, Column and Known Good Query Embeddings in the Vector Store for Retreival Augmented Generation(RAG)\n", "\n", "> 3. Setting up firestore for persisting the session history for multiturn\n", "\n", "\n", "Afterwards, you will be able to run the Open Data QnA Pipeline to generate SQL queries and answer questions over your data source. " ] }, { "cell_type": "markdown", "metadata": { "id": "jsWGZW_fUJjN" }, "source": [ "### 📒 Using this interactive notebook\n", "\n", "If you have not used this IDE with jupyter notebooks it will ask for installing Python + Jupyter extensions. Please go ahead install them\n", "\n", "Click the **run** icons ▶️ of each cell within this notebook.\n", "\n", "> 💡 Alternatively, you can run the currently selected cell with `Ctrl + Enter` (or `⌘ + Enter` on a Mac).\n", "\n", "> ⚠️ **To avoid any errors**, wait for each section to finish in their order before clicking the next “run” icon.\n", "\n", "This sample must be connected to a **Google Cloud project**, but nothing else is needed other than your Google Cloud project.\n", "\n", "You can use an existing project. Alternatively, you can create a new Cloud project [with cloud credits for free.](https://cloud.google.com/free/docs/gcp-free-tier)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Change your Kernel to the created .venv with poetry from README.md**\n", "\n", "Path would look like e.g. /home/admin_/opendata/.venv/bin/python or ~cache/user/opendataqna/.venv/bin/python\n", "\n", "Below is the Kernel how it should look like before you proceed.\n", "\n", "\n", "![Kernel](../utilities/imgs/Kernel%20Changed.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set Python Module Path to Root" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "\n", "module_path = os.path.abspath(os.path.join('..'))\n", "sys.path.append(module_path)" ] }, { "cell_type": "markdown", "metadata": { "id": "p4W6FPnrYEE8" }, "source": [ "### 🔗 **Connect Your Google Cloud Project**\n", "Time to connect your Google Cloud Project to this notebook. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Updated property [core/project].\n", "Project has been set to three-p-o\n" ] } ], "source": [ "#@markdown Please fill in the value below with your GCP project ID and then run the cell.\n", "PROJECT_ID = \"three-p-o\"\n", "\n", "# Quick input validations.\n", "assert PROJECT_ID, \"⚠️ Please provide your Google Cloud Project ID\"\n", "\n", "# Configure gcloud.\n", "!gcloud config set project {PROJECT_ID}\n", "print(f'Project has been set to {PROJECT_ID}')\n", "\n", "os.environ['GOOGLE_CLOUD_QUOTA_PROJECT']=PROJECT_ID\n", "os.environ['GOOGLE_CLOUD_PROJECT']=PROJECT_ID\n", "\n", "#If errors out for authentication restart the kernel and start from the previous cell" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ⚙️ **Enable Required API Services in the GCP Project**" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Operation \"operations/acat.p2-978842762722-9cbec4fc-9e93-41cf-a354-4a2ff66161d4\" finished successfully.\n" ] } ], "source": [ "#Enable all the required APIs for the Open Data QnA solution\n", "\n", "!gcloud services enable \\\n", " cloudapis.googleapis.com \\\n", " compute.googleapis.com \\\n", " iam.googleapis.com \\\n", " run.googleapis.com \\\n", " sqladmin.googleapis.com \\\n", " aiplatform.googleapis.com \\\n", " bigquery.googleapis.com \\\n", " firestore.googleapis.com --project {PROJECT_ID}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Configure your inputs for the environments**\n", "\n", "This section assumes that a datasource is already set up in your GCP project. If a datasource has not been set up, use the notebooks below to copy a public data set from BigQuery to Cloud SQL or BigQuery on your GCP project\n", "\n", "\n", "Enabled Data Sources:\n", "* PostgreSQL on Google Cloud SQL (Copy Sample Data: [0_CopyDataToCloudSqlPG.ipynb](0_CopyDataToCloudSqlPG.ipynb))\n", "* BigQuery (Copy Sample Data: [0_CopyDataToBigQuery.ipynb](0_CopyDataToBigQuery.ipynb))\n", "\n", "Enabled Vector Stores:\n", "* pgvector on PostgreSQL \n", "* BigQuery vector\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 🤔 **Choose Data Source and Vector Store**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Sources to connect**\n", "- This solution lets you setup multiple data source at the same time.\n", "- You can group multiple tables from different datasets or schema into a grouping and provide the details\n", "- If your dataset/schema has many tables and you want to run the solution against few you should specifically choose a group for that tables only\n", "\n", "**Format for data_source_list.csv**\n", "\n", "**source | user_grouping | schema | table**\n", "\n", "**source** - Supported Data Sources. #Options: bigquery , cloudsql-pg\n", "\n", "**user_grouping** - Logical grouping or use case name for tables from same or different schema/dataset. When left black it default to the schema value in the next column\n", "\n", "**schema** - schema name for postgres or dataset name in bigquery \n", "\n", "**table** - name of the tables to run the solutions against. Leave this column blank after filling schema/dataset if you want to run solution for whole dataset/schema\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fill out the parameters and configuration settings below. \n", "These are the parameters for setting configurations for the vector store tables to be created. \n", "\n", "Additionally, you can specify whether you have and want to use known-good-queries for the pipeline run and whether you want to enable logging.\n", "\n", "**Known good queries:** if you have known working user question <-> SQL query pairs, you can put them into the file `scripts/known_good_sql.csv`. This will be used as a caching layer and for in-context learning: If an exact match of the user question is found in the vector store, the pipeline will skip SQL Generation and output the cached SQL query. If the similarity score is between 90-100%, the known good queries will be used as few-shot examples by the SQL Generator Agent. \n", "\n", "**Logging:** you can enable logging. If enabled, a dataset is created in Big Query in your project, which will store the logging table and save information from the pipeline run in the logging table. This is especially helpful for debugging.\n", "\n", "**use_column_samples:** you can enable use column samples flag to let the pipeline select sample values of the columns from the source database. In some specific usecase where we need to understand the format or case sensitivity of the values this flag help LLM to have better understanding. Though this is one time setup, please be aware that turning this on mean getting samples from each column and it can be an expensive operation when there are lot many columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#[CONFIG]\n", "embedding_model = 'vertex' # Options: 'vertex' or 'vertex-lang'\n", "description_model = 'gemini-1.5-pro' # 'gemini-1.0-pro', 'gemini-1.5-pro', 'text-bison-32k', 'gemini-1.5-flash'\n", "vector_store = 'bigquery-vector' # Options: 'bigquery-vector', 'cloudsql-pgvector'\n", "logging = True # True or False \n", "kgq_examples = True # True or False\n", "use_session_history = True # True or False\n", "use_column_samples = True #True or False\n", "\n", "#[GCP]\n", "project_id = PROJECT_ID\n", "\n", "#[PGCLOUDSQL]\n", "# Default values for pgvector setup, change only if needed\n", "pg_region = 'us-central1'\n", "pg_instance = 'pg15-opendataqna'\n", "pg_database = 'opendataqna-db'\n", "pg_user = 'pguser'\n", "pg_password = 'pg123'\n", "\n", "#[BIGQUERY]\n", "# Name for the BQ dataset created for bigquery-vector and/or logging. Change names only if needed.\n", "bq_dataset_region = 'us-central1'\n", "bq_opendataqna_dataset_name = 'opendataqna'\n", "bq_log_table_name = 'audit_log_table'\n", "\n", "#Details for firestore to store the chat session history\n", "firestore_region='us-central1'\n", "## firestore_database is named as 'opendataqna-session-logs' (This is designed to not be customizable as the setup includes creation of composite indexes from backend)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fill out the parameters and configuration settings below.\n", "These are the parameters for setting configurations for the vector store tables to be created.\n", "\n", "Additionally, you can specify whether you have and want to use known-good-queries for the pipeline run and whether you want to enable logging.\n", "\n", "**Known good queries:** if you have known working user question <-> SQL query pairs, you can put them into the file `scripts/known_good_sql.csv`. This will be used as a caching layer and for in-context learning: If an exact match of the user question is found in the vector store, the pipeline will skip SQL Generation and output the cached SQL query. If the similarity score is between 90-100%, the known good queries will be used as few-shot examples by the SQL Generator Agent.\n", "\n", "**Logging:** you can enable logging. If enabled, a dataset is created in Big Query in your project, which will store the logging table and save information from the pipeline run in the logging table. This is especially helpful for debugging.\n", "\n", "**use_column_samples = yes** if you want the solution to collect some samples values from the data source columns to imporve understanding of values. yes or no\n", "\n", "**use_column_samples:** you can enable use column samples flag to let the pipeline select sample values of the columns from the source database. In some specific usecase where we need to understand the format or case sensitivity of the values this flag help LLM to have better understanding. Though this is one time setup, please be aware that turning this on mean getting samples from each column and it can be an expensive operation when there are lot many columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Quick input verifications below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Input verification - Vector Store\n", "assert vector_store in {'bigquery-vector', 'cloudsql-pgvector'}, \"⚠️ Invalid VECTOR_STORE. Must be 'bigquery-vector' or 'cloudsql-pgvector'\"\n", "\n", "# Input verification - Firestore Region\n", "assert firestore_region, \"⚠️ Provide firestore region name\"\n", "\n", "if logging: \n", " assert bq_log_table_name, \"⚠️ Please provide a name for your log table if you want to use logging\"\n", "\n", "if vector_store == 'bigquery':\n", " assert bq_dataset_region, \"⚠️ Please provide the Data Set Region\"\n", " assert bq_opendataqna_dataset_name, \"⚠️ Please provide the name of the logging/vector store dataset on Bigquery\"\n", "\n", "elif vector_store == 'cloudsql-pg':\n", " assert pg_region, \"⚠️ Please provide Region of the Cloud SQL Instance\"\n", " assert pg_instance, \"⚠️ Please provide the name of the Cloud SQL Instance\"\n", " assert pg_database, \"⚠️ Please provide the name of the PostgreSQL Database on the Cloud SQL Instance\"\n", " assert pg_user, \"⚠️ Please provide a username for the Cloud SQL Instance\"\n", " assert pg_password, \"⚠️ Please provide the Password for the PG_USER\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 💾 **Save Configuration to File** \n", "Save the configurations set in this notebook to `config.ini`. The parameters from this file are used in notebooks and in various modeules in the repo" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scripts import save_config\n", "\n", "save_config(embedding_model, description_model, vector_store, logging, kgq_examples, use_column_samples, PROJECT_ID,\n", " pg_region, pg_instance, pg_database, pg_user, pg_password, \n", " bq_dataset_region, bq_opendataqna_dataset_name, bq_log_table_name,firestore_region)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# **1. Vector Store Setup** (Run once)\n", "---\n", "\n", "This section walks through the Vector Store Setup needed for running the Open Data QnA application. \n", "\n", "It covers the following steps: \n", "> 1. Configuration: Environment and Databases setup including logging on Bigquery for analytics\n", "\n", "> 2. Creation of Table, Column and Known Good Query Embeddings in the Vector Store for Retreival Augmented Generation(RAG)\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ⚙️ **1.1 Database Setup for Vector Store**\n", "\n", "If BigQuery is your vector store, the dataset is created.\n", "\n", "If 'cloudsql-pgvector' is chosen as vector store, PostgreSQL Instance on CloudSQL (Note that this version of code supports only creating vector store on same instance as source)\n", "\n", "The cell will also create a dataset to store the log table on Big Query, **if** logging is enabled if its not bigquery vector." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from env_setup import create_vector_store\n", "# Setup vector store for embeddings\n", "create_vector_store() \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **1.2 Create Embeddings in Vector Store for RAG** " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 🖋️ **Create Table and Column Embeddings**\n", "\n", "In this step, table and column metadata is retreived from the data source and embeddings are generated for both" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from env_setup import get_embeddings\n", "\n", "# Generate embeddings for tables and columns\n", "table_schema_embeddings, col_schema_embeddings = get_embeddings()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 💾 **Save the Table and Column Embeddings in the Vector Store**\n", "The table and column embeddings created in the above step are save to the Vector Store chosen" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from env_setup import store_embeddings\n", "\n", "# Store table/column embeddings (asynchronous)\n", "await(store_embeddings(table_schema_embeddings, col_schema_embeddings))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# **2. Firestore Database Setup**\n", "---\n", "\n", "This section walks through setting up the firestore DB to store the session history of the conversation for multiturn\n", "\n", "It covers the following steps: \n", "> 1. Creation Firestore Database\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from env_setup import create_firestore_db\n", "\n", "create_firestore_db(firestore_region)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 🗄️ **3. Load Known Good SQL into Vector Store**\n", "Known Good Queries are used to create query cache for Few shot examples. Creating a query cache is highly recommended for best outcomes! \n", "\n", "The following cell will load the Natural Language Question and Known Good SQL pairs into our Vector Store. There pairs are loaded from `known_good_sql.csv` file inside scripts folder. If you have your own Question-SQL examples, curate them in .csv file before running the cell below. \n", "\n", "If no Known Good Queries are available at this time to create query cache, you can use [3_LoadKnownGoodSQL.ipynb](3_LoadKnownGoodSQL.ipynb) to load them later. \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Format of the Known Good SQL File (known_good_sql.csv)\n", "\n", "prompt | sql | user_grouping [3 columns]\n", "\n", "prompt ==> User Question \n", "\n", "sql ==> SQL for the user question (Note that the sql should enclosed in quotes and only in single line. Please remove the line break)\n", "\n", "user_grouping ==>This name should exactly match the grouping name you mentioned while creating vector store" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from env_setup import create_kgq_sql_table, store_kgq_sql_embeddings\n", "\n", "# Create table for known good queries (if enabled)\n", "await(create_kgq_sql_table()) \n", "\n", "# Store known good query embeddings (if enabled)\n", "await(store_kgq_sql_embeddings()) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 🥁 If all the above steps are executed suucessfully, the following should be set up:\n", "\n", "* GCP project and all the required IAM permissions\n", "\n", "* Environment to run the solution\n", "\n", "* Data source and Vector store for the solution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__________________________________________________________________________________________________________________" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 0 }

notebooks/1_Setup_OpenDataQnA.ipynb (575 lines of code) (raw):