notebooks/2_Run_OpenDataQnA.ipynb (458 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div style=\"display: flex; align-items: left;\">\n",
" <a href=\"https://sites.google.com/corp/google.com/genai-solutions/home?authuser=0\">\n",
" <img src=\"../utilities/imgs/aaie.png\" style=\"margin-right\">\n",
" </a>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "copyright"
},
"outputs": [],
"source": [
"# Copyright 2024 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DRyGcAepAPJ5"
},
"source": [
"\n",
"<h1 align=\"center\">Open Data QnA - Chat with your SQL Database</h1> \n",
"\n",
"\n",
"The pipeline run covers the following steps: \n",
"\n",
"> 1. Take user question and generate sql in the dialect corresponding to data source\n",
"\n",
"> 2. Execute the sql query and retreive the data\n",
"\n",
"> 3. Generate natural language respose and charts to display\n",
"\n",
"> 4. Clean Up resources\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jsWGZW_fUJjN"
},
"source": [
"### 📒 Using this interactive notebook\n",
"\n",
"If you have not used this IDE with jupyter notebooks it will ask for installing Python + Jupyter extensions. Please go ahead install them\n",
"\n",
"Click the **run** icons ▶️ of each cell within this notebook.\n",
"\n",
"> 💡 Alternatively, you can run the currently selected cell with `Ctrl + Enter` (or `⌘ + Enter` on a Mac).\n",
"\n",
"> ⚠️ **To avoid any errors**, wait for each section to finish in their order before clicking the next “run” icon.\n",
"\n",
"This sample must be connected to a **Google Cloud project**, but nothing else is needed other than your Google Cloud project.\n",
"\n",
"You can use an existing project. Alternatively, you can create a new Cloud project [with cloud credits for free.](https://cloud.google.com/free/docs/gcp-free-tier)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RicDCkdI-hmp"
},
"source": [
"# 🚧 **0. Pre-requisites**\n",
"\n",
"Make sure that you have completed the intial setup process using [1_Setup_OpenDataQnA.ipynb](1_Setup_OpenDataQnA.ipynb). If the 1_Setup_OpenDataQnA notebook has been run successfully, the following are set up:\n",
"* GCP project and all the required IAM permissions\n",
"\n",
"* Data source and Vector store for the solution\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **Change your Kernel to the created .venv with poetry from README.md**\n",
"\n",
"Path would look like e.g. /home/admin_/opendata/.venv/bin/python or ~cache/user/opendataqna/.venv/bin/python\n",
"\n",
"Below is the Kernel how it should look like before you proceed.\n",
"\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ⚙️ **1. Retreive Configuration Parameters**\n",
"The notebook will load all the configuration parameters from the `config.ini` file in the root directory. \n",
"Most of these parameters were set in the initial notebook `1_Setup_OpenDataQnA.ipynb` and save to the 'config.ini file.\n",
"Use the below cells to retrieve these values and specify additional ones required for this notebook. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"module_path = os.path.abspath(os.path.join('..'))\n",
"sys.path.append(module_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read your `PROJECT_ID` from the config.ini file, or set it manually below: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from utilities import PROJECT_ID\n",
"\n",
"#Only set the variable if you don't want the project id from config.ini to use\n",
"#PROJECT_ID = ''"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🔐 **2. Authenticate and Connect to Google Cloud Project**\n",
"Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.\n",
"\n",
"You can do this within Google Colab or using the Application Default Credentials in the Google Cloud CLI."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"Colab Auth\"\"\" \n",
"# from google.colab import auth\n",
"# auth.authenticate_user()\n",
"\n",
"\n",
"\"\"\"Google CLI Auth\"\"\"\n",
"# !gcloud auth application-default login\n",
"\n",
"\n",
"import google.auth\n",
"credentials, project_id = google.auth.default()\n",
"\n",
"# Configure gcloud.\n",
"!gcloud config set project {PROJECT_ID}\n",
"print(f'Project has been set to {PROJECT_ID}')\n",
"# !gcloud auth application-default set-quota-project {PROJECT_ID}\n",
"\n",
"import os\n",
"os.environ['GOOGLE_CLOUD_QUOTA_PROJECT']=PROJECT_ID\n",
"os.environ['GOOGLE_CLOUD_PROJECT']=PROJECT_ID"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ▶️ **3. Run the Open Data QnA Pipeline**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 🏃 **Run the Pipeline**\n",
"\n",
"The next cell executes the pipeline for answering natural language questions over structured data.\n",
"\n",
"The pipeline performs the following steps:\n",
"\n",
"1. **Agent Loading:** Initializes various agents for embedding questions, building SQL queries, validating SQL, debugging SQL, and generating responses.\n",
"2. **Data Source and Vector Store Configuration:** Sets the data source (BigQuery or PostgreSQL) and vector store (BigQuery or PostgreSQL) based on provided parameters and input files.\n",
"3. **Exact Match Search:** Attempts to find an exact match for the user's question in a knowledge graph cache (if enabled). If found, the cached SQL query is used.\n",
"4. **Similar Match and Schema Retrieval:** If no exact match is found, retrieves similar questions and associated SQL queries from the knowledge graph (if enabled). Also retrieves relevant table and column schemas based on similarity to the question.\n",
"5. **SQL Generation and Debugging:** Builds an initial SQL query using the retrieved information. If enabled, the debugger iteratively refines the query with potential validation and error feedback.\n",
"6. **SQL Execution and Response Generation:** Executes the final SQL query (if enabled) against the data source and retrieves results. A response agent then generates a natural language answer based on the results.\n",
"7. **Auditing:** Records the entire pipeline process, including generated SQL, responses, and potential errors, for later analysis.\n",
"\n",
"Args:\n",
"\n",
"* **session_id (str)** Session ID to identify the conversation\n",
"\n",
"* **user_question (str):** The natural language question to answer.\n",
"\n",
"* **user_grouping (str):** Based on what the grouping has been set across the table during setup same will be use to filter or support multiple approaches. Check you data_source_list.csv for the group name. If you didn't specify the group name it will default to `schema_name-source` format\n",
"\n",
"* **RUN_DEBUGGER (bool, optional):** Whether to run the SQL debugger. Defaults to True.\n",
"\n",
"* **EXECUTE_FINAL_SQL (bool, optional):** Whether to execute the final SQL query. Defaults to True.\n",
"\n",
"* **DEBUGGING_ROUNDS (int, optional):** The number of debugging rounds. Defaults to 2.\n",
"\n",
"* **LLM_VALIDATION (bool, optional):** Whether to use LLM for SQL validation during debugging. Defaults to True.\n",
"\n",
"* **Embedder_model (str, optional):** The name of the embedding model. Defaults to 'vertex'.\n",
"\n",
"* **SQLBuilder_model (str, optional):** The name of the SQL building model. Defaults to 'gemini-1.0-pro'.\n",
"\n",
"* **SQLChecker_model (str, optional):** The name of the SQL validation model. Defaults to 'gemini-1.0-pro'.\n",
"\n",
"* **SQLDebugger_model (str, optional):** The name of the SQL debugging model. Defaults to 'gemini-1.0-pro'.\n",
"\n",
"* **Responder_model (str, optional):** The name of the response generation model. Defaults to 'gemini-1.0-pro'.\n",
"\n",
"* **num_table_matches (int, optional):** The number of similar tables to retrieve. Defaults to 5.\n",
"\n",
"* **num_column_matches (int, optional):** The number of similar columns to retrieve. Defaults to 10.\n",
"\n",
"* **table_similarity_threshold (float, optional):** The similarity threshold for tables. Defaults to 0.3.\n",
"\n",
"* **column_similarity_threshold (float, optional):** The similarity threshold for columns. Defaults to 0.3.\n",
"\n",
"* **example_similarity_threshold (float, optional):** The similarity threshold for example questions. Defaults to 0.3.\n",
"\n",
"* **num_sql_matches (int, optional):** The number of similar SQL queries to retrieve. Defaults to 3.\n",
"\n",
"\n",
"Returns:\n",
"- **final_sql (str):** The final generated SQL query.\n",
"- **response (pandas.DataFrame or str):** The result of executing the SQL query or an error message.\n",
"- **_resp (str):** The final response generated by the response agent."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DO NOT RE-RUN THE CELL UNTIL YOU WANT TO START NEW CONVERSATION"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"import uuid\n",
"\n",
"session_id=str(uuid.uuid1())\n",
"\n",
"#Keep the session id to test for multi turn, change it for number conversation\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ❓ **Ask your Natural Language Question**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### CHANGE THE QUESTION FOR EACH FOLLOW UP. IF NEW CONVERSATION RE-RUN ABOVE CELL AND THEN CHANGE THE QUESTION"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Session ID ::: \" + str(session_id))\n",
"\n",
"#MovieExplorer-bigquery\n",
"user_question = \"What are the top 5 most common genre's in the dataset?\"\n",
"\n",
"#follow up-1\n",
"# user_question= 'Can you only do this for just movies'\n",
"\n",
"print(\"User Question:- \"+user_question)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from opendataqna import run_pipeline\n",
"\n",
"final_sql, response, _resp = await run_pipeline(session_id,\n",
" user_question,\n",
" user_grouping = 'MovieExplorer-bigquery',\n",
" RUN_DEBUGGER=True,\n",
" EXECUTE_FINAL_SQL=True,\n",
" DEBUGGING_ROUNDS = 2, \n",
" LLM_VALIDATION=False,\n",
" Embedder_model='vertex', # Options: 'vertex' or 'vertex-lang'\n",
" SQLBuilder_model= 'gemini-1.5-pro',\n",
" SQLChecker_model= 'gemini-1.5-pro',\n",
" SQLDebugger_model= 'gemini-1.5-pro',\n",
" Responder_model= 'gemini-1.5-pro',\n",
" num_table_matches = 5,\n",
" num_column_matches = 10,\n",
" table_similarity_threshold = 0.1,\n",
" column_similarity_threshold = 0.1, \n",
" example_similarity_threshold = 0.1, \n",
" num_sql_matches=3)\n",
"\n",
" \n",
"\n",
"print(\"*\"*50 +\"\\nGenerated SQL\\n\"+\"*\"*50+\"\\n\"+final_sql)\n",
"print(\"\\n\"+\"*\"*50 +\"\\nResults\\n\"+\"*\"*50)\n",
"display(response)\n",
"print(\"*\"*50 +\"\\nNatural Response\\n\"+\"*\"*50+\"\\n\"+_resp)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 📊 **Create Charts for the results** (Run only when you have proper results in the above cells)\n",
"Agent provides two suggestive google charts to display on a UI with element IDs chart_div and chart_div_1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from opendataqna import visualize\n",
"\n",
"chart_js=''\n",
"chart_js,invalid_response = visualize(session_id,user_question,final_sql,response) #sending \n",
"# print(chart_js[\"chart_div_1\"])\n",
"if not invalid_response:\n",
" print(\"Chart Code Generated\")\n",
"else:\n",
" print(\"Error in visualization code generation: \"+ chart_js)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import HTML\n",
"\n",
"html_code = f'''\n",
"<script type=\"text/javascript\" src=\"https://www.gstatic.com/charts/loader.js\"></script>\n",
"<script type=\"text/javascript\">\n",
"{chart_js[\"chart_div\"]}\n",
"</script>\n",
"<div id=\"chart_div\"></div>\n",
"'''\n",
"\n",
"HTML(html_code)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"html_code = f'''\n",
"<script type=\"text/javascript\" src=\"https://www.gstatic.com/charts/loader.js\"></script>\n",
"<script type=\"text/javascript\">\n",
"{chart_js[\"chart_div_1\"]}\n",
"</script>\n",
"<div id=\"chart_div_1\"></div>\n",
"'''\n",
"\n",
"HTML(html_code)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🗑 **Clean Up Notebook Resources**\n",
"Make sure to delete your Cloud SQL instance and BigQuery Datasets when your are finished with this notebook to avoid further costs. 💸 💰\n",
"\n",
"Uncomment and run the cell below to delete "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# # delete Cloud SQL instance\n",
"# !gcloud sql instances delete {PG_INSTANCE} -q\n",
"\n",
"# #delete BigQuery Dataset using bq utility\n",
"# !bq rm -r -f -d {BQ_DATASET_NAME}\n",
"\n",
"# #delete BigQuery 'Open Data QnA' Vector Store Dataset using bq utility\n",
"\n",
"# !bq rm -r -f -d {BQ_OPENDATAQNA_DATASET_NAME}\n",
"\n"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}