colab-enterprise/gen-ai-demo/Sample-Synthetic-Data-Generation-GenAI.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "clfviNVNXpqj" }, "source": [ "## <img src=\"https://lh3.googleusercontent.com/mUTbNK32c_DTSNrhqETT5aQJYFKok2HB1G2nk2MZHvG5bSs0v_lmDm_ArW7rgd6SDGHXo0Ak2uFFU96X6Xd0GQ=w160-h128\" width=\"45\" valign=\"top\" alt=\"BigQuery\"> Sample Synthetic Data Generation using GenAI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### License" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##################################################################################\n", "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "# \n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "# \n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "###################################################################################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notebook Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- WARNING!!! This notebook was used for the original data generation for the demo. The notebook works, but there are newer notebooks in the project. This notebook has been kept for reference.\n", "\n", "- Use LLMs for code generation. You can start with a table schema or even a picture of your ERD (see Menu-Synthetic-Data-Generation-GenAI). \n", " 1. Create your table DDLs\n", " 2. Create LLM prompts for each table and ask it to populate the table with data\n", " 3. Provide the prompts with starting primary keys\n", " 4. Provide the prompts with foreign keys\n", " 5. The LLM makes can understand that it should generate 3 records for menu items (small, med, large) with pricing set accordingly.\n", " 6. The LLM can read the description of each field and use that to generate valid values" ] }, { "cell_type": "markdown", "metadata": { "id": "DszuLZoo9A7k" }, "source": [ "### Initialize Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bhKxJadjWa1R" }, "outputs": [], "source": [ "import pandas as pd\n", "import json\n", "import bigframes.pandas as bf\n", "#from bigframesllm import BigFramesLLM\n", "from bigframes.ml.llm import GeminiTextGenerator" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MSC6-rboip3h" }, "outputs": [], "source": [ "from google.cloud import bigquery\n", "client = bigquery.Client()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lpU1P_fAXviJ" }, "outputs": [], "source": [ "PROJECT_ID = \"${project_id}\"\n", "REGION = \"us\"\n", "DATASET_ID = \"data_beans_synthetic_data\" # use a different dataset so we do not overwrite our demo data\n", "CONNECTION_NAME = \"vertex-ai\"\n", "\n", "connection = f\"{PROJECT_ID}.{REGION}.{CONNECTION_NAME}\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qMijiOqUZmAp" }, "outputs": [], "source": [ "# bf.reset_session() # if you need to change the region\n", "bf.options.bigquery.project = PROJECT_ID\n", "bf.options.bigquery.location = REGION" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zy7lRW09Ws1J" }, "outputs": [], "source": [ "session = bf.get_global_session()\n", "\n", "gemini_model = GeminiTextGenerator(session=session, connection_name=connection)" ] }, { "cell_type": "markdown", "metadata": { "id": "6YeExbVqf4ZE" }, "source": [ "### Supporting Functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QtSLYNa_exfc" }, "outputs": [], "source": [ "def PrettyPrintJson(json_string):\n", " json_object = json.loads(json_string)\n", " json_formatted_str = json.dumps(json_object, indent=2)\n", " print(json_formatted_str)\n", " return json.dumps(json_object)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sthZI1h2fe8H" }, "outputs": [], "source": [ "def LLM(prompt, isOutputJson, max_output_tokens=1024, temperature=0, top_p=0, top_k=1):\n", " print()\n", " print(\"Prompt: \", prompt)\n", " print()\n", " df_prompt = pd.DataFrame(\n", " {\n", " \"prompt\": [prompt],\n", " })\n", " bf_df_prompt = bf.read_pandas(df_prompt)\n", " prediction = gemini_model.predict(bf_df_prompt,\n", " max_output_tokens=max_output_tokens,\n", " temperature=temperature, # 0 to 1 (1 random)\n", " top_p=top_p, # 0 to 1 (1 random)\n", " top_k=top_k, # (1 to 40 random)\n", " ).to_pandas()\n", " try:\n", " # Remove common LLM output mistakes\n", " result = prediction['ml_generate_text_llm_result'][0]\n", "\n", " result = result.replace(\"```json\\n\",\"\")\n", " result = result.replace(\"```JSON\\n\",\"\")\n", " result = result.replace(\"```json\",\"\")\n", " result = result.replace(\"```JSON\",\"\")\n", " result = result.replace(\"```sql\\n\",\"\")\n", " result = result.replace(\"```SQL\\n\",\"\")\n", " result = result.replace(\"```sql\",\"\")\n", " result = result.replace(\"```SQL\",\"\")\n", " result = result.replace(\"```\",\"\")\n", "\n", " if isOutputJson:\n", " result = result.replace(\"\\n\",\" \")\n", " json_string = PrettyPrintJson(result)\n", " json_string = json_string.replace(\"'\",\"\\\\'\")\n", " json_string = json_string.strip()\n", " return json_string\n", " else:\n", " if \"INSERT INTO\" in result:\n", " # do nothing (do not escape the single ticks, the LLM should do this\n", " # automatically for any text fields)\n", " print(\"Do nothing\")\n", " else:\n", " result = result.replace(\"'\",\"\\\\'\")\n", " result = result.strip()\n", " return result\n", "\n", " except:\n", " print(\"Error (raw): \", prediction['ml_generate_text_llm_result'][0])\n", " print(\"Error (result): \", result)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_8p8PkhFibYw" }, "outputs": [], "source": [ "def GetTableSchema(dataset_name, table_name):\n", " import io\n", "\n", " dataset_ref = client.dataset(dataset_name, project=PROJECT_ID)\n", " table_ref = dataset_ref.table(table_name)\n", " table = client.get_table(table_ref)\n", "\n", " f = io.StringIO(\"\")\n", " client.schema_to_json(table.schema, f)\n", " return f.getvalue()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HgMmeQSWvq6k" }, "outputs": [], "source": [ "def GetForeignKeys(dataset_name, table_name, field_name):\n", " sql = f\"\"\"\n", " SELECT STRING_AGG(CAST({field_name} AS STRING), \",\" ORDER BY {field_name}) AS result\n", " FROM `{PROJECT_ID}.{dataset_name}.{table_name}`\n", " WHERE {field_name} <= 10 -- demo hack since the database is fully populated \n", " \"\"\"\n", " #print(sql)\n", " df_result = client.query(sql).to_dataframe()\n", " #display(df_result)\n", " return df_result['result'].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "esIZXzKjsKIF" }, "outputs": [], "source": [ "def GetDistinctValues(dataset_name, table_name, field_name):\n", " sql = f\"\"\"\n", " SELECT STRING_AGG(DISTINCT {field_name}, \",\" ) AS result\n", " FROM `{PROJECT_ID}.{dataset_name}.{table_name}`\n", " \"\"\"\n", " #print(sql)\n", " df_result = client.query(sql).to_dataframe()\n", " #display(df_result)\n", " return df_result['result'].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Tto7sIdJ5vYi" }, "outputs": [], "source": [ "def GetStartingValue(dataset_name, table_name, field_name):\n", " sql = f\"\"\"\n", " SELECT IFNULL(MAX({field_name}),0) + 1 AS result\n", " FROM `{PROJECT_ID}.{dataset_name}.{table_name}`\n", " \"\"\"\n", " #print(sql)\n", " df_result = client.query(sql).to_dataframe()\n", " #display(df_result)\n", " return df_result['result'].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7sK-fbCaVMCZ" }, "outputs": [], "source": [ "def GetMaximumValue(dataset_name, table_name, field_name):\n", " sql = f\"\"\"\n", " SELECT IFNULL(MAX({field_name}),0) AS result\n", " FROM `{PROJECT_ID}.{dataset_name}.{table_name}`\n", " \"\"\"\n", " #print(sql)\n", " df_result = client.query(sql).to_dataframe()\n", " #display(df_result)\n", " return df_result['result'].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DFqXl0cefaOe" }, "outputs": [], "source": [ "def RunQuery(sql):\n", " import time\n", "\n", " #return True # return early for now\n", "\n", " job_config = bigquery.QueryJobConfig(priority=bigquery.QueryPriority.INTERACTIVE)\n", " query_job = client.query(sql, job_config=job_config)\n", "\n", " # Check on the progress by getting the job's updated state.\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " while query_job.state != \"DONE\":\n", " time.sleep(2)\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " if query_job.error_result == None:\n", " return True\n", " else:\n", " return False" ] }, { "cell_type": "markdown", "metadata": { "id": "CUPdwNf_89Gq" }, "source": [ "### Create Tables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Here we are starting with our schema DDL and using our description to let the LLM know valid values.\n", "- Check out the Menu-Synthetic-Data-Generation-GenAI notebook to see how we can start with an ERD (image) and let the LLM read the image and create our schema based upon it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bigquery\n", "CREATE SCHEMA IF NOT EXISTS `${project_id}.data_beans_synthetic_data`;" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "V1DRyogrX0Yz" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.company`\n", "(\n", " company_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " company_name STRING NOT NULL OPTIONS(description=\"Name of the company.\"),\n", ")\n", "CLUSTER BY company_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mV5dmUl2X1S6" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.location`\n", "(\n", " location_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " company_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: A valid value from the company table.\"),\n", " location_name STRING NOT NULL OPTIONS(description=\"Name of the location.\"),\n", " location_type STRING NOT NULL OPTIONS(description=\"Valid values for location_type are: 'Fixed', 'Mobile'\"),\n", ")\n", "CLUSTER BY location_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nvfwRUU2YTXC" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.customer`\n", "(\n", " customer_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " company_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: A valid value from the company table.\"),\n", " customer_name STRING NOT NULL OPTIONS(description=\"Name of the customer.\"),\n", " customer_yob INT NOT NULL OPTIONS(description=\"Customer year of birth\"),\n", " customer_email STRING NOT NULL OPTIONS(description=\"Customer's email address\"),\n", " customer_inception_date DATE NOT NULL OPTIONS(description=\"Date of first customer interaction\")\n", "\n", ")\n", "CLUSTER BY customer_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nK3j19oMEpGn" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.customer_profile`\n", "(\n", " customer_id INTEGER NOT NULL OPTIONS(description=\"Primary key. Foreign key: Customer table.\"),\n", " customer_llm_summary STRING NOT NULL OPTIONS(description=\"LLM generated summary of customer data.\"),\n", " customer_lifetime_value STRING NOT NULL OPTIONS(description=\"Total sales for this customer.\"),\n", " customer_cluster_id INT NOT NULL OPTIONS(description=\"Clustering algorithm id.\"),\n", " customer_review_llm_summary STRING NOT NULL OPTIONS(description=\"LLM summary are all of the customer reviews.\"),\n", " customer_survey_llm_summary STRING NOT NULL OPTIONS(description=\"LLM summary are all of the customer surveys.\")\n", "\n", ")\n", "CLUSTER BY customer_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hEdN9VbhFRnT" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.customer_survey`\n", "(\n", " customer_survey_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " customer_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: Customer table\"),\n", " survey_date DATE NOT NULL OPTIONS(description=\"The date of the survey\"),\n", " question_1 STRING NOT NULL OPTIONS(description=\"A survey question.\"),\n", " response_1 STRING NOT NULL OPTIONS(description=\"The customers response.\"),\n", " question_2 STRING NOT NULL OPTIONS(description=\"A survey question.\"),\n", " response_2 STRING NOT NULL OPTIONS(description=\"The customers response.\"),\n", " question_3 STRING NOT NULL OPTIONS(description=\"A survey question.\"),\n", " response_3 STRING NOT NULL OPTIONS(description=\"The customers response.\")\n", ")\n", "CLUSTER BY customer_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cKK7KEnaH_Em" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.customer_review`\n", "(\n", " customer_review_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " customer_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: Customer table\"),\n", " location_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: Location table\"),\n", " review_datetime TIMESTAMP NOT NULL OPTIONS(description=\"Date and time of the review.\"),\n", " review_text STRING NOT NULL OPTIONS(description=\"The customer's review of the coffee.\"),\n", " review_mp3 STRING OPTIONS(description=\"The GCS location of an attached MP3 file.\"),\n", " review_image STRING OPTIONS(description=\"The GCS location of an attached image file.\"),\n", " review_sentiment STRING OPTIONS(description=\"The sentiment of the review text.\"),\n", " social_media_source STRING NOT NULL OPTIONS(description=\"The social media site the review was posted on.\"),\n", " social_media_handle STRING NOT NULL OPTIONS(description=\"The customer's social media handle\"),\n", " gen_ai_recommended_action STRING OPTIONS(description=\"Valid values for gen_ai_recommended_action are: 'Send Survey', 'Send Coupon'\"),\n", " gen_ai_reponse STRING OPTIONS(description=\"The Generated response from the LLM.\"),\n", " human_response STRING OPTIONS(description=\"The human manually entered response.\"),\n", " response_sent_action STRING OPTIONS(description=\"Valid values for response_sent_action are: 'Sent LLM Response', 'Human called'\"),\n", " response_sent_date TIMESTAMP OPTIONS(description=\"Date and time the response was sent.\")\n", "\n", ")\n", "CLUSTER BY customer_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bktx53fDfszf" }, "outputs": [], "source": [ "sql = f\"\"\"ALTER TABLE `{PROJECT_ID}.{DATASET_ID}.customer_review`\n", " ADD COLUMN IF NOT EXISTS llm_detected_theme JSON OPTIONS(description=\"The LLM detected themes in the customer review.\");\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1mIv7Q-DRxI4" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.menu`\n", "(\n", " menu_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " company_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: Company table.\"),\n", " item_name STRING NOT NULL OPTIONS(description=\"The name of the coffee drink\"),\n", " item_price FLOAT64 NOT NULL OPTIONS(description=\"The price of the coffee\"),\n", " item_description STRING NOT NULL OPTIONS(description=\"The detailed description of the coffee drink\"),\n", " item_size STRING NOT NULL OPTIONS(description=\"Valid Values: Small, Medium, Large\")\n", "\n", ")\n", "CLUSTER BY menu_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Uiu6p13mAjQG" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.order`\n", "(\n", " order_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " location_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: Location table.\"),\n", " customer_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: Customer table.\"),\n", " order_datetime TIMESTAMP NOT NULL OPTIONS(description=\"The datetime the order was started.\"),\n", " order_completion_datetime TIMESTAMP NOT NULL OPTIONS(description=\"The datetime the order was completed.\")\n", "\n", ")\n", "CLUSTER BY order_id;\n", "\"\"\"\n", "\n", "client.query(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IoR5uQmLEOiJ" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.order_item`\n", "(\n", " order_item_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " order_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: Order table\"),\n", " menu_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: Menu table\"),\n", " quantity INTEGER NOT NULL OPTIONS(description=\"Number of items ordered\")\n", "\n", ")\n", "CLUSTER BY order_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TJcSIw-ym4d6" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.coffee_farm`\n", "(\n", " coffee_farm_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " name STRING NOT NULL OPTIONS(description=\"The name of the coffee farm.\"),\n", " latitude FLOAT64 NOT NULL OPTIONS(description=\"The latitude of the coffee farm.\"),\n", " longitude FLOAT64 NOT NULL OPTIONS(description=\"The longitude of the coffee farm.\"),\n", " lat_long GEOGRAPHY NOT NULL OPTIONS(description=\"The latitude and longitude of the coffee farm.\"),\n", " contact_name STRING NOT NULL OPTIONS(description=\"The main contact person of the coffee farm.\"),\n", " contact_email STRING NOT NULL OPTIONS(description=\"The email address of the coffee farm.\"),\n", " contact_code STRING NOT NULL OPTIONS(description=\"Alpha 3 Code\")\n", ")\n", "CLUSTER BY coffee_farm_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3OESaGAsbxis" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.coffee_processor`\n", "(\n", " coffee_processor_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " name STRING NOT NULL OPTIONS(description=\"The name of the coffee processing facility.\"),\n", " latitude FLOAT64 NOT NULL OPTIONS(description=\"The latitude of the coffee processing facility.\"),\n", " longitude FLOAT64 NOT NULL OPTIONS(description=\"The longitude of the coffee processing facility.\"),\n", " lat_long GEOGRAPHY NOT NULL OPTIONS(description=\"The latitude and longitude of the coffee processing facility.\"),\n", " contact_name STRING NOT NULL OPTIONS(description=\"The main contact person of the coffee processing facility.\"),\n", " contact_email STRING NOT NULL OPTIONS(description=\"The email address of the coffee processing facility.\"),\n", " contact_code STRING NOT NULL OPTIONS(description=\"Alpha 3 Code\")\n", ")\n", "CLUSTER BY coffee_processor_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9y95NMOufFEF" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.coffee_roaster`\n", "(\n", " coffee_roaster_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " name STRING NOT NULL OPTIONS(description=\"The name of the coffee roaster facility.\"),\n", " latitude FLOAT64 NOT NULL OPTIONS(description=\"The latitude of the coffee roaster facility.\"),\n", " longitude FLOAT64 NOT NULL OPTIONS(description=\"The longitude of the coffee roaster facility.\"),\n", " lat_long GEOGRAPHY NOT NULL OPTIONS(description=\"The latitude and longitude of the coffee roaster facility.\"),\n", " contact_name STRING NOT NULL OPTIONS(description=\"The main contact person of the coffee roaster facility.\"),\n", " contact_email STRING NOT NULL OPTIONS(description=\"The email address of the coffee roaster facility.\"),\n", " contact_code STRING NOT NULL OPTIONS(description=\"Alpha 3 Code\")\n", ")\n", "CLUSTER BY coffee_roaster_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dd5pjKEzSaji" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.city`\n", "(\n", " city_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " city_name STRING NOT NULL OPTIONS(description=\"The name of the city.\"),\n", " country_code STRING NOT NULL OPTIONS(description=\"Alpha 3 Code.\"),\n", " latitude FLOAT64 NOT NULL OPTIONS(description=\"The latitude of the city.\"),\n", " longitude FLOAT64 NOT NULL OPTIONS(description=\"The longitude of the city.\"),\n", " lat_long GEOGRAPHY NOT NULL OPTIONS(description=\"The latitude and longitude of the city.\")\n", ")\n", "CLUSTER BY city_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)\n", "\n", "sql = f\"\"\"\n", "INSERT INTO `{PROJECT_ID}.{DATASET_ID}.city` VALUES\n", "(1, 'New York City', 'USA', 40.7128, 74.0060, ST_GeogPoint(74.0060, 40.7128)),\n", "(2, 'London', 'GBR', 51.5072, 0.1276, ST_GeogPoint(0.1276, 51.5072)),\n", "(3, 'Tokyo', 'JPN', 35.6764, 139.6500, ST_GeogPoint(139.6500, 35.6764)),\n", "(4, 'San Francisco', 'USA', 37.7749, 122.4194, ST_GeogPoint(122.4194, 37.7749));\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fnQfyW0XR7jT" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.sales_forecast`\n", "(\n", " sales_forecast_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " forecast_date DATE NOT NULL OPTIONS(description=\"The date for the forecasted sales\"),\n", " sales_forecast_amount FLOAT64 NOT NULL OPTIONS(description=\"The projected sales for the city.\"),\n", " city_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: City table.\")\n", ")\n", "CLUSTER BY sales_forecast_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5DavsmDlXmLE" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.marketing_campaign`\n", "(\n", " marketing_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " marketing_campaign_name STRING NOT NULL OPTIONS(description=\"The cost for the campaign.\"),\n", " marketing_cost FLOAT64 NOT NULL OPTIONS(description=\"The cost for the campaign.\"),\n", " number_of_new_customers INT64 NOT NULL OPTIONS(description=\"The projected sales for the city.\")\n", ")\n", "CLUSTER BY marketing_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)\n", "\n", "sql = f\"\"\"\n", "INSERT INTO `{PROJECT_ID}.{DATASET_ID}.marketing_campaign` VALUES\n", "(1, 'Google Ads', 15000, 1500),\n", "(2, 'Billboards', 4000, 200);\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "markdown", "metadata": { "id": "Jr2sIqNP800E" }, "source": [ "### Company Table" ] }, { "cell_type": "markdown", "metadata": { "id": "RKpEobqAXesn" }, "source": [ "Create the names for the coffee companies.\n", "\n", "* Create the names outside of the SQL statement. This provides more control over the names.\n", "* Use the names for the SQL prompt.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Qj3hXqA_FZWm" }, "outputs": [], "source": [ "company_count = 10\n", "\n", "table_name = \"company\"\n", "primary_key = \"company_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TvLGkqIcWqH6" }, "outputs": [], "source": [ "prompt = f\"\"\"Generate {company_count} creative names and return in the below json format.\n", "- The name should be new and not a name that is already used by an existing coffee company.\n", "- The name should be related to coffee.\n", "- The name should be related to a food truck type of service.\n", "\n", "JSON format: [ \"value\" ]\n", "Sample JSON Response: [ \"value1\", \"value2\" ]\n", "\"\"\"\n", "\n", "llm_valid_execution = False\n", "while llm_valid_execution == False:\n", " try:\n", " company_names = LLM(prompt, True, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " llm_valid_execution = True\n", " except Exception as error:\n", " print(\"An error occurred:\", error)\n", "\n", "print(f\"company_names: {company_names}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "N8mIkpstyyw0" }, "outputs": [], "source": [ "# Override (hardcoded)\n", "company_names = \"\"\"[\n", " \"BeanStreet Brews\",\n", " \"MochaWheels Cafe\",\n", " \"Brew 'n Go Bistro\",\n", " \"Rolling Roast Express\",\n", " \"JavaJolt Junction\",\n", " \"Café on Wheels\",\n", " \"Espresso Eats Mobile\",\n", " \"CuppaCruiser\",\n", " \"The Roaming Roastery\",\n", " \"StreetBean Delights\"\n", "]\"\"\"\n", "\n", "company_names = company_names.replace(\"'\",\"\\\\'\") # Deal with single quotes in output" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dmdXJAxHy7xU" }, "outputs": [], "source": [ "company_names" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7qb4CFnsglfU" }, "outputs": [], "source": [ "starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", "prompt=f\"\"\"\n", "You are a database engineer and need to generate data for a table for the below schema.\n", "- The schema is for a Google Cloud BigQuery Table.\n", "- The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", "- Read the description of each field for valid values.\n", "- Do not preface the response with any special characters or 'sql'.\n", "- Generate {company_count} insert statements for this table.\n", "- Valid values for company_name are: {company_names}\n", "- The starting value of the field {primary_key} is {starting_value}.\n", "- Only generate a single statement, not multiple INSERTs.\n", "\n", "\n", "Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", "Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", "Schema: {schema}\n", "\"\"\"\n", "\n", "llm_valid_execution = False\n", "while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "deSLhzKC8qHQ" }, "source": [ "### Location Table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AkAbXzGMFWEF" }, "outputs": [], "source": [ "location_count = 50\n", "location_list = \"New York, San Francisco, Miami\"\n", "\n", "table_name = \"company\"\n", "field_name = \"company_id\"\n", "company_ids = GetForeignKeys(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"location\"\n", "primary_key = \"location_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QayPYuBQX3q9" }, "outputs": [], "source": [ "prompt = f\"\"\"Generate {location_count} creative names and return in the below json format.\n", "- The name should be new and not a name that is already used by an existing coffee company.\n", "- The name should be related to mobile coffee trucks that travel around the city.\n", "- The name should be related the following locations: {location_list}\n", "\n", "JSON format: [ \"value\" ]\n", "Sample JSON Response: [ \"value1\", \"value2\" ]\n", "\"\"\"\n", "\n", "llm_valid_execution = False\n", "while llm_valid_execution == False:\n", " try:\n", " location_names = LLM(prompt, True, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " llm_valid_execution = True\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7hzH8uvczlN7" }, "outputs": [], "source": [ "# Override (hardcoded)\n", "\n", "location_names = \"\"\"[\n", " \"CitySips Roaming Cafe\",\n", " \"JavaJourney Express\",\n", " \"UrbanCaffeine Cruiser\",\n", " \"CafeWheels NY\",\n", " \"Golden Gate Grind Mobile\",\n", " \"Sunshine Sips on Wheels\",\n", " \"Big Apple Brew Bus\",\n", " \"Bay Brew Hauler\",\n", " \"Magic City Mocha Mobile\",\n", " \"Metropolis Mug Mover\",\n", " \"New York Nectar Nomad\",\n", " \"SFO Street Sips\",\n", " \"MiaMornings Mobile\",\n", " \"CityBeans Roam-uccino\",\n", " \"Sunrise City Sipper\",\n", " \"Gotham Grind on Wheels\",\n", " \"Bay Area Bean Bus\",\n", " \"Mia Mochaccino Mobile\",\n", " \"Cityscape Sip Stop\",\n", " \"Transit Brew Buggy\",\n", " \"Fog City Fueler\",\n", " \"Miami Metro Mugs\",\n", " \"NY Espresso Express\",\n", " \"Golden Grind Getter\",\n", " \"Sunny Side Sips Shuttle\",\n", " \"Empire Espresso Explorer\",\n", " \"SF Sidewalk Sipper\",\n", " \"Beachside Brew Bounder\",\n", " \"Urban Sipper's Shuttle\",\n", " \"NYC Nomadic Nectar\",\n", " \"Golden Bridge Brewmobile\",\n", " \"Sunny State Sipster\",\n", " \"Cafe Cruiser Central\",\n", " \"NY Neighborhood Nectar\",\n", " \"Frisco Fuel on Wheels\",\n", " \"MiaMug Mobility\",\n", " \"Metropolitan Mochaccino\",\n", " \"CitySips Street Surfer\",\n", " \"Golden Gate Gourmet Glide\",\n", " \"Beach Breeze Brew Bus\",\n", " \"City Roast Cruiser\",\n", " \"NYC Urban Uplifter\",\n", " \"Frisco Fresh Brews\",\n", " \"Miami Magic Mugs\",\n", " \"Coffee Cart Connection\",\n", " \"Empire City Espresso Explorer\",\n", " \"Golden Glow Grind Rover\",\n", " \"Sun-Kissed Sip & Go\",\n", " \"CityLife Latte Lorry\",\n", " \"Cityscape Sipper Shuttle\"\n", "]\"\"\"\n", "\n", "location_names = location_names.replace(\"'\",\"\\\\'\") # Deal with single quotes in output" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3WpCWlL-lb9h" }, "outputs": [], "source": [ "starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", "prompt=f\"\"\"\n", "You are a database engineer and need to generate data for a table for the below schema.\n", "- The schema is for a Google Cloud BigQuery Table.\n", "- The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", "- Read the description of each field for valid values.\n", "- Do not preface the response with any special characters or 'sql'.\n", "- Generate {location_count} insert statements for this table.\n", "- Valid values for location_name are: {location_names}\n", "- Valid values for company_id are: {company_ids}\n", "- The starting value of the field {primary_key} is {starting_value}.\n", "- Only generate a single statement, not multiple INSERTs.\n", "\n", "Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", "Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", "Schema: {schema}\n", "\"\"\"\n", "\n", "llm_valid_execution = False\n", "while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "nlhRKtPAsAWz" }, "source": [ "### Customer Table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "uk0dNDwba44U" }, "outputs": [], "source": [ "sql = f\"\"\"ALTER TABLE `{PROJECT_ID}.{DATASET_ID}.customer`\n", " ADD COLUMN IF NOT EXISTS country_code STRING OPTIONS(description=\"The home country of the customer.\");\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FwWRZChiFTIl" }, "outputs": [], "source": [ "customer_count = 5\n", "country = \"Great Britain\"\n", "country_code = \"GBR\"\n", "\n", "table_name = \"company\"\n", "field_name = \"company_id\"\n", "company_ids = GetForeignKeys(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"customer\"\n", "primary_key = \"customer_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6DcfseTbsB8u" }, "outputs": [], "source": [ "loop_count = 1\n", "loop_index = 1\n", "\n", "while loop_index <= loop_count:\n", " print(f\"loop_index: {loop_index} | loop_count: {loop_count}\")\n", " starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", " prompt=f\"\"\"\n", " You are a database engineer and need to generate data for a table for the below schema.\n", " - The schema is for a Google Cloud BigQuery Table.\n", " - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", " - Read the description of each field for valid values.\n", " - Do not preface the response with any special characters or 'sql'.\n", " - Generate {customer_count} insert statements for this table.\n", " - The customer_inception_date is a date and should be within the past 5 years.\n", " - The customer_name should be names used in the country {country} and be a first and last name.\n", " - Valid values for company_id are: {company_ids}\n", " - The starting value of the field {primary_key} is {starting_value}.\n", " - Only generate a single statement, not multiple INSERTs.\n", " - Set the country_code to {country_code}\n", "\n", " Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", " Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", " Schema: {schema}\n", " \"\"\"\n", "\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " loop_index = loop_index + 1\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "ylfzCAIzEALT" }, "source": [ "### Customer Survey Table" ] }, { "cell_type": "markdown", "metadata": { "id": "ZEvAdTsKaLbB" }, "source": [ "* Generate a list of survey coffee questions\n", "* Save these, so we re-use in each loop\n", "* Generate survey responses\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "W9NKsQYbEALe" }, "outputs": [], "source": [ "customer_survey_count = 3\n", "\n", "prompt = f\"\"\"You work in the marketing department and are gathering feedback on coffee served to customers.\n", "Generate {customer_survey_count} random coffee survey questions and return in the below JSON format.\n", "- Do not include any special characters in the json.\n", "- Do not include ```json in the output.\n", "\n", "JSON format: [ \"value\" ]\n", "Sample JSON Response: [ \"value1\", \"value2\" ]\n", "\"\"\"\n", "\n", "llm_valid_execution = False\n", "while llm_valid_execution == False:\n", " try:\n", " generatcustomer_surveysed_customer_surveys = LLM(prompt, True, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " llm_valid_execution = True\n", " except Exception as error:\n", " print(\"An error occurred:\", error)\n", "\n", "# Hard coded so we ask the same questions (for now)\n", "customer_surveys = '[\"How satisfied were you with the taste of our coffee?\", \"How likely are you to recommend our coffee to a friend?\", \"Which of the following best describes your overall experience with our coffee?\"]'\n", "\n", "# Load the json do we index\n", "customer_surveys_json = json.loads(customer_surveys)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CceughTOFQW-" }, "outputs": [], "source": [ "rows_of_data_to_generate = 5\n", "\n", "table_name = \"customer\"\n", "field_name = \"customer_id\"\n", "customer_ids = GetForeignKeys(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"customer_survey\"\n", "primary_key = \"customer_survey_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UQL9jWr0EALe" }, "outputs": [], "source": [ "loop_count = 10\n", "loop_index = 1\n", "\n", "while loop_index <= loop_count:\n", " print(f\"loop_index: {loop_index} | loop_count: {loop_count}\")\n", " starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", " prompt=f\"\"\"\n", " You are a database engineer and need to generate data for a table for the below schema.\n", " - The schema is for a Google Cloud BigQuery Table.\n", " - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", " - Read the description of each field for valid values.\n", " - Do not preface the response with any special characters or 'sql'.\n", " - Generate {rows_of_data_to_generate} insert statements for this table.\n", " - Valid values for customer_id are: {customer_ids}\n", " - The survey_date is a date and should be within the past 5 years.\n", " - Valid values for question_1 is {customer_surveys_json[0]}\n", " - Valid values for question_2 is {customer_surveys_json[1]}\n", " - Valid values for question_3 is {customer_surveys_json[2]}\n", " - The response for each question should be 2 to 20 words.\n", " - The response can have a postive, neutral or negative sentiment.\n", " - The starting value of the field {primary_key} is {starting_value}.\n", " - Escape single quotes with a backslash. Example: Adam's Answer: Adam\\'s\n", " - Only generate a single statement, not multiple INSERTs.\n", "\n", " Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", " Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", " Schema: {schema}\n", " \"\"\"\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " loop_index = loop_index + 1\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "WbxE7eDp-L1j" }, "source": [ "### Customer Review Table" ] }, { "cell_type": "markdown", "metadata": { "id": "Mg8ZZEViq9lu" }, "source": [ "### Create customer reviews" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8VwUtOMh-L1j" }, "outputs": [], "source": [ "rows_of_data_to_generate = 3\n", "\n", "table_name = \"customer\"\n", "field_name = \"customer_id\"\n", "customer_ids = GetForeignKeys(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"location\"\n", "field_name = \"location_id\"\n", "location_ids = GetForeignKeys(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"customer_review\"\n", "primary_key = \"customer_review_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LL6cYlwWCYQu" }, "outputs": [], "source": [ "# location_ids='1,11,21'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bVm36Q_1-L1k" }, "outputs": [], "source": [ "import random\n", "loop_count = 1000\n", "loop_index = 1\n", "\n", "while loop_index <= loop_count:\n", " print(f\"loop_index: {loop_index} | loop_count: {loop_count}\")\n", " starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", " if random.random() < .25:\n", " prompt=f\"\"\"\n", " You are a database engineer and need to generate data for a table for the below schema.\n", " You need to generate reviews for customers who have purchased your brewed coffee.\n", " Write a negative in first person based upon the following: \"Bad Service\",\"Long Wait Time\",\"Slow Service\",\"Dirty\",\"Overpriced\",\"Overcrowded\",\"Noisy Location\",\"Lack of Allergan Information\",\"Inconsistent Quality\",\"Lack of Seating\",\"No Flavor\",\"Too weak\",\"Too strong\",\"Too bitter\",\"Limited Menu\"\n", " - The schema is for a Google Cloud BigQuery Table.\n", " - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", " - Read the description of each field for valid values.\n", " - Do not preface the response with any special characters or 'sql'.\n", " - Generate {rows_of_data_to_generate} insert statements for this table.\n", " - Valid values for customer_id are: {customer_ids}\n", " - Valid values for location_id are: {location_ids}\n", " - The review_datetime is a date and should be within the past 5 years.\n", " - The response for each question should be 20 to 100 words.\n", " - The starting value of the field {primary_key} is {starting_value}.\n", " - Only generate a single statement, not multiple INSERTs.\n", " - Escape single quotes with a backslash. Example: Adam's Answer: Adam\\'s\n", " - Only generate data for these fields: customer_review_id, customer_id, location_id, review_datetime, review_text, social_media_source, social_media_handle\n", "\n", " Examples:\n", " Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", " Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", " Schema: {schema}\n", " \"\"\"\n", " else:\n", " prompt=f\"\"\"\n", " You are a database engineer and need to generate data for a table for the below schema.\n", " You need to generate reviews for customers who have purchased your brewed coffee.\n", " Write a positive or neutral review in first person based upon the following: \"Good Service\",\"Short Wait Time\",\"Fast Service\",\"Clean\",\"Good value\",\"Cozy Seating Areas\",\"Quite Location\",\"Variety of Milk Alternatives\",\"Consistent Quality\",\"Lots of places to sit\",\"Lots of Flavor\",\"Good Taste\",\"Good Selection\"\n", " - The schema is for a Google Cloud BigQuery Table.\n", " - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", " - Read the description of each field for valid values.\n", " - Do not preface the response with any special characters or 'sql'.\n", " - Generate {rows_of_data_to_generate} insert statements for this table.\n", " - Valid values for customer_id are: {customer_ids}\n", " - The review_datetime is a date and should be within the past 5 years.\n", " - The response for each question should be 20 to 100 words.\n", " - The starting value of the field {primary_key} is {starting_value}.\n", " - Only generate a single statement, not multiple INSERTs.\n", " - Escape single quotes with a backslash. Example: Adam's Answer: Adam\\'s\n", " - Only generate data for these fields: customer_review_id, customer_id, location_id, review_datetime, review_text, social_media_source, social_media_handle\n", "\n", " Examples:\n", " Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", " Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", " Schema: {schema}\n", " \"\"\"\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " loop_index = loop_index + 1\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "uWUogwtAqNTT" }, "source": [ "### Score the Sentiment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fSB45vFQPnaM" }, "outputs": [], "source": [ "sql = \"\"\"SELECT customer_review_id,\n", " review_text\n", " FROM `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " WHERE review_sentiment IS NULL\n", " ORDER BY customer_review_id\n", "\"\"\"\n", "\n", "# Fields to update\n", "# review_sentiment,\n", "# gen_ai_recommended_action,\n", "# gen_ai_reponse,\n", "# human_response,\n", "# response_sent_action,\n", "# response_sent_date\n", "\n", "df_process = client.query(sql).to_dataframe()\n", "\n", "for row in df_process.itertuples():\n", " customer_review_id = row.customer_review_id\n", " review_text = row.review_text\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " prompt=f\"\"\"\n", " For the given review classify the sentiment as Positive, Neutral or Negative.\n", " Review: {review_text}\n", " \"\"\"\n", " review_sentiment = LLM(prompt, False, max_output_tokens=10, temperature=0, top_p=0, top_k=1)\n", "\n", " sql = f\"\"\"UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " SET review_sentiment = '{review_sentiment}'\n", " WHERE customer_review_id = {customer_review_id}\n", " \"\"\"\n", "\n", " print (sql)\n", "\n", " llm_valid_execution = RunQuery(sql)\n", " llm_valid_execution = True\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "6W_Dr1xGqRML" }, "source": [ "### Gen AI Response" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZJeGxIYwqUlc" }, "outputs": [], "source": [ "sql = \"\"\"SELECT customer_review_id,\n", " review_text\n", " FROM `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " WHERE gen_ai_reponse IS NULL\n", " ORDER BY customer_review_id\"\"\"\n", "\n", "df_process = client.query(sql).to_dataframe()\n", "\n", "for row in df_process.itertuples():\n", " customer_review_id = row.customer_review_id\n", " review_text = row.review_text\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " prompt=f\"\"\"\n", " Generate responses to the below customer review who purchased coffee and return the results the below json format.\n", " The review can be positive, negative, or neutral.\n", " Provide a variety of responses, including thanking customers for positive reviews, addressing concerns in negative reviews, and engaging with neutral reviews.\n", " Please generate at least 5 different responses.\n", "\n", " JSON format: [ \"value\" ]\n", " Sample JSON Response: [ \"response 1\", \"response 2\", \"response 3\", \"response 4\", \"response 5\" ]\n", "\n", " Review: {review_text}\"\"\"\n", "\n", " json_result = LLM(prompt, True, max_output_tokens=1024, temperature=0, top_p=0, top_k=1)\n", " print(f\"json_result: {json_result}\")\n", "\n", " if json_result == None:\n", " llm_valid_execution = False\n", " else:\n", " sql = f\"\"\"UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " SET gen_ai_reponse = '{json_result}'\n", " WHERE customer_review_id = {customer_review_id}\n", " \"\"\"\n", "\n", " print(f\"sql: {sql}\")\n", "\n", " llm_valid_execution = RunQuery(sql)\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "azULzoGAqVGz" }, "source": [ "### Gen AI Recommended Action" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "noL3Es8tqjr0" }, "outputs": [], "source": [ "sql = \"\"\"SELECT customer_review_id,\n", " review_text\n", " FROM `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " WHERE gen_ai_recommended_action IS NULL\n", " ORDER BY customer_review_id\"\"\"\n", "\n", "df_process = client.query(sql).to_dataframe()\n", "\n", "for row in df_process.itertuples():\n", " customer_review_id = row.customer_review_id\n", " review_text = row.review_text\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " prompt=\"\"\"\n", " Select one of the following actions based upon the below customer review who purchased coffee.\n", " - First randomly sort the actions.\n", " - Select the best action based upon the sentiment of the review.\n", " - It is okay to use the action \"Send the customer a coupon\" for both positive and negative reviews.\n", " - Return the results the below json format.\n", " - Do not include any special characters or \"```json\" in the json output\n", "\n", " Actions\n", " - \"Thank the Customer\"\n", " - \"Apologize to the Customer\"\n", " - \"Send the customer a coupon\"\n", " - \"Call the customer\"\n", " - \"Promote Additional Products\"\n", " - \"Promise to Investigate\"\n", " - \"Encourage More Reviews\"\n", " - \"Invite Further Engagement\"\n", " - \"Reshare the review on other social media\"\n", "\n", " JSON format: { \"action\" : \"value\", \"explaination\" : \"llm explaination\" }\n", " Sample JSON Response: { \"action\" : \"Call the customer\", \"explaination\" : \"The customer left their phone number in the review.\" }\n", " Sample JSON Response: { \"action\" : \"Encourage More Reviews\", \"explaination\" : \"Thanks for the review, please keep posting.\" }\n", "\n", " Review:\"\"\"\n", " prompt = prompt + review_text\n", "\n", " json_result = LLM(prompt, True, max_output_tokens=1024, temperature=0, top_p=0, top_k=1)\n", " print(f\"json_result: {json_result}\")\n", "\n", " if json_result == None:\n", " llm_valid_execution = False\n", " else:\n", " sql = f\"\"\"UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " SET gen_ai_recommended_action = '{json_result}'\n", " WHERE customer_review_id = {customer_review_id}\n", " \"\"\"\n", "\n", " print(f\"sql: {sql}\")\n", "\n", " llm_valid_execution = RunQuery(sql)\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "mOtL7RgiZox9" }, "source": [ "### Detect Customer Themes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "y7r3Kzovd-y2" }, "outputs": [], "source": [ "sql = \"\"\"SELECT customer_review_id,\n", " review_text\n", " FROM `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " WHERE llm_detected_theme IS NULL\n", " ORDER BY customer_review_id\"\"\"\n", "\n", "df_process = client.query(sql).to_dataframe()\n", "\n", "for row in df_process.itertuples():\n", " customer_review_id = row.customer_review_id\n", " review_text = row.review_text\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " prompt=\"\"\"\n", " Classify the below customer review as one or more of the below themes.\n", " - Return the results the below json format.\n", " - Include an explaination for selecting each theme.\n", " - Do not include double quotes in the explaination.\n", " - Do not include any special characters, double quotes or \"```json\" in the json output.\n", "\n", " Themes\n", " - \"Bad Service\"\n", " - \"Long Wait Time\"\n", " - \"Slow Service\"\n", " - \"Dirty\"\n", " - \"Overpriced\"\n", " - \"Overcrowded\"\n", " - \"Noisy Location\"\n", " - \"Lack of Allergan Information\"\n", " - \"Inconsistent Quality\"\n", " - \"Lack of Seating\"\n", " - \"No Flavor\"\n", " - \"Too weak\"\n", " - \"Too strong\"\n", " - \"Too bitter\"\n", " - \"Limited Menu\"\n", " - \"Good Service\"\n", " - \"Short Wait Time\"\n", " - \"Fast Service\"\n", " - \"Clean\"\n", " - \"Good value\"\n", " - \"Cozy Seating Areas\"\n", " - \"Quite Location\"\n", " - \"Variety of Milk Alternatives\"\n", " - \"Consistent Quality\"\n", " - \"Lots of places to sit\"\n", " - \"Lots of Flavor\"\n", " - \"Good Taste\"\n", " - \"Good Selection\"\n", "\n", " JSON format: [{ \"theme\" : \"value\", \"explaination\" : \"llm explaination\" }]\n", " Sample JSON Response: [{ \"theme\" : \"Fast Service\", \"explaination\" : \"The customer got their order fast.\" }]\n", " Sample JSON Response: [{ \"theme\" : \"Overpriced\", \"explaination\" : \"The customer said it was too expensive.\" }]\n", "\n", " Review:\"\"\"\n", " prompt = prompt + review_text\n", "\n", " json_result = LLM(prompt, True, max_output_tokens=1024, temperature=0, top_p=0, top_k=1)\n", " print(f\"json_result: {json_result}\")\n", "\n", " if json_result == None:\n", " llm_valid_execution = False\n", " else:\n", " sql = f\"\"\"UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " SET llm_detected_theme = JSON'{json_result}'\n", " WHERE customer_review_id = {customer_review_id}\n", " \"\"\"\n", "\n", " print(f\"sql: {sql}\")\n", "\n", " llm_valid_execution = RunQuery(sql)\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "Wuv45clVnqUf" }, "source": [ "### Menu Table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0wyRzw-QFHm-" }, "outputs": [], "source": [ "table_name = \"company\"\n", "field_name = \"company_id\"\n", "company_ids = GetForeignKeys(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"menu\"\n", "primary_key = \"menu_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zlEe3aBvf33F" }, "outputs": [], "source": [ "loop_count = 10\n", "loop_index = 1\n", "menu_count = 10\n", "\n", "while loop_index <= loop_count:\n", " print(f\"loop_index: {loop_index} | loop_count: {loop_count}\")\n", "\n", " # Get Menu Names\n", " menu_count = 5 # We mulitple this by 3 to get 15 (one for small, medium and large sizes)\n", "\n", " table_name = \"menu\"\n", " field_name = \"item_name\"\n", " existing_values = GetDistinctValues(DATASET_ID, table_name, field_name)\n", "\n", " prompt = f\"\"\"Generate {menu_count} different coffee drink names and return in the below json format.\n", " - The name can be an existing coffee drink or something new.\n", " - The name should be related to coffee.\n", " - Do not use any of these names: [{existing_values}]\n", " - Do not number the results.\n", "\n", " JSON format: [ \"value\" ]\n", " Sample JSON Response: [ \"value1\", \"value2\" ]\n", " \"\"\"\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " menu_names = LLM(prompt, True, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " llm_valid_execution = True\n", " except Exception as error:\n", " print(\"An error occurred:\", error)\n", "\n", "\n", " # Insert data\n", " starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", " prompt=f\"\"\"\n", " You are a database engineer and need to generate data for a table for the below schema.\n", " - The schema is for a Google Cloud BigQuery Table.\n", " - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", " - Read the description of each field for valid values.\n", " - Do not preface the response with any special characters or 'sql'.\n", " - Generate {menu_count * 3} total rows for this table.\n", " - Valid values for company_id are: {company_ids}\n", " - Valid values for item_name are: {menu_names}\n", " - The starting value of the field {primary_key} is {starting_value}.\n", " - Only generate a single statement, not multiple INSERTs.\n", " - Create a Small, Medium and Large size for each item_name. The same company_id should be used as well for all 3 sizes.\n", "\n", " Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", " Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", " Schema: {schema}\n", " \"\"\"\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " loop_index = loop_index + 1\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6TEqBTkAAqLF" }, "outputs": [], "source": [ "loop_count = 1\n", "loop_index = 1\n", "menu_count = 20\n", "\n", "while loop_index <= loop_count:\n", " print(f\"loop_index: {loop_index} | loop_count: {loop_count}\")\n", "\n", " table_name = \"menu\"\n", " field_name = \"item_name\"\n", " existing_values = GetDistinctValues(DATASET_ID, table_name, field_name)\n", "\n", " prompt = f\"\"\"Generate {menu_count} different foods that you would buy with coffee and return in the below json format.\n", " - The name can be an existing coffee drink or something new.\n", " - The name should be related to coffee.\n", " - Do not use any of these names: [{existing_values}]\n", " - Do not number the results.\n", "\n", " JSON format: [ \"value\" ]\n", " Sample JSON Response: [ \"value1\", \"value2\" ]\n", " \"\"\"\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " #menu_names = LLM(prompt, True, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " llm_valid_execution = True\n", " except Exception as error:\n", " print(\"An error occurred:\", error)\n", "\n", " menu_names = \"\"\"\n", " [\"Cinnamon Roll\",\n", " \"Espresso Brownie\",\n", " \"Coffee Tiramisu\",\n", " \"Cafe Latte Donut\",\n", " \"Coffee Glazed Bacon\",\n", " \"Hazelnut Croissant\",\n", " \"Mocha Pancakes\",\n", " \"Coffee-infused Oatmeal\",\n", " \"Cafe Muffin\",\n", " \"Espresso Chocolate Truffle\",\n", " \"Caffeine Chip Cookies\",\n", " \"Mocha Almond Croissant\",\n", " \"Coffee Bean Scones\",\n", " \"Java Chip Pancakes\",\n", " \"Hazelnut Latte Donut\",\n", " \"Cappuccino Cheesecake\",\n", " \"Coffee Tiramisu\",\n", " \"Caramel Macchiato Muffin\",\n", " \"Coffee-Infused Chocolate Truffles\",\n", " \"Coffee Bean Ice Cream\"]\"\"\"\n", " # Insert data\n", " starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", " prompt=f\"\"\"\n", " You are a database engineer and need to generate data for a table for the below schema.\n", " - The schema is for a Google Cloud BigQuery Table.\n", " - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", " - Read the description of each field for valid values.\n", " - Do not preface the response with any special characters or 'sql'.\n", " - Generate {menu_count} total rows for this table.\n", " - Valid values for company_id are: {company_ids}\n", " - Valid values for item_name are: {menu_names}\n", " - The starting value of the field {primary_key} is {starting_value}.\n", " - Only generate a single statement, not multiple INSERTs.\n", " - Create a Small, Medium and Large size for each item_name. The same company_id should be used as well for all 3 sizes.\n", "\n", " Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", " Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", " Schema: {schema}\n", " \"\"\"\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " loop_index = loop_index + 1\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "KeDoqMQNAFcE" }, "source": [ "### Order Table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "j8Aoyg2hE_Ru" }, "outputs": [], "source": [ "order_count = 10\n", "\n", "table_name = \"customer\"\n", "field_name = \"customer_id\"\n", "max_customer_id = GetMaximumValue(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"location\"\n", "field_name = \"location_id\"\n", "max_location_id = GetMaximumValue(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"order\"\n", "primary_key = \"order_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JPmtl23VAFcO" }, "outputs": [], "source": [ "loop_count = 50\n", "loop_index = 1\n", "\n", "while loop_index <= loop_count:\n", " print(f\"loop_index: {loop_index} | loop_count: {loop_count}\")\n", " starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", " prompt=f\"\"\"\n", " You are a database engineer and need to generate data for a table for the below schema.\n", " - The schema is for a Google Cloud BigQuery Table.\n", " - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", " - Read the description of each field for valid values.\n", " - Do not preface the response with any special characters or 'sql'.\n", " - Generate {order_count} insert statements for this table.\n", " - The order_datetime is a date and should be within the past 1 year.\n", " - The order_completion_datetime should be within 60 to 900 seconds of the order_datetime.\n", " - Valid values for location_id between 1 and {max_location_id}.\n", " - Valid values for customer_id between 1 and {max_customer_id}.\n", " - The starting value of the field {primary_key} is {starting_value}.\n", " - Only generate a single statement, not multiple INSERTs.\n", " - Timestamps should use this format: 2020-06-02 23:57:12.120174 UTC.\n", "\n", " Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", " Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", " Schema: {schema}\n", " \"\"\"\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " sql = sql.replace(\"\\\\'\",\"'\")\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " loop_index = loop_index + 1\n", " except Exception as error:\n", " print(\"An error occurred:\", error)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "0Bni86QYEusy" }, "source": [ "### Order Item Table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "op7cuZl0F60U" }, "outputs": [], "source": [ "order_item_count = 3" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b1EdHmZrFbc9" }, "outputs": [], "source": [ "table_name = \"menu\"\n", "field_name = \"menu_id\"\n", "max_menu_id = GetMaximumValue(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"order\"\n", "field_name = \"order_id\"\n", "max_order_id = GetMaximumValue(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"order_item\"\n", "primary_key = \"order_item_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4KYEq_cjEusy" }, "outputs": [], "source": [ "loop_count = 100\n", "loop_index = 1\n", "\n", "while loop_index <= loop_count:\n", " print(f\"loop_index: {loop_index} | loop_count: {loop_count}\")\n", " starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", " max_order_item_id = GetMaximumValue(DATASET_ID, table_name, \"order_id\")\n", "\n", " if max_order_item_id > max_order_id:\n", " print(\"Breaking out of loop since we have items for each order.\")\n", " break\n", "\n", " order_id_to_generate_data = max_order_item_id + 1\n", " print(f\"order_id_to_generate_data: {order_id_to_generate_data}\")\n", "\n", " prompt=f\"\"\"\n", " You are a database engineer and need to generate data for a table for the below schema.\n", " - The schema is for a Google Cloud BigQuery Table.\n", " - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", " - Read the description of each field for valid values.\n", " - Do not preface the response with any special characters or 'sql'.\n", " - Generate {order_item_count} insert statements for this table.\n", " - You can have 1 to 10 items for a single order_id.\n", " - The order_id should use the value of: {order_id_to_generate_data}.\n", " - Valid values for menu_id between 1 and {max_menu_id}.\n", " - Valid values for quantity as between 1 and 4.\n", " - The starting value of the field {primary_key} is {starting_value}.\n", " - Only generate a single statement, not multiple INSERTs.\n", "\n", " Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", " Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", " Schema: {schema}\n", " \"\"\"\n", "\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " loop_index = loop_index + 1\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "6LvkDmsAZVuQ" }, "source": [ "### Coffee Farm Table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "u0GZ4jVSZVuR" }, "outputs": [], "source": [ "coffee_farm_count = 10\n", "\n", "table_name = \"coffee_farm\"\n", "primary_key = \"coffee_farm_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0rm-TxQLmQuS" }, "outputs": [], "source": [ "table_name = \"coffee_farm\"\n", "field_name = \"name\"\n", "existing_values = GetDistinctValues(DATASET_ID, table_name, field_name)\n", "\n", "prompt = f\"\"\"Generate {coffee_farm_count} creative names and return in the below json format.\n", "- The name should be new and not a name that is already used by an existing coffee company.\n", "- The name should be related to coffee farms that grow coffee beans.\n", "- Do not use the following values: {existing_values}\n", "\n", "JSON format: [ \"value\" ]\n", "Sample JSON Response: [ \"value1\", \"value2\" ]\n", "\"\"\"\n", "\n", "llm_valid_execution = False\n", "while llm_valid_execution == False:\n", " try:\n", " coffee_farm_names = LLM(prompt, True, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " llm_valid_execution = True\n", " except Exception as error:\n", " print(\"An error occurred:\", error)\n", "\n", "print(f\"coffee_farm_names: {coffee_farm_names}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HFj7hytMZVuR" }, "outputs": [], "source": [ "starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", "prompt=f\"\"\"\n", "You are a database engineer and need to generate data for a table for the below schema.\n", "- The schema is for a Google Cloud BigQuery Table.\n", "- The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", "- Read the description of each field for valid values.\n", "- Do not preface the response with any special characters or 'sql'.\n", "- Generate {coffee_farm_count} insert statements for this table.\n", "- The starting value of the field {primary_key} is {starting_value}.\n", "- Only generate a single statement, not multiple INSERTs.\n", "- Valid values for name are: {coffee_farm_names}\n", "- Use the World Geodetic System (WGS) for the latitude and longitude values.\n", "- When inserting GEOGRAPHY data types use the BigQuery function ST_GEOGPOINT.\n", "- Escape single quotes with a backslash. Example: Adam's Answer: Adam\\'s\n", "\n", "Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", "Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", "Schema: {schema}\n", "\"\"\"\n", "\n", "llm_valid_execution = False\n", "while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "EG1vLI1rbY3o" }, "source": [ "### Coffee Processor Table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2WDgpdk_bY3p" }, "outputs": [], "source": [ "coffee_processor_count = 10\n", "\n", "table_name = \"coffee_processor\"\n", "primary_key = \"coffee_processor_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HJgAgQckpbI4" }, "outputs": [], "source": [ "table_name = \"coffee_processor\"\n", "field_name = \"name\"\n", "existing_values = GetDistinctValues(DATASET_ID, table_name, field_name)\n", "\n", "prompt = f\"\"\"Generate {coffee_processor_count} creative names and return in the below json format.\n", "- The name should be new and not a name that is already used by an existing coffee company.\n", "- The name should be related to coffee processing facilities that process the raw beans.\n", "- Do not use the following values: {existing_values}\n", "\n", "JSON format: [ \"value\" ]\n", "Sample JSON Response: [ \"value1\", \"value2\" ]\n", "\"\"\"\n", "\n", "llm_valid_execution = False\n", "while llm_valid_execution == False:\n", " try:\n", " coffee_processor_names = LLM(prompt, True, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " llm_valid_execution = True\n", " except Exception as error:\n", " print(\"An error occurred:\", error)\n", "\n", "print(f\"coffee_processor_names: {coffee_processor_names}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dpO1oNZLbY3p" }, "outputs": [], "source": [ "starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", "prompt=f\"\"\"\n", "You are a database engineer and need to generate data for a table for the below schema.\n", "- The schema is for a Google Cloud BigQuery Table.\n", "- The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", "- Read the description of each field for valid values.\n", "- Do not preface the response with any special characters or 'sql'.\n", "- Generate a single INSERT statement with {coffee_processor_count} rows of data.\n", "- The starting value of the field {primary_key} is {starting_value}.\n", "- Only generate a single statement, not multiple INSERTs.\n", "- Valid values for name are: {coffee_processor_names}\n", "- Use the World Geodetic System (WGS) for the latitude and longitude values.\n", "- When inserting GEOGRAPHY data types use the BigQuery function ST_GEOGPOINT.\n", "- Escape single quotes with a backslash. Example: Adam's Answer: Adam\\'s\n", "\n", "Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", "Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", "Schema: {schema}\n", "\"\"\"\n", "\n", "llm_valid_execution = False\n", "while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "x0iOmN-ue1xl" }, "source": [ "### Coffee Roaster Table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "D7A3AIEGe1xl" }, "outputs": [], "source": [ "coffee_roaster_count = 10\n", "\n", "table_name = \"coffee_roaster\"\n", "primary_key = \"coffee_roaster_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AxDwKdUWuEPV" }, "outputs": [], "source": [ "table_name = \"coffee_roaster\"\n", "field_name = \"name\"\n", "existing_values = GetDistinctValues(DATASET_ID, table_name, field_name)\n", "\n", "prompt = f\"\"\"Generate {coffee_roaster_count} creative names and return in the below json format.\n", "- The name should be new and not a name that is already used by an existing coffee company.\n", "- The name should be related to coffee roasters that roast the beans to perfection.\n", "- Do not use the following values: {existing_values}\n", "\n", "JSON format: [ \"value\" ]\n", "Sample JSON Response: [ \"value1\", \"value2\" ]\n", "\"\"\"\n", "\n", "llm_valid_execution = False\n", "while llm_valid_execution == False:\n", " try:\n", " coffee_roaster_names = LLM(prompt, True, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " llm_valid_execution = True\n", " except Exception as error:\n", " print(\"An error occurred:\", error)\n", "\n", "print(f\"coffee_roaster_names: {coffee_roaster_names}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TYBHUYUye1xl" }, "outputs": [], "source": [ "starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", "prompt=f\"\"\"\n", "You are a database engineer and need to generate data for a table for the below schema.\n", "- The schema is for a Google Cloud BigQuery Table.\n", "- The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", "- Read the description of each field for valid values.\n", "- Do not preface the response with any special characters or 'sql'.\n", "- Generate a single INSERT statement with {coffee_roaster_count} rows of data.\n", "- The starting value of the field {primary_key} is {starting_value}.\n", "- Only generate a single statement, not multiple INSERTs.\n", "- Valid values for name are: {coffee_processor_names}\n", "- Use the World Geodetic System (WGS) for the latitude and longitude values.\n", "- When inserting GEOGRAPHY data types use the BigQuery function ST_GEOGPOINT.\n", "- Escape single quotes with a backslash. Example: Adam's Answer: Adam\\'s\n", "\n", "Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", "Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", "Schema: {schema}\n", "\"\"\"\n", "\n", "llm_valid_execution = False\n", "while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "MJhOnDeYPgKL" }, "source": [ "### Create Location data (Location table and Location History table)" ] }, { "cell_type": "markdown", "metadata": { "id": "CYgnnW2egj0v" }, "source": [ "### Set the Location Table data (hard coded)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kfQgzYDVOEqT" }, "outputs": [], "source": [ "sql = f\"\"\"ALTER TABLE `{PROJECT_ID}.{DATASET_ID}.location`\n", " ADD COLUMN IF NOT EXISTS city_id INT OPTIONS(description=\"The primary city for a location. For coffee trucks this is the current city.\"),\n", " ADD COLUMN IF NOT EXISTS current_latitude FLOAT64 OPTIONS(description=\"The current latitude of the coffee truck or physical store.\"),\n", " ADD COLUMN IF NOT EXISTS current_longitude FLOAT64 OPTIONS(description=\"The current longitude of the coffee truck or physical store.\"),\n", " ADD COLUMN IF NOT EXISTS current_lat_long GEOGRAPHY OPTIONS(description=\"The current latitude and longitude of the coffee truck or physical store.\")\n", " ;\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EYdv-FgyRWxv" }, "outputs": [], "source": [ "\"\"\"\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 1 WHERE location_id = 1;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 2;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 3;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 3 WHERE location_id = 4;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 5;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 6;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 7;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 8;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 3 WHERE location_id = 9;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 3 WHERE location_id = 10;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 3 WHERE location_id = 11;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 1 WHERE location_id = 12;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 13;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 1 WHERE location_id = 14;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 3 WHERE location_id = 15;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 1 WHERE location_id = 16;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 17;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 18;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 3 WHERE location_id = 19;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 20;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 21;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 22;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 3 WHERE location_id = 23;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 24;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 1 WHERE location_id = 25;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 26;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 27;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 28;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 29;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 30;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 31;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 3 WHERE location_id = 32;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 3 WHERE location_id = 33;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 3 WHERE location_id = 34;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 1 WHERE location_id = 35;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 36;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 1 WHERE location_id = 37;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 38;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 39;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 3 WHERE location_id = 40;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 1 WHERE location_id = 41;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 1 WHERE location_id = 42;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 2 WHERE location_id = 43;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 1 WHERE location_id = 44;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 1 WHERE location_id = 45;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 46;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 47;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 1 WHERE location_id = 48;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 3 WHERE location_id = 49;\n", "UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET city_id = 4 WHERE location_id = 50;\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "id": "qVtZkcb4gZWs" }, "source": [ "### Create geo-location data for the Location table based upon the city" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Hn4ucbsCgYaQ" }, "outputs": [], "source": [ "sql = \"\"\"SELECT location_id,\n", " city_id\n", " FROM `${project_id}.${bigquery_data_beans_curated_dataset}.location`\n", " WHERE current_latitude IS NULL\n", " ORDER BY location_id\"\"\"\n", "\n", "df_process = client.query(sql).to_dataframe()\n", "\n", "for row in df_process.itertuples():\n", " location_id = row.location_id\n", " city_id = row.city_id\n", "\n", " city_name = \"\"\n", " match city_id:\n", " case 1:\n", " city_name = \"New York City\"\n", " case 2:\n", " city_name = \"London\"\n", " case 3:\n", " city_name = \"Tokyo\"\n", " case 4:\n", " city_name = \"San Francisco\"\n", " case _:\n", " city_name = \"ERROR\"\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False and city_name != \"ERROR\":\n", " try:\n", " prompt=f\"\"\"\n", " Generate a longitude and latitude for a coffee truck for the city {city_name}.\n", " Return the results as a SQL update statement:\n", " UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.location` SET current_latitude = ?, current_longitude = ?, current_lat_long = ST_GeogPoint(?, ?) WHERE location_id = {location_id};\n", " \"\"\"\n", "\n", " result = LLM(prompt, False, max_output_tokens=1024, temperature=0, top_p=0, top_k=1)\n", " result = result.replace(\"```sql\\n\",\"\").replace(\"\\n```\",\"\")\n", " print(f\"result: {result}\")\n", "\n", " print(f\"sql: {sql}\")\n", "\n", " llm_valid_execution = RunQuery(sql)\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "-5TdM4AUgfTJ" }, "source": [ "### Create the location history for each truck" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "N4ZBxn_iNJWQ" }, "outputs": [], "source": [ "sql = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET_ID}.location_history`\n", "(\n", " location_history_id INTEGER NOT NULL OPTIONS(description=\"Primary key.\"),\n", " location_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: Location table.\"),\n", " city_id STRING NOT NULL OPTIONS(description=\"Foreign key: City table.\"),\n", " start_datetime TIMESTAMP NOT NULL OPTIONS(description=\"The start time of when this location was established\"),\n", " stop_datetime TIMESTAMP NOT NULL OPTIONS(description=\"The end time of when this location was moved from.\"),\n", " latitude FLOAT64 NOT NULL OPTIONS(description=\"The latitude of the coffee truck.\"),\n", " longitude FLOAT64 NOT NULL OPTIONS(description=\"The longitude of the coffee truck.\"),\n", " lat_long GEOGRAPHY NOT NULL OPTIONS(description=\"The latitude and longitude of the coffee truck.\")\n", ")\n", "CLUSTER BY location_id;\n", "\"\"\"\n", "\n", "RunQuery(sql)" ] }, { "cell_type": "markdown", "metadata": { "id": "QRmg26z6fZ3F" }, "source": [ "### Customer Reviews" ] }, { "cell_type": "markdown", "metadata": { "id": "xpGwpvfcfZ3P" }, "source": [ "### Create customer reviews" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Dld0ZPKtfZ3Q" }, "outputs": [], "source": [ "rows_of_data_to_generate = 3\n", "\n", "table_name = \"customer\"\n", "field_name = \"customer_id\"\n", "customer_ids = GetForeignKeys(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"location\"\n", "field_name = \"location_id\"\n", "location_ids = GetForeignKeys(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"customer_review\"\n", "primary_key = \"customer_review_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ye_DYSV4fZ3Q" }, "outputs": [], "source": [ "# location_ids='1,11,21'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SLhtXcC8fZ3Q" }, "outputs": [], "source": [ "import random\n", "loop_count = 10000\n", "loop_index = 1\n", "\n", "while loop_index <= loop_count:\n", " print(f\"loop_index: {loop_index} | loop_count: {loop_count}\")\n", " starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", " if random.random() < .25:\n", " prompt=f\"\"\"\n", " You are a database engineer and need to generate data for a table for the below schema.\n", " You need to generate reviews for customers who have purchased your brewed coffee.\n", " Write a negative in first person based upon the following: \"Bad Service\",\"Long Wait Time\",\"Slow Service\",\"Dirty\",\"Overpriced\",\"Overcrowded\",\"Noisy Location\",\"Lack of Allergan Information\",\"Inconsistent Quality\",\"Lack of Seating\",\"No Flavor\",\"Too weak\",\"Too strong\",\"Too bitter\",\"Limited Menu\"\n", " - The schema is for a Google Cloud BigQuery Table.\n", " - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", " - Read the description of each field for valid values.\n", " - Do not preface the response with any special characters or 'sql'.\n", " - Generate {rows_of_data_to_generate} insert statements for this table.\n", " - Valid values for customer_id are: {customer_ids}\n", " - Valid values for location_id are: {location_ids}\n", " - The review_datetime is a date and should be within the past 5 years.\n", " - The response for each question should be 20 to 100 words.\n", " - The starting value of the field {primary_key} is {starting_value}.\n", " - Only generate data for these fields: customer_review_id, customer_id, location_id, review_datetime, review_text, social_media_source, social_media_handle\n", " - Timestamps should use this format: 2020-06-02 23:57:12.120174 UTC.\n", "\n", " Examples:\n", " Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", " Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", " Schema: {schema}\n", " \"\"\"\n", " else:\n", " prompt=f\"\"\"\n", " You are a database engineer and need to generate data for a table for the below schema.\n", " You need to generate reviews for customers who have purchased your brewed coffee.\n", " Write a positive or neutral review in first person based upon the following: \"Good Service\",\"Short Wait Time\",\"Fast Service\",\"Clean\",\"Good value\",\"Cozy Seating Areas\",\"Quite Location\",\"Variety of Milk Alternatives\",\"Consistent Quality\",\"Lots of places to sit\",\"Lots of Flavor\",\"Good Taste\",\"Good Selection\"\n", " - The schema is for a Google Cloud BigQuery Table.\n", " - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", " - Read the description of each field for valid values.\n", " - Do not preface the response with any special characters or 'sql'.\n", " - Generate {rows_of_data_to_generate} insert statements for this table.\n", " - Valid values for customer_id are: {customer_ids}\n", " - The review_datetime is a date and should be within the past 5 years.\n", " - The response for each question should be 20 to 100 words.\n", " - The starting value of the field {primary_key} is {starting_value}.\n", " - Only generate data for these fields: customer_review_id, customer_id, location_id, review_datetime, review_text, social_media_source, social_media_handle\n", " - Timestamps should use this format: 2020-06-02 23:57:12.120174 UTC.\n", "\n", " Examples:\n", " Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", " Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", " Schema: {schema}\n", " \"\"\"\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=1, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " loop_index = loop_index + 1\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "eHSGhbkZfZ3R" }, "source": [ "### Score the Sentiment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UWBshLexfZ3R" }, "outputs": [], "source": [ "sql = \"\"\"SELECT customer_review_id,\n", " review_text\n", " FROM `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " WHERE review_sentiment IS NULL\n", " ORDER BY customer_review_id\n", "\"\"\"\n", "\n", "# Fields to update\n", "# review_sentiment,\n", "# gen_ai_recommended_action,\n", "# gen_ai_reponse,\n", "# human_response,\n", "# response_sent_action,\n", "# response_sent_date\n", "\n", "df_process = client.query(sql).to_dataframe()\n", "\n", "for row in df_process.itertuples():\n", " customer_review_id = row.customer_review_id\n", " review_text = row.review_text\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " prompt=f\"\"\"\n", " For the given review classify the sentiment as Positive, Neutral or Negative.\n", " Only return one of these words: \"Positive\", \"Neutral\", \"Negative\"\n", " Review: {review_text}\n", " \"\"\"\n", " review_sentiment = LLM(prompt, False, max_output_tokens=10, temperature=0, top_p=0, top_k=1)\n", "\n", " sql = f\"\"\"UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " SET review_sentiment = '{review_sentiment}'\n", " WHERE customer_review_id = {customer_review_id}\n", " \"\"\"\n", "\n", " print (sql)\n", "\n", " llm_valid_execution = RunQuery(sql)\n", " llm_valid_execution = True\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "CqrzJy8ofn_V" }, "source": [ "### Detect Customer Themes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dRotgqBefn_e" }, "outputs": [], "source": [ "sql = \"\"\"SELECT customer_review_id,\n", " review_text\n", " FROM `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " WHERE llm_detected_theme IS NULL\n", " ORDER BY customer_review_id\"\"\"\n", "\n", "df_process = client.query(sql).to_dataframe()\n", "\n", "for row in df_process.itertuples():\n", " customer_review_id = row.customer_review_id\n", " review_text = row.review_text\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " prompt=\"\"\"\n", " Classify the below customer review as one or more of the below themes.\n", " - Return the results the below json format.\n", " - Include an explaination for selecting each theme.\n", " - Do not include double quotes in the explaination.\n", " - Do not include any special characters, double quotes or \"```json\" in the json output.\n", "\n", " Themes\n", " - \"Bad Service\"\n", " - \"Long Wait Time\"\n", " - \"Slow Service\"\n", " - \"Dirty\"\n", " - \"Overpriced\"\n", " - \"Overcrowded\"\n", " - \"Noisy Location\"\n", " - \"Lack of Allergan Information\"\n", " - \"Inconsistent Quality\"\n", " - \"Lack of Seating\"\n", " - \"No Flavor\"\n", " - \"Too weak\"\n", " - \"Too strong\"\n", " - \"Too bitter\"\n", " - \"Limited Menu\"\n", " - \"Good Service\"\n", " - \"Short Wait Time\"\n", " - \"Fast Service\"\n", " - \"Clean\"\n", " - \"Good value\"\n", " - \"Cozy Seating Areas\"\n", " - \"Quite Location\"\n", " - \"Variety of Milk Alternatives\"\n", " - \"Consistent Quality\"\n", " - \"Lots of places to sit\"\n", " - \"Lots of Flavor\"\n", " - \"Good Taste\"\n", " - \"Good Selection\"\n", "\n", " JSON format: [{ \"theme\" : \"value\", \"explaination\" : \"llm explaination\" }]\n", " Sample JSON Response: [{ \"theme\" : \"Fast Service\", \"explaination\" : \"The customer got their order fast.\" }]\n", " Sample JSON Response: [{ \"theme\" : \"Overpriced\", \"explaination\" : \"The customer said it was too expensive.\" }]\n", "\n", " Review:\"\"\"\n", " prompt = prompt + review_text\n", "\n", " json_result = LLM(prompt, True, max_output_tokens=1024, temperature=0, top_p=0, top_k=1)\n", " print(f\"json_result: {json_result}\")\n", "\n", " if json_result == None:\n", " llm_valid_execution = False\n", " else:\n", " sql = f\"\"\"UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " SET llm_detected_theme = JSON'{json_result}'\n", " WHERE customer_review_id = {customer_review_id}\n", " \"\"\"\n", "\n", " print(f\"sql: {sql}\")\n", "\n", " llm_valid_execution = RunQuery(sql)\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "CZkWyPdPf191" }, "source": [ "### Gen AI Recommended Action" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "I7z5QNsTf19_" }, "outputs": [], "source": [ "sql = \"\"\"SELECT customer_review_id,\n", " review_text\n", " FROM `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " WHERE gen_ai_recommended_action IS NULL\n", " ORDER BY customer_review_id\"\"\"\n", "\n", "df_process = client.query(sql).to_dataframe()\n", "\n", "for row in df_process.itertuples():\n", " customer_review_id = row.customer_review_id\n", " review_text = row.review_text\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " prompt=\"\"\"\n", " Select one of the following actions based upon the below customer review who purchased coffee.\n", " - First randomly sort the actions.\n", " - Select the best action based upon the sentiment of the review.\n", " - It is okay to use the action \"Send the customer a coupon\" for both positive and negative reviews.\n", " - Return the results the below json format.\n", " - Do not include any special characters or \"```json\" in the json output\n", "\n", " Actions\n", " - \"Thank the Customer\"\n", " - \"Apologize to the Customer\"\n", " - \"Send the customer a coupon\"\n", " - \"Call the customer\"\n", " - \"Promote Additional Products\"\n", " - \"Promise to Investigate\"\n", " - \"Encourage More Reviews\"\n", " - \"Invite Further Engagement\"\n", " - \"Reshare the review on other social media\"\n", "\n", " JSON format: { \"action\" : \"value\", \"explaination\" : \"llm explaination\" }\n", " Sample JSON Response: { \"action\" : \"Call the customer\", \"explaination\" : \"The customer left their phone number in the review.\" }\n", " Sample JSON Response: { \"action\" : \"Encourage More Reviews\", \"explaination\" : \"Thanks for the review, please keep posting.\" }\n", "\n", " Review:\"\"\"\n", " prompt = prompt + review_text\n", "\n", " json_result = LLM(prompt, True, max_output_tokens=1024, temperature=0, top_p=0, top_k=1)\n", " print(f\"json_result: {json_result}\")\n", "\n", " if json_result == None:\n", " llm_valid_execution = False\n", " else:\n", " sql = f\"\"\"UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " SET gen_ai_recommended_action = '{json_result}'\n", " WHERE customer_review_id = {customer_review_id}\n", " \"\"\"\n", "\n", " print(f\"sql: {sql}\")\n", "\n", " llm_valid_execution = RunQuery(sql)\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1EfLAJmrOdeA" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "id": "oIGyz0ZmfT-p" }, "source": [ "### Gen AI Response" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NknCJr4ejuV7" }, "outputs": [], "source": [ "sql = \"\"\"SELECT customer_review_id,\n", " review_text\n", " FROM `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " WHERE gen_ai_reponse IS NULL\n", " ORDER BY customer_review_id DESC\"\"\"\n", "\n", "df_process = client.query(sql).to_dataframe()\n", "\n", "for row in df_process.itertuples():\n", " customer_review_id = row.customer_review_id\n", " review_text = row.review_text\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " prompt=f\"\"\"\n", " Generate responses to the below customer review who purchased coffee and return the results the below json format.\n", " The review can be positive, negative, or neutral.\n", " Provide a variety of responses, including thanking customers for positive reviews, addressing concerns in negative reviews, and engaging with neutral reviews.\n", " Please generate at least 5 different responses.\n", " Do not include newline characters in the response.\n", "\n", " JSON format: [ \"value\" ]\n", " Sample JSON Response: [ \"response 1\", \"response 2\", \"response 3\", \"response 4\", \"response 5\" ]\n", "\n", " Review: {review_text}\"\"\"\n", "\n", " json_result = LLM(prompt, True, max_output_tokens=1024, temperature=0, top_p=0, top_k=1)\n", " print(f\"json_result: {json_result}\")\n", "\n", " if json_result == None:\n", " llm_valid_execution = True\n", " llm_valid_execution = False\n", " else:\n", " sql = f\"\"\"UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " SET gen_ai_reponse = '{json_result}'\n", " WHERE customer_review_id = {customer_review_id}\n", " \"\"\"\n", "\n", " print(f\"sql: {sql}\")\n", "\n", " llm_valid_execution = RunQuery(sql)\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "80fLTGc8BOji" }, "source": [ "### Learnings" ] }, { "cell_type": "markdown", "metadata": { "id": "Re1SPznGBRwB" }, "source": [ "Issues:\n", "- LLMs take time for each call\n", "- Scaling to millions of rows is an issue (we can solve that with other solutions, ask us)\n", "- Output contains: ```\n", "- Output contains: ```sql\n", "- dates are not always valid dates \"2017-09-31\" to type DATE\n", "- for UUIDs, use an INT and then swap to UUID later (add a column and then do an update)\n", "- LLM returns single quotes in strings. Had to prompt or string.replace.\n", "- Probally need to use Min/Max of primary key for ints.\n", "- Sometimes the LLM generates multiple insert..intos\n", "- Inserts are sometimes many INSERTS and not many values\n", "\n", "Learnings\n", "- The LLM can generate Small, Medium and Large (3 records for menu items) with pricing that is correct.\n", "- The LLM can understand the schema\n", "- The LLM can understand the description (valid values)" ] }, { "cell_type": "markdown", "metadata": { "id": "vyV1TXsNDkQr" }, "source": [ "**Coffee Tables**\n", "- Company (done)\n", "- Location (done)\n", "- Customer (done, needs loop)\n", "- Customer Profile (created table. needs code to populate LLM items)\n", "- Customer Survey (done, needs loop)\n", "- Customer Review\n", "- Menu\n", "- Order\n", "- Order Detail\n", "- Coffee Farm\n", "- Coffee Processor\n", "- Coffee Exporter\n", "- Coffee Roaster\n", "- Coffee Supply Chain\n", "\n", "**Streaming Data**\n", "- Coffee Sales\n", "- Customer Feedback\n", "- Location Telemetry\n", "\n", "\n" ] } ], "metadata": { "colab": { "collapsed_sections": [ "DszuLZoo9A7k", "6YeExbVqf4ZE", "CUPdwNf_89Gq", "Jr2sIqNP800E", "deSLhzKC8qHQ", "nlhRKtPAsAWz", "ylfzCAIzEALT", "WbxE7eDp-L1j", "uWUogwtAqNTT", "6W_Dr1xGqRML", "Wuv45clVnqUf", "KeDoqMQNAFcE", "0Bni86QYEusy", "6LvkDmsAZVuQ", "EG1vLI1rbY3o", "x0iOmN-ue1xl", "MJhOnDeYPgKL", "QRmg26z6fZ3F", "xpGwpvfcfZ3P", "eHSGhbkZfZ3R", "CqrzJy8ofn_V", "CZkWyPdPf191", "oIGyz0ZmfT-p", "80fLTGc8BOji" ], "name": "BigQuery table", "private_outputs": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }

colab-enterprise/gen-ai-demo/Sample-Synthetic-Data-Generation-GenAI.ipynb (3,144 lines of code) (raw):