qwiklabs/colab-enterprise/gen-ai-demo/Customer-Reviews-Synthetic-Data-Generation-GenAI.ipynb (585 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## <img src=\"https://lh3.googleusercontent.com/mUTbNK32c_DTSNrhqETT5aQJYFKok2HB1G2nk2MZHvG5bSs0v_lmDm_ArW7rgd6SDGHXo0Ak2uFFU96X6Xd0GQ=w160-h128\" width=\"45\" valign=\"top\" alt=\"BigQuery\"> Generating Synthetic Data (Customer Reviews) using GenAI\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### License" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##################################################################################\n", "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "# \n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "# \n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "###################################################################################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notebook Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- This notebook will use GenAI to generate fake customer reviews and then score the review's sentiment.\n", "\n", "- Quick Notes:\n", " - Why synthetic data:\n", " - No licensing fees\n", " - LLMs can read your schema and description fields which means it understands what data to generate.\n", " - LLMs can generate fake names, addresses, latitude/longitude based upon a fake address.\n", " - LLMs can positive, neutrel and negative reviews.\n", " - In order to generate synthetic data we need to make sure we can handle the following items:\n", " - Have a unique primary key (INTs cannot be deplicated, UUIDs are easier to for synthetic generation)\n", " - Ensure that foreign key data is valid\n", "\n", "- Notebook Logic:\n", " 1. Get a list of valid customer foreign keys\n", " 2. Get a list of valid location foreign keys\n", " 3. Get the table schema from BigQuery (this also contains the description for each field)\n", " 4. Loop for the number of review to generate\n", " - Get the current maximum primary key and add one\n", " - Create our LLM prompt:\n", " - Determine if we are writing a negative or positive/neutral review \n", " - Provide a list of themes we want reviews based upon\n", " - Provide the prompt with a suggust range of reivew text words\n", " - Provide the prompt with our foreign keys\n", " - Provide the prompt with our primary key. And since we are asking for more the one review to be generated at a time, the LLM is smart enough to increment it for each row.\n", " - Provide the prompt that we want a single INSERT..INTO statemend versus multiple.\n", " 5. Execute the generated customer review SQL statement against BigQuery. This will insert the data.\n", " 6. For the reviews generated\n", " - Create a prompt asking to determine the sentiment (Positive, Neutral or Negative)\n", " - Update BigQuery with the sentiment" ] }, { "cell_type": "markdown", "metadata": { "id": "DszuLZoo9A7k" }, "source": [ "## Initialize Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bhKxJadjWa1R" }, "outputs": [], "source": [ "import pandas as pd\n", "import json\n", "import bigframes.pandas as bf\n", "#from bigframesllm import BigFramesLLM\n", "from bigframes.ml.llm import GeminiTextGenerator" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MSC6-rboip3h" }, "outputs": [], "source": [ "from google.cloud import bigquery\n", "client = bigquery.Client()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lpU1P_fAXviJ" }, "outputs": [], "source": [ "PROJECT_ID = \"${project_id}\"\n", "REGION = \"us\"\n", "DATASET_ID = \"${bigquery_data_beans_curated_dataset}\"\n", "CONNECTION_NAME = \"vertex-ai\"\n", "\n", "connection = f\"{PROJECT_ID}.{REGION}.{CONNECTION_NAME}\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qMijiOqUZmAp" }, "outputs": [], "source": [ "# bf.reset_session() # if you need to change the region\n", "bf.options.bigquery.project = PROJECT_ID\n", "bf.options.bigquery.location = REGION" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zy7lRW09Ws1J" }, "outputs": [], "source": [ "session = bf.get_global_session()\n", "\n", "gemini_model = GeminiTextGenerator(session=session, connection_name=connection)" ] }, { "cell_type": "markdown", "metadata": { "id": "6YeExbVqf4ZE" }, "source": [ "## Supporting Functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QtSLYNa_exfc" }, "outputs": [], "source": [ "def PrettyPrintJson(json_string):\n", " json_object = json.loads(json_string)\n", " json_formatted_str = json.dumps(json_object, indent=2)\n", " print(json_formatted_str)\n", " return json.dumps(json_object)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sthZI1h2fe8H" }, "outputs": [], "source": [ "def LLM(prompt, isOutputJson, max_output_tokens=1024, temperature=0, top_p=0, top_k=1):\n", " print()\n", " print(\"Prompt: \", prompt)\n", " print()\n", " df_prompt = pd.DataFrame(\n", " {\n", " \"prompt\": [prompt],\n", " })\n", " bf_df_prompt = bf.read_pandas(df_prompt)\n", " prediction = gemini_model.predict(bf_df_prompt,\n", " max_output_tokens=max_output_tokens,\n", " temperature=temperature, # 0 to 1 (1 random)\n", " top_p=top_p, # 0 to 1 (1 random)\n", " top_k=top_k, # (1 to 40 random)\n", " ).to_pandas()\n", " try:\n", " # Remove common LLM output mistakes\n", " result = prediction['ml_generate_text_llm_result'][0]\n", "\n", " result = result.replace(\"```json\\n\",\"\")\n", " result = result.replace(\"```JSON\\n\",\"\")\n", " result = result.replace(\"```json\",\"\")\n", " result = result.replace(\"```JSON\",\"\")\n", " result = result.replace(\"```sql\\n\",\"\")\n", " result = result.replace(\"```SQL\\n\",\"\")\n", " result = result.replace(\"```sql\",\"\")\n", " result = result.replace(\"```sql: bigquery\",\"\") \n", " result = result.replace(\"```SQL\",\"\")\n", " result = result.replace(\"```\",\"\")\n", "\n", " if isOutputJson:\n", " json_string = PrettyPrintJson(result)\n", " json_string = json_string.replace(\"'\",\"\\\\'\")\n", " json_string = json_string.strip()\n", " return json_string\n", " else:\n", " # result = result.replace(\"'\",\"\\\\'\")\n", " result = result.strip()\n", " return result\n", "\n", " except:\n", " print(\"Error (raw): \", prediction['ml_generate_text_llm_result'][0])\n", " print(\"Error (result): \", result)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_8p8PkhFibYw" }, "outputs": [], "source": [ "def GetTableSchema(dataset_name, table_name):\n", " import io\n", "\n", " dataset_ref = client.dataset(dataset_name, project=PROJECT_ID)\n", " table_ref = dataset_ref.table(table_name)\n", " table = client.get_table(table_ref)\n", "\n", " f = io.StringIO(\"\")\n", " client.schema_to_json(table.schema, f)\n", " return f.getvalue()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HgMmeQSWvq6k" }, "outputs": [], "source": [ "def GetForeignKeys(dataset_name, table_name, field_name):\n", " sql = f\"\"\"\n", " SELECT STRING_AGG(CAST({field_name} AS STRING), \",\" ORDER BY {field_name}) AS result\n", " FROM `{PROJECT_ID}.{dataset_name}.{table_name}`\n", " WHERE {field_name} <= 10 -- demo hack since the database is fully populated\n", " \"\"\"\n", " #print(sql)\n", " df_result = client.query(sql).to_dataframe()\n", " #display(df_result)\n", " return df_result['result'].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "esIZXzKjsKIF" }, "outputs": [], "source": [ "def GetDistinctValues(dataset_name, table_name, field_name):\n", " sql = f\"\"\"\n", " SELECT STRING_AGG(DISTINCT {field_name}, \",\" ) AS result\n", " FROM `{PROJECT_ID}.{dataset_name}.{table_name}`\n", " \"\"\"\n", " #print(sql)\n", " df_result = client.query(sql).to_dataframe()\n", " #display(df_result)\n", " return df_result['result'].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Tto7sIdJ5vYi" }, "outputs": [], "source": [ "def GetStartingValue(dataset_name, table_name, field_name):\n", " sql = f\"\"\"\n", " SELECT IFNULL(MAX({field_name}),0) + 1 AS result\n", " FROM `{PROJECT_ID}.{dataset_name}.{table_name}`\n", " \"\"\"\n", " #print(sql)\n", " df_result = client.query(sql).to_dataframe()\n", " #display(df_result)\n", " return df_result['result'].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7sK-fbCaVMCZ" }, "outputs": [], "source": [ "def GetMaximumValue(dataset_name, table_name, field_name):\n", " sql = f\"\"\"\n", " SELECT IFNULL(MAX({field_name}),0) AS result\n", " FROM `{PROJECT_ID}.{dataset_name}.{table_name}`\n", " \"\"\"\n", " #print(sql)\n", " df_result = client.query(sql).to_dataframe()\n", " #display(df_result)\n", " return df_result['result'].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DFqXl0cefaOe" }, "outputs": [], "source": [ "def RunQuery(sql):\n", " import time\n", "\n", " #return True # return early for now\n", "\n", " job_config = bigquery.QueryJobConfig(priority=bigquery.QueryPriority.INTERACTIVE)\n", " query_job = client.query(sql, job_config=job_config)\n", "\n", " # Check on the progress by getting the job's updated state.\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " while query_job.state != \"DONE\":\n", " time.sleep(2)\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " if query_job.error_result == None:\n", " return True\n", " else:\n", " return False" ] }, { "cell_type": "markdown", "metadata": { "id": "mOtL7RgiZox9" }, "source": [ "## Synthetic Data Generation" ] }, { "cell_type": "markdown", "metadata": { "id": "Mg8ZZEViq9lu" }, "source": [ "### Create customer reviews" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8VwUtOMh-L1j" }, "outputs": [], "source": [ "rows_of_data_to_generate = 3\n", "\n", "table_name = \"customer\"\n", "field_name = \"customer_id\"\n", "customer_ids = GetForeignKeys(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"location\"\n", "field_name = \"location_id\"\n", "location_ids = GetForeignKeys(DATASET_ID, table_name, field_name)\n", "\n", "table_name = \"customer_review\"\n", "primary_key = \"customer_review_id\"\n", "\n", "schema = GetTableSchema(DATASET_ID, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LL6cYlwWCYQu" }, "outputs": [], "source": [ "# location_ids='1,11,21'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bVm36Q_1-L1k" }, "outputs": [], "source": [ "import random\n", "loop_count = 1 # can be set higher\n", "loop_index = 1\n", "\n", "while loop_index <= loop_count:\n", " print(f\"loop_index: {loop_index} | loop_count: {loop_count}\")\n", " starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n", "\n", " if random.random() < .25:\n", " prompt=f\"\"\"\n", " You are a database engineer and need write a single BigQuery SQL statement.\n", " You to generate data for a table for the below schema.\n", " You need to generate reviews for customers who have purchased your brewed coffee.\n", " Write a negative in first person based upon the following: \"Bad Service\",\"Long Wait Time\",\"Slow Service\",\"Dirty\",\"Overpriced\",\"Overcrowded\",\"Noisy Location\",\"Lack of Allergan Information\",\"Inconsistent Quality\",\"Lack of Seating\",\"No Flavor\",\"Too weak\",\"Too strong\",\"Too bitter\",\"Limited Menu\"\n", " - The schema is for a Google Cloud BigQuery Table.\n", " - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", " - Read the description of each field for valid values.\n", " - Do not preface the response with any special characters or 'sql'.\n", " - Generate {rows_of_data_to_generate} insert statements for this table.\n", " - Valid values for customer_id are: {customer_ids}\n", " - Valid values for location_id are: {location_ids}\n", " - The review_datetime is a date and should be within the past 5 years.\n", " - The response for each question should be 20 to 100 words.\n", " - The starting value of the field {primary_key} is {starting_value}.\n", " - Only generate data for these fields: customer_review_id, customer_id, location_id, review_datetime, review_text, social_media_source, social_media_handle\n", " - Timestamps should use this format: 2020-06-02 23:57:12.120174 UTC.\n", "\n", " Examples:\n", " Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", " Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", " Schema: {schema}\n", " \"\"\"\n", " else:\n", " prompt=f\"\"\"\n", " You are a database engineer and need write a single BigQuery SQL statement.\n", " You to generate data for a table for the below schema.\n", " You need to generate reviews for customers who have purchased your brewed coffee.\n", " Write a positive or neutral review in first person based upon the following: \"Good Service\",\"Short Wait Time\",\"Fast Service\",\"Clean\",\"Good value\",\"Cozy Seating Areas\",\"Quite Location\",\"Variety of Milk Alternatives\",\"Consistent Quality\",\"Lots of places to sit\",\"Lots of Flavor\",\"Good Taste\",\"Good Selection\"\n", " - The schema is for a Google Cloud BigQuery Table.\n", " - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n", " - Read the description of each field for valid values.\n", " - Do not preface the response with any special characters or 'sql'.\n", " - Generate {rows_of_data_to_generate} insert statements for this table.\n", " - Valid values for customer_id are: {customer_ids}\n", " - The review_datetime is a date and should be within the past 5 years.\n", " - The response for each question should be 20 to 100 words.\n", " - The starting value of the field {primary_key} is {starting_value}.\n", " - Only generate data for these fields: customer_review_id, customer_id, location_id, review_datetime, review_text, social_media_source, social_media_handle\n", " - Timestamps should use this format: 2020-06-02 23:57:12.120174 UTC.\n", "\n", " Examples:\n", " Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", " Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", " Schema: {schema}\n", " \"\"\"\n", "\n", " llm_valid_execution = False\n", " temperature=.8\n", " while llm_valid_execution == False:\n", " try:\n", " sql = LLM(prompt, False, max_output_tokens=1024, temperature=temperature, top_p=1, top_k=40)\n", " print(\"---------------------------------\")\n", " print(\"sql: \", sql)\n", " print(\"---------------------------------\")\n", " llm_valid_execution = RunQuery(sql)\n", " loop_index = loop_index + 1\n", " except Exception as error:\n", " temperature = temperature - .1\n", " print(\"An error occurred:\", error)" ] }, { "cell_type": "markdown", "metadata": { "id": "uWUogwtAqNTT" }, "source": [ "### Score the Sentiment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fSB45vFQPnaM" }, "outputs": [], "source": [ "sql = \"\"\"SELECT customer_review_id,\n", " review_text\n", " FROM `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " WHERE review_sentiment IS NULL\n", " ORDER BY customer_review_id\n", "\"\"\"\n", "\n", "# Fields to update\n", "# review_sentiment,\n", "# gen_ai_recommended_action,\n", "# gen_ai_reponse,\n", "# human_response,\n", "# response_sent_action,\n", "# response_sent_date\n", "\n", "df_process = client.query(sql).to_dataframe()\n", "\n", "for row in df_process.itertuples():\n", " customer_review_id = row.customer_review_id\n", " review_text = row.review_text\n", "\n", " llm_valid_execution = False\n", " while llm_valid_execution == False:\n", " try:\n", " prompt=f\"\"\"\n", " For the given review classify the sentiment as Positive, Neutral or Negative.\n", " Only return one of these words: \"Positive\", \"Neutral\", \"Negative\"\n", " Review: {review_text}\n", " \"\"\"\n", " review_sentiment = LLM(prompt, False, max_output_tokens=10, temperature=0, top_p=0, top_k=1)\n", "\n", " sql = f\"\"\"UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n", " SET review_sentiment = '{review_sentiment}'\n", " WHERE customer_review_id = {customer_review_id}\n", " \"\"\"\n", "\n", " print (sql)\n", "\n", " llm_valid_execution = RunQuery(sql)\n", " llm_valid_execution = True\n", " except Exception as error:\n", " print(\"An error occurred:\", error)" ] } ], "metadata": { "colab": { "collapsed_sections": [ "DszuLZoo9A7k", "6YeExbVqf4ZE" ], "name": "BigQuery table", "private_outputs": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }