colab-enterprise/gen-ai-demo/Customer-Reviews-Synthetic-Data-Generation-GenAI.ipynb (585 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## <img src=\"https://lh3.googleusercontent.com/mUTbNK32c_DTSNrhqETT5aQJYFKok2HB1G2nk2MZHvG5bSs0v_lmDm_ArW7rgd6SDGHXo0Ak2uFFU96X6Xd0GQ=w160-h128\" width=\"45\" valign=\"top\" alt=\"BigQuery\"> Generating Synthetic Data (Customer Reviews) using GenAI\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### License"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"##################################################################################\n",
"# Copyright 2024 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"# \n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"# \n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License.\n",
"###################################################################################"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Notebook Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- This notebook will use GenAI to generate fake customer reviews and then score the review's sentiment.\n",
"\n",
"- Quick Notes:\n",
" - Why synthetic data:\n",
" - No licensing fees\n",
" - LLMs can read your schema and description fields which means it understands what data to generate.\n",
" - LLMs can generate fake names, addresses, latitude/longitude based upon a fake address.\n",
" - LLMs can positive, neutrel and negative reviews.\n",
" - In order to generate synthetic data we need to make sure we can handle the following items:\n",
" - Have a unique primary key (INTs cannot be deplicated, UUIDs are easier to for synthetic generation)\n",
" - Ensure that foreign key data is valid\n",
"\n",
"- Notebook Logic:\n",
" 1. Get a list of valid customer foreign keys\n",
" 2. Get a list of valid location foreign keys\n",
" 3. Get the table schema from BigQuery (this also contains the description for each field)\n",
" 4. Loop for the number of review to generate\n",
" - Get the current maximum primary key and add one\n",
" - Create our LLM prompt:\n",
" - Determine if we are writing a negative or positive/neutral review \n",
" - Provide a list of themes we want reviews based upon\n",
" - Provide the prompt with a suggust range of reivew text words\n",
" - Provide the prompt with our foreign keys\n",
" - Provide the prompt with our primary key. And since we are asking for more the one review to be generated at a time, the LLM is smart enough to increment it for each row.\n",
" - Provide the prompt that we want a single INSERT..INTO statemend versus multiple.\n",
" 5. Execute the generated customer review SQL statement against BigQuery. This will insert the data.\n",
" 6. For the reviews generated\n",
" - Create a prompt asking to determine the sentiment (Positive, Neutral or Negative)\n",
" - Update BigQuery with the sentiment"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DszuLZoo9A7k"
},
"source": [
"## Initialize Python"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bhKxJadjWa1R"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import json\n",
"import bigframes.pandas as bf\n",
"#from bigframesllm import BigFramesLLM\n",
"from bigframes.ml.llm import GeminiTextGenerator"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MSC6-rboip3h"
},
"outputs": [],
"source": [
"from google.cloud import bigquery\n",
"client = bigquery.Client()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "lpU1P_fAXviJ"
},
"outputs": [],
"source": [
"PROJECT_ID = \"${project_id}\"\n",
"REGION = \"us\"\n",
"DATASET_ID = \"${bigquery_data_beans_curated_dataset}\"\n",
"CONNECTION_NAME = \"vertex-ai\"\n",
"\n",
"connection = f\"{PROJECT_ID}.{REGION}.{CONNECTION_NAME}\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qMijiOqUZmAp"
},
"outputs": [],
"source": [
"# bf.reset_session() # if you need to change the region\n",
"bf.options.bigquery.project = PROJECT_ID\n",
"bf.options.bigquery.location = REGION"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "zy7lRW09Ws1J"
},
"outputs": [],
"source": [
"session = bf.get_global_session()\n",
"\n",
"gemini_model = GeminiTextGenerator(session=session, connection_name=connection)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6YeExbVqf4ZE"
},
"source": [
"## Supporting Functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QtSLYNa_exfc"
},
"outputs": [],
"source": [
"def PrettyPrintJson(json_string):\n",
" json_object = json.loads(json_string)\n",
" json_formatted_str = json.dumps(json_object, indent=2)\n",
" print(json_formatted_str)\n",
" return json.dumps(json_object)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sthZI1h2fe8H"
},
"outputs": [],
"source": [
"def LLM(prompt, isOutputJson, max_output_tokens=1024, temperature=0, top_p=0, top_k=1):\n",
" print()\n",
" print(\"Prompt: \", prompt)\n",
" print()\n",
" df_prompt = pd.DataFrame(\n",
" {\n",
" \"prompt\": [prompt],\n",
" })\n",
" bf_df_prompt = bf.read_pandas(df_prompt)\n",
" prediction = gemini_model.predict(bf_df_prompt,\n",
" max_output_tokens=max_output_tokens,\n",
" temperature=temperature, # 0 to 1 (1 random)\n",
" top_p=top_p, # 0 to 1 (1 random)\n",
" top_k=top_k, # (1 to 40 random)\n",
" ).to_pandas()\n",
" try:\n",
" # Remove common LLM output mistakes\n",
" result = prediction['ml_generate_text_llm_result'][0]\n",
"\n",
" result = result.replace(\"```json\\n\",\"\")\n",
" result = result.replace(\"```JSON\\n\",\"\")\n",
" result = result.replace(\"```json\",\"\")\n",
" result = result.replace(\"```JSON\",\"\")\n",
" result = result.replace(\"```sql\\n\",\"\")\n",
" result = result.replace(\"```SQL\\n\",\"\")\n",
" result = result.replace(\"```sql\",\"\")\n",
" result = result.replace(\"```sql: bigquery\",\"\") \n",
" result = result.replace(\"```SQL\",\"\")\n",
" result = result.replace(\"```\",\"\")\n",
"\n",
" if isOutputJson:\n",
" json_string = PrettyPrintJson(result)\n",
" json_string = json_string.replace(\"'\",\"\\\\'\")\n",
" json_string = json_string.strip()\n",
" return json_string\n",
" else:\n",
" # result = result.replace(\"'\",\"\\\\'\")\n",
" result = result.strip()\n",
" return result\n",
"\n",
" except:\n",
" print(\"Error (raw): \", prediction['ml_generate_text_llm_result'][0])\n",
" print(\"Error (result): \", result)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_8p8PkhFibYw"
},
"outputs": [],
"source": [
"def GetTableSchema(dataset_name, table_name):\n",
" import io\n",
"\n",
" dataset_ref = client.dataset(dataset_name, project=PROJECT_ID)\n",
" table_ref = dataset_ref.table(table_name)\n",
" table = client.get_table(table_ref)\n",
"\n",
" f = io.StringIO(\"\")\n",
" client.schema_to_json(table.schema, f)\n",
" return f.getvalue()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "HgMmeQSWvq6k"
},
"outputs": [],
"source": [
"def GetForeignKeys(dataset_name, table_name, field_name):\n",
" sql = f\"\"\"\n",
" SELECT STRING_AGG(CAST({field_name} AS STRING), \",\" ORDER BY {field_name}) AS result\n",
" FROM `{PROJECT_ID}.{dataset_name}.{table_name}`\n",
" WHERE {field_name} <= 10 -- demo hack since the database is fully populated\n",
" \"\"\"\n",
" #print(sql)\n",
" df_result = client.query(sql).to_dataframe()\n",
" #display(df_result)\n",
" return df_result['result'].iloc[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "esIZXzKjsKIF"
},
"outputs": [],
"source": [
"def GetDistinctValues(dataset_name, table_name, field_name):\n",
" sql = f\"\"\"\n",
" SELECT STRING_AGG(DISTINCT {field_name}, \",\" ) AS result\n",
" FROM `{PROJECT_ID}.{dataset_name}.{table_name}`\n",
" \"\"\"\n",
" #print(sql)\n",
" df_result = client.query(sql).to_dataframe()\n",
" #display(df_result)\n",
" return df_result['result'].iloc[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Tto7sIdJ5vYi"
},
"outputs": [],
"source": [
"def GetStartingValue(dataset_name, table_name, field_name):\n",
" sql = f\"\"\"\n",
" SELECT IFNULL(MAX({field_name}),0) + 1 AS result\n",
" FROM `{PROJECT_ID}.{dataset_name}.{table_name}`\n",
" \"\"\"\n",
" #print(sql)\n",
" df_result = client.query(sql).to_dataframe()\n",
" #display(df_result)\n",
" return df_result['result'].iloc[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7sK-fbCaVMCZ"
},
"outputs": [],
"source": [
"def GetMaximumValue(dataset_name, table_name, field_name):\n",
" sql = f\"\"\"\n",
" SELECT IFNULL(MAX({field_name}),0) AS result\n",
" FROM `{PROJECT_ID}.{dataset_name}.{table_name}`\n",
" \"\"\"\n",
" #print(sql)\n",
" df_result = client.query(sql).to_dataframe()\n",
" #display(df_result)\n",
" return df_result['result'].iloc[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DFqXl0cefaOe"
},
"outputs": [],
"source": [
"def RunQuery(sql):\n",
" import time\n",
"\n",
" #return True # return early for now\n",
"\n",
" job_config = bigquery.QueryJobConfig(priority=bigquery.QueryPriority.INTERACTIVE)\n",
" query_job = client.query(sql, job_config=job_config)\n",
"\n",
" # Check on the progress by getting the job's updated state.\n",
" query_job = client.get_job(\n",
" query_job.job_id, location=query_job.location\n",
" )\n",
" print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n",
"\n",
" while query_job.state != \"DONE\":\n",
" time.sleep(2)\n",
" query_job = client.get_job(\n",
" query_job.job_id, location=query_job.location\n",
" )\n",
" print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n",
"\n",
" if query_job.error_result == None:\n",
" return True\n",
" else:\n",
" return False"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mOtL7RgiZox9"
},
"source": [
"## Synthetic Data Generation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Mg8ZZEViq9lu"
},
"source": [
"### Create customer reviews"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8VwUtOMh-L1j"
},
"outputs": [],
"source": [
"rows_of_data_to_generate = 3\n",
"\n",
"table_name = \"customer\"\n",
"field_name = \"customer_id\"\n",
"customer_ids = GetForeignKeys(DATASET_ID, table_name, field_name)\n",
"\n",
"table_name = \"location\"\n",
"field_name = \"location_id\"\n",
"location_ids = GetForeignKeys(DATASET_ID, table_name, field_name)\n",
"\n",
"table_name = \"customer_review\"\n",
"primary_key = \"customer_review_id\"\n",
"\n",
"schema = GetTableSchema(DATASET_ID, table_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LL6cYlwWCYQu"
},
"outputs": [],
"source": [
"# location_ids='1,11,21'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bVm36Q_1-L1k"
},
"outputs": [],
"source": [
"import random\n",
"loop_count = 1 # can be set higher\n",
"loop_index = 1\n",
"\n",
"while loop_index <= loop_count:\n",
" print(f\"loop_index: {loop_index} | loop_count: {loop_count}\")\n",
" starting_value = GetStartingValue(DATASET_ID, table_name, primary_key)\n",
"\n",
" if random.random() < .25:\n",
" prompt=f\"\"\"\n",
" You are a database engineer and need write a single BigQuery SQL statement.\n",
" You to generate data for a table for the below schema.\n",
" You need to generate reviews for customers who have purchased your brewed coffee.\n",
" Write a negative in first person based upon the following: \"Bad Service\",\"Long Wait Time\",\"Slow Service\",\"Dirty\",\"Overpriced\",\"Overcrowded\",\"Noisy Location\",\"Lack of Allergan Information\",\"Inconsistent Quality\",\"Lack of Seating\",\"No Flavor\",\"Too weak\",\"Too strong\",\"Too bitter\",\"Limited Menu\"\n",
" - The schema is for a Google Cloud BigQuery Table.\n",
" - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n",
" - Read the description of each field for valid values.\n",
" - Do not preface the response with any special characters or 'sql'.\n",
" - Generate {rows_of_data_to_generate} insert statements for this table.\n",
" - Valid values for customer_id are: {customer_ids}\n",
" - Valid values for location_id are: {location_ids}\n",
" - The review_datetime is a date and should be within the past 5 years.\n",
" - The response for each question should be 20 to 100 words.\n",
" - The starting value of the field {primary_key} is {starting_value}.\n",
" - Only generate data for these fields: customer_review_id, customer_id, location_id, review_datetime, review_text, social_media_source, social_media_handle\n",
" - Timestamps should use this format: 2020-06-02 23:57:12.120174 UTC.\n",
"\n",
" Examples:\n",
" Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n",
" Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n",
"\n",
" Schema: {schema}\n",
" \"\"\"\n",
" else:\n",
" prompt=f\"\"\"\n",
" You are a database engineer and need write a single BigQuery SQL statement.\n",
" You to generate data for a table for the below schema.\n",
" You need to generate reviews for customers who have purchased your brewed coffee.\n",
" Write a positive or neutral review in first person based upon the following: \"Good Service\",\"Short Wait Time\",\"Fast Service\",\"Clean\",\"Good value\",\"Cozy Seating Areas\",\"Quite Location\",\"Variety of Milk Alternatives\",\"Consistent Quality\",\"Lots of places to sit\",\"Lots of Flavor\",\"Good Taste\",\"Good Selection\"\n",
" - The schema is for a Google Cloud BigQuery Table.\n",
" - The table name is \"{PROJECT_ID}.{DATASET_ID}.{table_name}\".\n",
" - Read the description of each field for valid values.\n",
" - Do not preface the response with any special characters or 'sql'.\n",
" - Generate {rows_of_data_to_generate} insert statements for this table.\n",
" - Valid values for customer_id are: {customer_ids}\n",
" - The review_datetime is a date and should be within the past 5 years.\n",
" - The response for each question should be 20 to 100 words.\n",
" - The starting value of the field {primary_key} is {starting_value}.\n",
" - Only generate data for these fields: customer_review_id, customer_id, location_id, review_datetime, review_text, social_media_source, social_media_handle\n",
" - Timestamps should use this format: 2020-06-02 23:57:12.120174 UTC.\n",
"\n",
" Examples:\n",
" Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n",
" Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n",
"\n",
" Schema: {schema}\n",
" \"\"\"\n",
"\n",
" llm_valid_execution = False\n",
" temperature=.8\n",
" while llm_valid_execution == False:\n",
" try:\n",
" sql = LLM(prompt, False, max_output_tokens=1024, temperature=temperature, top_p=1, top_k=40)\n",
" print(\"---------------------------------\")\n",
" print(\"sql: \", sql)\n",
" print(\"---------------------------------\")\n",
" llm_valid_execution = RunQuery(sql)\n",
" loop_index = loop_index + 1\n",
" except Exception as error:\n",
" temperature = temperature - .1\n",
" print(\"An error occurred:\", error)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uWUogwtAqNTT"
},
"source": [
"### Score the Sentiment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fSB45vFQPnaM"
},
"outputs": [],
"source": [
"sql = \"\"\"SELECT customer_review_id,\n",
" review_text\n",
" FROM `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n",
" WHERE review_sentiment IS NULL\n",
" ORDER BY customer_review_id\n",
"\"\"\"\n",
"\n",
"# Fields to update\n",
"# review_sentiment,\n",
"# gen_ai_recommended_action,\n",
"# gen_ai_reponse,\n",
"# human_response,\n",
"# response_sent_action,\n",
"# response_sent_date\n",
"\n",
"df_process = client.query(sql).to_dataframe()\n",
"\n",
"for row in df_process.itertuples():\n",
" customer_review_id = row.customer_review_id\n",
" review_text = row.review_text\n",
"\n",
" llm_valid_execution = False\n",
" while llm_valid_execution == False:\n",
" try:\n",
" prompt=f\"\"\"\n",
" For the given review classify the sentiment as Positive, Neutral or Negative.\n",
" Only return one of these words: \"Positive\", \"Neutral\", \"Negative\"\n",
" Review: {review_text}\n",
" \"\"\"\n",
" review_sentiment = LLM(prompt, False, max_output_tokens=10, temperature=0, top_p=0, top_k=1)\n",
"\n",
" sql = f\"\"\"UPDATE `${project_id}.${bigquery_data_beans_curated_dataset}.customer_review`\n",
" SET review_sentiment = '{review_sentiment}'\n",
" WHERE customer_review_id = {customer_review_id}\n",
" \"\"\"\n",
"\n",
" print (sql)\n",
"\n",
" llm_valid_execution = RunQuery(sql)\n",
" llm_valid_execution = True\n",
" except Exception as error:\n",
" print(\"An error occurred:\", error)"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [
"DszuLZoo9A7k",
"6YeExbVqf4ZE"
],
"name": "BigQuery table",
"private_outputs": true,
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}