qwiklabs/colab-enterprise/gen-ai-demo/Menu-Synthetic-Data-Generation-GenAI.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "k6eIqerFOzyj" }, "source": [ "## <img src=\"https://lh3.googleusercontent.com/mUTbNK32c_DTSNrhqETT5aQJYFKok2HB1G2nk2MZHvG5bSs0v_lmDm_ArW7rgd6SDGHXo0Ak2uFFU96X6Xd0GQ=w160-h128\" width=\"45\" valign=\"top\" alt=\"BigQuery\"> Generate Synthetic Menu data and images using Gemini Pro\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### License" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##################################################################################\n", "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "# \n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "# \n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "###################################################################################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notebook Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- This notebook demostrates how data engineers can increase the speed of their development work using LLMs. In the notebook, LLMs will be used to generate synthetic data, understand an image of an ERD and create the DDL and use image generation to create realistic images of products.\n", "\n", "- Notebook Logic:\n", " 1. Create our the menu table by showing Gemimi Pro a picture of the tables to create.\n", " 1. First we ened to download a picture of our ERD.\n", " 2. We need to construct a LLM prompt telling it to read the ERD and construct our SQL.\n", " 3. Execute the SQL to create the new table and primary keys on the table.\n", " 2. Create a LLM prompt to generate some unique coffee names\n", " 3. Create a LLM prompt to generate the insert statements to our menu table based upon the unique coffee name.\n", " - The prompt can create small, medium and large sizes.\n", " - The LLM is smart enough to create pricing correctly according to the size.\n", " - The LLM can generate our image prompt for Imagen2\n", " - The promp can pass in all the foreign and primary key values.\n", " 4. Execute the generated SQL.\n", " 5. Create a LLM prompt to create some food items.\n", " 6. Execute the generate SQL.\n", " 7. Read the menu items and the image prompt\n", " 8. Geenrate the coffee and food image\n", " 9. Upload the images to GCS" ] }, { "cell_type": "markdown", "metadata": { "id": "8zy0eEJmHxRZ" }, "source": [ "## Initialize Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3wiruT266H3e" }, "outputs": [], "source": [ "project_id=\"${project_id}\"\n", "location=\"us-central1\"\n", "model_id = \"imagegeneration@005\"\n", "\n", "# No need to set these\n", "city_names=[\"New York City\", \"London\", \"Tokyo\", \"San Francisco\"]\n", "city_ids=[1,2,3,4]\n", "city_languages=[\"American English\", \"British English\", \"Japanese\", \"American English\"]\n", "number_of_coffee_trucks = \"4\"\n", "\n", "dataset_id = \"data_beans_synthetic_data\"\n", "\n", "gcs_storage_bucket = \"${data_beans_curated_bucket}\"\n", "gcs_storage_path = \"data-beans/menu-images/\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "z4NpP0pCH0pj" }, "outputs": [], "source": [ "from PIL import Image\n", "from IPython.display import HTML\n", "import IPython.display\n", "import google.auth\n", "import requests\n", "import json\n", "import uuid\n", "import base64\n", "import os\n", "import cv2\n", "\n", "from google.cloud import bigquery\n", "client = bigquery.Client()" ] }, { "cell_type": "markdown", "metadata": { "id": "YtZuFgjbOjso" }, "source": [ "## Imagen2 / Gemini Pro / Gemini Pro Vision (Helper Functions)" ] }, { "cell_type": "markdown", "metadata": { "id": "xUolPsMFOjpZ" }, "source": [ "#### Imagen2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LPf6NurhNi2l" }, "outputs": [], "source": [ "def ImageGen(prompt):\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request() # required to acess access token\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " headers = {\n", " \"Content-Type\" : \"application/json\",\n", " \"Authorization\" : \"Bearer \" + access_token\n", " }\n", "\n", " # https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/image-generation\n", " url = f\"https://{location}-aiplatform.googleapis.com/v1/projects/{project}/locations/{location}/publishers/google/models/imagegeneration:predict\"\n", "\n", " payload = {\n", " \"instances\": [\n", " {\n", " \"prompt\": prompt\n", " }\n", " ],\n", " \"parameters\": {\n", " \"sampleCount\": 1\n", " }\n", " }\n", "\n", " response = requests.post(url, json=payload, headers=headers)\n", "\n", " if response.status_code == 200:\n", " image_data = json.loads(response.content)[\"predictions\"][0][\"bytesBase64Encoded\"]\n", " image_data = base64.b64decode(image_data)\n", " filename= str(uuid.uuid4()) + \".png\"\n", " with open(filename, \"wb\") as f:\n", " f.write(image_data)\n", " print(f\"Image generated OK.\")\n", " return filename\n", " else:\n", " error = f\"Error with prompt:'{prompt}' Status:'{response.status_code}' Text:'{response.text}'\"\n", " raise RuntimeError(error)" ] }, { "cell_type": "markdown", "metadata": { "id": "E5CFSdK3HxYm" }, "source": [ "#### Gemini Pro LLM" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9jTBzcSIMbwg" }, "outputs": [], "source": [ "def GeminiProLLM(prompt, temperature = .8, topP = .8, topK = 40):\n", "\n", " if temperature < 0:\n", " temperature = 0\n", "\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request() # required to acess access token\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " headers = {\n", " \"Content-Type\" : \"application/json\",\n", " \"Authorization\" : \"Bearer \" + access_token\n", " }\n", "\n", " # https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini\n", " url = f\"https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/publishers/google/models/gemini-2.0-flash:streamGenerateContent\"\n", "\n", " payload = {\n", " \"contents\": {\n", " \"role\": \"user\",\n", " \"parts\": {\n", " \"text\": prompt\n", " },\n", " },\n", " \"safety_settings\": {\n", " \"category\": \"HARM_CATEGORY_SEXUALLY_EXPLICIT\",\n", " \"threshold\": \"BLOCK_LOW_AND_ABOVE\"\n", " },\n", " \"generation_config\": {\n", " \"temperature\": temperature,\n", " \"topP\": topP,\n", " \"topK\": topK,\n", " \"maxOutputTokens\": 8192,\n", " \"candidateCount\": 1\n", " }\n", " }\n", "\n", " response = requests.post(url, json=payload, headers=headers)\n", "\n", " if response.status_code == 200:\n", " json_response = json.loads(response.content)\n", " llm_response = \"\"\n", " for item in json_response:\n", " try:\n", " llm_response = llm_response + item[\"candidates\"][0][\"content\"][\"parts\"][0][\"text\"]\n", " except Exception as err:\n", " print(f\"response.content: {response.content}\")\n", " raise RuntimeError(err)\n", "\n", " # Remove some typically response characters (if asking for a JSON reply)\n", " llm_response = llm_response.replace(\"```json\",\"\")\n", " llm_response = llm_response.replace(\"```\",\"\")\n", "\n", " # print(f\"llm_response:\\n{llm_response}\")\n", " return llm_response\n", " else:\n", " error = f\"Error with prompt:'{prompt}' Status:'{response.status_code}' Text:'{response.text}'\"\n", " raise RuntimeError(error)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "-L93udtrH1Oz" }, "source": [ "#### Gemini Pro Vision LLM" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ecvrUyp0BcXg" }, "outputs": [], "source": [ "# Use the Gemini with Vision\n", "def GeminiProVisionLLM(prompt, imageBase64, temperature = .4, topP = 1, topK = 32):\n", "\n", " if temperature < 0:\n", " temperature = 0\n", "\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request() # required to acess access token\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " headers = {\n", " \"Content-Type\" : \"application/json\",\n", " \"Authorization\" : \"Bearer \" + access_token\n", " }\n", "\n", " # https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini\n", " url = f\"https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/publishers/google/models/gemini-2.0-flash:streamGenerateContent\"\n", "\n", " payload = {\n", " \"contents\": [\n", " {\n", " \"role\": \"user\",\n", " \"parts\": [\n", " {\n", " \"text\": prompt\n", " },\n", " {\n", " \"inlineData\": {\n", " \"mimeType\": \"image/png\",\n", " \"data\": f\"{imageBase64}\"\n", " }\n", " }\n", " ]\n", " }\n", " ],\n", " \"safety_settings\": {\n", " \"category\": \"HARM_CATEGORY_SEXUALLY_EXPLICIT\",\n", " \"threshold\": \"BLOCK_LOW_AND_ABOVE\"\n", " },\n", " \"generation_config\": {\n", " \"temperature\": temperature,\n", " \"topP\": topP,\n", " \"topK\": topK,\n", " \"maxOutputTokens\": 2048,\n", " \"candidateCount\": 1\n", " }\n", " }\n", "\n", " response = requests.post(url, json=payload, headers=headers)\n", "\n", " if response.status_code == 200:\n", " json_response = json.loads(response.content)\n", " llm_response = \"\"\n", " for item in json_response:\n", " llm_response = llm_response + item[\"candidates\"][0][\"content\"][\"parts\"][0][\"text\"]\n", "\n", " # Remove some typically response characters (if asking for a JSON reply)\n", " llm_response = llm_response.replace(\"```json\",\"\")\n", " llm_response = llm_response.replace(\"```\",\"\")\n", "\n", " # print(f\"llm_response:\\n{llm_response}\")\n", " return llm_response\n", " else:\n", " error = f\"Error with prompt:'{prompt}' Status:'{response.status_code}' Text:'{response.text}'\"\n", " raise RuntimeError(error)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QNz6pofvfXDS" }, "outputs": [], "source": [ "# Use the Gemini with Vision\n", "def GeminiProVisionMultipleFileLLM(prompt, image_prompt, temperature = .4, topP = 1, topK = 32):\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request() # required to acess access token\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " headers = {\n", " \"Content-Type\" : \"application/json\",\n", " \"Authorization\" : \"Bearer \" + access_token\n", " }\n", "\n", " # https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini\n", " url = f\"https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/publishers/google/models/gemini-2.0-flash:streamGenerateContent\"\n", "\n", "\n", " parts = []\n", " new_item = {\n", " \"text\": prompt\n", " }\n", " parts.append(new_item)\n", "\n", " for item in image_prompt:\n", " new_item = {\n", " \"text\": f\"Image Name: {item['llm_image_filename']}:\\n\"\n", " }\n", " parts.append(new_item)\n", " new_item = {\n", " \"inlineData\": {\n", " \"mimeType\": \"image/png\",\n", " \"data\": item[\"llm_image_base64\"]\n", " }\n", " }\n", " parts.append(new_item)\n", "\n", " payload = {\n", " \"contents\": [\n", " {\n", " \"role\": \"user\",\n", " \"parts\": parts\n", " }\n", " ],\n", " \"safety_settings\": {\n", " \"category\": \"HARM_CATEGORY_SEXUALLY_EXPLICIT\",\n", " \"threshold\": \"BLOCK_LOW_AND_ABOVE\"\n", " },\n", " \"generation_config\": {\n", " \"temperature\": temperature,\n", " \"topP\": topP,\n", " \"topK\": topK,\n", " \"maxOutputTokens\": 2048,\n", " \"candidateCount\": 1\n", " }\n", " }\n", "\n", " response = requests.post(url, json=payload, headers=headers)\n", "\n", " if response.status_code == 200:\n", " json_response = json.loads(response.content)\n", " llm_response = \"\"\n", " for item in json_response:\n", " llm_response = llm_response + item[\"candidates\"][0][\"content\"][\"parts\"][0][\"text\"]\n", "\n", " # Remove some typically response characters (if asking for a JSON reply)\n", " llm_response = llm_response.replace(\"```json\",\"\")\n", " llm_response = llm_response.replace(\"```\",\"\")\n", "\n", " # print(f\"llm_response:\\n{llm_response}\")\n", " return llm_response\n", " else:\n", " error = f\"Error with prompt:'{prompt}' Status:'{response.status_code}' Text:'{response.text}'\"\n", " raise RuntimeError(error)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "rVCY93IyXPoO" }, "source": [ "#### SQL Functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jHCtXYuRNU0p" }, "outputs": [], "source": [ "def RunQuery(sql):\n", " import time\n", "\n", " if (sql.startswith(\"SELECT\") or sql.startswith(\"WITH\")):\n", " df_result = client.query(sql).to_dataframe()\n", " return df_result\n", " else:\n", " job_config = bigquery.QueryJobConfig(priority=bigquery.QueryPriority.INTERACTIVE)\n", " query_job = client.query(sql, job_config=job_config)\n", "\n", " # Check on the progress by getting the job's updated state.\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " while query_job.state != \"DONE\":\n", " time.sleep(2)\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " if query_job.error_result == None:\n", " return True\n", " else:\n", " return False" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "p9j1GdAwNifB" }, "outputs": [], "source": [ "def GetNextPrimaryKey(fully_qualified_table_name, field_name):\n", " sql = f\"\"\"\n", " SELECT IFNULL(MAX({field_name}),0) AS result\n", " FROM `{fully_qualified_table_name}`\n", " \"\"\"\n", " # print(sql)\n", " df_result = client.query(sql).to_dataframe()\n", " # display(df_result)\n", " return df_result['result'].iloc[0] + 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ws3i8HVGkk5p" }, "outputs": [], "source": [ "def GetTableSchema(dataset_name, table_name):\n", " import io\n", "\n", " dataset_ref = client.dataset(dataset_name, project=project_id)\n", " table_ref = dataset_ref.table(table_name)\n", " table = client.get_table(table_ref)\n", "\n", " f = io.StringIO(\"\")\n", " client.schema_to_json(table.schema, f)\n", " return f.getvalue()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Qx0q3yUNl63e" }, "outputs": [], "source": [ "def GetStartingValue(dataset_name, table_name, field_name):\n", " sql = f\"\"\"\n", " SELECT IFNULL(MAX({field_name}),0) + 1 AS result\n", " FROM `{project_id}.{dataset_name}.{table_name}`\n", " \"\"\"\n", " #print(sql)\n", " df_result = client.query(sql).to_dataframe()\n", " #display(df_result)\n", " return df_result['result'].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6z5Fc5Mel6sH" }, "outputs": [], "source": [ "def GetForeignKeys(dataset_name, table_name, field_name):\n", " sql = f\"\"\"\n", " SELECT STRING_AGG(CAST({field_name} AS STRING), \",\" ORDER BY {field_name}) AS result\n", " FROM `{project_id}.{dataset_name}.{table_name}`\n", " \"\"\"\n", " #print(sql)\n", " df_result = client.query(sql).to_dataframe()\n", " #display(df_result)\n", " return df_result['result'].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9-6THGgjnDKg" }, "outputs": [], "source": [ "def GetDistinctValues(dataset_name, table_name, field_name):\n", " sql = f\"\"\"\n", " SELECT STRING_AGG(DISTINCT {field_name}, \",\" ) AS result\n", " FROM `{project_id}.{dataset_name}.{table_name}`\n", " \"\"\"\n", " #print(sql)\n", " df_result = client.query(sql).to_dataframe()\n", " #display(df_result)\n", " return df_result['result'].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run DDL\n", "# This has been done to make the code re-runnable for the demo\n", "# We need to add some logic for the PK and FKs and the prompt will be updated to code gen it\n", "\n", "def RunDDL(sql):\n", " import time\n", "\n", " sql = f\"\"\"CREATE SCHEMA IF NOT EXISTS `{project_id}.{dataset_id}`;\n", "\n", "CREATE TABLE IF NOT EXISTS `{project_id}.{dataset_id}.menu`\n", "(\n", " menu_id INT64 NOT NULL OPTIONS(description=\"Primary key. Menu table.\"),\n", " company_id INT64 NOT NULL OPTIONS(description=\"Foreign key: Company table.\"),\n", " item_name STRING NOT NULL OPTIONS(description=\"The name of the menu item.\"),\n", " item_price FLOAT64 NOT NULL OPTIONS(description=\"The price of the menu item.\"),\n", " item_description STRING NOT NULL OPTIONS(description=\"The description of the menu item.\"),\n", " item_size STRING NOT NULL OPTIONS(description=\"The size of the menu item.\"),\n", " llm_item_description_prompt STRING OPTIONS(description=\"LLM prompt are all of the menu item descriptions.\"),\n", " llm_item_description string OPTIONS(description=\"LLM summary are all of the menu item descriptions.\"),\n", " llm_item_image_prompt STRING OPTIONS(description=\"LLM prompt are all of the menu item images.\"),\n", " llm_item_image_url string OPTIONS(description=\"LLM summary are all of the menu item images.\")\n", ")\n", "CLUSTER BY menu_id;\n", "\n", "CREATE TABLE IF NOT EXISTS `{project_id}.{dataset_id}.company`\n", "(\n", "company_id INT64 NOT NULL OPTIONS(description=\"Primary key. Company table.\"),\n", "company_name STRING NOT NULL OPTIONS(description=\"The name of the company.\")\n", ")\n", "CLUSTER BY company_id;\n", "\n", "ALTER TABLE `{project_id}.{dataset_id}.menu` DROP PRIMARY KEY IF EXISTS;\n", "ALTER TABLE `{project_id}.{dataset_id}.menu` ADD PRIMARY KEY (menu_id) NOT ENFORCED;\n", "\n", "ALTER TABLE `{project_id}.{dataset_id}.company` DROP PRIMARY KEY IF EXISTS;\n", "ALTER TABLE `{project_id}.{dataset_id}.company` ADD PRIMARY KEY (company_id) NOT ENFORCED;\n", "\n", "ALTER TABLE `{project_id}.{dataset_id}.menu` DROP CONSTRAINT IF EXISTS `menu.fk$1`;\n", "ALTER TABLE `{project_id}.{dataset_id}.menu` ADD FOREIGN KEY (company_id) REFERENCES `{project_id}.{dataset_id}.company`(company_id) NOT ENFORCED;\n", " \"\"\"\n", "\n", " # To see the contraints in BigQuery\n", " # SELECT * FROM `data-beans-demo-5r3n4jbe01.data_beans_synthetic_data.INFORMATION_SCHEMA.TABLE_CONSTRAINTS`\n", "\n", "\n", " job_config = bigquery.QueryJobConfig(priority=bigquery.QueryPriority.INTERACTIVE)\n", " query_job = client.query(sql, job_config=job_config)\n", "\n", " # Check on the progress by getting the job's updated state.\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " while query_job.state != \"DONE\":\n", " time.sleep(2)\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " if query_job.error_result == None:\n", " return True\n", " else:\n", " return False" ] }, { "cell_type": "markdown", "metadata": { "id": "BlxddNzpmAgp" }, "source": [ "#### Helper Functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "E1yrPjvVXNCz" }, "outputs": [], "source": [ "def convert_png_to_base64(image_path):\n", " image = cv2.imread(image_path)\n", "\n", " # Convert the image to a base64 string.\n", " _, buffer = cv2.imencode('.png', image)\n", " base64_string = base64.b64encode(buffer).decode('utf-8')\n", "\n", " return base64_string" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ylq2crklNuCB" }, "outputs": [], "source": [ "# This was generated by GenAI\n", "\n", "def copy_file_to_gcs(local_file_path, bucket_name, destination_blob_name):\n", " \"\"\"Copies a file from a local drive to a GCS bucket.\n", "\n", " Args:\n", " local_file_path: The full path to the local file.\n", " bucket_name: The name of the GCS bucket to upload to.\n", " destination_blob_name: The desired name of the uploaded file in the bucket.\n", "\n", " Returns:\n", " None\n", " \"\"\"\n", "\n", " import os\n", " from google.cloud import storage\n", "\n", " # Ensure the file exists locally\n", " if not os.path.exists(local_file_path):\n", " raise FileNotFoundError(f\"Local file '{local_file_path}' not found.\")\n", "\n", " # Create a storage client\n", " storage_client = storage.Client()\n", "\n", " # Get a reference to the bucket\n", " bucket = storage_client.bucket(bucket_name)\n", "\n", " # Create a blob object with the desired destination path\n", " blob = bucket.blob(destination_blob_name)\n", "\n", " # Upload the file from the local filesystem\n", " content_type = \"\"\n", " if local_file_path.endswith(\".html\"):\n", " content_type = \"text/html; charset=utf-8\"\n", "\n", " if local_file_path.endswith(\".json\"):\n", " content_type = \"application/json; charset=utf-8\"\n", "\n", " if content_type == \"\":\n", " blob.upload_from_filename(local_file_path)\n", " else:\n", " blob.upload_from_filename(local_file_path, content_type = content_type)\n", "\n", " print(f\"File '{local_file_path}' uploaded to GCS bucket '{bucket_name}' as '{destination_blob_name}. Content-Type: {content_type}'.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "MOnML8jpdwzg" }, "source": [ "## Menu Synthetic Data and Image Generation" ] }, { "cell_type": "markdown", "metadata": { "id": "xk2c2lvsnHZI" }, "source": [ "#### Download the ERD image and Generate our Database Schema DDL" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0E29ZF_newIa" }, "outputs": [], "source": [ "import requests\n", "import shutil\n", "\n", "menu_erd_filename = \"Data-Beans-Menu-ERD.png\"\n", "\n", "# Specify the image URL\n", "img_url = f\"https://storage.googleapis.com/data-analytics-golden-demo/data-beans/v1/colab-supporting-images/{menu_erd_filename}\"\n", "\n", "# Send a GET request to fetch the image\n", "response = requests.get(img_url, stream=True)\n", "\n", "# Check for successful download\n", "if response.status_code == 200:\n", " # Set decode_content to True to prevent encoding errors\n", " response.raw.decode_content = True\n", "\n", " # Open a local file in binary write mode\n", " with open(menu_erd_filename, \"wb\") as f:\n", " # Copy image data to the local file in chunks\n", " shutil.copyfileobj(response.raw, f)\n", "\n", " print(\"Image downloaded successfully!\")\n", "else:\n", " print(\"Image download failed with status code:\", response.status_code)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JZm_BtyiflQh" }, "outputs": [], "source": [ "print(f\"Filename: {menu_erd_filename}\")\n", "img = Image.open(menu_erd_filename)\n", "img.thumbnail([800,389]) # width, height\n", "IPython.display.display(img)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MJS1jALgeBL-" }, "outputs": [], "source": [ "llm_erd_prompt=f\"\"\"Use BigQuery SQL commands to create the following:\n", "- Create a new BigQuery schema named \"{dataset_id}\".\n", "- Use only BigQuery datatypes. Double and triple check this since it causes a lot of errors.\n", "- Create the BigQuery DDLs for the attached ERD.\n", "- Create primary keys for each table using the ALTER command. Use the \"NOT ENFORCED\" keyword.\n", "- Create foreign keys for each table using the ALTER command. Use the \"NOT ENFORCED\" keyword.\n", "- For each field add an OPTIONS for the description.\n", "- Cluster the table by the primary key.\n", "- For columns that can be null do not add \"NULL\" to the created SQL statement. BigQuery leaves this blank.\n", "- All ALTER TABLE statements should by at the bottom of the generated script.\n", "- The ALTER TABLES statements should be order by the primary key statements and then the foreign key statements. Order matters!\n", "- Double check your work especially that you used ONLY BigQuery data types.\n", "\n", "Previous Errors that have been generated by this script. Be sure to check your work to avoid encountering these.\n", "- Query error: Type not found: FLOAT at [6:12]\n", "- Query error: Table test.company does not have Primary Key constraints at [25:1]\n", "\n", "Example:\n", "CREATE TABLE IF NOT EXISTS `{project_id}.{dataset_id}.customer`\n", "(\n", " customer_id INTEGER NOT NULL OPTIONS(description=\"Primary key. Customer table.\"),\n", " country_id INTEGER NOT NULL OPTIONS(description=\"Foreign key: Country table.\"),\n", " customer_llm_summary STRING NOT NULL OPTIONS(description=\"LLM generated summary of customer data.\"),\n", " customer_lifetime_value STRING NOT NULL OPTIONS(description=\"Total sales for this customer.\"),\n", " customer_cluster_id FLOAT NOT NULL OPTIONS(description=\"Clustering algorithm id.\"),\n", " customer_review_llm_summary STRING OPTIONS(description=\"LLM summary are all of the customer reviews.\"),\n", " customer_survey_llm_summary STRING OPTIONS(description=\"LLM summary are all of the customer surveys.\")\n", ")\n", "CLUSTER BY customer_id;\n", "\n", "CREATE TABLE IF NOT EXISTS `{project_id}.{dataset_id}.country`\n", "(\n", "country_id INTEGER NOT NULL OPTIONS(description=\"Primary key. Country table.\"),\n", "country_name STRING NOT NULL OPTIONS(description=\"The name of the country.\")\n", ")\n", "CLUSTER BY country_id;\n", "\n", "\n", "ALTER TABLE `{project_id}.{dataset_id}.customer` ADD PRIMARY KEY (customer_id) NOT ENFORCED;\n", "ALTER TABLE `{project_id}.{dataset_id}.country` ADD PRIMARY KEY (country_id) NOT ENFORCED;\n", "\n", "ALTER TABLE `{project_id}.{dataset_id}.customer` ADD FOREIGN KEY (country_id) REFERENCES `{project_id}.{dataset_id}.country`(country_id) NOT ENFORCED;\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gI-WIjuOh9pd" }, "outputs": [], "source": [ "imageBase64 = convert_png_to_base64(menu_erd_filename)\n", "\n", "llm_response = GeminiProVisionLLM(llm_erd_prompt, imageBase64, temperature=.2, topP=1, topK=32)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "q9De2HudibHl" }, "outputs": [], "source": [ "# Need to prompt this\n", "llm_response = llm_response.replace(\"STRING NULL OPTIONS\",\"STRING OPTIONS\")\n", "llm_response = llm_response.replace(\"JSON NULL OPTIONS\",\"JSON OPTIONS\")\n", "llm_response = llm_response.replace(\"BOOLEAN NULL OPTIONS\",\"BOOLEAN OPTIONS\")\n", "\n", "# Run this by hand (it will create a new dataset so we do not overwrite our main demo dataset)\n", "print(llm_response)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iCgstDOxmY2a" }, "outputs": [], "source": [ "RunDDL(llm_response)" ] }, { "cell_type": "markdown", "metadata": { "id": "Iuv25Ot7kH9t" }, "source": [ "### Generate Data" ] }, { "cell_type": "markdown", "metadata": { "id": "YRChhVaUms-n" }, "source": [ "#### Company Table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "T3DMardzkVN2" }, "outputs": [], "source": [ "company_count = 1\n", "\n", "table_name = \"company\"\n", "primary_key = \"company_id\"\n", "\n", "schema = GetTableSchema(dataset_id, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5uhJ9oqwkykN" }, "outputs": [], "source": [ "company_names_prompt = f\"\"\"Generate {company_count} creative names and return in the below json format.\n", "- The name should be new and not a name that is already used by an existing coffee company.\n", "- The name should be related to coffee.\n", "- The name should be related to a food truck type of service.\n", "\n", "JSON format: [ \"value\" ]\n", "Sample JSON Response: [ \"value1\", \"value2\" ]\n", "\"\"\"\n", "\n", "llm_success = False\n", "temperature=.8\n", "while llm_success == False:\n", " try:\n", " company_names = GeminiProLLM(company_names_prompt, temperature=temperature, topP=.8, topK = 40)\n", " company_names_json = json.loads(company_names)\n", " llm_success = True\n", " except:\n", " # Reduce the temperature for more accurate generation\n", " temperature = temperature - .05\n", " print(\"Regenerating...\")\n", "\n", "print(f\"company_names: {company_names}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "v22VV9nx0ruY" }, "outputs": [], "source": [ "company_names = [\"Data Beans\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wgDPSRgPlFRo" }, "outputs": [], "source": [ "starting_value = GetStartingValue(dataset_id, table_name, primary_key)\n", "\n", "company_names_sql_prompt=f\"\"\"\n", "You are a database engineer and need to generate data for a table for the below schema.\n", "- The schema is for a Google Cloud BigQuery Table.\n", "- The table name is \"{project_id}.{dataset_id}.{table_name}\".\n", "- Read the description of each field for valid values.\n", "- Do not preface the response with any special characters or 'sql'.\n", "- Generate {company_count} insert statements for this table.\n", "- Valid values for company_name are: {company_names}\n", "- The starting value of the field {primary_key} is {starting_value}.\n", "- Only generate a single statement, not multiple INSERTs.\n", "\n", "\n", "Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", "Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", "Schema: {schema}\n", "\"\"\"\n", "\n", "llm_success = False\n", "temperature=.8\n", "while llm_success == False:\n", " try:\n", " sql = GeminiProLLM(company_names_sql_prompt, temperature=temperature, topP=.8, topK = 40)\n", " print(f\"SQL: {sql}\")\n", "\n", " if starting_value == 1:\n", " llm_success = RunQuery(sql)\n", " else:\n", " llm_success = True\n", " except:\n", " # Reduce the temperature for more accurate generation\n", " temperature = temperature - .05\n", " print(\"Regenerating...\")" ] }, { "cell_type": "markdown", "metadata": { "id": "f0vcij24m00o" }, "source": [ "#### Menu Table (Coffee)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fDH4tKa6kLIF" }, "outputs": [], "source": [ "table_name = \"company\"\n", "field_name = \"company_id\"\n", "company_ids = GetForeignKeys(dataset_id, table_name, field_name)\n", "\n", "table_name = \"menu\"\n", "primary_key = \"menu_id\"\n", "\n", "schema = GetTableSchema(dataset_id, table_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MhFoKlYam9dw" }, "outputs": [], "source": [ "menu_count = 3 # We mulitple this by 3 to get 9 (one for small, medium and large sizes)\n", "\n", "table_name = \"menu\"\n", "field_name = \"item_name\"\n", "existing_values = GetDistinctValues(dataset_id, table_name, field_name)\n", "\n", "menu_items_prompt = f\"\"\"Generate {menu_count} different coffee drink names and return in the below json format.\n", "- The name can be an existing coffee drink or think outside the box for something new.\n", "- The name should be related to coffee.\n", "- Do not use any of these names: [{existing_values}]\n", "- Do not number the results.\n", "\n", "JSON format: [ \"value\" ]\n", "Sample JSON Response: [ \"value1\", \"value2\" ]\n", "\"\"\"\n", "\n", "llm_success = False\n", "temperature=.8\n", "while llm_success == False:\n", " try:\n", " menu_names = GeminiProLLM(menu_items_prompt, temperature=temperature, topP=.8, topK = 40)\n", " menu_names_json = json.loads(menu_names)\n", " llm_success = True\n", " except:\n", " # Reduce the temperature for more accurate generation\n", " temperature = temperature - .05\n", " print(\"Regenerating...\")\n", "\n", "print(f\"menu_names: {menu_names}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "65MpRat_n3n1" }, "outputs": [], "source": [ "# Insert data\n", "starting_value = GetStartingValue(dataset_id, table_name, primary_key)\n", "\n", "menu_names_sql_prompt=f\"\"\"\n", "You are a database engineer and need to generate data for a table for the below schema.\n", "- The schema is for a Google Cloud BigQuery Table.\n", "- The table name is \"{project_id}.{dataset_id}.{table_name}\".\n", "- Read the description of each field for valid values.\n", "- Do not preface the response with any special characters or 'sql'.\n", "- Generate {menu_count * 3} total rows for this table.\n", "- Valid values for company_id are: {company_ids}\n", "- Valid values for item_name are: {menu_names}\n", "- The starting value of the field {primary_key} is {starting_value}.\n", "- Only generate a single statement, not multiple INSERTs.\n", "- Create a Small, Medium and Large size for each item_name. The same company_id should be used as well for all 3 sizes.\n", "- For the field \"llm_item_image_prompt\", limit the text to 256 characters.\n", "- For the field \"llm_item_image_url\" use the following pattern and replace [[menu_id]] with the generated menu id: https://storage.cloud.google.com/{gcs_storage_bucket}/{gcs_storage_path}[[menu_id]].png\n", "\n", "Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", "Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", "Schema: {schema}\n", "\"\"\"\n", "\n", "llm_success = False\n", "temperature=.8\n", "while llm_success == False:\n", " try:\n", " sql = GeminiProLLM(menu_names_sql_prompt, temperature=temperature, topP=.8, topK = 40)\n", " print(f\"SQL: {sql}\")\n", " llm_success = RunQuery(sql)\n", " except:\n", " # Reduce the temperature for more accurate generation\n", " temperature = temperature - .05\n", " print(\"Regenerating...\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VmOKq9nzrbhj" }, "outputs": [], "source": [ "# Query to get a list of menu items and the prompts for the images\n", "\n", "sql = f\"\"\"SELECT menu_id,\n", " item_name,\n", " llm_item_image_prompt\n", " FROM `{project_id}.{dataset_id}.{table_name}`\n", " WHERE menu_id BETWEEN {starting_value} AND {(starting_value - 1) + (menu_count * 3)}\n", " ORDER BY menu_id\"\"\"\n", "\n", "print(f\"SQL: {sql}\")\n", "df_process = client.query(sql).to_dataframe()\n", "coffee_drink_image_files = []\n", "\n", "for row in df_process.itertuples():\n", " menu_id = row.menu_id\n", " item_name = row.item_name\n", " llm_item_image_prompt = row.llm_item_image_prompt + \" The image should be related to a coffee drink you would buy.\"\n", "\n", " print(f\"item_name: {item_name}\")\n", " print(f\"llm_item_image_prompt: {llm_item_image_prompt}\")\n", " try:\n", " image_file = ImageGen(llm_item_image_prompt)\n", " coffee_drink_image_files.append ({\n", " \"menu_id\" : menu_id,\n", " \"item_name\" : item_name,\n", " \"llm_item_image_prompt\" : llm_item_image_prompt,\n", " \"gcs_storage_bucket\" : gcs_storage_bucket,\n", " \"gcs_storage_path\" : gcs_storage_path,\n", " \"llm_image_filename\" : image_file\n", " })\n", " except:\n", " print(\"Image failed to generate.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0NiOCtQEs76B" }, "outputs": [], "source": [ "# View the results\n", "for item in coffee_drink_image_files:\n", " print(f\"menu_id: {item['menu_id']}\")\n", " print(f\"item_name: {item['item_name']}\")\n", " print(f\"llm_item_image_prompt: {item['llm_item_image_prompt']}\")\n", " img = Image.open(item[\"llm_image_filename\"])\n", " img.thumbnail([500,500])\n", " IPython.display.display(img)" ] }, { "cell_type": "markdown", "metadata": { "id": "AmpFMOp59SkP" }, "source": [ "#### Menu Table (Food)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YcUNIquQ9wMY" }, "outputs": [], "source": [ "table_name = \"menu\"\n", "field_name = \"item_name\"\n", "existing_values = GetDistinctValues(dataset_id, table_name, field_name)\n", "\n", "menu_items_prompt = f\"\"\"Generate {menu_count} different foods that you would buy with coffee and return in the below json format.\n", "- The name can be an existing food or think outside the box for something new.\n", "- The items need to be food items and not coffee drinks.\n", "- The name should be related to coffee or a play on words around coffee.\n", "- Do not use any of these names: [{existing_values}]\n", "- Do not number the results.\n", "\n", "JSON format: [ \"value\" ]\n", "Sample JSON Response: [ \"value1\", \"value2\" ]\n", "\"\"\"\n", "\n", "llm_success = False\n", "temperature=.8\n", "while llm_success == False:\n", " try:\n", " menu_names = GeminiProLLM(menu_items_prompt, temperature=temperature, topP=.8, topK = 40)\n", " menu_names_json = json.loads(menu_names)\n", " llm_success = True\n", " except:\n", " # Reduce the temperature for more accurate generation\n", " temperature = temperature - .05\n", " print(\"Regenerating...\")\n", "\n", "print(f\"menu_names: {menu_names}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3QLwoNo6-Fo0" }, "outputs": [], "source": [ "# Insert data\n", "starting_value = GetStartingValue(dataset_id, table_name, primary_key)\n", "\n", "menu_names_sql_prompt=f\"\"\"\n", "You are a database engineer and need to generate data for a table for the below schema.\n", "- The schema is for a Google Cloud BigQuery Table.\n", "- The table name is \"{project_id}.{dataset_id}.{table_name}\".\n", "- Read the description of each field for valid values.\n", "- Do not preface the response with any special characters or 'sql'.\n", "- Generate {menu_count} total rows for this table.\n", "- Valid values for company_id are: {company_ids}\n", "- Valid values for item_name are: {menu_names}\n", "- The starting value of the field {primary_key} is {starting_value}.\n", "- Only generate a single statement, not multiple INSERTs.\n", "- Hardcode the field \"item_size\" to \"n/a\"\n", "- For the field \"llm_item_image_prompt\", limit the text to 256 characters.\n", "- For the field \"llm_item_image_url\" use the following pattern and replace [[menu_id]] with the generated menu id: https://storage.cloud.google.com/{gcs_storage_bucket}/{gcs_storage_path}[[menu_id]].png\n", "\n", "Example 1: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Sample'),(2, 'Sample');\n", "Example 2: INSERT INTO `my-dataset.my-dataset.my-table` (field_1, field_2) VALUES (1, 'Data'),(2, 'Data'),(3, 'Data');\n", "\n", "Schema: {schema}\n", "\"\"\"\n", "\n", "llm_success = False\n", "temperature=.8\n", "while llm_success == False:\n", " try:\n", " sql = GeminiProLLM(menu_names_sql_prompt, temperature=temperature, topP=.8, topK = 40)\n", " print(f\"SQL: {sql}\")\n", " llm_success = RunQuery(sql)\n", " except:\n", " # Reduce the temperature for more accurate generation\n", " temperature = temperature - .05\n", " print(\"Regenerating...\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IbftZef8-u7M" }, "outputs": [], "source": [ "# Query to get a list of menu items and the prompts for the images\n", "\n", "sql = f\"\"\"SELECT menu_id,\n", " item_name,\n", " llm_item_image_prompt\n", " FROM `{project_id}.{dataset_id}.{table_name}`\n", " WHERE menu_id BETWEEN {starting_value} AND { starting_value - 1 + menu_count }\n", " ORDER BY menu_id\"\"\"\n", "\n", "print(f\"SQL: {sql}\")\n", "df_process = client.query(sql).to_dataframe()\n", "food_image_files = []\n", "\n", "for row in df_process.itertuples():\n", " menu_id = row.menu_id\n", " item_name = row.item_name\n", " llm_item_image_prompt = row.llm_item_image_prompt\n", "\n", " print(f\"item_name: {item_name}\")\n", " print(f\"llm_item_image_prompt: {llm_item_image_prompt}\")\n", " try:\n", " image_file = ImageGen(llm_item_image_prompt)\n", " food_image_files.append ({\n", " \"menu_id\" : menu_id,\n", " \"item_name\" : item_name,\n", " \"llm_item_image_prompt\" : llm_item_image_prompt,\n", " \"gcs_storage_bucket\" : gcs_storage_bucket,\n", " \"gcs_storage_path\" : gcs_storage_path,\n", " \"llm_image_filename\" : image_file\n", " })\n", " except:\n", " print(\"Image failed to generate.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PMqVuHje-z1R" }, "outputs": [], "source": [ "# View the results\n", "for item in food_image_files:\n", " print(f\"menu_id: {item['menu_id']}\")\n", " print(f\"item_name: {item['item_name']}\")\n", " print(f\"llm_item_image_prompt: {item['llm_item_image_prompt']}\")\n", " img = Image.open(item[\"llm_image_filename\"])\n", " img.thumbnail([500,500])\n", " IPython.display.display(img)" ] }, { "cell_type": "markdown", "metadata": { "id": "9879tAKZCxvJ" }, "source": [ "#### Save the results to storage" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "aRMySmnxDgea" }, "outputs": [], "source": [ "# When we create the sample data for our table we also asked for the LLM to generate the correct GCS / HTTP path\n", "\n", "# Copy all image files to storage\n", "for item in coffee_drink_image_files:\n", " copy_file_to_gcs(item[\"llm_image_filename\"],item[\"gcs_storage_bucket\"], item[\"gcs_storage_path\"] + str(item['menu_id']) + \".png\")\n", "\n", "# Copy all image files to storage\n", "for item in food_image_files:\n", " copy_file_to_gcs(item[\"llm_image_filename\"],item[\"gcs_storage_bucket\"], item[\"gcs_storage_path\"] + str(item['menu_id']) + \".png\")" ] } ], "metadata": { "colab": { "collapsed_sections": [ "k6eIqerFOzyj", "8zy0eEJmHxRZ", "YtZuFgjbOjso", "xUolPsMFOjpZ", "E5CFSdK3HxYm", "-L93udtrH1Oz", "rVCY93IyXPoO", "BlxddNzpmAgp" ], "name": "BigQuery table", "private_outputs": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }

qwiklabs/colab-enterprise/gen-ai-demo/Menu-Synthetic-Data-Generation-GenAI.ipynb (1,397 lines of code) (raw):