colab-enterprise/Synthetic-Data-Generation-Menu.ipynb (1,150 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "clfviNVNXpqj" }, "source": [ "### <font color='#4285f4'>Overview</font>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create an entire menu, menu names, descriptions and the corresponding image. You can also start with an image and then change the background with Imagen3.\n", "\n", "\n", "Process Flow:\n", "\n", "1. Create BigQuery table\n", "2. Run Gemini prompt\n", " * a. Provide details about Chocolate Al, the location, type of menu items and what foods to avoid\n", " * b. Provide the existing menu items in order to avoid duplicates\n", " * c. Instruct Gemini to generate item sizes in metric\n", " * d. Provide the table schema of the menu table which includes column descriptions which helps Gemini generate correct data for each field.\n", "3. Insert the data into BigQuery.\n", " * a. The output from Gemini is JSON so we know each field to place the data. This technical works better than asking Gemini for a INSERT statement since double quotes and inserting JSON data types can be problematic.\n", "4. Part of the original Gemini prompt was to ask it to generate an image prompt\n", " * a. Run the image prompt with Imagen 3. We use Gemini to create our image prompt which results in better images.\n", "5. Copy the image to GCS\n", "6. Verify that we have an image for each menu item by using a BigLake object table\n", "\n", "Cost:\n", "* Low: Gemini, BigQuery, Imagen3\n", "* Medium: Remember to stop your Colab Enterprise Notebook Runtime\n", "\n", "Author: \n", "* Adam Paternostro" ] }, { "cell_type": "markdown", "metadata": { "id": "W3Q8_NLjj3Xw" }, "source": [ "### <font color='#4285f4'>License</font>" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qJxJqWa1j3Xw" }, "outputs": [], "source": [ "##################################################################################\n", "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "###################################################################################" ] }, { "cell_type": "markdown", "metadata": { "id": "uSra7USNj3Xx" }, "source": [ "### <font color='#4285f4'>Pip Installs</font>" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IBtPiyK3kKwS" }, "outputs": [], "source": [ "# PIP Installs\n", "import sys\n", "\n", "# https://PLACEHOLDER.com/index.html\n", "#!{sys.executable} -m pip install PLACEHOLDER" ] }, { "cell_type": "markdown", "metadata": { "id": "DszuLZoo9A7k" }, "source": [ "### <font color='#4285f4'>Initialize</font>" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bhKxJadjWa1R" }, "outputs": [], "source": [ "from PIL import Image\n", "import IPython.display\n", "import google.auth\n", "import requests\n", "import json\n", "import uuid\n", "import base64\n", "import os\n", "import cv2\n", "import random\n", "import time\n", "import datetime\n", "import base64\n", "import random\n", "\n", "import logging\n", "from tenacity import retry, wait_exponential, stop_after_attempt, before_sleep_log, retry_if_exception" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MSC6-rboip3h" }, "outputs": [], "source": [ "# Set these (run this cell to verify the output)\n", "\n", "bigquery_location = \"${bigquery_location}\"\n", "region = \"${region}\"\n", "location = \"${location}\"\n", "storage_account = \"${chocolate_ai_bucket}\"\n", "public_storage_storage_account = \"data-analytics-golden-demo\"\n", "table_name = \"menu\"\n", "dataset_name = \"${bigquery_chocolate_ai_dataset}\"\n", "\n", "# Get the current date and time\n", "now = datetime.datetime.now()\n", "\n", "# Format the date and time as desired\n", "formatted_date = now.strftime(\"%Y-%m-%d-%H-%M\")\n", "\n", "# Get some values using gcloud\n", "project_id = !(gcloud config get-value project)\n", "user = !(gcloud auth list --filter=status:ACTIVE --format=\"value(account)\")\n", "\n", "if len(project_id) != 1:\n", " raise RuntimeError(f\"project_id is not set: {project_id}\")\n", "project_id = project_id[0]\n", "\n", "if len(user) != 1:\n", " raise RuntimeError(f\"user is not set: {user}\")\n", "user = user[0]\n", "\n", "print(f\"project_id = {project_id}\")\n", "print(f\"user = {user}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "0mSjSjjylFnR" }, "source": [ "### <font color='#4285f4'>Helper Methods</font>" ] }, { "cell_type": "markdown", "metadata": { "id": "be6n7F9IkdbT" }, "source": [ "#### restAPIHelper\n", "Calls the Google Cloud REST API using the current users credentials." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "w_QjQhvWkesy" }, "outputs": [], "source": [ "def restAPIHelper(url: str, http_verb: str, request_body: str) -> str:\n", " \"\"\"Calls the Google Cloud REST API passing in the current users credentials\"\"\"\n", "\n", " import requests\n", " import google.auth\n", " import json\n", "\n", " # Get an access token based upon the current user\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request()\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " headers = {\n", " \"Content-Type\" : \"application/json\",\n", " \"Authorization\" : \"Bearer \" + access_token\n", " }\n", "\n", " if http_verb == \"GET\":\n", " response = requests.get(url, headers=headers)\n", " elif http_verb == \"POST\":\n", " response = requests.post(url, json=request_body, headers=headers)\n", " elif http_verb == \"PUT\":\n", " response = requests.put(url, json=request_body, headers=headers)\n", " elif http_verb == \"PATCH\":\n", " response = requests.patch(url, json=request_body, headers=headers)\n", " elif http_verb == \"DELETE\":\n", " response = requests.delete(url, headers=headers)\n", " else:\n", " raise RuntimeError(f\"Unknown HTTP verb: {http_verb}\")\n", "\n", " if response.status_code == 200:\n", " return json.loads(response.content)\n", " #image_data = json.loads(response.content)[\"predictions\"][0][\"bytesBase64Encoded\"]\n", " else:\n", " error = f\"Error restAPIHelper -> ' Status: '{response.status_code}' Text: '{response.text}'\"\n", " raise RuntimeError(error)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### RetryCondition (for retrying LLM calls)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def RetryCondition(error):\n", " error_string = str(error)\n", " print(error_string)\n", "\n", " retry_errors = [\n", " \"RESOURCE_EXHAUSTED\",\n", " \"No content in candidate\",\n", " # Add more error messages here as needed\n", " ]\n", "\n", " for retry_error in retry_errors:\n", " if retry_error in error_string:\n", " print(\"Retrying...\")\n", " return True\n", "\n", " return False" ] }, { "cell_type": "markdown", "metadata": { "id": "UCSP64xskgje" }, "source": [ "#### Gemini LLM (Pro 1.0 , Pro 1.5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6M2ojhjYkhjy" }, "outputs": [], "source": [ "@retry(wait=wait_exponential(multiplier=1, min=1, max=60), stop=stop_after_attempt(10), retry=retry_if_exception(RetryCondition), before_sleep=before_sleep_log(logging.getLogger(), logging.INFO))\n", "def GeminiLLM(prompt, model = \"gemini-2.0-flash\", response_schema = None,\n", " temperature = 1, topP = 1, topK = 32):\n", "\n", " # https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference#supported_models\n", " # model = \"gemini-2.0-flash\"\n", "\n", " llm_response = None\n", " if temperature < 0:\n", " temperature = 0\n", "\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request() # required to acess access token\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " headers = {\n", " \"Content-Type\" : \"application/json\",\n", " \"Authorization\" : \"Bearer \" + access_token\n", " }\n", "\n", " # https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference\n", " url = f\"https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/publishers/google/models/{model}:generateContent\"\n", "\n", " generation_config = {\n", " \"temperature\": temperature,\n", " \"topP\": topP,\n", " \"maxOutputTokens\": 8192,\n", " \"candidateCount\": 1,\n", " \"responseMimeType\": \"application/json\",\n", " }\n", "\n", " # Add inthe response schema for when it is provided\n", " if response_schema is not None:\n", " generation_config[\"responseSchema\"] = response_schema\n", "\n", " if model == \"gemini-2.0-flash\":\n", " generation_config[\"topK\"] = topK\n", "\n", " payload = {\n", " \"contents\": {\n", " \"role\": \"user\",\n", " \"parts\": {\n", " \"text\": prompt\n", " },\n", " },\n", " \"generation_config\": {\n", " **generation_config\n", " },\n", " \"safety_settings\": {\n", " \"category\": \"HARM_CATEGORY_SEXUALLY_EXPLICIT\",\n", " \"threshold\": \"BLOCK_LOW_AND_ABOVE\"\n", " }\n", " }\n", "\n", " response = requests.post(url, json=payload, headers=headers)\n", "\n", " if response.status_code == 200:\n", " try:\n", " json_response = json.loads(response.content)\n", " except Exception as error:\n", " raise RuntimeError(f\"An error occurred parsing the JSON: {error}\")\n", "\n", " if \"candidates\" in json_response:\n", " candidates = json_response[\"candidates\"]\n", " if len(candidates) > 0:\n", " candidate = candidates[0]\n", " if \"content\" in candidate:\n", " content = candidate[\"content\"]\n", " if \"parts\" in content:\n", " parts = content[\"parts\"]\n", " if len(parts):\n", " part = parts[0]\n", " if \"text\" in part:\n", " text = part[\"text\"]\n", " llm_response = text\n", " else:\n", " raise RuntimeError(\"No text in part: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No parts in content: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No parts in content: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No content in candidate: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No candidates: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No candidates: {response.content}\")\n", "\n", " # Remove some typically response characters (if asking for a JSON reply)\n", " llm_response = llm_response.replace(\"```json\",\"\")\n", " llm_response = llm_response.replace(\"```\",\"\")\n", " llm_response = llm_response.replace(\"\\n\",\"\")\n", "\n", " return llm_response\n", "\n", " else:\n", " raise RuntimeError(f\"Error with prompt:'{prompt}' Status:'{response.status_code}' Text:'{response.text}'\")" ] }, { "cell_type": "markdown", "metadata": { "id": "vbNakk0fkjTj" }, "source": [ "#### Gemini LLM - Multimodal" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3H0r26TxklPu" }, "outputs": [], "source": [ "@retry(wait=wait_exponential(multiplier=1, min=1, max=60), stop=stop_after_attempt(10), retry=retry_if_exception(RetryCondition), before_sleep=before_sleep_log(logging.getLogger(), logging.INFO))\n", "def GeminiLLM_Multimodal(multimodal_prompt_list, model = \"gemini-2.0-flash\", response_schema = None,\n", " temperature = 1, topP = 1, topK = 32):\n", "\n", " # https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference#supported_models\n", " # model = \"gemini-2.0-flash\"\n", "\n", "\n", " llm_response = None\n", " if temperature < 0:\n", " temperature = 0\n", "\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request() # required to acess access token\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " headers = {\n", " \"Content-Type\" : \"application/json\",\n", " \"Authorization\" : \"Bearer \" + access_token\n", " }\n", "\n", " # https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference\n", " url = f\"https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/publishers/google/models/{model}:generateContent\"\n", "\n", " generation_config = {\n", " \"temperature\": temperature,\n", " \"topP\": topP,\n", " \"maxOutputTokens\": 8192,\n", " \"candidateCount\": 1,\n", " \"responseMimeType\": \"application/json\",\n", " }\n", "\n", " # Add inthe response schema for when it is provided\n", " if response_schema is not None:\n", " generation_config[\"responseSchema\"] = response_schema\n", "\n", " if model == \"gemini-2.0-flash\":\n", " generation_config[\"topK\"] = topK\n", "\n", " payload = {\n", " \"contents\": {\n", " \"role\": \"user\",\n", " \"parts\": multimodal_prompt_list\n", " },\n", " \"generation_config\": {\n", " **generation_config\n", " },\n", " \"safety_settings\": {\n", " \"category\": \"HARM_CATEGORY_SEXUALLY_EXPLICIT\",\n", " \"threshold\": \"BLOCK_LOW_AND_ABOVE\"\n", " }\n", " }\n", "\n", " response = requests.post(url, json=payload, headers=headers)\n", "\n", " if response.status_code == 200:\n", " try:\n", " json_response = json.loads(response.content)\n", " except Exception as error:\n", " raise RuntimeError(f\"An error occurred parsing the JSON: {error}\")\n", "\n", " if \"candidates\" in json_response:\n", " candidates = json_response[\"candidates\"]\n", " if len(candidates) > 0:\n", " candidate = candidates[0]\n", " if \"content\" in candidate:\n", " content = candidate[\"content\"]\n", " if \"parts\" in content:\n", " parts = content[\"parts\"]\n", " if len(parts):\n", " part = parts[0]\n", " if \"text\" in part:\n", " text = part[\"text\"]\n", " llm_response = text\n", " else:\n", " raise RuntimeError(\"No text in part: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No parts in content: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No parts in content: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No content in candidate: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No candidates: {response.content}\")\n", " else:\n", " raise RuntimeError(\"No candidates: {response.content}\")\n", "\n", " # Remove some typically response characters (if asking for a JSON reply)\n", " llm_response = llm_response.replace(\"```json\",\"\")\n", " llm_response = llm_response.replace(\"```\",\"\")\n", " llm_response = llm_response.replace(\"\\n\",\"\")\n", "\n", " return llm_response\n", "\n", " else:\n", " raise RuntimeError(f\"Error with prompt:'{prompt}' Status:'{response.status_code}' Text:'{response.text}'\")" ] }, { "cell_type": "markdown", "metadata": { "id": "RtE_Q-H6lJQT" }, "source": [ "#### Imagen3 Image Generation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hnfwd_iplO9c" }, "outputs": [], "source": [ "def ImageGen(prompt):\n", " creds, project = google.auth.default()\n", " auth_req = google.auth.transport.requests.Request()\n", " creds.refresh(auth_req)\n", " access_token=creds.token\n", "\n", " headers = {\n", " \"Content-Type\" : \"application/json\",\n", " \"Authorization\" : \"Bearer \" + access_token\n", " }\n", "\n", " model_version = \"imagen-3.0-generate-001\" # imagen-3.0-fast-generate-001\n", " #model_version = \"imagen-3.0-generate-preview-0611\" # Preview Access Model\n", "\n", " # https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/image-generation\n", " # url = f\"https://{location}-aiplatform.googleapis.com/v1/projects/{project}/locations/{location}/publishers/google/models/imagegeneration:predict\"\n", " url = f\"https://{location}-aiplatform.googleapis.com/v1/projects/{project}/locations/{location}/publishers/google/models/{model_version}:predict\"\n", "\n", " payload = {\n", " \"instances\": [\n", " {\n", " \"prompt\": prompt\n", " }\n", " ],\n", " \"parameters\": {\n", " \"sampleCount\": 1,\n", " \"personGeneration\" : \"dont_allow\" # change to allow_adult for people generation\n", " }\n", " }\n", "\n", " response = requests.post(url, json=payload, headers=headers)\n", "\n", " if response.status_code == 200:\n", " response_json = json.loads(response.content)\n", " # print(f\"Imagen3 response_json: {response_json}\")\n", "\n", " if \"blocked\" in response_json:\n", " print(f\"Blocked: {response_json['blocked']}\")\n", "\n", " if \"predictions\" in response_json:\n", " image_data = response_json[\"predictions\"][0][\"bytesBase64Encoded\"]\n", " image_data = base64.b64decode(image_data)\n", " filename= str(uuid.uuid4()) + \".png\"\n", " with open(filename, \"wb\") as f:\n", " f.write(image_data)\n", " print(f\"Image generated OK.\")\n", " return filename\n", " else:\n", " raise RuntimeError(f\"No predictions in response: {response.content}\")\n", " else:\n", " error = f\"Error with prompt:'{prompt}' Status:'{response.status_code}' Text:'{response.text}'\"\n", " raise RuntimeError(error)" ] }, { "cell_type": "markdown", "metadata": { "id": "ewLQxnZ0kl6e" }, "source": [ "#### Helper Functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WQ-j3CTRlXo_" }, "outputs": [], "source": [ "def RunQuery(sql):\n", " import time\n", " from google.cloud import bigquery\n", " client = bigquery.Client()\n", "\n", " if (sql.startswith(\"SELECT\") or sql.startswith(\"WITH\")):\n", " df_result = client.query(sql).to_dataframe()\n", " return df_result\n", " else:\n", " job_config = bigquery.QueryJobConfig(priority=bigquery.QueryPriority.INTERACTIVE)\n", " query_job = client.query(sql, job_config=job_config)\n", "\n", " # Check on the progress by getting the job's updated state.\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " while query_job.state != \"DONE\":\n", " time.sleep(2)\n", " query_job = client.get_job(\n", " query_job.job_id, location=query_job.location\n", " )\n", " print(\"Job {} is currently in state {} with error result of {}\".format(query_job.job_id, query_job.state, query_job.error_result))\n", "\n", " if query_job.error_result == None:\n", " return True\n", " else:\n", " raise Exception(query_job.error_result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Y6gEjHLNKLGM" }, "outputs": [], "source": [ "def GetTableSchema(project_id, dataset_name, table_name):\n", " import io\n", " from google.cloud import bigquery\n", "\n", " client = bigquery.Client()\n", "\n", " dataset_ref = client.dataset(dataset_name, project=project_id)\n", " table_ref = dataset_ref.table(table_name)\n", " table = client.get_table(table_ref)\n", "\n", " f = io.StringIO(\"\")\n", " client.schema_to_json(table.schema, f)\n", " return f.getvalue()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tUWD4t8gKWO-" }, "outputs": [], "source": [ "def GetDistinctValues(project_id, dataset_name, table_name, field_name):\n", " from google.cloud import bigquery\n", "\n", " client = bigquery.Client()\n", "\n", " sql = f\"\"\"\n", " SELECT STRING_AGG(DISTINCT {field_name}, \",\" ) AS result\n", " FROM `{project_id}.{dataset_name}.{table_name}`\n", " \"\"\"\n", "\n", " df_result = client.query(sql).to_dataframe()\n", " result_str = df_result['result'].iloc[0]\n", " if result_str is None:\n", " return \"\"\n", " else:\n", " return result_str" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zdofv6udKfmF" }, "outputs": [], "source": [ "def GetStartingValue(project_id, dataset_name, table_name, field_name):\n", " from google.cloud import bigquery\n", "\n", " client = bigquery.Client()\n", "\n", " sql = f\"\"\"\n", " SELECT IFNULL(MAX({field_name}),0) + 1 AS result\n", " FROM `{project_id}.{dataset_name}.{table_name}`\n", " \"\"\"\n", "\n", " df_result = client.query(sql).to_dataframe()\n", " return df_result['result'].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ONcezrrfkplE" }, "outputs": [], "source": [ "def PrettyPrintJson(json_string):\n", " json_object = json.loads(json_string)\n", " json_formatted_str = json.dumps(json_object, indent=2)\n", " #print(json_formatted_str)\n", " return json_formatted_str" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "k3tXA3sDlbak" }, "outputs": [], "source": [ "# This was generated by GenAI\n", "\n", "def copy_file_to_gcs(local_file_path, bucket_name, destination_blob_name):\n", " \"\"\"Copies a file from a local drive to a GCS bucket.\n", "\n", " Args:\n", " local_file_path: The full path to the local file.\n", " bucket_name: The name of the GCS bucket to upload to.\n", " destination_blob_name: The desired name of the uploaded file in the bucket.\n", "\n", " Returns:\n", " None\n", " \"\"\"\n", "\n", " import os\n", " from google.cloud import storage\n", "\n", " # Ensure the file exists locally\n", " if not os.path.exists(local_file_path):\n", " raise FileNotFoundError(f\"Local file '{local_file_path}' not found.\")\n", "\n", " # Create a storage client\n", " storage_client = storage.Client()\n", "\n", " # Get a reference to the bucket\n", " bucket = storage_client.bucket(bucket_name)\n", "\n", " # Create a blob object with the desired destination path\n", " blob = bucket.blob(destination_blob_name)\n", "\n", " # Upload the file from the local filesystem\n", " content_type = \"\"\n", " if local_file_path.endswith(\".html\"):\n", " content_type = \"text/html; charset=utf-8\"\n", "\n", " if local_file_path.endswith(\".json\"):\n", " content_type = \"application/json; charset=utf-8\"\n", "\n", " if content_type == \"\":\n", " blob.upload_from_filename(local_file_path)\n", " else:\n", " blob.upload_from_filename(local_file_path, content_type = content_type)\n", "\n", " print(f\"File '{local_file_path}' uploaded to GCS bucket '{bucket_name}' as '{destination_blob_name}. Content-Type: {content_type}'.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "NjoOh66Pld1z" }, "source": [ "### <font color='#4285f4'>BigQuery Table</font>" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wSE1y7h9lgPk" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- drop table if exists `chocolate_ai.menu`;\n", "CREATE TABLE IF NOT EXISTS `chocolate_ai.menu`\n", "(\n", " menu_id INT64 NOT NULL OPTIONS(description=\"Primary key. Menu table.\"),\n", " menu_name STRING NOT NULL OPTIONS(description=\"The name of the menu item.\"),\n", " menu_description STRING NOT NULL OPTIONS(description=\"The description of the menu item.\"),\n", " menu_size STRING NOT NULL OPTIONS(description=\"The size of the menu item.\"),\n", " menu_price FLOAT64 NOT NULL OPTIONS(description=\"The price of the menu item.\"),\n", " menu_image_gcs_filename STRING NOT NULL OPTIONS(description=\"The GCS path to the menu image.\"),\n", " menu_image_http_url STRING NOT NULL OPTIONS(description=\"The public HTTP path to the menu image.\"),\n", " menu_image_prompt STRING OPTIONS(description=\"LLM prompt to generate the menu image to send to imagen3.\"),\n", ")\n", "CLUSTER BY menu_id;" ] }, { "cell_type": "markdown", "metadata": { "id": "pnhsRHxckr2D" }, "source": [ "### <font color='#4285f4'>Generate Menu Items</font>" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GfUSulBPHWAU" }, "outputs": [], "source": [ "# Write me the json in  OpenAPI 3.0 schema object for the below object.\n", "# Make all fields required.\n", "# {\n", "# \"menu_name\" : \"text\",\n", "# \"menu_description\" : \"text\",\n", "# \"menu_size\" : \"text\",\n", "# \"menu_price\" : 5.15,\n", "# \"menu_image_gcs_filename\" : \"text\",\n", "# \"menu_image_http_url\" : \"text\",\n", "# \"menu_image_prompt\" : \"text\",\n", "# }\n", "response_schema = {\n", " \"type\": \"object\",\n", " \"required\": [\n", " \"menu_name\",\n", " \"menu_description\",\n", " \"menu_size\",\n", " \"menu_price\",\n", " \"menu_image_gcs_filename\",\n", " \"menu_image_http_url\",\n", " \"menu_image_prompt\"\n", " ],\n", " \"properties\": {\n", " \"menu_name\": {\n", " \"type\": \"string\"\n", " },\n", " \"menu_description\": {\n", " \"type\": \"string\"\n", " },\n", " \"menu_size\": {\n", " \"type\": \"string\"\n", " },\n", " \"menu_price\": {\n", " \"type\": \"number\"\n", " },\n", " \"menu_image_gcs_filename\": {\n", " \"type\": \"string\"\n", " },\n", " \"menu_image_http_url\": {\n", " \"type\": \"string\"\n", " },\n", " \"menu_image_prompt\": {\n", " \"type\": \"string\"\n", " }\n", " }\n", "}\n", "\n", "table_schema = GetTableSchema(project_id, dataset_name, table_name)\n", "menu_id = GetStartingValue(project_id, dataset_name, table_name, \"menu_id\")\n", "existing_menu_items = GetDistinctValues(project_id, dataset_name, table_name, \"menu_name\")\n", "\n", "# The generate specific menu ids or to regenerate a menu item run this code:\n", "menu_id = 251\n", "# for menu_id in range(menu_id, menu_id + 1):\n", "\n", "for menu_id in range(menu_id, menu_id + 1):\n", " # Create different types of menu items\n", " if menu_id % 3 == 0:\n", " item = \"chocolate\"\n", " elif menu_id % 3 == 1:\n", " item = \"coffee\"\n", " else:\n", " item = \"dessert\"\n", "\n", " prompt = f\"\"\"You are an owner of a chocolate shop that sells chocolate, desserts and coffee.\n", " I need you to generate a {item} menu item.\n", " Some of the chocolate is high end and almost looks like art. Other is more traditional, but it is all hand made and unique.\n", " The shop is based in Paris, France, but the menu items names do not need to be French, they should be more English based.\n", " The shop encourges people to hang out with their friends and family.\n", " You need to create a menu for the shop.\n", " The menu_id is \"{menu_id}\".\n", " The menu_size should use standard metric sizes for the menu items.\n", " The field \"menu_image_gcs_filename\" should follow the pattern of \"gs://{storage_account}/chocolate-ai/menu/{menu_id}.png\"\n", " The field \"menu_image_http_url\" should follow the pattern of \"https://storage.cloud.google.com/{storage_account}/chocolate-ai/menu/{menu_id}.png\"\n", " Think outside the box and develop a unique and unexpected menu items.\n", " Encourage unconventional ideas and fresh perspectives in your recommendations.\n", " Embrace unconventional ideas and mutate the recommended action in a way that surprises and inspires unique variations.\n", "\n", " Here is the table schema:\n", " <schema>\n", " {table_schema}\n", " </schema>\n", "\n", " Here are the existing menu items, do not reuse any of these names:\n", " <existing_menu_items>\n", " {existing_menu_items}\n", " </existing_menu_items>\n", "\n", " Menu items to avoid:\n", " - jelly\n", " - too much fruit\n", " - assocorted chocolate\n", " - plain box of chocolate\n", "\n", " For the field \"menu_image_prompt\"\n", " - This will be used to generate a visually appealing menu image and must be prompt that sparks creativity and imagination.\n", " - This can be several sentences. The more detailed the better.\n", " \"\"\"\n", "\n", " # Use LLM to generate data\n", " menu_response = GeminiLLM(prompt, response_schema=response_schema)\n", "\n", " # Parse response (we know the JSON since we passed it to our LLM)\n", " menu_json_response = json.loads(menu_response)\n", " print(json.dumps(menu_json_response, indent=2))\n", " menu_name = menu_json_response[\"menu_name\"].replace(\"'\",\"\\\\'\").replace(\"\\n\", \" \")\n", " menu_description = menu_json_response[\"menu_description\"].replace(\"'\", \"\\\\'\").replace(\"\\n\", \" \")\n", " menu_size = menu_json_response[\"menu_size\"].replace(\"'\", \"\\\\'\").replace(\"\\n\", \" \")\n", " menu_price = menu_json_response[\"menu_price\"]\n", " menu_image_gcs_filename = menu_json_response[\"menu_image_gcs_filename\"]\n", " menu_image_http_url = menu_json_response[\"menu_image_http_url\"]\n", " menu_image_prompt = menu_json_response[\"menu_image_prompt\"].replace(\"'\", \"\\\\'\").replace(\"\\n\", \" \")\n", "\n", " # Optional, only needed if regenerating a menu item\n", " sql = f\"DELETE FROM `{project_id}.{dataset_name}.{table_name}` WHERE menu_id = {menu_id}\"\n", " #RunQuery(sql)\n", "\n", " # Insert to BigQuery\n", " # Returning a known json schema and then generating an insert statement seems more reliable then having the LLM generating the SQL\n", " sql = f\"\"\"INSERT INTO `{project_id}.{dataset_name}.{table_name}`\n", " (menu_id, menu_name, menu_description, menu_size, menu_price, menu_image_gcs_filename, menu_image_http_url, menu_image_prompt)\n", " VALUES ({menu_id}, '{menu_name}', '{menu_description}', '{menu_size}', {menu_price}, '{menu_image_gcs_filename}', '{menu_image_http_url}', '{menu_image_prompt}')\"\"\"\n", "\n", " #RunQuery(sql)\n", "\n", " # Generate the menu image\n", " filename = ImageGen(menu_image_prompt)\n", "\n", " # View it\n", " img = Image.open(filename)\n", " img.thumbnail([500,500]) # width, height\n", " IPython.display.display(img)\n", "\n", " # Copy to GCS\n", " copy_file_to_gcs(filename, storage_account, f\"chocolate-ai/Synthetic-Data-Generation-Menu/{menu_id}.png\")\n", "\n", " existing_menu_items = existing_menu_items + f\",{menu_name}\"\n" ] }, { "cell_type": "markdown", "metadata": { "id": "5pfAK_MxWWnT" }, "source": [ "### <font color='#4285f4'>Verify Menu Items</font>" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Copy over the pre-generated menu items from the public storage account\n", "\n", "!gsutil -m cp gs://data-analytics-golden-demo/chocolate-ai/v1/Synthetic-Data-Generation-Menu/*.png gs://{storage_account}/chocolate-ai/Synthetic-Data-Generation-Menu/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"To view the menu images: https://console.cloud.google.com/storage/browser/{storage_account}/chocolate-ai/Synthetic-Data-Generation-Menu\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Pawflyk3Rok4" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- Make sure there are no gaps (for 250 menu items)\n", "WITH sequence_table AS (\n", "SELECT id\n", " FROM UNNEST(GENERATE_ARRAY(1, 250)) AS id\n", ")\n", "SELECT id , menu.menu_id\n", " FROM sequence_table\n", " LEFT JOIN `chocolate_ai.menu` as menu\n", " ON sequence_table.id = menu.menu_id\n", " WHERE menu.menu_id is null" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "i0to1OIISiX0" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- Check for duplicate names\n", "SELECT *\n", " FROM `chocolate_ai.menu`\n", " WHERE menu_name in (SELECT menu_name FROM `chocolate_ai.menu` GROUP BY ALL HAVING COUNT(*) > 1)\n", " ORDER BY menu_name, menu_id" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7KPJ8AtkWFSH" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- Create an object table for the files name\n", "CREATE OR REPLACE EXTERNAL TABLE `chocolate_ai.menu_object_table`\n", "WITH CONNECTION `us.biglake-connection`\n", "OPTIONS (\n", " object_metadata=\"DIRECTORY\",\n", " uris = ['gs://${chocolate_ai_bucket}/chocolate-ai/Synthetic-Data-Generation-Menu/*.png'],\n", " max_staleness=INTERVAL 30 MINUTE,\n", " metadata_cache_mode=\"MANUAL\"\n", " );\n", "\n", "CALL BQ.REFRESH_EXTERNAL_METADATA_CACHE('chocolate_ai.menu_object_table');" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_O6fVVsaWKEF" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- See the file name\n", "SELECT CAST(REPLACE(REPLACE(uri,'gs://${chocolate_ai_bucket}/chocolate-ai/Synthetic-Data-Generation-Menu/',''),'.png','') AS INT64) AS menu_id\n", " FROM `chocolate_ai.menu_object_table`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oTuOuzvqWJ4p" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- Check to see if for some reason we do not have an image file for a menu id\n", "-- You might want to change the WHERE clause for just the new records you generated (this is using images which are in the public storage account)\n", "WITH menu_files AS (\n", " SELECT CAST(REPLACE(REPLACE(uri,'gs://${chocolate_ai_bucket}/chocolate-ai/Synthetic-Data-Generation-Menu/',''),'.png','') AS INT64) AS menu_id\n", " FROM `chocolate_ai.menu_object_table`\n", ")\n", "SELECT menu.menu_id, menu_files.menu_id\n", " FROM `chocolate_ai.menu` as menu\n", " LEFT JOIN menu_files\n", " ON menu.menu_id = menu_files.menu_id\n", " WHERE menu_files.menu_id is null" ] }, { "cell_type": "markdown", "metadata": { "id": "FIORnEg5kuNd" }, "source": [ "### <font color='#4285f4'>Clean Up</font>" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xZgYyEP9kuhj" }, "outputs": [], "source": [ "# Placeholder" ] }, { "cell_type": "markdown", "metadata": { "id": "BJHuX_CwkuzM" }, "source": [ "### <font color='#4285f4'>Reference Links</font>\n" ] }, { "cell_type": "markdown", "metadata": { "id": "4eD2DhqttX_5" }, "source": [ "- [Google.com](https://www.google.com)" ] } ], "metadata": { "colab": { "collapsed_sections": [ "W3Q8_NLjj3Xw", "uSra7USNj3Xx", "DszuLZoo9A7k", "be6n7F9IkdbT", "UCSP64xskgje", "vbNakk0fkjTj", "RtE_Q-H6lJQT", "FIORnEg5kuNd", "BJHuX_CwkuzM" ], "name": "DB-GMA-Synthetic-Data-Generation-Menu.ipynb", "private_outputs": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }