colab-enterprise/rideshare_llm_ai_lakehouse_demo.ipynb (1,340 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "id": "9mIA1BTRrphs", "metadata": { "id": "9mIA1BTRrphs" }, "source": [ "# <img src=\"https://lh3.googleusercontent.com/mUTbNK32c_DTSNrhqETT5aQJYFKok2HB1G2nk2MZHvG5bSs0v_lmDm_ArW7rgd6SDGHXo0Ak2uFFU96X6Xd0GQ=w160-h128\" width=\"45\" valign=\"top\" alt=\"BigQuery\"> Welcome to Rideshare AI Lakehouse powered by BigQuery Studio\n", "\n", "Watch the demo on <a href=\"https://youtu.be/5ZStKD4joK4\">YouTube</a>" ] }, { "cell_type": "markdown", "id": "LylylPEjaXrC", "metadata": { "id": "LylylPEjaXrC" }, "source": [ "## **Step 01: Overview**" ] }, { "cell_type": "markdown", "id": "9LeW2mN1s6sy", "metadata": { "id": "9LeW2mN1s6sy" }, "source": [ "### Summary\n" ] }, { "cell_type": "markdown", "id": "6HNmymbgtFWr", "metadata": { "id": "6HNmymbgtFWr" }, "source": [ "This portion of the demo will showcase how you can build an AI Lakehouse using Google Cloud's Data and Analytics stack. The demo is for our fitcious company Rideshare Plus where we will use both qualitative and quantitative analysis to build a complete customer and driver profile. This notebook will highlight:\n", "- Using LLMs to extract customer and driver preferences based upon customer reviews.\n", "- Using LLMs to understand our customers and to create a summarized customer understand to improve our interactions with our customers.\n", "- Using LLMs to understand our drivers and how they are interacting with our customers and an overall customer statisification summary.\n", "\n", "All code is located on GitHub: https://goo.gle/dagd\n", "\n", "Please check out the other part of the demo which uses AI, image object analysis and data at scale to predict the most profitable pickup locations." ] }, { "cell_type": "markdown", "id": "U3gYlvLItPXU", "metadata": { "id": "U3gYlvLItPXU" }, "source": [ "### Details" ] }, { "cell_type": "markdown", "id": "dpdULnrStPID", "metadata": { "id": "dpdULnrStPID" }, "source": [ "- For each customer review\n", " - Speech to Text\n", "\n", "- Process the Customer Reviews (extract knowledge)\n", " - Sentiment\n", " - Extract Themes\n", " - Driver view\n", " - Customer view\n", " - Summarize Themes\n", " - For each Driver\n", " - For each Customer\n", " - Summarize All Reviews\n", " - For each Driver\n", " - We can use this as an Employee Review for the driver to suggust training.\n", " - For each Customer\n", " - We can use this to inform the driver of the customer preferences so the driver can taylor their ride for the customer.\n", " - Perform Quantitative analysis and summarize with LLM\n", " - For each Driver\n", " - For each Customer\n", "\n", "- LLM (Other)\n", " - LLM can be used to see if our LLM Summary is:\n", " - Good Quality\n", " - Correct for puncation and grammer" ] }, { "cell_type": "markdown", "id": "WPtkAs20tXpx", "metadata": { "id": "WPtkAs20tXpx" }, "source": [ "## **Step 02: Customer Review Speech to Text**" ] }, { "cell_type": "markdown", "id": "H_svdrJqtrDS", "metadata": { "id": "H_svdrJqtrDS" }, "source": [ "### Summary\n" ] }, { "cell_type": "markdown", "id": "sTMN0fYGtzvj", "metadata": { "id": "sTMN0fYGtzvj" }, "source": [ "We will start by processing our audio files at scale. These can be achieved through object tables which will monitor a storage account for new customer reviews. Upon detecting a review, BigQuery can ask Vertex AI Speech to Text (STT) for the extracted text.\n", "\n", "For each audio file containing the review we run it through the STT service to extract the text.\n", "\n", "* First we create an Object Table pointing to the GCS bucket holding the audio files.\n", "* We call `BQML.TRANSCRIBE audio`" ] }, { "cell_type": "markdown", "id": "ac98761d", "metadata": {}, "source": [ "Inspect the Object Table" ] }, { "cell_type": "code", "execution_count": null, "id": "fe0b01c4", "metadata": {}, "outputs": [], "source": [ "%%bigquery\n", "SELECT REGEXP_EXTRACT(uri, r'text_synth_(\\d+)\\.mp3') as trip_id,*\n", " FROM EXTERNAL_OBJECT_TRANSFORM(TABLE ${bigquery_rideshare_llm_raw_dataset}.biglake_rideshare_audios, ['SIGNED_URL'])\n", " LIMIT 5;" ] }, { "cell_type": "markdown", "id": "67ab545d", "metadata": {}, "source": [ "Process one particular customer audio review:\n", " - Get the signed_url using `EXTERNAL_OBJECT_TRANSFORM` on the Object Table\n", " - Download the audio file and inpect it (play it)\n", " - Run the speech to the cloud function and review the transcription results\n", " - Change the Trip Id to 24406991 for a Negative review" ] }, { "cell_type": "code", "execution_count": null, "id": "e7b5bef4", "metadata": {}, "outputs": [], "source": [ "trip_id = '30618713' #@param{type:\"string\"}\n", "params = { \"trip_id\": trip_id}" ] }, { "cell_type": "code", "execution_count": null, "id": "AVrrmWUaHTuj", "metadata": { "id": "AVrrmWUaHTuj" }, "outputs": [], "source": [ "%%bigquery signed_url --params $params\n", "WITH signed_urls AS (\n", " SELECT signed_url,\n", " REGEXP_EXTRACT(uri, r'text_synth_(\\d+)\\.mp3') as trip_id\n", " FROM EXTERNAL_OBJECT_TRANSFORM(TABLE ${bigquery_rideshare_llm_raw_dataset}.biglake_rideshare_audios, ['SIGNED_URL']))\n", " SELECT signed_url FROM signed_urls WHERE trip_id = @trip_id;" ] }, { "cell_type": "code", "execution_count": null, "id": "3d93dca1", "metadata": {}, "outputs": [], "source": [ "import requests\n", "response = requests.get(signed_url['signed_url'][0],stream=True)\n", "with open(f'text_synth_{trip_id}.mp3', 'wb') as f:\n", " f.write(response.content)" ] }, { "cell_type": "code", "execution_count": null, "id": "e09825c1", "metadata": {}, "outputs": [], "source": [ "import IPython.display\n", "IPython.display.Audio(f'text_synth_{trip_id}.mp3',rate=16000)" ] }, { "cell_type": "code", "execution_count": null, "id": "9cdc7194", "metadata": {}, "outputs": [], "source": [ "%%bigquery --params $params\n", "BEGIN\n", "CREATE TEMP TABLE customer_reviews_transcript AS (SELECT ${bigquery_rideshare_llm_raw_dataset}.ext_udf_ai_extract_text(signed_url) AS customer_review_text,\n", " REGEXP_EXTRACT(uri, r'text_synth_(\\d+)\\.mp3') as trip_id\n", " FROM EXTERNAL_OBJECT_TRANSFORM(TABLE ${bigquery_rideshare_llm_raw_dataset}.biglake_rideshare_audios, ['SIGNED_URL']) WHERE REGEXP_EXTRACT(uri, r'text_synth_(\\d+)\\.mp3') = @trip_id );\n", "SELECT JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS result,\n", " prompt,\n", " ml_generate_text_result\n", " FROM ML.GENERATE_TEXT(MODEL `${project_id}.${bigquery_rideshare_llm_curated_dataset}.gemini_model`,\n", " ( SELECT CONCAT('The following phase is a customer review for a rideshare company. ',\n", " 'Fix grammar and punctuation in the phase. ',\n", " 'Correct any words that might be out of context (e.g. ride chair should be rideshare). ', \n", " 'Phrase: ', customer_review_text) as prompt\n", " FROM customer_reviews_transcript\n", " WHERE trip_id = @trip_id),\n", "STRUCT(\n", " 1 AS temperature,\n", " 512 AS max_output_tokens,\n", " 0 AS top_p,\n", " 1 AS top_k\n", " ));\n", " END;" ] }, { "cell_type": "markdown", "id": "Z32uOzrquJml", "metadata": { "id": "Z32uOzrquJml" }, "source": [ "## **Step 03: Review Sentiment**" ] }, { "cell_type": "markdown", "id": "SYmSnDBduLOz", "metadata": { "id": "SYmSnDBduLOz" }, "source": [ "### Summary" ] }, { "cell_type": "markdown", "id": "CIAVjNIauMzN", "metadata": { "id": "CIAVjNIauMzN" }, "source": [ "For each customer review we run it through the LLM.\n", "1. First we construct a prompt and save this in a field in our table.\n", "2. Ask the LLM to score these in bulk.\n", "3. We then extract the scored text from the JSON result.\n", "\n", "*The complete code is in notebook: rideshare_llm_step_01_customer_sentiment_analysis*" ] }, { "cell_type": "markdown", "id": "ywWde8bRuYwm", "metadata": { "id": "ywWde8bRuYwm" }, "source": [ "### Code" ] }, { "cell_type": "code", "execution_count": null, "id": "9vCgj9k8ucH3", "metadata": { "id": "9vCgj9k8ucH3" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- Select a Positive Review from the Enriched Zone\n", "-- For each review we created a LLM prompt (llm_sentiment_prompt)\n", " /* Sample Prompt:\n", " For the given review classify the sentiment as Positive, Neutral or Negative.\n", " Review: My rideshare driver was amazing! He was very professional and attentive.\n", " He made sure to pay close attention to the road without any distractions. He also made sure to provide me with a comfortable and safe ride.\n", " I would definitely recommend this driver to anyone looking for a rideshare.\n", " */\n", "-- We will then use this prompt to pass to our LLM and receive the scored output\n", "\n", "SELECT customer.customer_name,\n", " driver.driver_name,\n", " customer_review.customer_review_text,\n", " customer_review.llm_sentiment_prompt\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer_review` AS customer_review\n", " INNER JOIN `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer` AS customer\n", " ON customer_review.customer_id = customer.customer_id\n", " INNER JOIN `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.driver` AS driver\n", " ON customer_review.driver_id = driver.driver_id\n", " WHERE trip_id = 20857125;" ] }, { "cell_type": "markdown", "id": "pvswIy14zy4d", "metadata": { "id": "pvswIy14zy4d" }, "source": [ "**ML.GENERATE_TEXT model**\n", "\n", "We will use the ML.GENERATE_TEXT model which is Google's Gemini Large Language Model (LLM)\n", "- <a href=\"https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-text\">Documentation Link</a>" ] }, { "cell_type": "code", "execution_count": null, "id": "OkuuZ0crwKyv", "metadata": { "id": "OkuuZ0crwKyv" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- Same query as above, but we will pass the prompt to the LLM along with setting our LLM model parameters\n", "SELECT JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS result,\n", " prompt,\n", " ml_generate_text_result\n", " FROM ML.GENERATE_TEXT(MODEL`${project_id}.${bigquery_rideshare_llm_curated_dataset}.gemini_model`,\n", " (SELECT customer_review.llm_sentiment_prompt AS prompt\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer_review` AS customer_review\n", " INNER JOIN `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer` AS customer\n", " ON customer_review.customer_id = customer.customer_id\n", " INNER JOIN `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.driver` AS driver\n", " ON customer_review.driver_id = driver.driver_id\n", " WHERE trip_id = 20857125),\n", "STRUCT(\n", " 0 AS temperature,\n", " 50 AS max_output_tokens,\n", " 0 AS top_p,\n", " 1 AS top_k\n", " ));" ] }, { "cell_type": "markdown", "id": "2NVlNOYsxZUK", "metadata": { "id": "2NVlNOYsxZUK" }, "source": [ "**Understanding our LLM parameters**\n", "- <a href=\"https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text#request_body\">Documentation Link</a>\n", "\n", "|Parameter|Range|Description|\n", "|----|----|----|\n", "|temperature|0.0 -> 1.0|Temperature controls the degree of randomness in token selection. Lower temperatures are good for prompts that require a less open-ended or creative response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 means that the highest probability tokens are always selected. In this case, responses for a given prompt are mostly deterministic, but a small amount of variation is still possible.|\n", "|max_output_tokens|1 -> 1024|Sets the maximum number of tokens that the model outputs. Specify a lower value for shorter responses and a higher value for longer responses. The default is 50. A token is approximately four characters.|\n", "|top_p|0.0 -> 1.0|Specify a lower value for less random responses and a higher value for more random responses. This value sets a threshold probability and selects the smallest set of tokens whose cumulative probability exceeds the threshold.|\n", "|top_k|1 -> 40|Specify a lower value for less random responses and a higher value for more random responses. A `top_k` of 1 means the selected token is the most probable among all tokens in the model's vocabulary (also called greedy decoding). In contrast, a `top_k` of 3 means that the next token is selected from the top 3 most probable tokens (using temperature). For each token selection step, the `top_k` tokens with the highest probabilities are sampled. Then tokens are further filtered based on `top_p` with the final token selected using temperature sampling.|\n" ] }, { "cell_type": "markdown", "id": "ccfXzEVH1CyV", "metadata": { "id": "ccfXzEVH1CyV" }, "source": [] }, { "cell_type": "markdown", "id": "mUcOlBni1PMv", "metadata": { "id": "mUcOlBni1PMv" }, "source": [ "## **Step 04: Extract Themes from Customer Reviews**" ] }, { "cell_type": "markdown", "id": "bMZli9Or1WQd", "metadata": { "id": "bMZli9Or1WQd" }, "source": [ "### Summary" ] }, { "cell_type": "markdown", "id": "xfP_7XWi1XV-", "metadata": { "id": "xfP_7XWi1XV-" }, "source": [ "- We want to use our customer reviews to understand our customer preferences. - At the same time we can use this information to understand what our drivers are doing. \n", "- For instance if customers consistently complain about a driver's car being too cold, we can extract that the driver keeps their car cold. \n", "- If the same customer mentions that they are typically cold we can extract that the customer likes a warmer car.\n", "\n", "*The complete code is in notebook: rideshare_llm_step_02_driver_themes and rideshare_llm_step_03_customer_themes*" ] }, { "cell_type": "markdown", "id": "K4SlVuD51Y7e", "metadata": { "id": "K4SlVuD51Y7e" }, "source": [ "### Code (Driver Theme Extraction)" ] }, { "cell_type": "markdown", "id": "JBARFNJYBIiV", "metadata": { "id": "JBARFNJYBIiV" }, "source": [ "Create a **Qualitative** data analysis for **Driver** Habits\n", " - We have found some patterns that we want to dive deeper into:\n", " - trunk space\n", " - driving speed\n", " - hours worked\n", " - preferred pickup locations\n", " - average trip distance\n", " - crossing state lines\n", " - vechicle cleanliness\n", " - vechile temperature\n", " - maximum passengers\n", " - conversation with customer\n", " - music playing\n", " - distracted driver\n", " - target pay\n", "\n", "*The complete code is in notebook: rideshare_llm_step_02_driver_themes* " ] }, { "cell_type": "code", "execution_count": null, "id": "m9n3rfLb2wIl", "metadata": { "id": "m9n3rfLb2wIl" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- View the customer review\n", "\n", "SELECT customer.customer_name,\n", " driver.driver_name,\n", " customer_review.customer_review_text,\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer_review` AS customer_review\n", " INNER JOIN `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer` AS customer\n", " ON customer_review.customer_id = customer.customer_id\n", " INNER JOIN `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.driver` AS driver\n", " ON customer_review.driver_id = driver.driver_id\n", " WHERE trip_id = 15369855;" ] }, { "cell_type": "code", "execution_count": null, "id": "UZx7T3WE9Kej", "metadata": { "id": "UZx7T3WE9Kej" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "SELECT JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS result,\n", " prompt,\n", " ml_generate_text_result\n", " FROM ML.GENERATE_TEXT(MODEL`${project_id}.${bigquery_rideshare_llm_curated_dataset}.gemini_model`,\n", " (SELECT CONCAT(\n", "\"\"\"\n", "Classify the text as one or more of the following categories and return in the below json format.\n", "- \"trunk space small\"\n", "- \"trunk space large\"\n", "- \"driving too fast\"\n", "- \"driving too slow\"\n", "- \"clean car\"\n", "- \"dirty car\"\n", "- \"car too hot\"\n", "- \"car too cold\"\n", "- \"driver likes conversation\"\n", "- \"driver likes no conversation\"\n", "- \"driver likes music\"\n", "- \"driver likes no music\"\n", "- \"distracted driver\"\n", "\n", "JSON format: [ \"value\" ] \n", "Sample JSON Response: [ \"dirty car\", \"car too cold\" ] \n", "\n", "\n", "Text:\n", "\"\"\", customer_review_text) AS prompt\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer_review`\n", " WHERE trip_id = 15369855),\n", "STRUCT(\n", " 0 AS temperature,\n", " 1024 AS max_output_tokens,\n", " 0 AS top_p,\n", " 1 AS top_k\n", " ))" ] }, { "cell_type": "markdown", "id": "nZH_mOGIAW48", "metadata": { "id": "nZH_mOGIAW48" }, "source": [ "### Code (Customer Theme Extraction)" ] }, { "cell_type": "markdown", "id": "06c1gEAm_o87", "metadata": { "id": "06c1gEAm_o87" }, "source": [ "Create a **Qualitative** data analysis for **Customer** Habits\n", " - We have found some patterns that we want to dive deeper into:\n", " - trunk space\n", " - driving speed\n", " - vechicle cleanliness\n", " - vechile temperature\n", " - conversation\n", " - music playing\n", " - distracted driver\n", "\n", "*The complete code is in notebook: rideshare_llm_step_03_customer_themes*\n" ] }, { "cell_type": "code", "execution_count": null, "id": "dgj6URxuAPq7", "metadata": { "id": "dgj6URxuAPq7" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- Same review as the driver, but now from the customers standpoint of view\n", "\n", "SELECT JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS result,\n", " prompt,\n", " ml_generate_text_result\n", " FROM ML.GENERATE_TEXT(MODEL`${project_id}.${bigquery_rideshare_llm_curated_dataset}.gemini_model`,\n", " (SELECT CONCAT(\n", "\"\"\"\n", "Classify the customer review as one or more of the following categories and return in the below json format.\n", "- \"customer has small luggage\"\n", "- \"customer has large luggage\"\n", "- \"customer likes to drive fast\"\n", "- \"customer likes to drive slow\"\n", "- \"customer likes a clean car\"\n", "- \"customer likes the temperature warm\"\n", "- \"customer likes the temperature cold\"\n", "- \"customer likes conversation\"\n", "- \"customer likes no conversation\"\n", "- \"customer likes music\"\n", "- \"customer likes quiet\"\n", "\n", "JSON format: [ \"value\" ] \n", "Sample JSON Response: [ \"customer likes music\", \"customer likes a clean car\" ] \n", "\n", "Review:\n", "\"\"\", customer_review_text) AS prompt\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer_review`\n", " WHERE trip_id = 15369855),\n", "STRUCT(\n", " 0 AS temperature,\n", " 1024 AS max_output_tokens,\n", " 0 AS top_p,\n", " 1 AS top_k\n", " ))" ] }, { "cell_type": "markdown", "id": "LksdqAMuB25B", "metadata": { "id": "LksdqAMuB25B" }, "source": [ "## **Step 05: Summarize Themes**" ] }, { "cell_type": "markdown", "id": "3wmau8B7DDsi", "metadata": { "id": "3wmau8B7DDsi" }, "source": [ "### Summary" ] }, { "cell_type": "markdown", "id": "qbSsJzHIDFfT", "metadata": { "id": "qbSsJzHIDFfT" }, "source": [ "Now that we have extracted all the themes for both the driver and customer poiint of view:\n", "- Each theme needs to mapped to a category (e.g. music)\n", " - We might have 10 \"likes music on\"\n", " - We might have 50 \"likes music off\"\n", "- For each category (music) determine the prevailing preference\n", " - Since we have 50 \"likes music off\" we will use that as our \"winner\"\n", "- Write a summary using the LLM to process this data so we can review in just a few sentences.\n", "\n", "*The complete code is in notebook: rideshare_llm_step_04_driver_summary and rideshare_llm_step_05_customer_summary*" ] }, { "cell_type": "markdown", "id": "EIJWNWHKDGG_", "metadata": { "id": "EIJWNWHKDGG_" }, "source": [ "### Code (Driver Theme Summary)" ] }, { "cell_type": "markdown", "id": "dgapHJDlIpxt", "metadata": { "id": "dgapHJDlIpxt" }, "source": [ "Process the themes for each driver by passing their attibutes to the LLM.\n", "\n", "*The complete code is in notebook: rideshare_llm_step_04_driver_summary*" ] }, { "cell_type": "code", "execution_count": null, "id": "i0mA2PKIDvO9", "metadata": { "id": "i0mA2PKIDvO9" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- After we have determined the prevailing attribute for our Driver, combine them so we can create a LLM Prompt\n", "\n", "SELECT driver_id,\n", " STRING_AGG(\n", " CASE WHEN TRIM(extracted_driver_attribute) = 'driver likes music' THEN 'Attribute: the driver likes the radio on\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'driver likes no music' THEN 'Attribute: the driver likes the radio off\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'trunk space large' THEN 'Attribute: the driver has a large amount of trunk space\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'trunk space small' THEN 'Attribute: the driver has a small trunk\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'Category: trunk space small' THEN 'Attribute: the driver has a small trunk\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'driver likes conversation' THEN 'Attribute: the driver likes to have a conversation with the passengers\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'driver likes no conversation' THEN 'Attribute: the driver does not like to talk to the passengers\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'driving too fast' THEN 'Attribute: the driver tends to drive too fast\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'driving too slow' THEN 'Attribute: the driver drives too slow\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'clean car' THEN 'Attribute: the driver keeps their car clean on the inside and out\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'dirty car' THEN 'Attribute: the driver has a dirty car\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'car too hot' THEN 'Attribute: the driver has a keeps their car too warm inside\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'car too cold' THEN 'Attribute: the driver keeps their car too cold\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'distracted driver' THEN 'Attribute: the driver is a distracted driver\\n'\n", " WHEN TRIM(extracted_driver_attribute) = 'safe driver' THEN 'Attribute: the driver is a safe driver\\n'\n", " ELSE ''\n", " END,'') AS driver_attribute_agg\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.driver_attribute`\n", "GROUP BY driver_id\n", "LIMIT 5;" ] }, { "cell_type": "markdown", "id": "eClNNTAeO3HJ", "metadata": { "id": "eClNNTAeO3HJ" }, "source": [ "Create the Driver Attibute Summary" ] }, { "cell_type": "code", "execution_count": null, "id": "Y9NHt32RElaQ", "metadata": { "id": "Y9NHt32RElaQ" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- We will take the concatenated list of attributes and run them through our LLM to create a Driver Theme/Attibute Summary\n", "\n", "SELECT JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS result,\n", " FROM ML.GENERATE_TEXT(MODEL`${project_id}.${bigquery_rideshare_llm_curated_dataset}.gemini_model`,\n", " (SELECT\n", "\"\"\"\n", "Write a 20 to 500 word summary for the following attributes.\n", "1. Randomly sort the attributes\n", "2. Select the first 3 to 5 attributes\n", "3. Write the summary in present tense for only the first 3 to 5 attributes\n", "4. The driver's name is Danilo Gottardi\n", "5. Use sentences with varying lengths\n", "6. Write 2 to 5 sentences\n", "\n", "Attribute: the driver likes the radio off\n", "Attribute: the driver has a small trunk\n", "Attribute: the driver does not like to talk to the passengers\n", "Attribute: the driver drives too slow\n", "Attribute: the driver has a dirty car\n", "Attribute: the driver keeps their car too cold\n", "Attribute: the driver is a distracted driver\n", "\"\"\" AS prompt),\n", "STRUCT(\n", " -- Set the parameters for a more creative/random responses\n", " -- You can adjust these during the demo\n", " .5 AS temperature, -- Change this to Zero and One to see the results (demo default is .5)\n", " 1024 AS max_output_tokens,\n", " 1 AS top_p,\n", " 40 AS top_k\n", " ));" ] }, { "cell_type": "markdown", "id": "hmMmFaFWHsHi", "metadata": { "id": "hmMmFaFWHsHi" }, "source": [ "**Demo:** Rerun the above SQL changing the temperature parameter to see the varying responses." ] }, { "cell_type": "markdown", "id": "WorrEPQzIg4r", "metadata": { "id": "WorrEPQzIg4r" }, "source": [ "### Code (Customer Theme Summary)" ] }, { "cell_type": "markdown", "id": "UCA2FrhpI4q1", "metadata": { "id": "UCA2FrhpI4q1" }, "source": [ "Process the themes for each customer by passing their attibutes to the LLM.\n", "\n", "*The complete code is in notebook: rideshare_llm_step_05_customer_summary*" ] }, { "cell_type": "code", "execution_count": null, "id": "eu5zWoWeI478", "metadata": { "id": "eu5zWoWeI478" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- View our aggregated customer preferences\n", "\n", "SELECT customer_id,\n", " STRING_AGG(extracted_customer_attribute,', ') AS customer_attribute_agg\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer_attribute`\n", "WHERE customer_id IN (3760, 549, 2672, 3025)\n", "GROUP BY customer_id;" ] }, { "cell_type": "markdown", "id": "bNeLnSMtO-M6", "metadata": { "id": "bNeLnSMtO-M6" }, "source": [ "Create the Customer Attibute Summary" ] }, { "cell_type": "code", "execution_count": null, "id": "Ry0xtGVgJ7p4", "metadata": { "id": "Ry0xtGVgJ7p4" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- We will take the concatenated list of attributes and run them through our LLM to create a Customer Theme/Attibute Summary\n", "-- Customer Id: 3760\n", "\n", "SELECT JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS result,\n", " FROM ML.GENERATE_TEXT(MODEL`${project_id}.${bigquery_rideshare_llm_curated_dataset}.gemini_model`,\n", " (SELECT\n", "\"\"\"\n", "Write a 50 to 600 word summary for the following customer preferences.\n", "1. The customer's name is Paulina Peruzzi.\n", "2. Write the summary in present tense.\n", "3. Write the summary from the customers prespective.\n", "4. Do not repeat the same subject in the summary.\n", "5. Write 3 to 6 sentences.\n", "\n", "Preference: likes a clean car.\n", "Preference: likes a warm vehicle inside.\n", "Preference: likes to have a conversation.\n", "Preference: likes their driver to drive slow.\n", "Preference: prefers the radio off.\n", "\"\"\" AS prompt),\n", "STRUCT(\n", " -- Set the parameters for a more creative/random responses\n", " -- You can adjust these during the demo\n", " .75 AS temperature, -- Change this to Zero and One to see the results (demo default is .75)\n", " 1024 AS max_output_tokens,\n", " 1 AS top_p,\n", " 40 AS top_k\n", " ));" ] }, { "cell_type": "markdown", "id": "gOgqG_F2MVcv", "metadata": { "id": "gOgqG_F2MVcv" }, "source": [ "**Demo:** Rerun the above SQL changing the temperature parameter to see the varying responses." ] }, { "cell_type": "markdown", "id": "s3L2_59nPG3H", "metadata": { "id": "s3L2_59nPG3H" }, "source": [ "## **Step 06: Summarize All Reviews**" ] }, { "cell_type": "markdown", "id": "DEO76K_IPQdx", "metadata": { "id": "DEO76K_IPQdx" }, "source": [ "### Summary" ] }, { "cell_type": "markdown", "id": "4y33SqPYPUzP", "metadata": { "id": "4y33SqPYPUzP" }, "source": [ "For each Driver summarize all the customer reviews for that driver.\n", "\n", "For each Customer summarize all their reviews for their various drivers.\n", "\n", "*The complete code is in notebook: rideshare_llm_step_04_driver_summary and rideshare_llm_step_05_customer_summary*" ] }, { "cell_type": "markdown", "id": "Ms2Bus35PUfJ", "metadata": { "id": "Ms2Bus35PUfJ" }, "source": [ "### Code (Driver Review Summary)" ] }, { "cell_type": "code", "execution_count": null, "id": "ulwJl82wPyGZ", "metadata": { "id": "ulwJl82wPyGZ" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "SELECT CONCAT('Write a 100 to 600 word summary for the following customer reviews.\\n',\n", " '1. Write the summary in present tense.\\n',\n", " '2. Write the summary from the customers prespective.\\n',\n", " '3. Do not repeat the same subject in the summary.\\n',\n", " '4. The reviews are for the driver ', driver_name, ' at a rideshare company.\\n',\n", " '5. The reviews are written by different customers.\\n',\n", " '6. Write 3 to 6 sentences.\\n',\n", " customer_review_agg) AS prompt\n", " FROM (\n", " SELECT customer_review.driver_id,\n", " STRING_AGG(CONCAT(\"Review: \",customer_review_text),'\\n') AS customer_review_agg\n", " FROM (SELECT driver_id,\n", " customer_review_text\n", " FROM (SELECT trip.driver_id,\n", " trip.pickup_time,\n", " customer_review.customer_review_text,\n", " ROW_NUMBER() OVER (PARTITION BY trip.driver_id ORDER BY trip.pickup_time) AS row_nbr\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer_review` AS customer_review\n", " INNER JOIN `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.trip` AS trip\n", " ON customer_review.trip_id = trip.trip_id\n", " AND customer_review.driver_id = 23 -- Change this for demoing different drivers\n", " ) AS reviews\n", " WHERE row_nbr < 50 -- top 50 most recent\n", " ) AS customer_review\n", " GROUP BY driver_id\n", "\n", " ) AS top_reviews\n", " INNER JOIN `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.driver` AS driver\n", " ON driver.driver_id = top_reviews.driver_id;" ] }, { "cell_type": "code", "execution_count": null, "id": "fBL1ycN_P8GL", "metadata": { "id": "fBL1ycN_P8GL" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- Driver\n", "\n", "SELECT JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS result,\n", " FROM ML.GENERATE_TEXT(MODEL`${project_id}.${bigquery_rideshare_llm_curated_dataset}.gemini_model`,\n", " (\n", " SELECT CONCAT('Write a 100 to 600 word summary for the following customer reviews.\\n',\n", " '1. Write the summary in present tense.\\n',\n", " '2. Write the summary from the customers prespective.\\n',\n", " '3. Do not repeat the same subject in the summary.\\n',\n", " '4. The reviews are for the driver ', driver_name, ' at a rideshare company.\\n',\n", " '5. The reviews are written by different customers.\\n',\n", " '6. Write 3 to 6 sentences.\\n',\n", " customer_review_agg) AS prompt\n", " FROM (\n", " SELECT customer_review.driver_id,\n", " STRING_AGG(CONCAT(\"Review: \",customer_review_text),'\\n') AS customer_review_agg\n", " FROM (SELECT driver_id,\n", " customer_review_text\n", " FROM (SELECT trip.driver_id,\n", " trip.pickup_time,\n", " customer_review.customer_review_text,\n", " ROW_NUMBER() OVER (PARTITION BY trip.driver_id ORDER BY trip.pickup_time) AS row_nbr\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer_review` AS customer_review\n", " INNER JOIN `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.trip` AS trip\n", " ON customer_review.trip_id = trip.trip_id\n", " AND customer_review.driver_id = 23 -- Change this for demoing different drivers\n", " ) AS reviews\n", " WHERE row_nbr < 50 -- top 50 most recent\n", " ) AS customer_review\n", " GROUP BY driver_id\n", " ) AS top_reviews\n", " INNER JOIN `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.driver` AS driver\n", " ON driver.driver_id = top_reviews.driver_id),\n", "STRUCT(\n", " -- Set the parameters for a more creative/random responses\n", " -- You can adjust these during the demo\n", " .75 AS temperature, -- Change this to Zero and One to see the results (demo default is .75)\n", " 1024 AS max_output_tokens,\n", " 1 AS top_p,\n", " 40 AS top_k\n", " ));" ] }, { "cell_type": "markdown", "id": "44Kx1w_qPeDz", "metadata": { "id": "44Kx1w_qPeDz" }, "source": [ "### Code (Customer Review Summary)" ] }, { "cell_type": "code", "execution_count": null, "id": "RADTgOuKPzCS", "metadata": { "id": "RADTgOuKPzCS" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "SELECT CONCAT('Write a 100 to 600 word summary for the following customer reviews.\\n',\n", " '1. The reviews are written by ', customer.customer_name, '.\\n',\n", " '2. Write the summary in present tense.\\n',\n", " '3. Do not repeat the same subject in the summary.\\n',\n", " '4. The reviews are for different drivers.\\n',\n", " '5. The reviews are a single rideshare company.\\n',\n", " '6. The drivers all work for the rideshare company.\\n',\n", " '7. Write 3 to 6 sentences.\\n',\n", " customer_review_agg)\n", " FROM (SELECT customer_id,\n", " STRING_AGG(CONCAT('Review: ',customer_review_text,'\\n'),'') AS customer_review_agg\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer_review`\n", " WHERE customer_id = 3556\n", " GROUP BY customer_id) AS customer_review\n", " INNER JOIN `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer` AS customer\n", " ON customer.customer_id = customer_review.customer_id;" ] }, { "cell_type": "code", "execution_count": null, "id": "f4x2LGjfSinV", "metadata": { "id": "f4x2LGjfSinV" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "-- Customer\n", "\n", "SELECT JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS result,\n", " FROM ML.GENERATE_TEXT(MODEL`${project_id}.${bigquery_rideshare_llm_curated_dataset}.gemini_model`,\n", " (SELECT CONCAT('Write a 100 to 600 word summary for the following customer reviews.\\n',\n", " '1. The reviews are written by ', customer.customer_name, '.\\n',\n", " '2. Write the summary in present tense.\\n',\n", " '3. Do not repeat the same subject in the summary.\\n',\n", " '4. The reviews are for different drivers.\\n',\n", " '5. The reviews are a single rideshare company.\\n',\n", " '6. The drivers all work for the rideshare company.\\n',\n", " '7. Write 3 to 6 sentences.\\n',\n", " customer_review_agg) AS prompt\n", " FROM (SELECT customer_id,\n", " STRING_AGG(CONCAT('Review: ',customer_review_text,'\\n'),'') AS customer_review_agg\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer_review`\n", " WHERE customer_id = 3556\n", " GROUP BY customer_id) AS customer_review\n", " INNER JOIN `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer` AS customer\n", " ON customer.customer_id = customer_review.customer_id),\n", "STRUCT(\n", " -- Set the parameters for a more creative/random responses\n", " -- You can adjust these during the demo\n", " 1 AS temperature, -- Change this to 0, .75 and 1 to see the results (demo default is 1)\n", " 1024 AS max_output_tokens,\n", " 1 AS top_p,\n", " 40 AS top_k\n", " ));" ] }, { "cell_type": "markdown", "id": "iUg7kigqTlh1", "metadata": { "id": "iUg7kigqTlh1" }, "source": [ "## **Step 07: Perform Quantitative analysis and summarize with LLM**" ] }, { "cell_type": "markdown", "id": "wos_HI-OT3Df", "metadata": { "id": "wos_HI-OT3Df" }, "source": [ "### Summary" ] }, { "cell_type": "markdown", "id": "jbPArD7HT6pP", "metadata": { "id": "jbPArD7HT6pP" }, "source": [ "Using a our trip data, it was summed, averaged and analyized to find some common patterns of drivers. The data is then passed to the LLM to create a summary.\n", "\n", "To see how the quanitative data was created please see the stored procedure\n", "- ${project_id}.${bigquery_rideshare_llm_enriched_dataset}.sp_step_01_quantitative_analysis\n", "\n", "*The complete code is in notebook: rideshare_llm_step_06_driver_quantitative_analysis and rideshare_llm_step_07_customer_quantitative_analysis*" ] }, { "cell_type": "markdown", "id": "1tpCXW_1T7mu", "metadata": { "id": "1tpCXW_1T7mu" }, "source": [ "### Code (Driver Quantitative Analysis)" ] }, { "cell_type": "code", "execution_count": null, "id": "HnArKqNOTzlY", "metadata": { "id": "HnArKqNOTzlY" }, "outputs": [], "source": [ "%%bigquery\n", "SELECT driver_quantitative_analysis_prompt\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.driver` AS driver\n", " WHERE driver_id = 66;" ] }, { "cell_type": "code", "execution_count": null, "id": "cqkfn7MIUUl2", "metadata": { "id": "cqkfn7MIUUl2" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "SELECT JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS result,\n", " FROM ML.GENERATE_TEXT(MODEL`${project_id}.${bigquery_rideshare_llm_curated_dataset}.gemini_model`,\n", " (SELECT\n", "\"\"\"\n", "Write a 3 to 8 sentence summary of the following attributes of a driver in third person gender neutral form:\n", "- The driver picks up customers at 4 pickup locations. This is an average number. These locations are: Midtown South, Elmhurst, University Heights/Morris Heights, Brownsville.\n", "- The driver is not willing to drive accross state lines.\n", "- The driver typically does not pickup or dropoff at the airport.\n", "- The driver only works on weekdays.\n", "- The driver likes to work a split shift which appears to target rush hour.\n", "\"\"\" AS prompt),\n", "STRUCT(\n", " -- Set the parameters for a more creative/random responses\n", " -- You can adjust these during the demo\n", " .8 AS temperature, -- Change this to Zero and One to see the results (demo default is .8)\n", " 1024 AS max_output_tokens,\n", " .70 AS top_p,\n", " 25 AS top_k\n", " ));" ] }, { "cell_type": "markdown", "id": "rMlYOIw0Vg2n", "metadata": { "id": "rMlYOIw0Vg2n" }, "source": [ "### Code (Customer Quantitative Analysis)" ] }, { "cell_type": "code", "execution_count": null, "id": "c1BsBoU4VlD_", "metadata": { "id": "c1BsBoU4VlD_" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "SELECT customer_quantitative_analysis_prompt\n", " FROM `${project_id}.${bigquery_rideshare_llm_enriched_dataset}.customer` AS customer\n", " WHERE customer_id = 3219;" ] }, { "cell_type": "code", "execution_count": null, "id": "HWbs3wpGWgJR", "metadata": { "id": "HWbs3wpGWgJR" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "SELECT JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS result,\n", " FROM ML.GENERATE_TEXT(MODEL`${project_id}.${bigquery_rideshare_llm_curated_dataset}.gemini_model`,\n", " (SELECT\n", "\"\"\"\n", "Write a 2 to 3 sentence summary of the following attributes of a customer who uses a rideshare services.\n", "- Martin Hernandez uses the service on weekdays.\n", "- Martin Hernandez likes to use the service during the morning and afternoon rush hours.\n", "\"\"\" AS prompt),\n", "STRUCT(\n", " -- Set the parameters for a more creative/random responses\n", " -- You can adjust these during the demo\n", " .8 AS temperature, -- Change this to Zero and One to see the results (demo default is .8)\n", " 1024 AS max_output_tokens,\n", " .70 AS top_p,\n", " 25 AS top_k\n", " ));" ] }, { "cell_type": "markdown", "id": "xbEe-y9ccy27", "metadata": { "id": "xbEe-y9ccy27" }, "source": [ "## **Step 08: Using LLMs to correct LLMs**" ] }, { "cell_type": "markdown", "id": "HX_cmEdUc_AD", "metadata": { "id": "HX_cmEdUc_AD" }, "source": [ "### Summary" ] }, { "cell_type": "markdown", "id": "w2fJZy4cdBfV", "metadata": { "id": "w2fJZy4cdBfV" }, "source": [ "During the demo sometime the LLM output incomplete text and had grammatical errors. You can use the LLM to correct this." ] }, { "cell_type": "markdown", "id": "Ip5iw7JadMp0", "metadata": { "id": "Ip5iw7JadMp0" }, "source": [ "### Code" ] }, { "cell_type": "code", "execution_count": null, "id": "Qnp3nFurdO0S", "metadata": { "id": "Qnp3nFurdO0S" }, "outputs": [], "source": [ "%%bigquery\n", "\n", "SELECT JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS result,\n", " FROM ML.GENERATE_TEXT(MODEL`${project_id}.${bigquery_rideshare_llm_curated_dataset}.gemini_model`,\n", " (SELECT\n", "\"\"\"\n", "For the following text only make the following changes:\n", "1. Correct spelling\n", "2. Correct incomplete sentences\n", "3. Fix punctuation mistakes\n", "4. Use the correct pronouns of he/him\n", "\n", "Text: The drivers car was clean and has a large trunk for storage.. Mike driving very carefully. He like to have radio ons. They is not distracted while driving\n", "\"\"\" AS prompt),\n", "STRUCT(\n", " .1 AS temperature,\n", " 1024 AS max_output_tokens,\n", " .1 AS top_p,\n", " 10 AS top_k\n", " ));" ] } ], "metadata": { "colab": { "private_outputs": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" } }, "nbformat": 4, "nbformat_minor": 5 }