gemini/multimodal-live-api/intro_multimodal_live_api_genai

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "oXnEutuDQa9c" }, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "JAPoU8Sm5E6e" }, "source": [ "# Getting Started with the Live API using Gen AI SDK\n", "\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb\">\n", " <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fmultimodal-live-api%2Fintro_multimodal_live_api_genai_sdk.ipynb\">\n", " <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb\">\n", " <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb\">\n", " <img width=\"32px\" src=\"https://www.svgrepo.com/download/217753/github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", "</table>\n", "\n", "<div style=\"clear: both;\"></div>\n", "\n", "<b>Share to:</b>\n", "\n", "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n", "</a>\n", "\n", "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n", "</a>\n", "\n", "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n", "</a>\n", "\n", "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n", "</a>\n", "\n", "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n", "</a>\n" ] }, { "cell_type": "markdown", "metadata": { "id": "84f0f73a0f76" }, "source": [ "| Authors |\n", "| --- |\n", "| [Eric Dong](https://github.com/gericdong) |\n", "| [Holt Skinner](https://github.com/holtskinner) |" ] }, { "cell_type": "markdown", "metadata": { "id": "tvgnzT1CKxrO" }, "source": [ "## Overview\n", "\n", "The Live API enables low-latency bidirectional voice and video interactions with Gemini. The API can process text, audio, and video input, and it can provide text and audio output. This tutorial demonstrates the following simple examples to help you get started with the Live API using the Google Gen AI SDK in Vertex AI.\n", "\n", "- Text-to-text generation\n", "- Text-to-audio generation\n", "- Text-to-audio conversation\n", "- Function calling\n", "- Code execution\n", "- Google Search\n", "- Audio transcription\n", "\n", "See the [Live API](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live) page for more details." ] }, { "cell_type": "markdown", "metadata": { "id": "gPiTOAHURvTM" }, "source": [ "## Getting Started" ] }, { "cell_type": "markdown", "metadata": { "id": "CHRZUpfWSEpp" }, "source": [ "### Install Google Gen AI SDK for Python\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sG3_LKsWSD3A" }, "outputs": [], "source": [ "%pip install --upgrade --quiet google-genai" ] }, { "cell_type": "markdown", "metadata": { "id": "HlMVjiAWSMNX" }, "source": [ "### Authenticate your notebook environment (Colab only)\n", "\n", "If you are running this notebook on Google Colab, run the cell below to authenticate your environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "12fnq4V0SNV3" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "0Ef0zVX-X9Bg" }, "source": [ "### Import libraries\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xBCH3hnAX9Bh" }, "outputs": [], "source": [ "from IPython.display import Audio, Markdown, display\n", "from google.genai.types import (\n", " AudioTranscriptionConfig,\n", " Content,\n", " GoogleSearch,\n", " LiveConnectConfig,\n", " Part,\n", " PrebuiltVoiceConfig,\n", " SpeechConfig,\n", " Tool,\n", " ToolCodeExecution,\n", " VoiceConfig,\n", ")\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "id": "LymmEN6GSTn-" }, "source": [ "### Set Google Cloud project information and create client\n", "\n", "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n", "\n", "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Nqwi-5ufWp_B" }, "outputs": [], "source": [ "# Use the environment variable if the user doesn't provide Project ID.\n", "import os\n", "\n", "PROJECT_ID = \"[your-project-id]\" # @param {type: \"string\", placeholder: \"[your-project-id]\", isTemplate: true}\n", "if not PROJECT_ID or PROJECT_ID == \"[your-project-id]\":\n", " PROJECT_ID = str(os.environ.get(\"GOOGLE_CLOUD_PROJECT\"))\n", "\n", "LOCATION = os.environ.get(\"GOOGLE_CLOUD_REGION\", \"us-central1\")\n", "\n", "from google import genai\n", "\n", "client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)" ] }, { "cell_type": "markdown", "metadata": { "id": "5M7EKckIYVFy" }, "source": [ "### Use the Gemini 2.0 Flash model\n", "\n", "The Live API is a new capability introduced with the [Gemini 2.0 Flash model](https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-coEslfWPrxo" }, "outputs": [], "source": [ "MODEL_ID = \"gemini-2.0-flash-live-preview-04-09\" # @param {type: \"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "b51c5ced31f7" }, "source": [ "## Use the Live API\n", "\n", "The Live API is a stateful API that uses [WebSockets](https://en.wikipedia.org/wiki/WebSocket). This section shows some basic examples of how to use the Live API for text-to-text and text-to-audio generation." ] }, { "cell_type": "markdown", "metadata": { "id": "Q1DE3s_LIUuE" }, "source": [ "### **Example 1**: Text-to-text generation\n", "\n", "You send a text prompt and receive a text message.\n", "\n", "**Notes**\n", "- A session `session` represents a single WebSocket connection between the client and the server.\n", "- A session configuration includes the model, generation parameters, system instructions, and tools.\n", " - `response_modalities` accepts `TEXT` or `AUDIO`.\n", "- After a new session is initiated, the session can exchange messages with the server to\n", " - Send text, audio, or video to the server.\n", " - Receive audio, text, or function call responses from the server.\n", "- When sending messages to the server, set `end_of_turn` to `True` to indicate that the server content generation should start with the currently accumulated prompt. Otherwise, the server awaits additional messages before starting generation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NbJZzc7CIha5" }, "outputs": [], "source": [ "async with client.aio.live.connect(\n", " model=MODEL_ID,\n", " config=LiveConnectConfig(response_modalities=[\"TEXT\"]),\n", ") as session:\n", " text_input = \"Hello? Gemini are you there?\"\n", " display(Markdown(f\"**Input:** {text_input}\"))\n", "\n", " await session.send_client_content(\n", " turns=Content(role=\"user\", parts=[Part(text=text_input)])\n", " )\n", "\n", " response = []\n", "\n", " async for message in session.receive():\n", " if message.text:\n", " response.append(message.text)\n", "\n", " display(Markdown(f\"**Response >** {''.join(response)}\"))" ] }, { "cell_type": "markdown", "metadata": { "id": "cG3346aA9sRR" }, "source": [ "### **Example 2**: Text-to-audio generation\n", "\n", "You send a text prompt and receive a model response in audio.\n", "\n", "- The Live API supports the following voices:\n", " - `Puck`\n", " - `Charon`\n", " - `Kore`\n", " - `Fenrir`\n", " - `Aoede`\n", "- To specify a voice, set the `voice_name` within the `speech_config` object, as part of your session configuration." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Iz3OkQ-a51QM" }, "outputs": [], "source": [ "voice_name = \"Aoede\" # @param [\"Aoede\", \"Puck\", \"Charon\", \"Kore\", \"Fenrir\", \"Leda\", \"Orus\", \"Zephyr\"]\n", "\n", "config = LiveConnectConfig(\n", " response_modalities=[\"AUDIO\"],\n", " speech_config=SpeechConfig(\n", " voice_config=VoiceConfig(\n", " prebuilt_voice_config=PrebuiltVoiceConfig(\n", " voice_name=voice_name,\n", " )\n", " ),\n", " ),\n", ")\n", "\n", "async with client.aio.live.connect(\n", " model=MODEL_ID,\n", " config=config,\n", ") as session:\n", " text_input = \"Hello? Gemini are you there?\"\n", " display(Markdown(f\"**Input:** {text_input}\"))\n", "\n", " await session.send_client_content(\n", " turns=Content(role=\"user\", parts=[Part(text=text_input)])\n", " )\n", "\n", " audio_data = []\n", " async for message in session.receive():\n", " if (\n", " message.server_content.model_turn\n", " and message.server_content.model_turn.parts\n", " ):\n", " for part in message.server_content.model_turn.parts:\n", " if part.inline_data:\n", " audio_data.append(\n", " np.frombuffer(part.inline_data.data, dtype=np.int16)\n", " )\n", "\n", " if audio_data:\n", " display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))" ] }, { "cell_type": "markdown", "metadata": { "id": "JOBlWf566HOx" }, "source": [ "### **Example 3**: Text-to-audio conversation\n", "\n", "**Step 1**: You set up a conversation with the API that allows you to send text prompts and receive audio responses.\n", "\n", "**Notes**\n", "\n", "- While the model keeps track of in-session interactions, explicit session history accessible through the API isn't available yet. When a session is terminated the corresponding context is erased." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bhY0P0qpRP5y" }, "outputs": [], "source": [ "config = LiveConnectConfig(response_modalities=[\"AUDIO\"])\n", "\n", "\n", "async def main() -> None:\n", " async with client.aio.live.connect(model=MODEL_ID, config=config) as session:\n", "\n", " async def send() -> bool:\n", " text_input = input(\"Input > \")\n", " if text_input.lower() in (\"q\", \"quit\", \"exit\"):\n", " return False\n", " await session.send_client_content(\n", " turns=Content(role=\"user\", parts=[Part(text=text_input)])\n", " )\n", "\n", " return True\n", "\n", " async def receive() -> None:\n", "\n", " audio_data = []\n", "\n", " async for message in session.receive():\n", " if (\n", " message.server_content.model_turn\n", " and message.server_content.model_turn.parts\n", " ):\n", " for part in message.server_content.model_turn.parts:\n", " if part.inline_data:\n", " audio_data.append(\n", " np.frombuffer(part.inline_data.data, dtype=np.int16)\n", " )\n", "\n", " if message.server_content.turn_complete:\n", " display(Markdown(\"**Response >**\"))\n", " display(\n", " Audio(np.concatenate(audio_data), rate=24000, autoplay=True)\n", " )\n", " break\n", "\n", " return\n", "\n", " while True:\n", " if not await send():\n", " break\n", " await receive()" ] }, { "cell_type": "markdown", "metadata": { "id": "94IeUUb3e90M" }, "source": [ "**Step 2** Run the chat, input your prompts, or type `q`, `quit` or `exit` to exit.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2UvgUDIYJqfw" }, "outputs": [], "source": [ "await main()" ] }, { "cell_type": "markdown", "metadata": { "id": "907da7836dcf" }, "source": [ "### **Example 4**: Function calling\n", "\n", "You can use function calling to create a description of a function, then pass that description to the model in a request. The response from the model includes the name of a function that matches the description and the arguments to call it with.\n", "\n", "**Notes**:\n", "\n", "- All functions must be declared at the start of the session by sending tool definitions.\n", "- Currently only one tool is supported in the API." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0f4657af21e3" }, "outputs": [], "source": [ "def get_current_weather(location: str) -> str:\n", " \"\"\"Example method. Returns the current weather.\n", "\n", " Args:\n", " location: The city and state, e.g. San Francisco, CA\n", " \"\"\"\n", " weather_map: dict[str, str] = {\n", " \"Boston, MA\": \"snowing\",\n", " \"San Francisco, CA\": \"foggy\",\n", " \"Seattle, WA\": \"raining\",\n", " \"Austin, TX\": \"hot\",\n", " \"Chicago, IL\": \"windy\",\n", " }\n", " return weather_map.get(location, \"unknown\")\n", "\n", "\n", "config = LiveConnectConfig(\n", " response_modalities=[\"TEXT\"],\n", " tools=[get_current_weather],\n", ")\n", "\n", "async with client.aio.live.connect(\n", " model=MODEL_ID,\n", " config=config,\n", ") as session:\n", " text_input = \"Get the current weather in Boston.\"\n", " display(Markdown(f\"**Input:** {text_input}\"))\n", "\n", " await session.send_client_content(\n", " turns=Content(role=\"user\", parts=[Part(text=text_input)])\n", " )\n", "\n", " async for message in session.receive():\n", " if message.tool_call:\n", " for function_call in message.tool_call.function_calls:\n", " display(Markdown(f\"**FunctionCall >** {str(function_call)}\"))" ] }, { "cell_type": "markdown", "metadata": { "id": "23cb7ab89311" }, "source": [ "### **Example 5**: Code Execution\n", "\n", " You can use code execution capability to generate and execute Python code directly within the API.\n", "\n", " In this example, you initialize the code execution tool by passing `code_execution` in a `Tool` definition, and register this tool with the model when a session is initiated." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TZu8j7nrBfCi" }, "outputs": [], "source": [ "config = LiveConnectConfig(\n", " response_modalities=[\"TEXT\"],\n", " tools=[Tool(code_execution=ToolCodeExecution())],\n", ")\n", "\n", "async with client.aio.live.connect(\n", " model=MODEL_ID,\n", " config=config,\n", ") as session:\n", " text_input = \"Write tool code to calculate the 15th fibonacci number then find the nearest palindrome to it\"\n", " display(Markdown(f\"**Input:** {text_input}\"))\n", "\n", " await session.send_client_content(\n", " turns=Content(role=\"user\", parts=[Part(text=text_input)])\n", " )\n", "\n", " response = []\n", "\n", " async for message in session.receive():\n", " if message.text:\n", " response.append(message.text)\n", " if message.server_content.model_turn:\n", " if message.server_content.model_turn.parts:\n", " for part in message.server_content.model_turn.parts:\n", " if part.executable_code:\n", " display(\n", " Markdown(\n", " f\"\"\"\n", "**Executable code:**\n", "```py\n", "{part.executable_code.code}\n", "```\n", "\"\"\"\n", " )\n", " )\n", " if part.code_execution_result:\n", " display(\n", " Markdown(\n", " f\"\"\"\n", "**Code execution result:**\n", "```py\n", "{part.code_execution_result.output}\n", "```\n", "\"\"\"\n", " )\n", " )\n", "\n", " display(Markdown(f\"**Response >** {''.join(response)}\"))" ] }, { "cell_type": "markdown", "metadata": { "id": "73660342318d" }, "source": [ "### **Example 6**: Google Search\n", "\n", "The `google_search` tool lets the model conduct Google searches. For example, try asking it about events that are too recent to be in the training data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "e64fc3a94d49" }, "outputs": [], "source": [ "config = LiveConnectConfig(\n", " response_modalities=[\"TEXT\"],\n", " tools=[Tool(google_search=GoogleSearch())],\n", ")\n", "\n", "async with client.aio.live.connect(\n", " model=MODEL_ID,\n", " config=config,\n", ") as session:\n", " text_input = (\n", " \"Tell me about the largest earthquake in California the week of Dec 5 2024?\"\n", " )\n", " display(Markdown(f\"**Input:** {text_input}\"))\n", "\n", " await session.send_client_content(\n", " turns=Content(role=\"user\", parts=[Part(text=text_input)])\n", " )\n", "\n", " response = []\n", "\n", " async for message in session.receive():\n", " if message.text:\n", " response.append(message.text)\n", "\n", " display(Markdown(f\"**Response >** {''.join(response)}\"))" ] }, { "cell_type": "markdown", "metadata": { "id": "J28mNbeBY4Ri" }, "source": [ "### **Example 7**: Audio transcription\n", "\n", "The Live API provides transcriptions for both input and output audio." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Cvrjwb_pY-1_" }, "outputs": [], "source": [ "config = LiveConnectConfig(\n", " response_modalities=[\"AUDIO\"],\n", " input_audio_transcription=AudioTranscriptionConfig(),\n", " output_audio_transcription=AudioTranscriptionConfig(),\n", ")\n", "\n", "\n", "async with client.aio.live.connect(\n", " model=MODEL_ID,\n", " config=config,\n", ") as session:\n", " text_input = \"Hello? Gemini are you there?\"\n", " display(Markdown(f\"**Input:** {text_input}\"))\n", "\n", " await session.send_client_content(\n", " turns=Content(role=\"user\", parts=[Part(text=text_input)])\n", " )\n", "\n", " audio_data = []\n", " input_transcription = []\n", " output_transcription = []\n", "\n", " async for message in session.receive():\n", " if (\n", " message.server_content.input_transcription\n", " and message.server_content.input_transcription.text\n", " ):\n", " input_transcription.append(message.server_content.input_transcription)\n", " if (\n", " message.server_content.output_transcription\n", " and message.server_content.output_transcription.text\n", " ):\n", " output_transcription.append(\n", " message.server_content.output_transcription.text\n", " )\n", " if (\n", " message.server_content.model_turn\n", " and message.server_content.model_turn.parts\n", " ):\n", " for part in message.server_content.model_turn.parts:\n", " if part.inline_data:\n", " audio_data.append(\n", " np.frombuffer(part.inline_data.data, dtype=np.int16)\n", " )\n", "\n", " if input_transcription:\n", " display(Markdown(f\"**Input transcription >** {''.join(input_transcription)}\"))\n", "\n", " if audio_data:\n", " display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))\n", "\n", " if output_transcription:\n", " display(Markdown(f\"**Output transcription >** {''.join(output_transcription)}\"))" ] }, { "cell_type": "markdown", "metadata": { "id": "usjiqTDXfk_6" }, "source": [ "## What's next\n", "\n", "- Learn how to [build a web application that enables you to use your voice and camera to talk to Gemini 2.0 through the Live API.](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/gemini/multimodal-live-api/websocket-demo-app)\n", "- See the [Live API reference docs](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-live).\n", "- See the [Google Gen AI SDK reference docs](https://googleapis.github.io/python-genai/).\n", "- Explore other notebooks in the [Google Cloud Generative AI GitHub repository](https://github.com/GoogleCloudPlatform/generative-ai)." ] } ], "metadata": { "colab": { "name": "intro_multimodal_live_api_genai_sdk.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }

gemini/multimodal-live-api/intro_multimodal_live_api_genai_sdk.ipynb (757 lines of code) (raw):