gemini/function-calling/multimodal_function_calling.ipynb (1,279 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "ijGzTHJJUCPY" }, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "VEqbX8OhE8y9" }, "source": [ "# Multimodal Function Calling with the Gemini API & Python SDK\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/multimodal_function_calling.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Run in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Ffunction-calling%2Fmultimodal_function_calling.ipynb\">\n", " <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Run in Colab Enterprise\n", " </a>\n", " </td> \n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/multimodal_function_calling.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/function-calling/multimodal_function_calling.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", "</table>\n", "\n", "<div style=\"clear: both;\"></div>\n", "\n", "<b>Share to:</b>\n", "\n", "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/multimodal_function_calling.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n", "</a>\n", "\n", "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/multimodal_function_calling.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n", "</a>\n", "\n", "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/multimodal_function_calling.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n", "</a>\n", "\n", "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/multimodal_function_calling.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n", "</a>\n", "\n", "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/multimodal_function_calling.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n", "</a> \n" ] }, { "cell_type": "markdown", "metadata": { "id": "84e7e432e6ff" }, "source": [ "| | |\n", "|-|-|\n", "|Author(s) | [Kristopher Overholt](https://github.com/koverholt) |" ] }, { "cell_type": "markdown", "metadata": { "id": "CkHPv2myT2cx" }, "source": [ "## Overview\n", "\n", "### Introduction to Multimodal Function Calling with Gemini\n", "\n", "This notebook demonstrates a powerful [Function Calling](https://cloud.google.com/vertex-ai/docs/generative-ai/multimodal/function-calling) capability of the Gemini model: support for multimodal inputs. With multimodal function calling, you can go beyond traditional text inputs, enabling Gemini to understand your intent and predict function calls and function parameters based on various inputs like images, audio, video, and PDFs. Function calling can also be referred to as *function calling with controlled generation*, which guarantees that output generated by the model always adheres to a specific schema so that you receive consistently formatted responses.\n", "\n", "Previously, implementing multimodal function calling required two separate calls to the Gemini API: one to extract information from media, and another to generate a function call based on the extracted text. This process was cumbersome, prone to errors, and resulted in the loss of detail in valuable contextual information. Gemini's multimodal function calling capability streamlines this workflow, enabling a single API call that efficiently processes multimodal inputs for accurate function predictions and structured outputs. \n", "\n", "### How It Works\n", "\n", "1. **Define Functions and Tools:** Describe your functions, then group them into `Tool` objects for Gemini to use.\n", "2. **Send Inputs and Prompt:** Provide Gemini with multimodal input (image, audio, PDF, etc.) and a prompt describing your request.\n", "3. **Gemini Predicts Action:** Gemini analyzes the multimodal input and prompt to predict the best function to call and its parameters.\n", "4. **Execute and Return:** Use Gemini's prediction to make API calls, then send the results back to Gemini.\n", "5. **Generate Response:** Gemini uses the API results to provide a final, natural language response to the user. \n", "\n", "This notebook will guide you through practical examples of using Gemini's multimodal function calling to build intelligent applications that go beyond the limitations of text-only interactions. " ] }, { "cell_type": "markdown", "metadata": { "id": "DrkcqHrrwMAo" }, "source": [ "### Objectives\n", "\n", "In this tutorial, you will learn how to use the Gemini API in Vertex AI with the Vertex AI SDK for Python to make function calls with multimodal inputs, using the Gemini 2.0 (`gemini-2.0-flash`) model. You'll explore how Gemini can process and understand various input types — including images, video, audio, and PDFs — to predict and execute functions.\n", "\n", "You will complete the following tasks:\n", "\n", "- Install the Vertex AI SDK for Python.\n", "- Define functions that can be called by Gemini.\n", "- Package functions into tools.\n", "- Send multimodal inputs (images, video, audio, PDFs) and prompts to Gemini.\n", "- Extract predicted function calls and their parameters from Gemini's response.\n", "- Use the predicted output to make API calls to external systems (demonstrated with an image input example). \n", "- Return API responses to Gemini for natural language response generation (demonstrated with an image input example). " ] }, { "cell_type": "markdown", "metadata": { "id": "C9nEPojogw-g" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "- Vertex AI\n", "\n", "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "r11Gu7qNgx1p" }, "source": [ "## Getting Started\n" ] }, { "cell_type": "markdown", "metadata": { "id": "No17Cw5hgx12" }, "source": [ "### Install Google Gen AI SDK" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tFy3H3aPgx12" }, "outputs": [], "source": [ "%pip install --upgrade --quiet google-genai wikipedia" ] }, { "cell_type": "markdown", "metadata": { "id": "R5Xep4W9lq-Z" }, "source": [ "### Restart current runtime\n", "\n", "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XRvKdaPDTznN" }, "outputs": [], "source": [ "# Restart kernel after installs so that your environment can access the new packages\n", "import IPython\n", "\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "SbmM4z7FOBpM" }, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>\n", "</div>\n" ] }, { "cell_type": "markdown", "metadata": { "id": "dmWOrTJ3gx13" }, "source": [ "### Authenticate your notebook environment (Colab only)\n", "\n", "If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NyKGtVQjgx13" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "DF4l8DTdWgPY" }, "source": [ "### Set Google Cloud project information and initialize Vertex AI SDK\n", "\n", "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n", "\n", "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Nqwi-5ufWp_B" }, "outputs": [], "source": [ "import os\n", "\n", "PROJECT_ID = \"[your-project-id]\" # @param {type: \"string\", placeholder: \"[your-project-id]\", isTemplate: true}\n", "if not PROJECT_ID or PROJECT_ID == \"[your-project-id]\":\n", " PROJECT_ID = str(os.environ.get(\"GOOGLE_CLOUD_PROJECT\"))\n", "\n", "LOCATION = os.environ.get(\"GOOGLE_CLOUD_REGION\", \"us-central1\")\n", "\n", "from google import genai\n", "\n", "client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)" ] }, { "cell_type": "markdown", "metadata": { "id": "2a7225e4390a" }, "source": [ "## Multimodal Function Calling in Action" ] }, { "cell_type": "markdown", "metadata": { "id": "jXHfaVS66_01" }, "source": [ "### Import libraries\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lslYAvw37JGQ" }, "outputs": [], "source": [ "from IPython.display import Markdown, display\n", "from google.genai.types import (\n", " FunctionDeclaration,\n", " GenerateContentConfig,\n", " Part,\n", " Tool,\n", " UserContent,\n", ")\n", "import wikipedia" ] }, { "cell_type": "markdown", "metadata": { "id": "8cd167f70edf" }, "source": [ "### Initialize model\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ed564d2b905c" }, "outputs": [], "source": [ "MODEL_ID = \"gemini-2.0-flash\"" ] }, { "cell_type": "markdown", "metadata": { "id": "aa432a6e021a" }, "source": [ "### Image-Based Function Calling: Finding Animal Habitats\n", "\n", "In this example, you'll send along an image of a bird and ask Gemini to identify its habitat. This involves defining a function that looks up regions where a given animal is found, creating a tool that uses this function, and then sending a request to Gemini.\n", "\n", "<img src=\"https://storage.googleapis.com/github-repo/generative-ai/gemini/function-calling/multi-color-bird.jpg\" width=\"250px\">\n", "\n", "First, you define a `FunctionDeclaration` called `get_wildlife_region`. This function takes the name of an animal species as input and returns information about its typical region." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ae36049d4512" }, "outputs": [], "source": [ "get_wildlife_region = FunctionDeclaration(\n", " name=\"get_wildlife_region\",\n", " description=\"Look up the region where an animal can be found\",\n", " parameters={\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"animal\": {\"type\": \"string\", \"description\": \"Species of animal\"}\n", " },\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "0933d807af15" }, "source": [ "Next, you create a `Tool` object that includes your `get_wildlife_region` function. Tools help group related functions that Gemini can use:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c4218e572dc8" }, "outputs": [], "source": [ "image_tool = Tool(\n", " function_declarations=[\n", " get_wildlife_region,\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "28abd9546c48" }, "source": [ "Now you're ready to send a request to Gemini. Initialize the `GenerativeModel` and specify the image to analyze, along with a prompt. The `tools` argument tells Gemini to consider the functions in your `image_tool`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "38b842d71bce" }, "outputs": [], "source": [ "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " Part.from_uri(\n", " file_uri=\"gs://github-repo/generative-ai/gemini/function-calling/multi-color-bird.jpg\",\n", " mime_type=\"image/jpeg\",\n", " ),\n", " \"What is the typical habitat or region where this animal lives?\",\n", " ],\n", " config=GenerateContentConfig(temperature=0, tools=[image_tool]),\n", ")\n", "response_function_call = response.function_calls[0]\n", "response.function_calls[0]" ] }, { "cell_type": "markdown", "metadata": { "id": "065787dbaa26" }, "source": [ "Let's examine the response from Gemini. You can extract the predicted function name:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "e2b92a75e5b9" }, "outputs": [], "source": [ "function_name = response.function_calls[0].name\n", "function_name" ] }, { "cell_type": "markdown", "metadata": { "id": "4a6ba2cf6937" }, "source": [ "You can also get the arguments that Gemini predicted for the function call:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c89f16d5082e" }, "outputs": [], "source": [ "function_args = {key: value for key, value in response.function_calls[0].args.items()}\n", "function_args" ] }, { "cell_type": "markdown", "metadata": { "id": "180ef53579a0" }, "source": [ "Now, you'll call an external API (in this case, using the `wikipedia` Python package) using the animal name that Gemini extracted from the image:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "07eed3ae7aa3" }, "outputs": [], "source": [ "api_response = wikipedia.page(function_args[\"animal\"]).content\n", "api_response[:500]" ] }, { "cell_type": "markdown", "metadata": { "id": "f238cad25c36" }, "source": [ "Finally, you return the API response to Gemini so it can generate a final answer in natural language:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "02ee532ce187" }, "outputs": [], "source": [ "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " UserContent(\n", " parts=[\n", " Part.from_uri(\n", " file_uri=\"gs://github-repo/generative-ai/gemini/function-calling/multi-color-bird.jpg\",\n", " mime_type=\"image/jpeg\",\n", " ),\n", " Part.from_text(\n", " text=\"Inspect the image and get the regions where this animal can be found\",\n", " ),\n", " ]\n", " ),\n", " Part.from_function_call(\n", " name=response_function_call.name, args=response_function_call.args\n", " ), # Function call response\n", " Part.from_function_response(\n", " name=function_name,\n", " response={\n", " \"content\": api_response, # Return the API response to the Gemini model\n", " },\n", " ),\n", " ],\n", ")\n", "\n", "display(Markdown(response.text))" ] }, { "cell_type": "markdown", "metadata": { "id": "6e5f13d0e644" }, "source": [ "This example showcases how Gemini's multimodal function calling processes an image, predicts a relevant function and its parameters, and integrates with external APIs to provide comprehensive user information. This process opens up exciting possibilities for building intelligent applications that can \"see\" and understand the world around them via API calls to Gemini." ] }, { "cell_type": "markdown", "metadata": { "id": "6036faa1fb70" }, "source": [ "### Video-Based Function Calling: Identifying Product Features" ] }, { "cell_type": "markdown", "metadata": { "id": "4dd489a96132" }, "source": [ "Now let's explore how Gemini can extract information from videos for the purpose of invoking a function call. You'll use a video showcasing multiple products and ask Gemini to identify its key features.\n", "\n", "<img src=\"https://storage.googleapis.com/github-repo/generative-ai/gemini/function-calling/made-by-google-24.gif\" width=\"600px\">\n", "\n", "Start by defining a function called `get_feature_info` that takes a list of product features as input and could potentially be used to retrieve additional details about those features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "41d1ed66b8b3" }, "outputs": [], "source": [ "get_feature_info = FunctionDeclaration(\n", " name=\"get_feature_info\",\n", " description=\"Get additional information about a product feature\",\n", " parameters={\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"features\": {\n", " \"type\": \"array\",\n", " \"description\": \"A list of product features\",\n", " \"items\": {\"type\": \"string\", \"description\": \"Product feature\"},\n", " }\n", " },\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "b972769f37e6" }, "source": [ "Next, create a tool that includes your `get_feature_info` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f134fc04e6bb" }, "outputs": [], "source": [ "video_tool = Tool(\n", " function_declarations=[\n", " get_feature_info,\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "c16c497b85d3" }, "source": [ "Send a video to Gemini, along with a prompt asking for information about the product features, making sure to include your `video_tool` in the `tools` kwarg:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "09fbe282c3d3" }, "outputs": [], "source": [ "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " Part.from_uri(\n", " file_uri=\"gs://github-repo/generative-ai/gemini/function-calling/made-by-google-24.mp4\",\n", " mime_type=\"video/mp4\",\n", " ),\n", " \"Inspect the video and get information about the product features shown\",\n", " ],\n", " config=GenerateContentConfig(temperature=0, tools=[video_tool]),\n", ")\n", "\n", "response.function_calls" ] }, { "cell_type": "markdown", "metadata": { "id": "4115fd61850b" }, "source": [ "Gemini correctly predicted the `get_feature_info` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1ae3bb7a4847" }, "outputs": [], "source": [ "function_name = response.function_calls[0].name\n", "function_name" ] }, { "cell_type": "markdown", "metadata": { "id": "c17668290dd0" }, "source": [ "And you can see the list of product features that Gemini extracted from the video, which are available as structured function arguments that adhere to the JSON schema we defined in the `FunctionDeclaration`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "992c59809c7b" }, "outputs": [], "source": [ "function_args = {key: value for key, value in response.function_calls[0].args.items()}\n", "function_args" ] }, { "cell_type": "markdown", "metadata": { "id": "bd8dd10f1219" }, "source": [ "This example demonstrates Gemini's ability to understand video content. By defining a relevant function, you can use Gemini to extract structured information from videos and perform further actions based on that information.\n", "\n", "Now that the multimodal function call response is complete, you could use the function name and function arguments to call an external API using any REST API or client library of your choice, similar to how we did in the previous example with the `wikipedia` Python package.\n", "\n", "Since this sample notebook is focused on the mechanics of multimodal function calling rather than the subsequent function calls and API calls, we'll move on to another example with different multimodal inputs. You can refer to other sample notebooks on Gemini Function Calling for more details on where to go from here." ] }, { "cell_type": "markdown", "metadata": { "id": "41a08eda3be8" }, "source": [ "### Audio-Based Function Calling: Generating Book Recommendations" ] }, { "cell_type": "markdown", "metadata": { "id": "a94cd078b43c" }, "source": [ "In this example, you'll explore using audio input with Gemini's multimodal function calling. You'll send a podcast episode to Gemini and ask for book recommendations related to the topics discussed.\n", "\n", "<font color=\"green\">>>> \"SRE is just a production system specific manifestation of systems thinking ... and we kind of do it in an informal way.\"</font><br/>\n", "<font color=\"purple\">>>> \"The book called 'Thinking in Systems' ... it's a really good primer on this topic.\"</font><br/>\n", "<font color=\"green\">>>> \"An example of ... systems structure behavior thinking ... is the idea of like the cascading failure, that kind of vicious cycle of load that causes retries that causes more load ... \"</font><br/>\n", "<font color=\"purple\">>>> \"The worst pattern is the single embedded SRE that turns into the ops person ... you just end up doing all of the toil, all of the grunt work.\"</font><br/>\n", "<font color=\"green\">>>> \"Take that moment, take a breath, and really analyze the problem and understand how it's working as a system and understand how you can intervene to improve that.\"</font><br/>\n", "<font color=\"purple\">>>> \"Avoid just doing what you've done before and kicking the can down the road, and really think deeply about your problems.\"</font><br/>\n", "\n", "Define a function called `get_recommended_books` that takes a list of topics as input and (hypothetically) returns relevant book recommendations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9714025043bd" }, "outputs": [], "source": [ "get_recommended_books = FunctionDeclaration(\n", " name=\"get_recommended_books\",\n", " description=\"Get recommended books based on a list of topics\",\n", " parameters={\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"topics\": {\n", " \"type\": \"array\",\n", " \"description\": \"A list of topics\",\n", " \"items\": {\"type\": \"string\", \"description\": \"Topic\"},\n", " },\n", " },\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "f23465d0938f" }, "source": [ "Now create a tool that includes your newly defined function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "d61600788e03" }, "outputs": [], "source": [ "audio_tool = Tool(\n", " function_declarations=[\n", " get_recommended_books,\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "560afbd15a17" }, "source": [ "Provide Gemini with the audio file and a prompt to recommend books based on the podcast content:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "47228e6631a3" }, "outputs": [], "source": [ "response = client.models.generate_content(\n", " model=\"gemini-2.0-flash-001\",\n", " contents=[\n", " Part.from_uri(\n", " file_uri=\"gs://github-repo/generative-ai/gemini/function-calling/google-cloud-sre-podcast-s2-e8.mp3\",\n", " mime_type=\"audio/mpeg\",\n", " ),\n", " \"Inspect the audio file and generate a list of recommended books based on the topics discussed.\",\n", " ],\n", " config=GenerateContentConfig(temperature=0, tools=[audio_tool]),\n", ")\n", "response.function_calls" ] }, { "cell_type": "markdown", "metadata": { "id": "db9b85de9752" }, "source": [ "You can see that Gemini has successfully predicted your `get_recommended_books` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eabef4d9faf4" }, "outputs": [], "source": [ "function_name = response.function_calls[0].name\n", "function_name" ] }, { "cell_type": "markdown", "metadata": { "id": "ea00f52eb487" }, "source": [ "And the function arguments contain the list of topics that Gemini identified and extracted from the input audio file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8c8f32e930c9" }, "outputs": [], "source": [ "function_args = {key: value for key, value in response.function_calls[0].args.items()}\n", "function_args" ] }, { "cell_type": "markdown", "metadata": { "id": "8acd15dd7cec" }, "source": [ "This example highlights Gemini's capacity to understand and extract information from audio, enabling you to create applications that respond to spoken content or audio-based interactions." ] }, { "cell_type": "markdown", "metadata": { "id": "93577f2d2fe1" }, "source": [ "### PDF-Based Function Calling: Extracting Company Data from Invoices" ] }, { "cell_type": "markdown", "metadata": { "id": "924cf8d1c711" }, "source": [ "This example demonstrates how to use Gemini's multimodal function calling to process PDF documents. You'll work with a set of invoices and extract the names of the (fictitious) companies involved.\n", "\n", "<img src=\"https://storage.googleapis.com/github-repo/generative-ai/gemini/function-calling/invoice-synthetic-overview.png\" width=\"1000px\">\n", "\n", "Define a function called `get_company_information` that (in a real-world scenario) could be used to fetch details about a given list of companies:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ba57e626e9d2" }, "outputs": [], "source": [ "get_company_information = FunctionDeclaration(\n", " name=\"get_company_information\",\n", " description=\"Get information about a list of companies\",\n", " parameters={\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"companies\": {\n", " \"type\": \"array\",\n", " \"description\": \"A list of companies\",\n", " \"items\": {\"type\": \"string\", \"description\": \"Company name\"},\n", " }\n", " },\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "fae1c7a7d8a9" }, "source": [ "Package your newly defined function into a tool:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "a62ed01019f0" }, "outputs": [], "source": [ "invoice_tool = Tool(\n", " function_declarations=[\n", " get_company_information,\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "91dcbcbb0f50" }, "source": [ "Now you can provide Gemini with multiple PDF invoices and ask it to get company information:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "e509abf4d73a" }, "outputs": [], "source": [ "response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " Part.from_uri(\n", " file_uri=\"gs://github-repo/generative-ai/gemini/function-calling/invoice-synthetic-1.pdf\",\n", " mime_type=\"application/pdf\",\n", " ),\n", " Part.from_uri(\n", " file_uri=\"gs://github-repo/generative-ai/gemini/function-calling/invoice-synthetic-2.pdf\",\n", " mime_type=\"application/pdf\",\n", " ),\n", " Part.from_uri(\n", " file_uri=\"gs://github-repo/generative-ai/gemini/function-calling/invoice-synthetic-3.pdf\",\n", " mime_type=\"application/pdf\",\n", " ),\n", " Part.from_uri(\n", " file_uri=\"gs://github-repo/generative-ai/gemini/function-calling/invoice-synthetic-4.pdf\",\n", " mime_type=\"application/pdf\",\n", " ),\n", " Part.from_uri(\n", " file_uri=\"gs://github-repo/generative-ai/gemini/function-calling/invoice-synthetic-5.pdf\",\n", " mime_type=\"application/pdf\",\n", " ),\n", " \"Inspect the PDF files of invoices and retrieve information about each company\",\n", " ],\n", " config=GenerateContentConfig(temperature=0, tools=[invoice_tool]),\n", ")\n", "response.function_calls" ] }, { "cell_type": "markdown", "metadata": { "id": "974a138f3c6c" }, "source": [ "As expected, Gemini predicted the `get_company_information` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "766fdbafed76" }, "outputs": [], "source": [ "function_name = response.function_calls[0].name\n", "function_name" ] }, { "cell_type": "markdown", "metadata": { "id": "c80e9f280f5c" }, "source": [ "The function arguments contain the list of company names extracted from the PDF invoices:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9fa7a22d85b2" }, "outputs": [], "source": [ "function_args = {key: value for key, value in response.function_calls[0].args.items()}\n", "function_args" ] }, { "cell_type": "markdown", "metadata": { "id": "03e710b9d195" }, "source": [ "This example shows the power of Gemini for processing and extracting structured data from documents, a common requirement in many real-world applications." ] }, { "cell_type": "markdown", "metadata": { "id": "77f53d886376" }, "source": [ "### Image-Based Chat: Building a Multimodal Chatbot" ] }, { "cell_type": "markdown", "metadata": { "id": "d145dc63a74a" }, "source": [ "Let's put it all together and build a simple multimodal chatbot. This chatbot will understand image inputs and respond to questions using the functions you define.\n", "\n", "<img src=\"https://storage.googleapis.com/github-repo/generative-ai/gemini/function-calling/baby-fox-info.png\" width=\"500px\">\n", "\n", "First, define three functions: `get_animal_details`, `get_location_details`, and `check_color_palette`. These functions represent the capabilities of your chatbot and could potentially be used to retrieve additional details using REST API calls:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1618b25ad3e0" }, "outputs": [], "source": [ "get_animal_details = FunctionDeclaration(\n", " name=\"get_animal_details\",\n", " description=\"Look up information about a given animal species\",\n", " parameters={\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"animal\": {\"type\": \"string\", \"description\": \"Species of animal\"}\n", " },\n", " },\n", ")\n", "\n", "get_location_details = FunctionDeclaration(\n", " name=\"get_location_details\",\n", " description=\"Look up information about a given location\",\n", " parameters={\n", " \"type\": \"object\",\n", " \"properties\": {\"location\": {\"type\": \"string\", \"description\": \"Location\"}},\n", " },\n", ")\n", "\n", "check_color_palette = FunctionDeclaration(\n", " name=\"check_color_palette\",\n", " description=\"Check hex color codes for accessibility\",\n", " parameters={\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"colors\": {\n", " \"type\": \"array\",\n", " \"description\": \"A list of colors in hexadecimal format\",\n", " \"items\": {\n", " \"type\": \"string\",\n", " \"description\": \"Hexadecimal representation of color, as in #355E3B\",\n", " },\n", " }\n", " },\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "ca63d74adeba" }, "source": [ "Group your functions into a tool:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "178ce7754626" }, "outputs": [], "source": [ "chat_tool = Tool(\n", " function_declarations=[\n", " get_animal_details,\n", " get_location_details,\n", " check_color_palette,\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "0eb1e7629b9e" }, "source": [ "Start a chat session with Gemini, providing it with your `chat_tool`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ac1ebff348c9" }, "outputs": [], "source": [ "chat = client.chats.create(\n", " model=MODEL_ID, config=GenerateContentConfig(temperature=0, tools=[chat_tool])\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "05bb7db4be62" }, "source": [ "Send an image of a fox, along with a simple prompt:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c3e47a96df7e" }, "outputs": [], "source": [ "response = chat.send_message(\n", " [\n", " Part.from_uri(\n", " file_uri=\"gs://github-repo/generative-ai/gemini/function-calling/baby-fox.jpg\",\n", " mime_type=\"image/jpeg\",\n", " ),\n", " \"Tell me about this animal\",\n", " ]\n", ")\n", "\n", "response.function_calls" ] }, { "cell_type": "markdown", "metadata": { "id": "38c96a599b94" }, "source": [ "Now ask about the location details in the image:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "153f7b93eb65" }, "outputs": [], "source": [ "response = chat.send_message(\n", " [\n", " Part.from_uri(\n", " file_uri=\"gs://github-repo/generative-ai/gemini/function-calling/baby-fox.jpg\",\n", " mime_type=\"image/jpeg\",\n", " ),\n", " \"Tell me details about this location\",\n", " ]\n", ")\n", "\n", "response.function_calls" ] }, { "cell_type": "markdown", "metadata": { "id": "db6363659da8" }, "source": [ "And finally, ask for a color palette based the image:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "af519b9c7bc5" }, "outputs": [], "source": [ "response = chat.send_message(\n", " [\n", " Part.from_uri(\n", " file_uri=\"gs://github-repo/generative-ai/gemini/function-calling/baby-fox.jpg\",\n", " mime_type=\"image/jpeg\",\n", " ),\n", " \"Get the color palette of this image and check it for accessibility\",\n", " ]\n", ")\n", "\n", "response.function_calls" ] }, { "cell_type": "markdown", "metadata": { "id": "4e38eb5eeb5b" }, "source": [ "While this chatbot doesn't actually execute the predicted functions, it demonstrates creating an interactive experience using multimodal inputs and function calling in a chat format. You can extend this example by implementing REST API calls or client library requests for each function to create a truly functional and engaging multimodal chatbot that's connected to the real world." ] }, { "cell_type": "markdown", "metadata": { "id": "ee5711d51ae0" }, "source": [ "## Conclusions\n", "\n", "In this notebook, you explored the powerful capabilities of Gemini's multimodal function calling. You learned how to:\n", "\n", "- Define functions and package them into tools.\n", "- Send multimodal inputs (images, video, audio, PDFs) and prompts to Gemini. \n", "- Extract predicted function calls and their parameters.\n", "- Use the predicted output to make (or potentially make) API calls.\n", "- Return API responses to Gemini for natural language generation. \n", "\n", "You've seen how Gemini can understand and act on a range of different multimodal inputs, which opens up a world of possibilities for building innovative and engaging multimodal applications. You can now use these powerful tools to create your own intelligent applications that seamlessly integrate media, natural language, and calls to external APIs and system.\n", "\n", "Experiment with different modalities, functions, and prompts to discover the full potential of Gemini's multimodal and function calling capabilities. And you can continue learning by exploring other sample notebooks in this repository and exploring the [documentation for Gemini Function Calling](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/function-calling). " ] } ], "metadata": { "colab": { "name": "multimodal_function_calling.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }