In [1]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Multimodal Prompting with Gemini: Working with Videos

<table align="left">
<td style="text-align: center">
<a href="https://colab.research.google.com/github/GoogleCloudPlatform/applied-ai-engineering-samples/blob/main/genai-on-vertex-ai/gemini/prompting_recipes/multimodal/multimodal_prompting_video.ipynb">
<img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Run in Colab
</a>
</td>
      <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fmain%2Fgenai-on-vertex-ai%2Fgemini%2Fprompting_recipes%2Fmultimodal%2Fmultimodal_prompting_video.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
<td style="text-align: center">
<a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/applied-ai-engineering-samples/main/genai-on-vertex-ai/gemini/prompting_recipes/multimodal/multimodal_prompting_video.ipynb">
<img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
</a>
</td>    
<td style="text-align: center">
<a href="https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples/blob/main/genai-on-vertex-ai/gemini/prompting_recipes/multimodal/multimodal_prompting_video.ipynb">
<img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
</a>
</td>
</table>

| | |
|-|-|
| Author(s) | [Michael Chertushkin](https://github.com/misha-chertushkin) |
| Reviewer(s) | [Rajesh Thallam](https://github.com/rthallam), [Skander Hannachi](https://github.com/skanderhn)  |
| Last updated | 2024-09-16 |

# Overview

---

Gemini models supports adding image, audio, video, and PDF files in text or chat prompts for a text or code response. Gemini 2.0 Flash supports up to 1 Million input tokens with up to 1 hours length of video per prompt. Gemini can analyze the audio embedded within a video as well. You can add videos to Gemini requests to perform [video analysis tasks](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/video-understanding) such as video summarization, video chapterization (or localization), key event detection, scene analysis, captioning and transcription and more. 

---

In this notebook we cover prompting recipes and strategies for working with Gemini on videos and show some examples on the way. This notebook is organized as follows:

- Video Understanding
- Key event detection
- Using System instruction
- Analyzing videos with step-by-step reasoning
- Generating structured output
- Using context caching for repeated queries

---

# Getting Started

The following steps are necessary to run this notebook, no matter what notebook environment you're using.

If you're entirely new to Google Cloud, [get started here](https://cloud.google.com/docs/get-started).

## Google Cloud Project Setup

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.
1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).
1. [Enable the Service Usage API](https://console.cloud.google.com/apis/library/serviceusage.googleapis.com)
1. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).
1. [Enable the Cloud Storage API](https://console.cloud.google.com/flows/enableapi?apiid=storage.googleapis.com).

## Google Cloud Permissions

**To run the complete Notebook, including the optional section, you will need to have the [Owner role](https://cloud.google.com/iam/docs/understanding-roles) for your project.**

If you want to skip the optional section, you need at least the following [roles](https://cloud.google.com/iam/docs/granting-changing-revoking-access):
* **`roles/serviceusage.serviceUsageAdmin`** to enable APIs
* **`roles/iam.serviceAccountAdmin`** to modify service agent permissions
* **`roles/aiplatform.user`** to use AI Platform components
* **`roles/storage.objectAdmin`** to modify and delete GCS buckets

## Install Vertex AI SDK for Python and other dependencies (If Needed)

The list `packages` contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.## Install Vertex AI SDK for Python and other dependencies (If Needed)

The list `packages` contains tuples of package import names and install names. If the import name is not found then the install name is used to install quitely for the current user.

In [None]:
! pip install google-cloud-aiplatform --upgrade --quiet --user

## Restart Runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [None]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Authenticate

If you're using Colab, run the code in the next cell. Follow the popups and authenticate with an account that has access to your Google Cloud [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects).

If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into [Application Default Credentials for your local environment](https://cloud.google.com/docs/authentication/provide-credentials-adc#local-dev) and [initializing the Google Cloud CLI](https://cloud.google.com/docs/authentication/gcloud). In many cases, running `gcloud auth application-default login` in a shell on the machine running the notebook kernel is sufficient.

More authentication options are discussed [here](https://cloud.google.com/docs/authentication).

In [1]:
# Colab authentication.
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()
    print("Authenticated")

## Set Google Cloud project information and Initialize Vertex AI SDK

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

Make sure to change `PROJECT_ID` in the next cell. You can leave the values for `REGION` unless you have a specific reason to change them.

In [None]:
import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

vertexai.init(project=PROJECT_ID, location=REGION)
print("Vertex AI SDK initialized.")
print(f"Vertex AI SDK version = {vertexai.__version__}")

## Import Libraries

In [None]:
from vertexai.generative_models import (GenerationConfig, GenerativeModel,
                                        HarmBlockThreshold, HarmCategory, Part)

## Define Utility functions

In [None]:
import http.client
import textwrap
import typing
import urllib.request

from google.cloud import storage
from IPython import display
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"


def wrap(string, max_width=80):
    return textwrap.fill(string, max_width)


def get_bytes_from_url(url: str) -> bytes:
    with urllib.request.urlopen(url) as response:
        response = typing.cast(http.client.HTTPResponse, response)
        bytes = response.read()
    return bytes


def get_bytes_from_gcs(gcs_path: str):
    bucket_name = gcs_path.split("/")[2]
    object_prefix = "/".join(gcs_path.split("/")[3:])
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.get_blob(object_prefix)
    return blob.download_as_bytes()


def display_image(image_url: str, width: int = 300, height: int = 200):
    if image_url.startswith("gs://"):
        image_bytes = get_bytes_from_gcs(image_url)
    else:
        image_bytes = get_bytes_from_url(image_url)
    display.display(display.Image(data=image_bytes, width=width, height=height))


def display_video(video_url: str, width: int = 300, height: int = 200):
    if video_url.startswith("gs://"):
        video_bytes = get_bytes_from_gcs(video_url)
    else:
        video_bytes = get_bytes_from_url(video_url)
    display.display(
        display.Video(
            data=video_bytes,
            width=width,
            height=height,
            embed=True,
            mimetype="video/mp4",
        )
    )

def display_audio(audio_url: str, width: int = 300, height: int = 200):
    if audio_url.startswith("gs://"):
        audio_bytes = get_bytes_from_gcs(audio_url)
    else:
        audio_bytes = get_bytes_from_url(audio_url)
    display.display(display.Audio(data=audio_bytes, embed=True))


def print_prompt(contents: list[str | Part]):
    for content in contents:
        if isinstance(content, Part):
            if content.mime_type.startswith("image"):
                display_image(image_url=content.file_data.file_uri)
            elif content.mime_type.startswith("video"):
                display_video(video_url=content.file_data.file_uri)
            elif content.mime_type.startswith("audio"):
                display_audio(audio_url=content.file_data.file_uri)
            else:
                print(content)
        else:
            print(content)

## Initialize Gemini

In [None]:
# Gemini Config
GENERATION_CONFIG = {
    "max_output_tokens": 8192,
    "temperature": 0.1,
    "top_p": 0.95,
}

SAFETY_CONFIG = {
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
}

gemini = GenerativeModel(model_name="gemini-2.0-flash-001")
videos_path_prefix = (
    "gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/videos"
)


def generate(
    model,
    contents,
    safety_settings=SAFETY_CONFIG,
    generation_config=GENERATION_CONFIG,
    as_markdown=False,
):
    responses = model.generate_content(
        contents=contents,
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=False,
    )
    if isinstance(responses, list):
        for response in responses:
            if as_markdown:
                display.display(display.Markdown(response.text))
            else:
                print(wrap(response.text), end="")
    else:
        if as_markdown:
            display.display(display.Markdown(responses.text))
        else:
            print(wrap(responses.text), end="")

In [None]:
display_video(
    video_url="gs://public-aaie-genai-samples/gemini/prompting_recipes/multimodal/videos/video_1.mp4"
)

# Prompt #1. Video Understanding

This task requires the input to be presented in two different modalities: text and video. The example of the API call is below, however this is non-optimal prompt and we can make it better.

In [5]:
video_path = f"{videos_path_prefix}/video_1.mp4"
video_content = Part.from_uri(uri=video_path, mime_type="video/mp4")
prompt = """Provide a description of the video. The description should also 
contain anything important which people say in the video."""

contents = [video_content, prompt]
# print_prompt(contents)

In [6]:
generate(gemini, contents)

Here is a description of the video:  The video shows a person tossing a pink
collapsible cup in the air and catching it. The background is a white curtain.
The person's arm and hand are visible. The cup is the main focus of the video.
The video is shot in a bright, minimalist style. There is no audio in the video.

As we see the model correctly picked what happens there, but it did not provide much details. Let's modify the prompt.

### Video Understanding. Advanced Prompt


In [7]:
prompt = """You are an expert video analyzer. You task is to analyze the video 
and produce the detailed description about what happens on the video.

Key Points:
- Use timestamps (in MM:SS format) to output key events from the video.
- Add information about what happens at each timestamp.
- Add information about entities in the video and capture the relationship between them.
- Highlight the central theme or focus of the video.

Remember:
- Try to recover hidden meaning from the scene. For example, some hidden humor 
  or some hidden context.
"""

contents = [video_content, prompt]
generate(gemini, contents, as_markdown=True)

Here's a detailed analysis of the video:

*   **00:00** A hand throws a pink collapsible cup into the air against a white curtain backdrop.
*   **00:01** The hand catches the cup.
*   **00:02** The hand throws the cup into the air again, this time in its collapsed form.
*   **00:03** The hand catches the cup in its expanded form.
*   **00:04** The hand shakes the cup.
*   **00:05** The hand holds the cup still.
*   **00:06** The hand moves the cup around.
*   **00:07** The hand throws the cup into the air again.
*   **00:08** The hand holds the cup still.
*   **00:09** The hand shakes the cup.
*   **00:10** The hand holds the cup still.

The central theme of the video is showcasing a pink collapsible cup. The hand interacts with the cup by throwing it in the air, catching it, shaking it, and holding it still. The white curtain backdrop provides a clean and simple background, drawing attention to the cup and the hand's interaction with it.


The response with the updated prompt captures much more details. Although this prompt is rather generic and can be used for other videos, let's add specifics to the prompt. For example, if we want to capture at which time certain event happened.

# Prompt #2. Video Understanding: Key events detection


In [8]:
prompt = """You are an expert video analyzer. You task is to analyze the video 
and produce the detailed description about what happens on the video.

Key Points:
- Use timestamps (in MM:SS format) to output key events from the video.
- Add information about what happens at each timestamp.
- Add information about entities in the video and capture the relationship between them.
- Highlight the central theme or focus of the video.

Remember:
- Try to recover hidden meaning from the scene. For example, some hidden humor 
  or some hidden context.

At which moment the cup was thrown for the second time?
"""

contents = [video_content, prompt]
generate(gemini, contents, as_markdown=True)

Here's a detailed analysis of the video:

*   **00:00** A hand throws a pink collapsible cup into the air against a white curtain backdrop.
*   **00:01** The hand catches the cup.
*   **00:02** The hand throws the cup again.
*   **00:03** The hand catches the cup again.
*   **00:04** The hand shakes the cup.
*   **00:07** The hand throws the cup again.
*   **00:08** The hand catches the cup again.
*   **00:09** The hand shakes the cup.

The central theme of the video is a person playing with a pink collapsible cup, repeatedly throwing it into the air and catching it.


# Prompt #3. Video Understanding: Using System instruction

System Instruction (SI) is an effective way to steer Gemini's behavior and shape 
how the model responds to your prompt. SI can be used to describe model behavior 
such as persona, goal, tasks to perform, output format / tone / style, any constraints etc. 

SI behaves more "sticky" (or consistent) during multi-turn behavior. For example, 
if you want to achieve a behavior that the model will consistently follow, then 
system instruction is the best way to put this instruction.

In this example, we will move the task rules to system instruction and the 
question on a specific event in the user prompt.

In [9]:
system_prompt = """You are an expert video analyzer. You task is to analyze the video 
and produce the detailed description about what happens on the video.

Key Points:
- Use timestamps (in MM:SS format) to output key events from the video.
- Add information about what happens at each timestamp.
- Add information about entities in the video and capture the relationship between them.
- Highlight the central theme or focus of the video.

Remember:
- Try to recover hidden meaning from the scene. For example, some hidden humor 
  or some hidden context.
"""

prompt = "At which moment the cup was thrown for the second time?"

In [10]:
gemini_si = GenerativeModel(
    model_name="gemini-2.0-flash-001", system_instruction=system_prompt
)

contents = [video_content, prompt]
generate(gemini_si, contents, as_markdown=True)

[00:07] The cup was thrown for the second time.


# Prompt #4. Video Understanding: Step-by-step reasoning

We see that actually a mistake happened in analyzing the video. The model does not show all the timestamps where the cup is thrown. Let's fix it with "step-by-step reasoning".

In [11]:
step_by_step_prompt = """Describe the video. Analyze the video step-by-step. 
Output all times when the cup is thrown with timestamps. 
After that output the timestamp, when the cup is thrown for the second time.
"""

contents = [video_content, step_by_step_prompt]
generate(gemini_si, contents, as_markdown=True)

Here's a breakdown of the video:

The central theme of the video revolves around a person playfully tossing and catching a pink, collapsible cup. The background is a simple white curtain, keeping the focus entirely on the cup and the hand interacting with it.

Here's a step-by-step analysis with timestamps:

*   **00:00** The person throws the pink cup into the air.
*   **00:01** The person catches the pink cup.
*   **00:02** The person throws the pink cup into the air again.
*   **00:03** The person catches the pink cup.
*   **00:07** The person throws the pink cup into the air for the third time.

The cup is thrown for the second time at **00:02**.

# Prompt #5. Video Understanding: Get structured outputs

Gemini models can generate structured outputs such as JSON, providing a blueprint for the model's output. This feature is also referred to as [controlled generation](https://developers.googleblog.com/en/mastering-controlled-generation-with-gemini-15-schema-adherence/). 

In this example, we demonstrate Gemini to return structured output (JSON) from a video analysis. One of the ways to achieve better understanding of video (or any multimodal) content is to prompt the model to explain its "reasoning" about the response. This has proven to be very effective method, however it can increase the latency. 

[Vertex AI Gemini API](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output) makes it easy to return JSON output by configuring response MIME type as `application/json`. Optionally, you can also configure `response_schema` with the JSON schema for the model to generate output as per the schema.

In [13]:
response_schema = {
    "type": "ARRAY",
    "items": {
        "type": "OBJECT",
        "properties": {
            "harmfulness_reasoning": {
                "type": "STRING",
                "description": "Step-by-step detailed reasoning about how harmful is the video",
            },
            "harmfulness_score": {
                "type": "INTEGER",
                "description": "Number between 0 and 5 indicating how harmful is the video",
            },
        },
        "required": ["harmfulness_reasoning", "harmfulness_score"],
    },
}

In [14]:
structured_prompt = """You are an expert video analyzer. You task is to analyze the video 
and produce a harmfulness score - how harmful this video can be for kids."""

contents = [video_content, structured_prompt]

generate(
    gemini,
    contents,
    generation_config=GenerationConfig(
        response_mime_type="application/json", response_schema=response_schema
    ),
)

[   {     "harmfulness_reasoning": "The video features a hand tossing and
catching a pink cup. There is no indication of any harmful or dangerous content,
nor does it contain any themes or visuals that would be considered inappropriate
for children. The scene is simple and does not present any risk of promoting
negative behaviors.",     "harmfulness_score": 0   } ]

The model returned the correct score for the video by asking the model to output "reasoning" along with the score. Adding "reasoning" field before the "score" gives a consistent and correct score. The intuition is  that LLM can generate "reasoning" first and rely on the thoughts to properly produce the score.

# Prompt #6. Video Understanding: Context Caching

[Context caching](https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview?hl=en) is a method to reduce the cost of requests that contain repeated content with high input token count. It can potentially reduce the latency at the cost of storing the objects in the cache. The user can specify cache expiration time for which the object is saved in cache.

Context caching helps a lot when we want:
- to repeatedly ask questions about the long video
- to reduce costs and save latency

In [15]:
long_video_path = f"{videos_path_prefix}/long_video_1.mp4"
long_video_content = Part.from_uri(uri=long_video_path, mime_type="video/mp4")

prompt = """Describe what happens in the beginning, in the middle and in the 
end of the video. Also, list the name of the main character and any problems 
they face."""

contents = [long_video_content, prompt]
# print_prompt(contents)

In [16]:
# Time the call without context caching
from timeit import default_timer as timer

start = timer()
generate(gemini, contents)
end = timer()

print(f"\nTime elapsed: {end - start} seconds")

Here's a breakdown of the video:  **Beginning (0:00 - 1:25):**  *   The video
opens with the title card for "Sherlock Jr." starring Buster Keaton, presented
by Joseph M. Schenck. *   Credits for the story, photography, art direction, and
electrician are shown. *   Copyright information for 1924 is displayed. *   A
proverb appears: "Don't try to do two things at once and expect to do justice to
both." *   The opening narration sets the scene: a boy working as a moving
picture operator in a small-town theater is also studying to be a detective.
**Middle (1:26 - 38:59):**  *   The video shows the main character, a young man
with a mustache, reading a book titled "How-To-Be-A-Detective" in an empty
theater. *   His boss tells him to clean the theater instead of reading
detective books. *   The young man is shown sweeping the theater and then
walking to a confectionery store. *   He sees a girl in the store and wants to
buy her chocolates, but he doesn't have enough money. *   The girl's fa

In [17]:
import datetime

from vertexai.preview import caching
from vertexai.preview.generative_models import GenerativeModel

cached_content = caching.CachedContent.create(
    model_name="gemini-2.0-flash-001",
    contents=[long_video_content],
    ttl=datetime.timedelta(hours=1),
    display_name="long video cache",
)

model_cached = GenerativeModel.from_cached_content(cached_content=cached_content)

In [18]:
# Call with context caching
start = timer()
responses = model_cached.generate_content(
    prompt,
    generation_config=GENERATION_CONFIG,
    safety_settings=SAFETY_CONFIG,
    stream=False,
)
end = timer()

print(wrap(responses.text), end="")

print(f"\nTime elapsed: {end - start} seconds")

Here's a breakdown of the video:  **Beginning (0:00-1:25):**  *   The video
starts with the title card for the film "Sherlock Jr." starring Buster Keaton,
presented by Joseph M. Schenck. *   Credits are shown, including director,
writers, photography, art director, and electrician. *   Copyright information
is displayed, indicating the film was copyrighted in 1924. *   A proverb is
presented: "Don't try to do two things at once and expect to do justice to
both." *   The introduction explains that the story is about a boy who tried to
do two things at once: work as a moving picture operator and study to be a
detective.  **Middle (1:26-38:59):**  *   The scene shifts to a movie theater
where a young man (Buster Keaton) is reading a book titled "How-To-Be-A-
Detective." *   His boss tells him to clean the theater instead of reading. *
The young man is shown working at the theater, but he is distracted by his
detective studies. *   He tries to buy a box of chocolates for a girl he likes,
b

As we see the result with context caching was relatively faster than without context caching. Not only that, the cost of the request is lower as we did not need to send the video again during the prompt for analysis.

Context caching therefore is ideal for the repeated questions against the same long file: video, document, audio.

# Conclusion

This demonstrated various examples of working with Gemini using videos. Following are general prompting strategies when working with Gemini on multimodal prompts, that can help achieve better performance from Gemini:

1. Craft clear and concise instructions.
1. Add your video or any media first for single-media prompts.
1. Add few-shot examples to the prompt to show the model how you want the task done and the expected output.
1. Break down the task step-by-step.
1. Specify the output format.
1. Ask Gemini to include reasoning in its response along with decision or scores
1. Use context caching for repeated queries.

Specifically, when working with videos following may help:

1. Specify timestamp format when localizing videos.
1. Ask Gemini to focus on visual content for well-known video clips.
1. Process long videos in segments for dense outputs.


---