gemini/use-cases/video-thumbnail-generation/video_thumbnail_generation.ipynb (677 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "ijGzTHJJUCPY" }, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "VEqbX8OhE8y9" }, "source": [ "# Video Thumbnail Generation using Gemini\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-thumbnail-generation/video_thumbnail_generation.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Google Colaboratory logo\"><br> Run in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fvideo-thumbnail-generation%2Fvideo_thumbnail_generation.ipynb\">\n", " <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Run in Colab Enterprise\n", " </a>\n", " </td> \n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-thumbnail-generation/video_thumbnail_generation.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/video-thumbnail-generation/video_thumbnail_generation.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", "</table>\n", "\n", "<div style=\"clear: both;\"></div>\n", "\n", "<b>Share to:</b>\n", "\n", "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-thumbnail-generation/video_thumbnail_generation.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n", "</a>\n", "\n", "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-thumbnail-generation/video_thumbnail_generation.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n", "</a>\n", "\n", "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-thumbnail-generation/video_thumbnail_generation.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n", "</a>\n", "\n", "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-thumbnail-generation/video_thumbnail_generation.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n", "</a>\n", "\n", "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-thumbnail-generation/video_thumbnail_generation.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n", "</a> " ] }, { "cell_type": "markdown", "metadata": { "id": "f0cc0f48513b" }, "source": [ "| | |\n", "|-|-|\n", "|Author(s) | [Kartik Chaudhary](https://github.com/kartikgill)|" ] }, { "cell_type": "markdown", "metadata": { "id": "DrkcqHrrwMAo" }, "source": [ "## Objectives\n", "\n", "In this tutorial, you will learn how to extract meaningful thumbnail images from a video using Gemini 2.0 model.\n", "\n", "You will complete the following tasks:\n", "\n", "- Install the Google Gen AI SDK for Python\n", "- Use the Gemini API in Vertex AI to interact with the Gemini model\n", " - Extract thumbnails for a Video along with captions using Gemini\n", " - Use **[`moviepy`](https://zulko.github.io/moviepy/)** Python library for frame extraction for a given timestamp\n", " - Using a better prompt to improve results" ] }, { "cell_type": "markdown", "metadata": { "id": "C9nEPojogw-g" }, "source": [ "### Costs\n", "\n", "This tutorial uses billable components of Google Cloud:\n", "\n", "- Vertex AI\n", "\n", "Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "r11Gu7qNgx1p" }, "source": [ "## Getting Started\n" ] }, { "cell_type": "markdown", "metadata": { "id": "No17Cw5hgx12" }, "source": [ "### Install libraries for Python\n", "\n", "- **[Google Gen AI SDK](https://cloud.google.com/vertex-ai/generative-ai/docs/sdks/overview)**: to call the Gemini API in Vertex AI.\n", "- **[moviepy](https://zulko.github.io/moviepy/)**: A module for video editing." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tFy3H3aPgx12" }, "outputs": [], "source": [ "%pip install --upgrade --quiet google-genai moviepy" ] }, { "cell_type": "markdown", "metadata": { "id": "dmWOrTJ3gx13" }, "source": [ "### Authenticate your notebook environment (Colab only)\n", "\n", "If you are running this notebook on Google Colab, run the following cell to authenticate your environment. This step is not required if you are using [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NyKGtVQjgx13" }, "outputs": [], "source": [ "import sys\n", "\n", "# Additional authentication is required for Google Colab\n", "if \"google.colab\" in sys.modules:\n", " # Authenticate user to Google Cloud\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "DF4l8DTdWgPY" }, "source": [ "### Set Google Cloud project information and initialize Vertex AI SDK\n", "\n", "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n", "\n", "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Nqwi-5ufWp_B" }, "outputs": [], "source": [ "# Use the environment variable if the user doesn't provide Project ID.\n", "import os\n", "\n", "from google import genai\n", "\n", "PROJECT_ID = \"[your-project-id]\" # @param {type: \"string\", placeholder: \"[your-project-id]\", isTemplate: true}\n", "if not PROJECT_ID or PROJECT_ID == \"[your-project-id]\":\n", " PROJECT_ID = str(os.environ.get(\"GOOGLE_CLOUD_PROJECT\"))\n", "\n", "LOCATION = os.environ.get(\"GOOGLE_CLOUD_REGION\", \"us-central1\")\n", "\n", "client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)" ] }, { "cell_type": "markdown", "metadata": { "id": "jXHfaVS66_01" }, "source": [ "### Import libraries\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lslYAvw37JGQ" }, "outputs": [], "source": [ "from google.genai.types import GenerateContentConfig, Part\n", "import matplotlib.pyplot as plt\n", "import moviepy\n", "from moviepy import VideoFileClip" ] }, { "cell_type": "markdown", "metadata": { "id": "4437b7608c8e" }, "source": [ "## Using the Gemini model\n", "\n", "The Gemini model is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots." ] }, { "cell_type": "markdown", "metadata": { "id": "BY1nfXrqRxVX" }, "source": [ "### Load the Gemini model\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2998506fe6d1" }, "outputs": [], "source": [ "MODEL_ID = \"gemini-2.0-flash\"" ] }, { "cell_type": "markdown", "metadata": { "id": "f0b295e7dee5" }, "source": [ "### Sample Video path from Google Cloud Storage\n", "\n", "gs://github-repo/generative-ai/gemini/use-cases/video-thumbnail-generation/sample_video_google_trips.webm\n", "\n", "![Google Trips](https://cloud.google.com/vertex-ai/generative-ai/docs/prompt-gallery/samples/video_video_q_and_a_89?hl=en)\n", "\n", "#### [Click here to watch/download this video](https://cloud.google.com/vertex-ai/generative-ai/docs/prompt-gallery/samples/video_video_q_and_a_89?hl=en)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "d3df3e7686dd" }, "outputs": [], "source": [ "video_uri = \"gs://github-repo/generative-ai/gemini/use-cases/video-thumbnail-generation/sample_video_google_trips.webm\"" ] }, { "cell_type": "markdown", "metadata": { "id": "e332de1fb5df" }, "source": [ "### Creating a local copy of the video for easy frame extraction" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "a1a71d47c5c1" }, "outputs": [], "source": [ "!gsutil cp {video_uri} sample_video.webm" ] }, { "cell_type": "markdown", "metadata": { "id": "125906be5cfa" }, "source": [ "### Creating a MoviePy Clip Object (Helps in extracting frame at a given timestamp)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "931570104d40" }, "outputs": [], "source": [ "clip = VideoFileClip(\"sample_video.webm\")" ] }, { "cell_type": "markdown", "metadata": { "id": "fd2aea00f91f" }, "source": [ "### Define a function to Call Gemini API" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bf07e42bf777" }, "outputs": [], "source": [ "def call_gemini(\n", " prompt: str,\n", " gcs_video_path: str,\n", ") -> dict:\n", " \"\"\"Call Gemini API with video and prompt.\"\"\"\n", " # define fixed schema for Gemini outputs\n", " response_schema = {\n", " \"type\": \"array\",\n", " \"items\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"timestamp\": {\n", " \"type\": \"string\",\n", " },\n", " \"caption\": {\n", " \"type\": \"string\",\n", " },\n", " },\n", " \"required\": [\"timestamp\", \"caption\"],\n", " },\n", " }\n", " # model configurations\n", " generation_config = GenerateContentConfig(\n", " temperature=1,\n", " top_p=0.8,\n", " max_output_tokens=8192,\n", " response_schema=response_schema,\n", " response_mime_type=\"application/json\",\n", " )\n", " # creating video input for API call\n", " video_input = Part.from_uri(\n", " file_uri=gcs_video_path,\n", " mime_type=\"video/webm\",\n", " )\n", " # calling Gemini API\n", " response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[video_input, prompt],\n", " config=generation_config,\n", " )\n", " return response.parsed" ] }, { "cell_type": "markdown", "metadata": { "id": "8908704c327f" }, "source": [ "### Defining a function to parse output and display results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b27574e5c060" }, "outputs": [], "source": [ "def display_results(\n", " json_response: dict,\n", " clip: moviepy.video.io.VideoFileClip.VideoFileClip,\n", ") -> None:\n", " \"\"\"Parse json output, extract thumbnail frames and display.\"\"\"\n", "\n", " # Image plotting settings\n", " fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 9))\n", "\n", " # extract frame for each timestamp and plot the images\n", " counter = 0\n", " for item in json_response:\n", " timestamp = item[\"timestamp\"]\n", " caption = item[\"caption\"]\n", " frame = clip.get_frame(timestamp)\n", " row, col = counter // 2, counter % 2\n", " ax[row, col].imshow(frame)\n", " ax[row, col].set_title(caption, fontdict={\"fontsize\": 9})\n", " counter += 1\n", "\n", " fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "973247741f50" }, "source": [ "# Case 1: Using a Simple Prompt" ] }, { "cell_type": "markdown", "metadata": { "id": "39445140da28" }, "source": [ "### Writing a basic prompt for thumbnail generation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "29bc908cd474" }, "outputs": [], "source": [ "basic_prompt = (\n", " \"\"\"Generate 4 thumbnail images from the given video file with short captions.\"\"\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "7ad3135b0247" }, "source": [ "### calling Gemini API with our prompt and video" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "35ce6c942d21" }, "outputs": [], "source": [ "response_dict = call_gemini(\n", " prompt=basic_prompt,\n", " gcs_video_path=video_uri,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "916b62896ee6" }, "source": [ "### showing JSON output from Gemini" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2c3e94043a97" }, "outputs": [], "source": [ "print(response_dict)" ] }, { "cell_type": "markdown", "metadata": { "id": "a4a72354a934" }, "source": [ "### displaying thumbnail results with captions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b57da95884ee" }, "outputs": [], "source": [ "display_results(response_dict, clip)" ] }, { "cell_type": "markdown", "metadata": { "id": "903857971e6b" }, "source": [ "# Case 2: Using an Advanced Prompt" ] }, { "cell_type": "markdown", "metadata": { "id": "50d6d00f63bd" }, "source": [ "### Writing an advanced prompt for better thumbnail generation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "19771c8d3bf3" }, "outputs": [], "source": [ "advanced_prompt = \"\"\"You are an expert in video content creation and content marketing.\n", "You have the ability to find best thumbnails from a video and provide meaningful and short and catchy captions for them.\n", "Your task is to find the best 4 thumbnails from a given video along with short, and meaningful captions that is good for marketing.\n", "Consider the following rules while generating thumbnails:\n", "\n", "- Thumbnail should have clear focus on the key objects and people, less focus on background\n", "- Thumbnail image should be high quality and bright, avoid blurry images\n", "- Thumbnail image and caption together tell a story\n", "- Thumbnail caption is good for marketing\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "id": "a22dd4dc041c" }, "source": [ "### calling Gemini API with advanced prompt" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "32c86bd57475" }, "outputs": [], "source": [ "response_dict_advanced = call_gemini(\n", " prompt=advanced_prompt,\n", " gcs_video_path=video_uri,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "0c67055675ae" }, "source": [ "### showing JSON output string" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "91ce4125090e" }, "outputs": [], "source": [ "print(response_dict_advanced)" ] }, { "cell_type": "markdown", "metadata": { "id": "11616f4f3f17" }, "source": [ "### displaying final thumbnails with captions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "40e7a1c967fd" }, "outputs": [], "source": [ "display_results(response_dict_advanced, clip)" ] }, { "cell_type": "markdown", "metadata": { "id": "4fc9eac1d116" }, "source": [ "### Observations\n", "\n", "#### Better prompting shows the following effects on results\n", "- Results have improved in quality\n", "- Captions are more meaningful\n", "- Thumbnail images and captions tell a better story" ] }, { "cell_type": "markdown", "metadata": { "id": "d0d201f5af81" }, "source": [ "## Conclusion\n", "\n", "- We just saw that Gemini has multimodal capabilities, and can be used for video understanding.\n", "- Results can be improved by better prompting with proper guidelines and expectations." ] } ], "metadata": { "colab": { "name": "video_thumbnail_generation.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }