audio/speech/use-cases/podcast/multi-speaker-podcast.ipynb (531 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "ur8xi4C7S06n" }, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "JAPoU8Sm5E6e" }, "source": [ "# Create a Multi-Speaker Podcast with Gemini 2.0 & Text-to-Speech\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/podcast/multi-speaker-podcast.ipynb\">\n", " <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Faudio%2Fspeech%2Fuse-cases%2Fpodcast%2Fmulti-speaker-podcast.ipynb\">\n", " <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/audio/speech/use-cases/podcast/multi-speaker-podcast.ipynb\">\n", " <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/podcast/multi-speaker-podcast.ipynb\">\n", " <img width=\"32px\" src=\"https://www.svgrepo.com/download/217753/github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", "</table>\n", "\n", "<div style=\"clear: both;\"></div>\n", "\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", "<b>Share to:</b>\n", "\n", "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/podcast/multi-speaker-podcast.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n", "</a>\n", "\n", "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/podcast/multi-speaker-podcast.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n", "</a>\n", "\n", "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/podcast/multi-speaker-podcast.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n", "</a>\n", "\n", "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/podcast/multi-speaker-podcast.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n", "</a>\n", "\n", "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/audio/speech/use-cases/podcast/multi-speaker-podcast.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n", "</a>\n", "</td>" ] }, { "cell_type": "markdown", "metadata": { "id": "84f0f73a0f76" }, "source": [ "| | |\n", "|-|-|\n", "| Author(s) | [Souvik Mukherjee](https://github.com/talktosauvik/), [Holt Skinner](https://github.com/holtskinner) |" ] }, { "cell_type": "markdown", "metadata": { "id": "tvgnzT1CKxrO" }, "source": [ "## Overview\n", "\n", "This notebook demonstrates how to use the [Gemini API in Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/overview) to generate an engaging multi-speaker podcast using studio voices in the [Text-to-Speech API](https://cloud.google.com/text-to-speech). \n", "\n", "> **NOTE:** [Generate dialogue with multiple speakers](https://cloud.google.com/text-to-speech/docs/create-dialogue-with-multispeakers) is only available to projects in allowlist. Please [contact Google Cloud](https://cloud.google.com/contact) if you want to use this feature.\n", "\n", "This can be useful for creating interviews, interactive storytelling, video games, e-learning platforms, and accessibility solutions.\n", "\n", "The steps performed include:\n", "\n", "- Load a PDF file from a Google Cloud Storage bucket or public URL\n", "- Summarize the content using Gemini 2.0 Flash\n", "- Return a pre-defined JSON schema using [Controlled Generation](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output)\n", "- Create a multi speaker conversation from the JSON script using Text-to-Speech.\n", "- Generate the audio as MP3 file.\n", "\n", "For a more advanced example using LangGraph, check out [Build Your Own AI Podcasting Agent with LangGraph & Gemini: AI-Powered Podcast Creation with Automated Research, Writing, and Refinement](https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/orchestration/langgraph_gemini_podcast.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "61RBz8LLbxCR" }, "source": [ "## Get started" ] }, { "cell_type": "markdown", "metadata": { "id": "No17Cw5hgx12" }, "source": [ "### Install Google Gen AI SDK for Python\n", "\n", "Install the following packages required to execute this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tFy3H3aPgx12" }, "outputs": [], "source": [ "%pip install --upgrade --quiet google-genai google-cloud-texttospeech" ] }, { "cell_type": "markdown", "metadata": { "id": "R5Xep4W9lq-Z" }, "source": [ "### Restart runtime\n", "\n", "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.\n", "\n", "The restart might take a minute or longer. After it's restarted, continue to the next step." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XRvKdaPDTznN" }, "outputs": [], "source": [ "import IPython\n", "\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "SbmM4z7FOBpM" }, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>\n", "</div>\n" ] }, { "cell_type": "markdown", "metadata": { "id": "dmWOrTJ3gx13" }, "source": [ "### Authenticating your notebook environment\n", "\n", "* If you are using **Colab** to run this notebook, run the cell below and continue.\n", "* If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "NyKGtVQjgx13" }, "outputs": [], "source": [ "import sys\n", "\n", "# Additional authentication is required for Google Colab\n", "if \"google.colab\" in sys.modules:\n", " # Authenticate user to Google Cloud\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "DF4l8DTdWgPY" }, "source": [ "### Set Google Cloud project information and create client\n", "\n", "To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n", "\n", "Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).\n", "\n", "\n", "* Please note the **available regions** for Text-to-Speech, see [documentation](https://cloud.google.com/text-to-speech/docs/endpoints)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Nqwi-5ufWp_B" }, "outputs": [], "source": [ "import os\n", "\n", "from google import genai\n", "\n", "PROJECT_ID = \"[your-project-id]\" # @param {type: \"string\", placeholder: \"[your-project-id]\", isTemplate: true}\n", "if not PROJECT_ID or PROJECT_ID == \"[your-project-id]\":\n", " PROJECT_ID = str(os.environ.get(\"GOOGLE_CLOUD_PROJECT\"))\n", "\n", "VERTEXAI_LOCATION = os.environ.get(\"GOOGLE_CLOUD_REGION\", \"us-central1\")\n", "TTS_LOCATION = \"us\" # @param {type:\"string\"}\n", "\n", "! gcloud config set project {PROJECT_ID}\n", "! gcloud auth application-default set-quota-project {PROJECT_ID}\n", "! gcloud auth application-default login -q" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "698a012fd0cc" }, "outputs": [], "source": [ "client = genai.Client(vertexai=True, project=PROJECT_ID, location=VERTEXAI_LOCATION)" ] }, { "cell_type": "markdown", "metadata": { "id": "5303c05f7aa6" }, "source": [ "### Import libraries" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "6fc324893334" }, "outputs": [], "source": [ "import json\n", "\n", "from IPython.display import Audio\n", "from google.api_core.client_options import ClientOptions\n", "from google.cloud import texttospeech_v1beta1 as texttospeech\n", "from google.genai.types import GenerateContentConfig, Part" ] }, { "cell_type": "markdown", "metadata": { "id": "ac0c378bb46a" }, "source": [ "### Load the Gemini 2.0 Flash model\n", "\n", "To learn more about all [Gemini models on Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "0b696580b27c" }, "outputs": [], "source": [ "MODEL_ID = \"gemini-2.0-flash-001\" # @param {type: \"string\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "e43229f3ad4f" }, "source": [ "### Define constants" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "id": "cf93d5f0ce00" }, "outputs": [], "source": [ "DEFAULT_LANGUAGE = \"en-US\"\n", "STUDIO_VOICE = \"en-US-Studio-MultiSpeaker\"\n", "\n", "SYSTEM_INSTRUCTION = \"\"\"You are a podcast writer. Your task is to generate a short podcast-style dialogue between two speakers, Speaker R and Speaker S\"\"\"\n", "\n", "response_schema = {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"dialogue\": {\n", " \"type\": \"array\",\n", " \"items\": {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"speaker\": {\"type\": \"string\"},\n", " \"line\": {\"type\": \"string\"},\n", " },\n", " \"required\": [\"speaker\", \"line\"],\n", " },\n", " }\n", " },\n", " \"required\": [\"dialogue\"],\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "MzeTv4NsxGoN" }, "source": [ "### Helper functions" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "id": "qO9cgw1-xGWF" }, "outputs": [], "source": [ "def generate_podcast_script(file_uri: str) -> list:\n", " \"\"\"Generates a podcast script using Gemini with controlled JSON output.\"\"\"\n", " prompt = \"\"\"\n", " The dialogue should be engaging and natural, with each speaker contributing roughly equal amounts. Return the dialogue as a JSON array of objects, where each object has a 'speaker' (either 'R' or 'S') and a 'line' property.\n", "\n", " Use the following information to create the content for the podcast dialogue:\n", " \"\"\"\n", "\n", " try:\n", " response = client.models.generate_content(\n", " model=MODEL_ID,\n", " contents=[\n", " prompt,\n", " Part.from_uri(file_uri=file_uri, mime_type=\"application/pdf\"),\n", " ],\n", " config=GenerateContentConfig(\n", " temperature=1,\n", " top_p=0.95,\n", " max_output_tokens=8192,\n", " response_mime_type=\"application/json\",\n", " response_schema=response_schema,\n", " ),\n", " )\n", " generated_json = json.loads(response.text)\n", " return generated_json[\"dialogue\"]\n", " except (json.JSONDecodeError, KeyError) as e:\n", " print(f\"Error generating or parsing JSON script: {e}. Returning empty list.\")\n", " return []\n", "\n", "\n", "def synthesize_podcast(dialogue: list[dict], output_file: str):\n", " \"\"\"Synthesizes speech for a podcast using MultiSpeakerMarkup.\"\"\"\n", " tts_client = texttospeech.TextToSpeechClient(\n", " client_options=ClientOptions(\n", " api_endpoint=f\"{TTS_LOCATION}-texttospeech.googleapis.com\"\n", " )\n", " )\n", " multi_speaker_markup = texttospeech.MultiSpeakerMarkup(\n", " turns=[\n", " texttospeech.MultiSpeakerMarkup.Turn(\n", " text=turn_data[\"line\"], speaker=turn_data[\"speaker\"]\n", " )\n", " for turn_data in dialogue\n", " ]\n", " )\n", "\n", " response = tts_client.synthesize_speech(\n", " input=texttospeech.SynthesisInput(multi_speaker_markup=multi_speaker_markup),\n", " voice=texttospeech.VoiceSelectionParams(\n", " language_code=DEFAULT_LANGUAGE, name=STUDIO_VOICE\n", " ),\n", " audio_config=texttospeech.AudioConfig(\n", " audio_encoding=texttospeech.AudioEncoding.MP3\n", " ),\n", " )\n", "\n", " with open(output_file, \"wb\") as out:\n", " out.write(response.audio_content)\n", " print(f\"Audio content written to file `{output_file}`\")" ] }, { "cell_type": "markdown", "metadata": { "id": "ltb_QilKxW01" }, "source": [ "## Call the Text-to-Speech API with script content" ] }, { "cell_type": "markdown", "metadata": { "id": "ZDGzyxkbx0Py" }, "source": [ "### Generate the podcast script from the content\n", "\n", "For this example, we will be using the [Gemini 2.0 paper from arXiv](https://arxiv.org/abs/2403.05530).\n", "\n", "You can replace this with the URL with any publicly-accessible PDF." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DoJgwgWRxYBS" }, "outputs": [], "source": [ "PDF_URL = \"gs://github-repo/2403_05530.pdf\" # @param {type: \"string\"}\n", "dialogue = generate_podcast_script(PDF_URL)\n", "print(\"Generated Dialogue:\")\n", "print(dialogue)" ] }, { "cell_type": "markdown", "metadata": { "id": "h9k_otpSyBd4" }, "source": [ "### Write the audio content into the output file" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XtD1oG7ZyFMm" }, "outputs": [], "source": [ "if dialogue:\n", " output_filename = \"podcast_output.mp3\"\n", " synthesize_podcast(dialogue, output_filename)\n", "else:\n", " print(\"No dialogue generated. Skipping audio synthesis.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "SsbWsbJMyGqP" }, "source": [ "### Listen to the audio file" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rCsneZpOyKwk" }, "outputs": [], "source": [ "Audio(output_filename)" ] } ], "metadata": { "colab": { "name": "multi-speaker-podcast.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }