genai-on-vertex-ai/gemini/model_upgrades/summarization/vertex_colab/summarization_eval.ipynb (256 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "sZtfr2Gyx_qM" }, "outputs": [], "source": [ "# Copyright 2025 Google LLC\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"as is\" basis,\n", "# without warranties or conditions of any kind, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "LbBTb55fx_qN" }, "source": [ "\n", "## **Summarization Eval Recipe**\n", "\n", "This Eval Recipe demonstrates how to compare performance of two models on a summarization task using [Vertex AI Evaluation Service](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview)." ] }, { "cell_type": "markdown", "metadata": { "id": "AjNAklK4x_qN" }, "source": [ "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-eval-recipe-summarization&utm_medium=aRT-clicks&utm_campaign=eval-recipe-summarization&destination=eval-recipe-summarization&url=https%3A%2F%2Fcolab.research.google.com%2Fgithub%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fblob%2Fmain%2Fgenai-on-vertex-ai%2Fgemini%2Fmodel_upgrades%2Fsummarization%2Fvertex_colab%2Fsummarization_eval.ipynb\">\n", " <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-eval-recipe-summarization&utm_medium=aRT-clicks&utm_campaign=eval-recipe-summarization&destination=eval-recipe-summarization&url=https%3A%2F%2Fconsole.cloud.google.com%2Fvertex-ai%2Fcolab%2Fimport%2Fhttps%3A%252F%252Fraw.githubusercontent.com%252FGoogleCloudPlatform%252Fapplied-ai-engineering-samples%252Fmain%252Fgenai-on-vertex-ai%252Fgemini%252Fmodel_upgrades%252Fsummarization%252Fvertex_colab%252Fsummarization_eval.ipynb\">\n", " <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-eval-recipe-summarization&utm_medium=aRT-clicks&utm_campaign=eval-recipe-summarization&destination=eval-recipe-summarization&url=https%3A%2F%2Fconsole.cloud.google.com%2Fvertex-ai%2Fworkbench%2Fdeploy-notebook%3Fdownload_url%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fmain%2Fgenai-on-vertex-ai%2Fgemini%2Fmodel_upgrades%2Fsummarization%2Fvertex_colab%2Fsummarization_eval.ipynb\">\n", " <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples/blob/main/genai-on-vertex-ai/gemini/model_upgrades/summarization/vertex_colab/summarization_eval.ipynb\">\n", " <img width=\"32px\" src=\"https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", "</table>" ] }, { "cell_type": "markdown", "metadata": { "id": "YYtBxd6y-Ju2" }, "source": [ "- Use case: summarize a news article.\n", "\n", "- Metric: this eval uses an Autorater (LLM Judge) to rate Summarization Quality.\n", "\n", "- Evaluation Dataset is based on [XSum](https://github.com/EdinburghNLP/XSum). It includes 5 news articles stored as plain text files, and a JSONL file with ground truth labels: [`dataset.jsonl`](./dataset.jsonl). Each record in this file includes 2 attributes:\n", " - `document`: relative path to the plain text file containing the news article\n", " - `reference`: ground truth label (short summary of the article)\n", "\n", "- Prompt Template is a zero-shot prompt located in [`prompt_template.txt`](./prompt_template.txt) with variable `document` that gets populated from the corresponding dataset attribute.\n", "\n", "Step 1 of 4: Configure eval settings\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZZnEA6GZ-kMW" }, "outputs": [], "source": [ "%%writefile .env\n", "PROJECT_ID=your-project-id # Google Cloud Project ID\n", "LOCATION=us-central1 # Region for all required Google Cloud services\n", "EXPERIMENT_NAME=eval-summarization # Creates Vertex AI Experiment to track the eval runs\n", "MODEL_BASELINE=gemini-1.5-flash-001 # Name of your current model\n", "MODEL_CANDIDATE=gemini-2.0-flash-001 # This model will be compared to the baseline model\n", "DATASET_URI=\"gs://gemini_assets/summarization/dataset.jsonl\" # Evaluation dataset in Google Cloud Storage\n", "PROMPT_TEMPLATE_URI=gs://gemini_assets/summarization/prompt_template.txt # Text file in Google Cloud Storage" ] }, { "cell_type": "markdown", "metadata": { "id": "aqeOCM7k9t5h" }, "source": [ "Step 2 of 4: Install Python libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "id": "bR0rvHA3Lby6" }, "outputs": [], "source": [ "%pip install --upgrade --user --quiet google-cloud-aiplatform[evaluation] python-dotenv\n", "# The error \"session crashed\" is expected. Please ignore it and proceed to the next cell.\n", "import IPython\n", "IPython.Application.instance().kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "9OUfM5Lz9_aM" }, "source": [ "Step 3 of 4: Authenticate to Google Cloud (requires permission to open a popup window)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VRFZFC6OLby7" }, "outputs": [], "source": [ "import os\n", "import sys\n", "import vertexai\n", "from dotenv import load_dotenv\n", "from google.cloud import storage\n", "\n", "load_dotenv(override=True)\n", "if os.getenv(\"PROJECT_ID\") == \"your-project-id\":\n", " raise ValueError(\"Please configure your Google Cloud Project ID in the first cell.\")\n", "if \"google.colab\" in sys.modules:\n", " from google.colab import auth\n", " auth.authenticate_user()\n", "vertexai.init(project=os.getenv('PROJECT_ID'), location=os.getenv('LOCATION'))" ] }, { "cell_type": "markdown", "metadata": { "id": "CQvekvLt9SWD" }, "source": [ "Step 4 of 4: Run the eval on both models and compare the Accuracy scores" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "KfG9JG9VHaNw" }, "outputs": [], "source": [ "import json\n", "import pandas as pd\n", "from datetime import datetime\n", "from IPython.display import clear_output\n", "from vertexai.evaluation import EvalTask, EvalResult, MetricPromptTemplateExamples\n", "from vertexai.generative_models import GenerativeModel, HarmBlockThreshold, HarmCategory\n", "\n", "def load_file(gcs_uri: str) -> str:\n", " blob = storage.Blob.from_string(gcs_uri, storage.Client())\n", " return blob.download_as_string().decode('utf-8')\n", "\n", "def load_dataset(dataset_uri: str):\n", " jsonl = load_file(dataset_uri)\n", " samples = [json.loads(line) for line in jsonl.splitlines() if line.strip()]\n", " df = pd.DataFrame(samples)\n", " df['document_text'] = df['document_uri'].apply(lambda document_uri: load_file(document_uri))\n", " return df[['document_text', 'reference']]\n", "\n", "def run_eval(model: str) -> EvalResult:\n", " timestamp = f\"{datetime.now().strftime('%b-%d-%H-%M-%S')}\".lower()\n", " return EvalTask(\n", " dataset=load_dataset(os.getenv(\"DATASET_URI\")),\n", " metrics=[MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY],\n", " experiment=os.getenv('EXPERIMENT_NAME')\n", " ).evaluate(\n", " model=GenerativeModel(model),\n", " prompt_template=load_file(os.getenv(\"PROMPT_TEMPLATE_URI\")),\n", " experiment_run_name=f\"{timestamp}-{model.replace('.', '-')}\"\n", " )\n", "\n", "baseline_results = run_eval(os.getenv(\"MODEL_BASELINE\"))\n", "candidate_results = run_eval(os.getenv(\"MODEL_CANDIDATE\"))\n", "clear_output()\n", "print(f\"Baseline model score: {baseline_results.summary_metrics['summarization_quality/mean']:.2f}\")\n", "print(f\"Candidate model score: {candidate_results.summary_metrics['summarization_quality/mean']:.2f}\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "EhZZ030LJGL3" }, "source": [ "You can access all prompts and model responses in `candidate_results.metrics_table`\n", "\n", "Dataset ([XSum](https://github.com/EdinburghNLP/XSum)) citation:\n", " @InProceedings{xsum-emnlp,\n", " author = {Shashi Narayan and Shay B. Cohen and Mirella Lapata},\n", " title = {Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization},\n", " booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},\n", " year = {2018}\n", "}\n", "\n", "Please use our [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval) to learn about all available metrics and customization options." ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "3.10.12", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 0 }