genai-on-vertex-ai/gemini/model_upgrades/summarization/vertex_colab/summarization_eval.ipynb (256 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sZtfr2Gyx_qM"
},
"outputs": [],
"source": [
"# Copyright 2025 Google LLC\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"as is\" basis,\n",
"# without warranties or conditions of any kind, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LbBTb55fx_qN"
},
"source": [
"\n",
"## **Summarization Eval Recipe**\n",
"\n",
"This Eval Recipe demonstrates how to compare performance of two models on a summarization task using [Vertex AI Evaluation Service](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AjNAklK4x_qN"
},
"source": [
"<table align=\"left\">\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-eval-recipe-summarization&utm_medium=aRT-clicks&utm_campaign=eval-recipe-summarization&destination=eval-recipe-summarization&url=https%3A%2F%2Fcolab.research.google.com%2Fgithub%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fblob%2Fmain%2Fgenai-on-vertex-ai%2Fgemini%2Fmodel_upgrades%2Fsummarization%2Fvertex_colab%2Fsummarization_eval.ipynb\">\n",
" <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-eval-recipe-summarization&utm_medium=aRT-clicks&utm_campaign=eval-recipe-summarization&destination=eval-recipe-summarization&url=https%3A%2F%2Fconsole.cloud.google.com%2Fvertex-ai%2Fcolab%2Fimport%2Fhttps%3A%252F%252Fraw.githubusercontent.com%252FGoogleCloudPlatform%252Fapplied-ai-engineering-samples%252Fmain%252Fgenai-on-vertex-ai%252Fgemini%252Fmodel_upgrades%252Fsummarization%252Fvertex_colab%252Fsummarization_eval.ipynb\">\n",
" <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://art-analytics.appspot.com/r.html?uaid=G-FHXEFWTT4E&utm_source=aRT-eval-recipe-summarization&utm_medium=aRT-clicks&utm_campaign=eval-recipe-summarization&destination=eval-recipe-summarization&url=https%3A%2F%2Fconsole.cloud.google.com%2Fvertex-ai%2Fworkbench%2Fdeploy-notebook%3Fdownload_url%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fapplied-ai-engineering-samples%2Fmain%2Fgenai-on-vertex-ai%2Fgemini%2Fmodel_upgrades%2Fsummarization%2Fvertex_colab%2Fsummarization_eval.ipynb\">\n",
" <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples/blob/main/genai-on-vertex-ai/gemini/model_upgrades/summarization/vertex_colab/summarization_eval.ipynb\">\n",
" <img width=\"32px\" src=\"https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
" </a>\n",
" </td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YYtBxd6y-Ju2"
},
"source": [
"- Use case: summarize a news article.\n",
"\n",
"- Metric: this eval uses an Autorater (LLM Judge) to rate Summarization Quality.\n",
"\n",
"- Evaluation Dataset is based on [XSum](https://github.com/EdinburghNLP/XSum). It includes 5 news articles stored as plain text files, and a JSONL file with ground truth labels: [`dataset.jsonl`](./dataset.jsonl). Each record in this file includes 2 attributes:\n",
" - `document`: relative path to the plain text file containing the news article\n",
" - `reference`: ground truth label (short summary of the article)\n",
"\n",
"- Prompt Template is a zero-shot prompt located in [`prompt_template.txt`](./prompt_template.txt) with variable `document` that gets populated from the corresponding dataset attribute.\n",
"\n",
"Step 1 of 4: Configure eval settings\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZZnEA6GZ-kMW"
},
"outputs": [],
"source": [
"%%writefile .env\n",
"PROJECT_ID=your-project-id # Google Cloud Project ID\n",
"LOCATION=us-central1 # Region for all required Google Cloud services\n",
"EXPERIMENT_NAME=eval-summarization # Creates Vertex AI Experiment to track the eval runs\n",
"MODEL_BASELINE=gemini-1.5-flash-001 # Name of your current model\n",
"MODEL_CANDIDATE=gemini-2.0-flash-001 # This model will be compared to the baseline model\n",
"DATASET_URI=\"gs://gemini_assets/summarization/dataset.jsonl\" # Evaluation dataset in Google Cloud Storage\n",
"PROMPT_TEMPLATE_URI=gs://gemini_assets/summarization/prompt_template.txt # Text file in Google Cloud Storage"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aqeOCM7k9t5h"
},
"source": [
"Step 2 of 4: Install Python libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"id": "bR0rvHA3Lby6"
},
"outputs": [],
"source": [
"%pip install --upgrade --user --quiet google-cloud-aiplatform[evaluation] python-dotenv\n",
"# The error \"session crashed\" is expected. Please ignore it and proceed to the next cell.\n",
"import IPython\n",
"IPython.Application.instance().kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9OUfM5Lz9_aM"
},
"source": [
"Step 3 of 4: Authenticate to Google Cloud (requires permission to open a popup window)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VRFZFC6OLby7"
},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import vertexai\n",
"from dotenv import load_dotenv\n",
"from google.cloud import storage\n",
"\n",
"load_dotenv(override=True)\n",
"if os.getenv(\"PROJECT_ID\") == \"your-project-id\":\n",
" raise ValueError(\"Please configure your Google Cloud Project ID in the first cell.\")\n",
"if \"google.colab\" in sys.modules:\n",
" from google.colab import auth\n",
" auth.authenticate_user()\n",
"vertexai.init(project=os.getenv('PROJECT_ID'), location=os.getenv('LOCATION'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CQvekvLt9SWD"
},
"source": [
"Step 4 of 4: Run the eval on both models and compare the Accuracy scores"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "KfG9JG9VHaNw"
},
"outputs": [],
"source": [
"import json\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"from IPython.display import clear_output\n",
"from vertexai.evaluation import EvalTask, EvalResult, MetricPromptTemplateExamples\n",
"from vertexai.generative_models import GenerativeModel, HarmBlockThreshold, HarmCategory\n",
"\n",
"def load_file(gcs_uri: str) -> str:\n",
" blob = storage.Blob.from_string(gcs_uri, storage.Client())\n",
" return blob.download_as_string().decode('utf-8')\n",
"\n",
"def load_dataset(dataset_uri: str):\n",
" jsonl = load_file(dataset_uri)\n",
" samples = [json.loads(line) for line in jsonl.splitlines() if line.strip()]\n",
" df = pd.DataFrame(samples)\n",
" df['document_text'] = df['document_uri'].apply(lambda document_uri: load_file(document_uri))\n",
" return df[['document_text', 'reference']]\n",
"\n",
"def run_eval(model: str) -> EvalResult:\n",
" timestamp = f\"{datetime.now().strftime('%b-%d-%H-%M-%S')}\".lower()\n",
" return EvalTask(\n",
" dataset=load_dataset(os.getenv(\"DATASET_URI\")),\n",
" metrics=[MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY],\n",
" experiment=os.getenv('EXPERIMENT_NAME')\n",
" ).evaluate(\n",
" model=GenerativeModel(model),\n",
" prompt_template=load_file(os.getenv(\"PROMPT_TEMPLATE_URI\")),\n",
" experiment_run_name=f\"{timestamp}-{model.replace('.', '-')}\"\n",
" )\n",
"\n",
"baseline_results = run_eval(os.getenv(\"MODEL_BASELINE\"))\n",
"candidate_results = run_eval(os.getenv(\"MODEL_CANDIDATE\"))\n",
"clear_output()\n",
"print(f\"Baseline model score: {baseline_results.summary_metrics['summarization_quality/mean']:.2f}\")\n",
"print(f\"Candidate model score: {candidate_results.summary_metrics['summarization_quality/mean']:.2f}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EhZZ030LJGL3"
},
"source": [
"You can access all prompts and model responses in `candidate_results.metrics_table`\n",
"\n",
"Dataset ([XSum](https://github.com/EdinburghNLP/XSum)) citation:\n",
" @InProceedings{xsum-emnlp,\n",
" author = {Shashi Narayan and Shay B. Cohen and Mirella Lapata},\n",
" title = {Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization},\n",
" booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},\n",
" year = {2018}\n",
"}\n",
"\n",
"Please use our [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval) to learn about all available metrics and customization options."
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "3.10.12",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}