gemini/evaluation/migration_guide_preview_to_GA

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "OzHQmCjBjZOs" }, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "5hCmIhlJjZOt" }, "source": [ "# Gen AI Evaluation Service SDK Preview-to-GA Migration Guide | Gen AI Evaluation SDK Tutorial\n", "\n", "\n", "<table align=\"left\">\n", " <td style=\"text-align: center\">\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/migration_guide_preview_to_GA_sdk.ipynb\">\n", " <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fevaluation%2Fmigration_guide_preview_to_GA_sdk.ipynb\">\n", " <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/evaluation/migration_guide_preview_to_GA_sdk.ipynb\">\n", " <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", " <td style=\"text-align: center\">\n", " <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/migration_guide_preview_to_GA_sdk.ipynb\">\n", " <img width=\"32px\" src=\"https://www.svgrepo.com/download/217753/github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n", " </a>\n", " </td>\n", "</table>\n", "\n", "<div style=\"clear: both;\"></div>\n", "\n", "<b>Share to:</b>\n", "\n", "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/migration_guide_preview_to_GA_sdk.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n", "</a>\n", "\n", "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/migration_guide_preview_to_GA_sdk.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n", "</a>\n", "\n", "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/migration_guide_preview_to_GA_sdk.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n", "</a>\n", "\n", "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/migration_guide_preview_to_GA_sdk.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n", "</a>\n", "\n", "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/migration_guide_preview_to_GA_sdk.ipynb\" target=\"_blank\">\n", " <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n", "</a> " ] }, { "cell_type": "markdown", "metadata": { "id": "kPgYxQc1OvFn" }, "source": [ "| | |\n", "|-|-|\n", "|Author(s) | [Jason Dai](https://github.com/jsondai), [Xi Liu](https://github.com/xiliucity) |" ] }, { "cell_type": "markdown", "metadata": { "id": "YHpyjyTyjZOu" }, "source": [ "**_NOTE_**: This notebook has been tested in the following environment:\n", "\n", "* Python version = 3.9" ] }, { "cell_type": "markdown", "metadata": { "id": "VlVVMkanjZOu" }, "source": [ "## Overview\n", "\n", "\n", "In this tutorial, you will get detailed guidance on how to migrate from the Preview version to the latest GA version of *Vertex AI Python SDK for Gen AI Evaluation Service* to evaluate **Retrieval-Augmented Generation** (RAG) and compare two models **side-by-side (SxS)**.\n", "\n", "In the GA release, instead of providing predefined black-box model-based metrics, the evaluation service start providing capability to support defining metrics based on your own criteria. You can still run out-of-box metrics through `MetricPromptTemplateExamples` class we provide in the SDK. The examples covers the following metrics in both Pointwise and Pairwise style.\n", "* `coherence`\n", "* `fluency`\n", "* `safety`\n", "* `groundedness`\n", "* `instruction_following`\n", "* `verbosity`\n", "* `text_quality`\n", "* `summarization_quality`\n", "* `question_answering_quality`\n", "* `multi_turn_chat_quality`\n", "* `multi_turn_safety`\n", "\n", "This notebook would focus on handling the breaking changes. If you need actionable help to deal with bugs triggered by breaking changes, please jump to the following sections:\n", "* How to handle discontinued metrics\n", "* How to handle the new input schema\n", "\n", "\n", "To learn more about the GA release details, please refer to the latest documentation and notebook tutorials in [Vertex Gen AI Evaluation Service](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview).\n", "\n", "The examples used in this notebook is from Stanford Question Answering Dataset [SQuAD 2.0](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15785042.pdf).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "jFwAqWJgjZOu" }, "source": [ "## Getting Started" ] }, { "cell_type": "markdown", "metadata": { "id": "1450a0568bdc" }, "source": [ "### Install Vertex AI Python SDK for Gen AI Evaluation Service" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "BzjEsUbFjZOv" }, "outputs": [], "source": [ "%pip install --upgrade --user --quiet google-cloud-aiplatform[evaluation]" ] }, { "cell_type": "markdown", "metadata": { "id": "DhgPNOM-jZOv" }, "source": [ "### Restart runtime\n", "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.\n", "\n", "The restart might take a minute or longer. After it's restarted, continue to the next step." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "lhsENQeVjZOv" }, "outputs": [], "source": [ "# import IPython\n", "\n", "# app = IPython.Application.instance()\n", "# app.kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": { "id": "947vOmGSjZOv" }, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>\n", "</div>\n" ] }, { "cell_type": "markdown", "metadata": { "id": "8AwQGoaCjZOv" }, "source": [ "### Authenticate your notebook environment (Colab only)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "RqtFi1ssjZOv" }, "outputs": [], "source": [ "import sys\n", "\n", "if \"google.colab\" in sys.modules:\n", " from google.colab import auth\n", "\n", " auth.authenticate_user()" ] }, { "cell_type": "markdown", "metadata": { "id": "ZIO45hdAjZOw" }, "source": [ "### Set Google Cloud project information and initialize Vertex AI SDK" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "wFd_RXDLjZOw" }, "outputs": [], "source": [ "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", "LOCATION = \"us-central1\" # @param {type:\"string\"}\n", "EXPERIMENT = \"eval-migration-ga\" # @param {type:\"string\"}\n", "\n", "if not PROJECT_ID or PROJECT_ID == \"[your-project-id]\":\n", " raise ValueError(\"Please set your PROJECT_ID\")\n", "\n", "import vertexai\n", "\n", "vertexai.init(project=PROJECT_ID, location=LOCATION)" ] }, { "cell_type": "markdown", "metadata": { "id": "97T5LEM1jZOw" }, "source": [ "### Import libraries\n", "\n", "Please update import path to the GA version SDK by changing `from vertexai.preview.evaluation` to **`from vertexai.evaluation`**." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "Fx5fiNH7jZOw" }, "outputs": [], "source": [ "import pandas as pd\n", "from vertexai.evaluation import (\n", " EvalTask,\n", " MetricPromptTemplateExamples,\n", " PairwiseMetric,\n", " PointwiseMetric,\n", ")\n", "from vertexai.generative_models import GenerativeModel\n", "from vertexai.preview.evaluation import notebook_utils" ] }, { "cell_type": "markdown", "metadata": { "id": "k65D8EW-jZOw" }, "source": [ "## How to handle discontinued metrics\n", "\n", "We removed the following metrics support from SDK:\n", "* `question_answering_helpfulness`\n", "* `question_answering_relevance`\n", "* `question_answering_correctness`\n", "* `summarization_helpfulness`\n", "* `summarization_verbosity`\n", "* `fulfillment` (rename to `instruction_following`)\n", "\n", "The rationale of this is because\n", "* We now provide two generic metric interface (`PointwiseMetirc` and `PairwiseMetric`) for customers to define the metrics with their own criteria, which is more transpairent and more affective. We also provide\n", "* Many of the metrics here should not be task specific. For example, `helpfulness`, `relevance`, and `verbosity` can be applied to all text-related tasks.\n", "* Some metrics here are not intuitive to users and are very subjective. For example, what is `question_answering_helpfulness`? How to define `helpfulness`? For different customers, the criteria of `helpfulness` can be totally different.\n", "\n", "**We recommend you to use the new `MetricPromptTemplateExamples` we provide, and adjust them for your own use cases.** However, if you still want to use the above discontinued metrics:\n", "* You can pin to an old version of the SDK, since we still maintain the API to support those metrics. The previous version `1.62.0` would be the recommended preview version to pin to. Example code:\n", "\n", " ```%pip install -q google-cloud-aiplatform[rapid-evaluation]==1.62.0```\n", "\n", "* You can use `instruction_following` to replace `fulfillment`, use `verbosity` to replace `summarization_verbosity`.\n", "\n", "* We provide examples below to help you define `fulfillment`, `helpfulness`, `relevance` in case you would still like to use them in your application.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "VniN4E877-Xj" }, "source": [ "### Define my own version of the discontinued metrics" ] }, { "cell_type": "markdown", "metadata": { "id": "2oH25Qc2jZOw" }, "source": [ "#### Prepare Dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "pfKz1jZcjZOw" }, "source": [ "To evaluate the RAG generated answers, the evaluation dataset is required to contain the following fields:\n", "\n", "* Prompt: The user supplied prompt consisting of the User Question and the RAG Retrieved Context\n", "* Response: The RAG Generated Answer" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "-U27Ifc7jZOw" }, "outputs": [], "source": [ "questions = [\n", " \"Which part of the brain does short-term memory seem to rely on?\",\n", " \"What provided the Roman senate with exuberance?\",\n", " \"What area did the Hasan-jalalians command?\",\n", "]\n", "\n", "retrieved_contexts = [\n", " \"Short-term memory is supported by transient patterns of neuronal communication, dependent on regions of the frontal lobe (especially dorsolateral prefrontal cortex) and the parietal lobe. Long-term memory, on the other hand, is maintained by more stable and permanent changes in neural connections widely spread throughout the brain. The hippocampus is essential (for learning new information) to the consolidation of information from short-term to long-term memory, although it does not seem to store information itself. Without the hippocampus, new memories are unable to be stored into long-term memory, as learned from patient Henry Molaison after removal of both his hippocampi, and there will be a very short attention span. Furthermore, it may be involved in changing neural connections for a period of three months or more after the initial learning.\",\n", " \"In 62 BC, Pompey returned victorious from Asia. The Senate, elated by its successes against Catiline, refused to ratify the arrangements that Pompey had made. Pompey, in effect, became powerless. Thus, when Julius Caesar returned from a governorship in Spain in 61 BC, he found it easy to make an arrangement with Pompey. Caesar and Pompey, along with Crassus, established a private agreement, now known as the First Triumvirate. Under the agreement, Pompey's arrangements would be ratified. Caesar would be elected consul in 59 BC, and would then serve as governor of Gaul for five years. Crassus was promised a future consulship.\",\n", " \"The Seljuk Empire soon started to collapse. In the early 12th century, Armenian princes of the Zakarid noble family drove out the Seljuk Turks and established a semi-independent Armenian principality in Northern and Eastern Armenia, known as Zakarid Armenia, which lasted under the patronage of the Georgian Kingdom. The noble family of Orbelians shared control with the Zakarids in various parts of the country, especially in Syunik and Vayots Dzor, while the Armenian family of Hasan-Jalalians controlled provinces of Artsakh and Utik as the Kingdom of Artsakh.\",\n", "]\n", "\n", "generated_answers = [\n", " \"frontal lobe and the parietal lobe\",\n", " \"The Roman Senate was filled with exuberance due to successes against Catiline.\",\n", " \"The Hasan-Jalalians commanded the area of Syunik and Vayots Dzor.\",\n", "]\n", "\n", "baseline_answers = [\n", " \"the frontal cortex and the parietal cortex, which are crucial for sensory and cognitive functions\",\n", " \"The Roman Senate celebrated triumphantly after significant victories over Catiline, bolstering their political influence\",\n", " \"The Hasan-Jalalians held control over the regions of Syunik and Vayots Dzor, maintaining power through strategic alliances and military strength\",\n", "]\n", "\n", "eval_dataset = pd.DataFrame(\n", " {\n", " \"instruction\": questions,\n", " \"context\": retrieved_contexts,\n", " \"response\": generated_answers,\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "a6AlHJ_bjZOw" }, "source": [ "#### Define the metrics\n", "\n", "\n", "We will define `fulfillment`, `helpfulness`, and `relevance` here in order to replace the discontinued ones." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "1lh_AFu380NZ" }, "outputs": [], "source": [ "relevance_prompt_template = \"\"\"\n", "# Instruction\n", "You are a professional writing evaluator. Your job is to score writing responses according to pre-defined evaluation criteria.\n", "You will be assessing question answering relevance, which measures the ability to respond with relevant information when asked a question.\n", "You will assign the writing response a score from 5, 4, 3, 2, 1, following the INDIVIDUAL RATING RUBRIC and EVALUATION STEPS.\n", "\n", "# Evaluation\n", "## Criteria\n", "Relevance: The response should be relevant to the instruction and directly address the instruction.\n", "\n", "## Rating Rubric\n", "5 (completely relevant): Response is entirely relevant to the instruction and provides clearly defined information that addresses the instruction's core needs directly.\n", "4 (mostly relevant): Response is mostly relevant to the instruction and addresses the instruction mostly directly.\n", "3 (somewhat relevant): Response is somewhat relevant to the instruction and may address the instruction indirectly, but could be more relevant and more direct.\n", "2 (somewhat irrelevant): Response is minimally relevant to the instruction and does not address the instruction directly.\n", "1 (irrelevant): Response is completely irrelevant to the instruction.\n", "\n", "## Evaluation Steps\n", "STEP 1: Assess relevance: is response relevant to the instruction and directly address the instruction?\n", "STEP 2: Score based on the criteria and rubrics.\n", "\n", "Give step by step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.\n", "\n", "# User Inputs and AI-generated Response\n", "## User Inputs\n", "### INSTRUCTION\n", "{instruction}\n", "\n", "### CONTEXT\n", "{context}\n", "\n", "## AI-generated Response\n", "{response}\n", "\"\"\"\n", "\n", "relevance = PointwiseMetric(\n", " metric=\"relevance\",\n", " metric_prompt_template=relevance_prompt_template,\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "tLb0qS_O98ZL" }, "outputs": [], "source": [ "helpfulness_prompt_template = \"\"\"\n", "# Instruction\n", "You are a professional writing evaluator. Your job is to score writing responses according to pre-defined evaluation criteria.\n", "You will be assessing question answering helpfulness, which measures the ability to provide important details when answering a question.\n", "You will assign the writing response a score from 5, 4, 3, 2, 1, following the INDIVIDUAL RATING RUBRIC and EVALUATION STEPS.\n", "\n", "# Evaluation\n", "## Criteria\n", "Helpfulness: The response is comprehensive with well-defined key details. The user would feel very satisfied with the content in a good response.\n", "\n", "## Rating Rubric\n", "5 (completely helpful): Response is useful and very comprehensive with well-defined key details to address the needs in the question and usually beyond what explicitly asked. The user would feel very satisfied with the content in the response.\n", "4 (mostly helpful): Response is very relevant to the question, providing clearly defined information that addresses the question's core needs. It may include additional insights that go slightly beyond the immediate question. The user would feel quite satisfied with the content in the response.\n", "3 (somewhat helpful): Response is relevant to the question and provides some useful content, but could be more relevant, well-defined, comprehensive, and/or detailed. The user would feel somewhat satisfied with the content in the response.\n", "2 (somewhat unhelpful): Response is minimally relevant to the question and may provide some vaguely useful information, but it lacks clarity and detail. It might contain minor inaccuracies. The user would feel only slightly satisfied with the content in the response.\n", "1 (unhelpful): Response is useless/irrelevant, contains inaccurate/deceptive/misleading information, and/or contains harmful/offensive content. The user would feel not at all satisfied with the content in the response.\n", "\n", "## Evaluation Steps\n", "STEP 1: Assess comprehensiveness: does the response provide specific, comprehensive, and clearly defined information for the user needs expressed in the question?\n", "STEP 2: Assess relevance: When appropriate for the question, does the response exceed the question by providing relevant details and related information to contextualize content and help the user better understand the response.\n", "STEP 3: Assess accuracy: Is the response free of inaccurate, deceptive, or misleading information?\n", "STEP 4: Assess safety: Is the response free of harmful or offensive content?\n", "\n", "Give step by step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.\n", "\n", "# User Inputs and AI-generated Response\n", "## User Inputs\n", "### INSTRUCTION\n", "{instruction}\n", "\n", "### CONTEXT\n", "{context}\n", "\n", "## AI-generated Response\n", "{response}\n", "\"\"\"\n", "\n", "helpfulness = PointwiseMetric(\n", " metric=\"helpfulness\",\n", " metric_prompt_template=helpfulness_prompt_template,\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "37spZd6x-fZJ" }, "outputs": [], "source": [ "fulfillment_prompt_template = \"\"\"\n", "# Instruction\n", "You are a professional writing evaluator. Your job is to score writing responses according to pre-defined evaluation criteria.\n", "You will be assessing fulfillment, which measures the ability to follow instructions.\n", "You will assign the writing response a score from 5, 4, 3, 2, 1, following the INDIVIDUAL RATING RUBRIC and EVALUATION STEPS.\n", "\n", "# Evaluation\n", "## Criteria\n", "Instruction following: The response demonstrates a clear understanding of the instructions, satisfying all of the instruction's requirements.\n", "\n", "## Rating Rubric\n", "5 (complete fulfillment): Response addresses all aspects and adheres to all requirements of the instruction. The user would feel like their instruction was completely understood.\n", "4 (good fulfillment): Response addresses most aspects and requirements of the instruction. It might miss very minor details or have slight deviations from requirements. The user would feel like their instruction was well understood.\n", "3 (some fulfillment): Response does not address some minor aspects and/or ignores some requirements of the instruction. The user would feel like their instruction was partially understood.\n", "2 (poor fulfillment): Response addresses some aspects of the instruction but misses key requirements or major components. The user would feel like their instruction was misunderstood in significant ways.\n", "1 (no fulfillment): Response does not address the most important aspects of the instruction. The user would feel like their request was not at all understood.\n", "\n", "## Evaluation Steps\n", "STEP 1: Assess instruction understanding: Does the response address the intent of the instruction such that a user would not feel the instruction was ignored or misinterpreted by the response?\n", "STEP 2: Assess requirements adherence: Does the response adhere to any requirements indicated in the instruction such as an explicitly specified word length, tone, format, or information that the response should include?\n", "\n", "Give step by step explanations for your scoring, and only choose scores from 5, 4, 3, 2, 1.\n", "\n", "# User Inputs and AI-generated Response\n", "## User Inputs\n", "### INSTRUCTION\n", "{instruction}\n", "\n", "## AI-generated Response\n", "{response}\n", "\"\"\"\n", "\n", "fulfillment = PointwiseMetric(\n", " metric=\"fulfillment\",\n", " metric_prompt_template=fulfillment_prompt_template,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "Ea0dMyPY_Fnh" }, "source": [ "#### Run Evaluation with defined metrics\n", "\n", "Now, you can run evaluation as before, using those three metrics." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hufni_ie_L9i" }, "outputs": [], "source": [ "eval_result = EvalTask(\n", " dataset=eval_dataset,\n", " metrics=[\n", " relevance,\n", " helpfulness,\n", " fulfillment,\n", " ],\n", " experiment=EXPERIMENT,\n", ").evaluate()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PTWN8A0X-c1i" }, "outputs": [], "source": [ "notebook_utils.display_eval_result(eval_result)" ] }, { "cell_type": "markdown", "metadata": { "id": "zFnSqBUpA6sO" }, "source": [ "## How to handle the new input schema\n", "\n", "\n", "In the GA release, all of the `MetricPromptTemplateExamples` requires taking in `prompt` and `response`/`baseline_model_response`. instead of more fine-grained inputs as before, such as `instruction` and `context`. The rationales are:\n", "* For most users, input user prompt is what they have, instead of `instruction` or `context`. It's also difficult to preprocess the user input prompt and breaking it down to `instruction` and `context`.\n", "* In an increasing number of use cases, we've observed that the input prompts contain such complex and intertwined information that they can't be broken down any further.\n", "\n", "The solution is simple:\n", "* **(Recommend)** Assemble \"instruction\" and \"context\" with a simple prompt_template `{instruction}: {context}` to a full input prompt and then use it for evaluation. Or just assemble them with a simple line of python of code.\n", "* (Not recommend) Modify the MetricPromptTemplateExample and make it take \"instruction\" and \"context\" as inputs instead of \"prompt\".\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_gJ4R_CvB7wI" }, "source": [ "### Example of preprocessing the dataset" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "8x71euRlAvzZ" }, "outputs": [], "source": [ "new_eval_dataset = pd.DataFrame(\n", " {\n", " \"prompt\": [\n", " \"Answer the question: \" + question + \" Context: \" + context\n", " for question, context in zip(questions, retrieved_contexts)\n", " ],\n", " \"response\": generated_answers,\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "uLW1Zk38B7Ok" }, "outputs": [], "source": [ "# Run evaluation with new metric prompt template examples\n", "eval_result = EvalTask(\n", " dataset=new_eval_dataset,\n", " metrics=[\n", " \"question_answering_quality\",\n", " \"groundedness\",\n", " \"safety\",\n", " \"instruction_following\",\n", " ],\n", " experiment=EXPERIMENT,\n", ").evaluate()\n", "\n", "notebook_utils.display_eval_result(eval_result)" ] }, { "cell_type": "markdown", "metadata": { "id": "Vge-TGGz3mXV" }, "source": [ "## How to migrate to `PairwiseMetric` for AutoSxS Evaluation\n", "\n", "The pipeline-based `AutoSxS` evaluation will be deprecated and replaced by GA version of Gen AI Eval Service SDK. The rationales are:\n", "\n", "* Better judge model (Autorater) quality: Gen AI Eval Service SDK uses the latest Gemini model instead of legacy `PaLM` model that AutoSxS uses.\n", "\n", "* Faster and easier to use: SDK provides a more faster and more intuitive interface than pipelines, allowing users to perform side-by-side (SxS) evaluation and see result more rapidly.\n", "\n", "* More flexibility: You can define your own pairwise comparison criteria and rating rubrics, and compute multiple pairwise metrics together in an `EvalTask`.\n", "\n", "**Solution:**\n", "\n", "* Use `PairwiseMetric` class in Gen AI Eval Service SDK for performing SxS evaluation for 2 models.\n", "\n", "* If you have a stored evaluation in Google Cloud Storage(GCS) or BigQuery(BQ), you can directly provide the URI in the `dataset` parameter when defining your `EvalTask`.\n", "\n", "* If your dataset contains fine-grained columns like `instruction`, `context`, assemble them with a simple prompt_template `{instruction}: {context}` to a full input prompt and then use it for evaluation. Or just assemble them with a simple line of python of code.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "sFHRuWjv86uG" }, "source": [ "### Evaluate two models side-by-side with `PairwiseMetric`" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "id": "B0_JmClz4t5x" }, "outputs": [], "source": [ "eval_dataset = pd.DataFrame(\n", " {\n", " \"prompt\": [\n", " \"Answer the question: \" + question + \" Context: \" + context\n", " for question, context in zip(questions, retrieved_contexts)\n", " ],\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sDAaYDqY9FBu" }, "outputs": [], "source": [ "metric_name = \"pairwise_text_quality\"\n", "pairwise_text_quality_result = EvalTask(\n", " dataset=eval_dataset,\n", " metrics=[\n", " PairwiseMetric(\n", " metric=metric_name,\n", " metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(\n", " metric_name\n", " ),\n", " # Specify baseline model for pairwise comparison\n", " baseline_model=GenerativeModel(\n", " \"gemini-2.0-flash\",\n", " ),\n", " ),\n", " ],\n", " experiment=EXPERIMENT,\n", ").evaluate(\n", " # Specify candidate model for pairwise comparison\n", " model=GenerativeModel(\"gemini-2.0-flash\"),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bPAoJu53-mfx" }, "outputs": [], "source": [ "notebook_utils.display_eval_result(pairwise_text_quality_result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NaUyJFbAHsNN" }, "outputs": [], "source": [ "notebook_utils.display_explanations(\n", " pairwise_text_quality_result, metrics=[metric_name], num=1\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lpSbexN9I-vW" }, "outputs": [], "source": [ "candidate_model_win_rate = round(\n", " pairwise_text_quality_result.summary_metrics[\n", " f\"{metric_name}/candidate_model_win_rate\"\n", " ]\n", " * 100\n", ")\n", "print(\n", " f\"Win rate: Autorater prefers Candidate Model over Baseline Model {candidate_model_win_rate}% of time.\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "J4LDtmU2_tnU" }, "source": [ "### Bring-your-own-response for SxS Evaluation" ] }, { "cell_type": "markdown", "metadata": { "id": "valIMVs53f47" }, "source": [ "#### Calculate a pairwise metric on the saved responses in eval dataset" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "id": "fK6jHUXhOn61" }, "outputs": [], "source": [ "eval_dataset = pd.DataFrame(\n", " {\n", " \"question\": questions,\n", " \"context\": retrieved_contexts,\n", " \"response\": generated_answers,\n", " \"baseline_model_response\": baseline_answers,\n", " }\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HtvjARWYj66F" }, "outputs": [], "source": [ "# Define an EvalTask with 2 example pairwise metrics\n", "byor_pairwise_result = EvalTask(\n", " dataset=eval_dataset,\n", " metrics=[\n", " MetricPromptTemplateExamples.Pairwise.VERBOSITY,\n", " MetricPromptTemplateExamples.Pairwise.SAFETY,\n", " ],\n", " experiment=EXPERIMENT,\n", ").evaluate(\n", " prompt_template=\"Answer the question: {question} Context: {context}\",\n", " evaluation_service_qps=5,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DbYpGVBbILtt" }, "outputs": [], "source": [ "notebook_utils.display_eval_result(byor_pairwise_result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cPa03kOGQZz9" }, "outputs": [], "source": [ "notebook_utils.display_explanations(\n", " byor_pairwise_result, metrics=[\"pairwise_verbosity\"]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tECY8u-E69Eo" }, "outputs": [], "source": [ "candidate_model_win_rate = round(\n", " byor_pairwise_result.summary_metrics[\"pairwise_verbosity/candidate_model_win_rate\"]\n", " * 100\n", ")\n", "print(\n", " f\"Win rate: Autorater prefers Candidate Model over Baseline Model {candidate_model_win_rate}% of time.\"\n", ")" ] } ], "metadata": { "colab": { "collapsed_sections": [ "tfQ7sPtOjZOw", "42svczJqjZOw" ], "name": "migration_guide_preview_to_GA_sdk.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }

gemini/evaluation/migration_guide_preview_to_GA_sdk.ipynb (900 lines of code) (raw):