gemini/evaluation/evaluate_autorater.ipynb (966 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ur8xi4C7S06n"
},
"outputs": [],
"source": [
"# Copyright 2025 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JAPoU8Sm5E6e"
},
"source": [
"# Evaluate your autorater with meta-evaluation\n",
"\n",
"<table align=\"left\">\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/tree/main/gemini/evaluation/evaluate_autorater.ipynb\">\n",
" <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fevaluation%2Fevaluate_autorater.ipynb\">\n",
" <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/evaluation/evaluate_autorater.ipynb\">\n",
" <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://console.cloud.google.com/bigquery/import?url=https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_autorater.ipynb\">\n",
" <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/bigquery/v1/32px.svg\" alt=\"BigQuery Studio logo\"><br> Open in BigQuery Studio\n",
" </a>\n",
" </td>\n",
" <td style=\"text-align: center\">\n",
" <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/tree/main/gemini/evaluation/evaluate_autorater.ipynb\">\n",
" <img width=\"32px\" src=\"https://www.svgrepo.com/download/217753/github.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
" </a>\n",
" </td>\n",
"</table>\n",
"\n",
"<div style=\"clear: both;\"></div>\n",
"\n",
"<b>Share to:</b>\n",
"\n",
"<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_autorater.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_autorater.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_autorater.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_autorater.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
"</a>\n",
"\n",
"<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/evaluate_autorater.ipynb\" target=\"_blank\">\n",
" <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
"</a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "84f0f73a0f76"
},
"source": [
"| | |\n",
"|-|-|\n",
"| Author(s) | [Yuan (Emily) Xue](https://www.linkedin.com/in/yuan-emily-xue-3483012/), [Irina Sigler](https://www.linkedin.com/in/irina-sigler-298a59b1/)|"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tvgnzT1CKxrO"
},
"source": [
"## Overview\n",
"\n",
"It's crucial to assess the effectiveness of your Large Language Model (LLM) evaluator to ensure it's guiding you correctly. This process, known as meta-evaluation, is a key step in establishing a task-specific evaluation framework.\n",
"\n",
"Essentially, it involves comparing the performance of automated evaluation systems (evaluators) against human judgments to determine how well they align with human preferences. This calibration is often done using agreement or correlation measures, depending on the specific evaluation task.\n",
"\n",
"This tutorial offers the design of two simple autoraters and a step-by-step guide to evaluate autoraters."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "61RBz8LLbxCR"
},
"source": [
"## Get started"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "No17Cw5hgx12"
},
"source": [
"### Install Vertex AI SDK and other required packages\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tFy3H3aPgx12"
},
"outputs": [],
"source": [
"%pip install --upgrade --user --quiet google-cloud-aiplatform"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "R5Xep4W9lq-Z"
},
"source": [
"### Restart runtime\n",
"\n",
"To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.\n",
"\n",
"The restart might take a minute or longer. After it's restarted, continue to the next step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XRvKdaPDTznN"
},
"outputs": [],
"source": [
"import IPython\n",
"\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SbmM4z7FOBpM"
},
"source": [
"<div class=\"alert alert-block alert-warning\">\n",
"<b>⚠️ The kernel is going to restart. In Colab or Colab Enterprise, you might see an error message that says \"Your session crashed for an unknown reason.\" This is expected. Wait until it's finished before continuing to the next step. ⚠️</b>\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dmWOrTJ3gx13"
},
"source": [
"### Authenticate your notebook environment (Colab only)\n",
"\n",
"If you're running this notebook on Google Colab, run the cell below to authenticate your environment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "NyKGtVQjgx13"
},
"outputs": [],
"source": [
"import sys\n",
"\n",
"if \"google.colab\" in sys.modules:\n",
" from google.colab import auth\n",
"\n",
" auth.authenticate_user()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DF4l8DTdWgPY"
},
"source": [
"### Set Google Cloud project information and initialize Vertex AI SDK\n",
"\n",
"To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).\n",
"\n",
"Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Nqwi-5ufWp_B"
},
"outputs": [],
"source": [
"# Use the environment variable if the user doesn't provide Project ID.\n",
"import os\n",
"\n",
"import vertexai\n",
"\n",
"PROJECT_ID = \"[your-project-id]\" # @param {type: \"string\", placeholder: \"[your-project-id]\", isTemplate: true}\n",
"\n",
"if not PROJECT_ID or PROJECT_ID == \"[your-project-id]\":\n",
" PROJECT_ID = str(os.environ.get(\"GOOGLE_CLOUD_PROJECT\"))\n",
"\n",
"LOCATION = os.environ.get(\"GOOGLE_CLOUD_REGION\", \"us-central1\")\n",
"\n",
"vertexai.init(project=PROJECT_ID, location=LOCATION)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5303c05f7aa6"
},
"source": [
"## Import libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6fc324893334"
},
"outputs": [],
"source": [
"import abc\n",
"from collections.abc import Callable\n",
"import dataclasses\n",
"import random\n",
"import re\n",
"\n",
"import pandas as pd\n",
"from scipy.stats import kendalltau, spearmanr\n",
"from sklearn.metrics import cohen_kappa_score, confusion_matrix\n",
"from vertexai.generative_models import GenerativeModel"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "e43229f3ad4f"
},
"source": [
"## Load model as autorater"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cf93d5f0ce00"
},
"outputs": [],
"source": [
"MODEL_ID = \"gemini-2.0-flash\" # @param {type:\"string\", isTemplate: true}\n",
"\n",
"autorater_model = GenerativeModel(MODEL_ID)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Fp6XQ3WZ-Uf0"
},
"source": [
"## Set Meta-evaluation components"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2mDDmEFocmO2"
},
"source": [
"### Define Autoraters\n",
"\n",
"The following code defines a system for automatically rating responses to prompts, particularly for comparing pairs of responses.\n",
"\n",
"The system has the following components:\n",
"\n",
"- **`base_model`** which is assumed to be a language model capable of generating text.\n",
"\n",
"- **`AutoRater`** class provides an abstract framework for response rating, implemented concretely by `BasicRater` and `SelfConsistencyRater`. **`BasicRater`** rates a single example using a chosen prompt method from a prompt map. **`SelfConsistencyRater`** enhances reliability by calling the base model multiple times with the same example and aggregating the results for a consensus.\n",
"\n",
"- **`The RaterInstruction`** dataclass pairs a prompt template with a corresponding parsing function to interpret the model's output. The parsing function uses regular expressions for extracting the winner or tie from the model's response. It handles potential errors gracefully and provides flexibility through different prompt methods and rating strategies.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sk8gbE3acqjH"
},
"outputs": [],
"source": [
"@dataclasses.dataclass\n",
"class Example:\n",
" \"\"\"Represents a single example for rating.\n",
"\n",
" Attributes:\n",
" prompt: The prompt string.\n",
" response1: The first response string.\n",
" response2: The second response string.\n",
" \"\"\"\n",
"\n",
" prompt: str\n",
" response1: str\n",
" response2: str\n",
"\n",
"\n",
"@dataclasses.dataclass\n",
"class RaterInstruction:\n",
" \"\"\"Stores a rating prompt and its associated result parser.\n",
"\n",
" Attributes:\n",
" prompt: The prompt string for the rater.\n",
" result_parser: A function to parse the rater's output string into an integer score.\n",
" \"\"\"\n",
"\n",
" prompt: str\n",
" result_parser: Callable[[str], int | None]\n",
"\n",
"\n",
"def simple_no_tie_result_parser(result_str: str) -> int | None:\n",
" \"\"\"Parses a result string for simple prompts (no ties allowed).\n",
"\n",
" Args:\n",
" result_str: The rater's output string.\n",
"\n",
" Returns:\n",
" -1 if response 1 is better, 1 if response 2 is better, or None if the output is invalid.\n",
" \"\"\"\n",
" matches = re.findall(r\"<winner>(.*?)</winner>\", result_str)\n",
" if not matches or len(matches) > 1:\n",
" return None\n",
" if matches[0] == \"1\":\n",
" return -1\n",
" elif matches[0] == \"2\":\n",
" return 1\n",
" else:\n",
" return None\n",
"\n",
"\n",
"def simple_with_tie_result_parser(result_str: str) -> int | None:\n",
" \"\"\"Parses a result string for simple prompts (ties allowed).\n",
"\n",
" Args:\n",
" result_str: The rater's output string.\n",
"\n",
" Returns:\n",
" -1 if response 1 is better, 1 if response 2 is better, 0 if it's a tie, or None if the output is invalid.\n",
" \"\"\"\n",
" if \"<tie>\" in result_str:\n",
" return 0\n",
"\n",
" matches = re.findall(r\"<winner>(.*?)</winner>\", result_str)\n",
" if not matches or len(matches) > 1:\n",
" return None\n",
" if matches[0] == \"1\":\n",
" return -1\n",
" elif matches[0] == \"2\":\n",
" return 1\n",
" else:\n",
" return None\n",
"\n",
"\n",
"def reward_bench_gemini_no_tie_parser(result_str: str) -> int | None:\n",
" \"\"\"Parses a result string for reward-bench Gemini prompts (no ties).\n",
"\n",
" Args:\n",
" result_str: The rater's output string.\n",
"\n",
" Returns:\n",
" -1 if response 1 (A) is better, 1 if response 2 (B) is better, or None if the output is invalid.\n",
" \"\"\"\n",
" match = re.search(r\"\\[\\[(A|B)\\]\\]\", result_str)\n",
" if match:\n",
" if match.group(1) == \"A\":\n",
" return -1\n",
" elif match.group(1) == \"B\":\n",
" return 1\n",
" else:\n",
" return None\n",
" else:\n",
" return None\n",
"\n",
"\n",
"PROMPT_MAP = {\n",
" \"simple_no_tie\": RaterInstruction(\n",
" prompt=\"\"\"\n",
" ## You are an impartial judge to evaluate the quality of two responses to the user prompt.\n",
"\n",
" [Start of Prompt]{prompt} [end of prompt]\n",
" [Start of Response 1]{response1}[End of the response 1]\n",
" [Start of Response 2]{response2}[End of the response 2]\n",
"\n",
" Your output should strictly follow this format: <winner>1</winner>, if response 1 is better; <winner>2</winner>, if response 2 is better.\n",
" \"\"\",\n",
" result_parser=simple_no_tie_result_parser,\n",
" ),\n",
" \"simple_with_tie\": RaterInstruction(\n",
" prompt=\"\"\"\n",
" ## You are an impartial judge to evaluate the quality of two responses to the user prompt.\n",
"\n",
" [Start of Prompt]{prompt} [end of prompt]\n",
" [Start of Response 1]{response1}[End of the response 1]\n",
" [Start of Response 2]{response2}[End of the response 2]\n",
"\n",
" Your output should strictly follow this format: <winner>1</winner>, if response 1 is better; <winner>2</winner>, if response 2 is better; <tie> if you cannot determine a winner\n",
" \"\"\",\n",
" result_parser=simple_with_tie_result_parser,\n",
" ),\n",
" \"reward-bench_gemini_no_tie\": RaterInstruction(\n",
" prompt=(\n",
" \"Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. \"\n",
" \"You should choose the assistant that follows the user's instructions and answers the user's question better. \"\n",
" \"Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. \"\n",
" \"Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. \"\n",
" \"Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. \"\n",
" \"Be as objective as possible. \"\n",
" \"Your output should only consist of '[[A]]' if assistant A is better, or '[[B]]' if assistant B is better. Omit any other output.\\n\"\n",
" \"[User Question]\\n{prompt}\\n\\n[The Start of Assistant A's Answer]\\n{response1}\\n[The End of Assistant A's Answer]\\n\\n[The Start of Assistant B's Answer]\\n{response2}\\n[The End of Assistant B's Answer]\",\n",
" ),\n",
" result_parser=simple_no_tie_result_parser,\n",
" ),\n",
"}\n",
"\n",
"\n",
"class AutoRater(abc.ABC):\n",
" \"\"\"Abstract base class for automatic rating of responses.\"\"\"\n",
"\n",
" def __init__(self, name, base_model, prompt_method):\n",
" \"\"\"Initializes the AutoRater.\n",
"\n",
" Args:\n",
" name: The name of the rater.\n",
" base_model: The base language model to use for rating.\n",
" prompt_method: The name of the prompt method to use.\n",
" \"\"\"\n",
" self.name = name\n",
" self.base_model = base_model\n",
" self.prompt_method = prompt_method\n",
"\n",
" @abc.abstractmethod\n",
" def rate_one_example(self, example: Example) -> int:\n",
" \"\"\"Rates a single example.\n",
"\n",
" Args:\n",
" example: The example to rate.\n",
"\n",
" Returns:\n",
" The rating for the example.\n",
" \"\"\"\n",
" # for pair-wise: -1, 0, 1\n",
" # for point-wise: scale 1-5\n",
" pass # This method should also be implemented by subclasses\n",
"\n",
" def rate_batch(self, id_to_example: dict[str, Example]) -> dict[str, int]:\n",
" \"\"\"Rates a batch of examples.\n",
"\n",
" Args:\n",
" id_to_example: A dictionary mapping example IDs to examples.\n",
"\n",
" Returns:\n",
" A dictionary mapping example IDs to ratings.\n",
" \"\"\"\n",
" id_to_rating = {}\n",
" for id, ex in id_to_example.items():\n",
" rating = self.rate_one_example(ex)\n",
" if rating is not None:\n",
" id_to_rating[id] = rating\n",
" return id_to_rating\n",
"\n",
"\n",
"class BasicRater(AutoRater):\n",
" \"\"\"A basic rater that uses a single call to the language model.\"\"\"\n",
"\n",
" def __init__(self, base_model, prompt_method):\n",
" \"\"\"Initializes the BasicRater.\n",
"\n",
" Args:\n",
" base_model: The base language model.\n",
" prompt_method: The prompt method name.\n",
" \"\"\"\n",
" super().__init__(\"Basic\", base_model, prompt_method)\n",
"\n",
" def rate_one_example(self, example: Example) -> int | None:\n",
" \"\"\"Rates a single example using a single LLM call.\n",
"\n",
" Args:\n",
" example: The example to rate.\n",
"\n",
" Returns:\n",
" The rating for the example, or None if an error occurred.\n",
" \"\"\"\n",
" autorater_prompt = PROMPT_MAP[self.prompt_method].prompt.format(\n",
" prompt=example.prompt,\n",
" response1=example.response1,\n",
" response2=example.response2,\n",
" )\n",
" try:\n",
" response = self.base_model.generate_content(autorater_prompt)\n",
" except Exception as e:\n",
" print(e)\n",
" return None\n",
" if response.candidates and response.candidates[0].content.parts:\n",
" result_str = response.candidates[0].content.parts[0].text\n",
" print(result_str)\n",
" result = PROMPT_MAP[self.prompt_method].result_parser(result_str)\n",
" print(result)\n",
" return result\n",
"\n",
"\n",
"class SelfConsistencyRater(AutoRater):\n",
" \"\"\"A self-consistency rater that uses multiple calls to the language model and aggregates the results.\"\"\"\n",
"\n",
" def __init__(self, base_model, prompt_method, num_calls):\n",
" \"\"\"Initializes the SelfConsistencyRater.\n",
"\n",
" Args:\n",
" base_model: The base language model.\n",
" prompt_method: The prompt method name.\n",
" num_calls: The number of calls to make to the language model.\n",
" \"\"\"\n",
" super().__init__(\"SelfConsistency\", base_model, prompt_method)\n",
" self.num_calls = num_calls\n",
"\n",
" def _get_consensus_result(self, result_list: list[int]) -> int:\n",
" \"\"\"Aggregates the results from multiple calls to the language model.\n",
"\n",
" Args:\n",
" result_list: A list of integer results.\n",
"\n",
" Returns:\n",
" The aggregated result.\n",
" \"\"\"\n",
" avg = sum(result_list) / len(result_list)\n",
" if avg > 0.5:\n",
" return 1\n",
" if avg < -0.5:\n",
" return -1\n",
" return 0\n",
"\n",
" def rate_one_example(self, example: Example) -> int | None:\n",
" \"\"\"Rates a single example using multiple LLM calls and consensus aggregation.\n",
"\n",
" Args:\n",
" example: The example to rate.\n",
"\n",
"\n",
" Returns:\n",
" The aggregated rating for the example, or None if an error occurred or no valid results were obtained.\n",
" \"\"\"\n",
" result_list = []\n",
" for i in range(self.num_calls):\n",
" autorater_prompt = PROMPT_MAP[self.prompt_method].prompt.format(\n",
" prompt=example.prompt,\n",
" response1=example.response1,\n",
" response2=example.response2,\n",
" )\n",
" try:\n",
" response = self.base_model.generate_content(autorater_prompt)\n",
" except Exception as e:\n",
" print(e)\n",
" continue\n",
"\n",
" if response.candidates and response.candidates[0].content.parts:\n",
" result_str = response.candidates[0].content.parts[0].text\n",
" print(result_str)\n",
" result = PROMPT_MAP[self.prompt_method].result_parser(result_str)\n",
" print(result)\n",
" if result is not None:\n",
" result_list.append(result)\n",
" print(result_list)\n",
" if result_list:\n",
" result = self._get_consensus_result(result_list)\n",
" print(result)\n",
" return result\n",
" else:\n",
" return None"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Zaej-nmd7Vyc"
},
"source": [
"### Define datasets import and process utils\n",
"\n",
"After you define the Autorater rating system, you prepare you evaluation data. The `process_reward_bench` function prepares data from the [AllenAI reward benchmark](https://github.com/allenai/reward-bench) for use in training or evaluating preference models.\n",
"\n",
"You read data from a specified parquet file ('raw' or 'filtered'), and creates a dictionary mapping question IDs to Example objects, each containing a prompt and two responses. A random \"golden rating\" (-1 or 1) is assigned to each example, indicating which response is preferred. This simulated preference data can then be used for tasks like training a reward model.\n",
"\n",
"Notice how the `process_reward_bench` function also handles limiting the number of processed examples based on the `total_count` parameter.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3-JBSxUh8Qg_"
},
"outputs": [],
"source": [
"def process_reward_bench(split: str = \"filtered\", total_count: int = None):\n",
" \"\"\"Processes reward benchmark data.\n",
"\n",
" This function processes reward benchmark data based on the specified split\n",
" and total count. If 'total_count' is less than 0, all data will be used.\n",
"\n",
" Args:\n",
" split: The split of data to use ('filtered' by default).\n",
" total_count: The maximum number of data points to process. If not set,\n",
" all data will be used.\n",
"\n",
" Returns:\n",
" This function does not return any value (implicitly returns None).\n",
" \"\"\"\n",
" split_to_path = {\n",
" \"raw\": \"data/raw-00000-of-00001.parquet\",\n",
" \"filtered\": \"data/filtered-00000-of-00001.parquet\",\n",
" }\n",
" df = pd.read_parquet(\"hf://datasets/allenai/reward-bench/\" + split_to_path[split])\n",
" id_to_example = {}\n",
" id_to_golden_rating = {}\n",
"\n",
" for index, (id, row) in enumerate(df.iterrows()):\n",
" if total_count is not None and index > total_count - 1:\n",
" break\n",
" question_id = f'{row[\"subset\"]}-{row[\"id\"]}'\n",
" golden_rating = random.choice([-1, 1])\n",
" if golden_rating == -1:\n",
" response1 = row[\"chosen\"]\n",
" response2 = row[\"rejected\"]\n",
" else:\n",
" response2 = row[\"chosen\"]\n",
" response1 = row[\"rejected\"]\n",
" prompt = row[\"prompt\"]\n",
" example = Example(prompt=prompt, response1=response1, response2=response2)\n",
" id_to_example[question_id] = example\n",
" id_to_golden_rating[question_id] = golden_rating\n",
"\n",
" return id_to_example, id_to_golden_rating"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uzS_HbuJ9srO"
},
"source": [
"### Meta Evaluation methods\n",
"\n",
"To implement the meta evaluation you define two functions:\n",
"\n",
"- `get_caliboration_result` computes several metrics to evaluate the performance of a model, including Spearman's rank correlation, Kendall's tau, Cohen's Kappa, and generates a confusion matrix. These metrics are commonly used in machine learning to assess the agreement between predicted and actual labels.\n",
"\n",
"- `get_aligned_lists` takes two dictionaries, each mapping IDs to ratings (e.g., automatically generated ratings and human-assigned golden ratings). It sorts the shared keys between the two dictionaries and creates two aligned lists of ratings corresponding to the sorted IDs, enabling paired comparisons between auto generated and true labels. This function is crucial for preparing data for comparison and evaluation, ensuring that the ratings being compared actually correspond to the same data points.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "xfdCDlK7UjDB"
},
"outputs": [],
"source": [
"def get_caliboration_result(\n",
" model_outputs: list[int],\n",
" golden_labels: list[int],\n",
" labels: list[int] = [-1, 0, 1],\n",
" weights: str = \"quadratic\",\n",
"):\n",
" \"\"\"Calculates various evaluation metrics for model outputs against golden labels.\n",
"\n",
" Args:\n",
" model_outputs: Predicted labels from the model.\n",
" golden_labels: True labels.\n",
" labels: Unique labels in the dataset. Default is [-1, 0, 1].\n",
" weights: Weighting scheme for Cohen's Kappa. Default is \"quadratic\".\n",
"\n",
" Returns:\n",
" A tuple containing the confusion matrix (DataFrame), Cohen's Kappa score, Spearman's rank correlation, and Kendall's rank correlation.\n",
" \"\"\"\n",
" spearman, _ = spearmanr(model_outputs, golden_labels)\n",
" kendall, _ = kendalltau(model_outputs, golden_labels)\n",
" kappa = cohen_kappa_score(\n",
" model_outputs, golden_labels, labels=labels, weights=weights\n",
" )\n",
"\n",
" conf_matrix = confusion_matrix(golden_labels, model_outputs, labels=labels)\n",
" conf_matrix_df = pd.DataFrame(\n",
" conf_matrix,\n",
" index=[\"Gold_1\", \"Gold_Tie\", \"Gold_2\"],\n",
" columns=[\"Model_1\", \"Model_Tie\", \"Model_2\"],\n",
" )\n",
"\n",
" return conf_matrix_df, kappa, spearman, kendall\n",
"\n",
"\n",
"def get_aligned_lists(\n",
" id_to_auto_ratings: dict, id_to_golden_rating: dict\n",
") -> tuple[list, list]:\n",
" \"\"\"Aligns two dictionaries of ratings based on shared keys (IDs).\n",
"\n",
" Args:\n",
" id_to_auto_ratings: Dictionary mapping IDs to automatically generated ratings.\n",
" id_to_golden_rating: Dictionary mapping IDs to golden ratings.\n",
"\n",
" Returns:\n",
" A tuple containing two aligned lists of ratings.\n",
" \"\"\"\n",
" id_list = sorted(id_to_auto_ratings.keys())\n",
" auto_ratings = [id_to_auto_ratings[id] for id in id_list]\n",
" golden_ratings = [id_to_golden_rating[id] for id in id_list]\n",
" return auto_ratings, golden_ratings"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UYqBU1g1VH7E"
},
"source": [
"### Evaluate your autorater"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "i4CeB7wpWErY"
},
"source": [
"#### Load the evaluation dataset\n",
"\n",
"Load the dataset to evaluate the autorater."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eJi4GRLoWB6h"
},
"outputs": [],
"source": [
"id_to_example, id_to_golden_rating = process_reward_bench(\n",
" split=\"filtered\", total_count=100\n",
")\n",
"len(id_to_example)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kXhjmwbqWJ8G"
},
"source": [
"#### Basic evaluation\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vW4GJMt7c0ai"
},
"source": [
"##### Run the evaluation\n",
"\n",
"Evaluate your autorater using the basic rater on a single example."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VxDNIo1PWNax"
},
"outputs": [],
"source": [
"rater = BasicRater(autorater_model, \"simple_no_tie\")\n",
"id_to_basic_auto_ratings = rater.rate_batch(id_to_example)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "H6YFKVkSWoX5"
},
"source": [
"##### Calculate Meta Evaluation scores\n",
"\n",
"Compute alignement metrics to assess the agreement between predicted and actual labels."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eRFeYO8PWtwL"
},
"outputs": [],
"source": [
"auto_ratings, golden_ratings = get_aligned_lists(\n",
" id_to_basic_auto_ratings, id_to_golden_rating\n",
")\n",
"conf_matrix_df, kappa_score, corr_spearman, corr_kendall = get_caliboration_result(\n",
" model_outputs=auto_ratings,\n",
" golden_labels=golden_ratings,\n",
" labels=[-1, 0, 1],\n",
" weights=\"quadratic\",\n",
")\n",
"print(f\"kappa_score: {kappa_score}\")\n",
"print(f\"corr_spearman: {corr_spearman}\")\n",
"print(f\"corr_kendall: {corr_kendall}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9R8nEaA3XIdb"
},
"outputs": [],
"source": [
"print(conf_matrix_df)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7bJCsXuOXO4I"
},
"source": [
"#### Self consistent evaluation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PmirMTxVdIuh"
},
"source": [
"##### Run the evaluation\n",
"\n",
"Evaluate your autorater by calling the base model multiple times with the same example and aggregating the results for a consensus."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "O3GsO6dTXPfO"
},
"outputs": [],
"source": [
"sc_rater = SelfConsistencyRater(autorater_model, \"simple_no_tie\", 3)\n",
"id_to_sc_auto_ratings = sc_rater.rate_batch(id_to_example)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "di0BSuncdfWj"
},
"source": [
"##### Calculate Meta Evaluation scores\n",
"\n",
"Compute alignement metrics to assess the autorater."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tMRKb13hXUPr"
},
"outputs": [],
"source": [
"auto_ratings, golden_ratings = get_aligned_lists(\n",
" id_to_sc_auto_ratings, id_to_golden_rating\n",
")\n",
"print(len(auto_ratings))\n",
"print(len(golden_ratings))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "WFXJLM23XV3U"
},
"outputs": [],
"source": [
"conf_matrix_df, kappa_score, corr_spearman, corr_kendall = get_caliboration_result(\n",
" model_outputs=auto_ratings,\n",
" golden_labels=golden_ratings,\n",
" labels=[-1, 0, 1],\n",
" weights=\"quadratic\",\n",
")\n",
"print(f\"kappa_score: {kappa_score}\")\n",
"print(f\"corr_spearman: {corr_spearman}\")\n",
"print(f\"corr_kendall: {corr_kendall}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "rR9GyCHWXXdr"
},
"outputs": [],
"source": [
"print(conf_matrix_df)"
]
}
],
"metadata": {
"colab": {
"name": "evaluate_autorater.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}