Evaluation/ModelEvaluation/Evaluate_Base_Model

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Evaluate Base Model Endpoints using Azure AI Evaluation APIs\n", "\n", "## Objective\n", "\n", "This tutorial provides a step-by-step guide on how to evaluate prompts against variety of model endpoints deployed on Azure AI Platform or non Azure AI platforms. \n", "\n", "This guide uses Python Class as an application target which is passed to Evaluate API provided by PromptFlow SDK to evaluate results generated by LLM models against provided prompts. \n", "\n", "This tutorial uses the following Azure AI services:\n", "\n", "- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n", "\n", "## Time\n", "\n", "You should expect to spend 30 minutes running this sample. \n", "\n", "## About this example\n", "\n", "This example demonstrates evaluating model endpoints responses against provided prompts using azure-ai-evaluation\n", "\n", "## Before you begin\n", "\n", "### Installation\n", "\n", "Install the following packages required to execute this notebook. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install azure-ai-evaluation\n", "%pip install promptflow-azure\n", "%pip install promptflow-tracing\n", "%pip install marshmallow==3.23.3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters and imports" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import random\n", "import os\n", "from pprint import pprint\n", "# Use the following code to set the environment variables if not already set. If set, you can skip this step.\n", "\n", "os.environ[\"AZURE_OPENAI_API_VERSION\"] = \"2024-05-01-preview\"#This is the api version for the model for example 2024-05-01-preview\n", "os.environ[\"AZURE_OPENAI_DEPLOYMENT\"] = \"gpt-4o\"#This is the deployment name for the model for example got-4o\n", "os.environ[\"AZURE_OPENAI_ENDPOINT\"] = \"https://xxxxx.openai.azure.com/\"#This is end point for deployment for judging the model\n", "os.environ[\"AZURE_SUBSCRIPTION_ID\"] = \"xxxxxx\"#This is end point for deployment for judging the model\n", "os.environ[\"AZURE_AI_FOUNDRY_RESOURCE_GROUP\"] = \"xxxxxx\"#This is the resource group for AI Foundry\n", "os.environ[\"AZURE_AI_FOUNDRY_PROJECT_NAME\"] = \"xxxxxx\"#This is the project name for AI Foundry\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "print(os.environ['AZURE_OPENAI_ENDPOINT'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Target Application\n", "\n", "We will use Evaluate API provided by Prompt Flow SDK. It requires a target Application or python Function, which handles a call to LLMs and retrieve responses. \n", "\n", "In the notebook, we will use an Application Target `ModelEndpoints` to get answers from multiple model endpoints against provided question aka prompts. \n", "\n", "This application target requires list of model endpoints and their authentication keys. For simplicity, we have provided them in the `env_var` variable which is passed into init() function of `ModelEndpoints`." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "env_var = {\n", " \"gpt4\": {\n", " \"endpoint\": \"https://xxxxx.openai.azure.com/openai/deployments/xxxxx/chat/completions?api-version=xxxxx\",\n", " \"key\": \"xxxx\",\n", " },\n", " \"gpt35-turbo\": {\n", " \"endpoint\": \"https://xxxxx.openai.azure.com/openai/deployments/xxxxx/chat/completions?api-version=xxxxx\",\n", " \"key\": \"xxxx\",\n", " }\n", "}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Please provide Azure AI Project details so that traces and eval results are pushing in the project in Azure AI Studio." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "azure_ai_project = {\n", " \"subscription_id\": os.environ[\"AZURE_SUBSCRIPTION_ID\"],\n", " \"resource_group_name\": os.environ[\"AZURE_AI_FOUNDRY_RESOURCE_GROUP\"],\n", " \"project_name\": os.environ[\"AZURE_AI_FOUNDRY_PROJECT_NAME\"],\n", "}" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Endpoints\n", "The following code demonstrates how to call various model endpoints, and is configured based on `env_var` set above. For any model in `env_var`, if you do not have that model deployed in your AI project, please comment it out. If you have a model that you would like to test that does not correspond with one of the types seen below, please include that type in the `__call__` function and create a helper function to call the model's endpoint via REST. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize model_endpoints.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "Following code reads Json file \"data.jsonl\" which contains inputs to the Application Target function. It provides question, context and ground truth on each line. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_json(\"data.jsonl\", lines=True)\n", "print(df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuration\n", "To use Relevance and Cohenrence Evaluator, we will Azure Open AI model details as a Judge that can be passed as model config." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "model_config = {\n", " \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n", " \"azure_deployment\": os.environ.get(\"AZURE_OPENAI_DEPLOYMENT\"),\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run the evaluation\n", "\n", "The Following code runs Evaluate API and uses Content Safety, Relevance and Coherence Evaluator to evaluate results from different models.\n", "\n", "The following are the few parameters required by Evaluate API. \n", "\n", "+ Data file (Prompts): It represents data file 'data.jsonl' in JSON format. Each line contains question, context and ground truth for evaluators. \n", "\n", "+ Application Target: It is name of python class which can route the calls to specific model endpoints using model name in conditional logic. \n", "\n", "+ Model Name: It is an identifier of model so that custom code in the App Target class can identify the model type and call respective LLM model using endpoint URL and auth key. \n", "\n", "+ Evaluators: List of evaluators is provided, to evaluate given prompts (questions) as input and output (answers) from LLM models. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pathlib\n", "\n", "from azure.ai.evaluation import evaluate\n", "from azure.ai.evaluation import (\n", " RelevanceEvaluator,\n", ")\n", "from model_endpoints import ModelEndpoints\n", "\n", "relevance_evaluator = RelevanceEvaluator(model_config)\n", "\n", "models = [\n", " \"gpt4\",\n", " \"gpt35-turbo\"\n", "]\n", "results = {}\n", "path = str(pathlib.Path(pathlib.Path.cwd())) + \"/data.jsonl\"\n", "\n", "for model in models:\n", " randomNum = random.randint(1111, 9999)\n", " results[model] = evaluate(\n", " evaluation_name=\"Eval-Run-\" + str(randomNum) + \"-\" + model.title(),\n", " data=path,\n", " target=ModelEndpoints(env_var, model),\n", " evaluators={\n", " \"relevance\": relevance_evaluator,\n", " },\n", " evaluator_config={\n", " \"relevance\": {\n", " \"column_mapping\": {\n", " \"response\": \"${target.response}\",\n", " \"context\": \"${data.context}\",\n", " \"query\": \"${data.query}\",\n", " },\n", " },\n", " },\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "View the results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "pd = pd.DataFrame(results)\n", "gpt4scores = [row['outputs.relevance.relevance'] for row in pd.gpt4.rows]\n", "np.mean(gpt4scores)\n", "gpt35_turbo_scores = [row['outputs.relevance.relevance'] for row in pd['gpt35-turbo'].rows]\n", "np.mean(gpt35_turbo_scores)\n", "\n", "print(\"GPT-4 Mean Relevance Score: \", np.mean(gpt4scores))\n", "print(\"GPT-35-Turbo Mean Relevance Score: \", np.mean(gpt35_turbo_scores))" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 2 }

Evaluation/ModelEvaluation/Evaluate_Base_Model_Endpoint.ipynb (317 lines of code) (raw):