Evaluation/ApplicationEvaluation/Simulate_Evaluation.ipynb (662 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simulate and Evaluate using AI as Judge Quality Evaluators with Azure AI Evaluation SDK\n",
"\n",
"## Objective\n",
"\n",
"This tutorial provides a step-by-step guide on how to evaluate prompts against variety of model endpoints deployed on Azure AI Platform or non Azure AI platforms. \n",
"\n",
"This guide uses Python Class as an application target which is passed to Evaluate API provided by Azure AI Evaluation SDK to evaluate results generated by LLM models against provided prompts. \n",
"\n",
"This tutorial uses the following Azure AI services:\n",
"\n",
"- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n",
"\n",
"## Time\n",
"\n",
"You should expect to spend 30 minutes running this sample. \n",
"\n",
"## About this example\n",
"\n",
"This example demonstrates evaluating model endpoints responses against provided prompts using azure-ai-evaluation\n",
"\n",
"## Before you begin\n",
"\n",
"### Installation\n",
"\n",
"Install the following packages required to execute this notebook. "
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"%pip install azure-ai-evaluation\n",
"%pip install promptflow-azure\n",
"%pip install azure-identity\n",
"%pip install --upgrade openai\n",
"#%pip install marshmallow==3.23.3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Parameters and imports"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import os\n",
"import json\n",
"import importlib.resources as pkg_resources\n",
"import requests\n",
"from typing import Any, Dict, List, Optional\n",
"from pathlib import Path\n",
"from pprint import pprint\n",
"from azure.ai.evaluation import evaluate\n",
"from azure.ai.evaluation import GroundednessEvaluator\n",
"from azure.ai.evaluation.simulator import Simulator\n",
"from openai import AzureOpenAI\n",
"from azure.identity import DefaultAzureCredential, get_bearer_token_provider\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"os.environ[\"AZURE_SUBSCRIPTION_ID\"] = \"xxxxx\"#This is the subscription ID for the Azure service\n",
"os.environ[\"AZURE_OPENAI_ENDPOINT\"] = \"xxxxx\"#This is the endpoint for the Azure OpenAI service\n",
"os.environ[\"AZURE_OPENAI_API_VERSION\"] = \"2024-05-01-preview\"#This is the API version for the Azure OpenAI service\n",
"os.environ[\"AZURE_OPENAI_DEPLOYMENT\"] = \"gpt-4o\"#This is the deployment for the Azure OpenAI service\n",
"os.environ[\"AZURE_AI_FOUNDRY_RESOURCE_GROUP\"] = \"xxxxxx\"#This is the resource group for AI Foundry\n",
"os.environ[\"AZURE_AI_FOUNDRY_PROJECT_NAME\"] = \"xxxxxx\"#This is the project name for AI Foundry\n",
"\n",
"project_scope = {\n",
" \"subscription_id\": os.environ.get(\"AZURE_SUBSCRIPTION_ID\"),\n",
" \"resource_group_name\": os.environ.get(\"AZURE_AI_FOUNDRY_RESOURCE_GROUP\"),\n",
" \"project_name\": os.environ.get(\"AZURE_AI_FOUNDRY_PROJECT_NAME\"),\n",
"}\n",
"model_config = {\n",
" \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n",
" \"azure_deployment\": os.environ.get(\"AZURE_OPENAI_DEPLOYMENT\"),\n",
"}\n",
"\n",
"search_endpoint = \"xxxxx\"#This is the endpoint for the Azure search service\n",
"index_name = \"xxxxx\"# This is the index name for the Azure search service\n",
"api_key = \"xxxxx\"#This is the API key for the Azure search service\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create the simulator"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.evaluation.simulator import Simulator\n",
"\n",
"simulator = Simulator(model_config=model_config)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Connect application end point to the simulator"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"from application_endpoint import ApplicationEndpoint\n",
" \n",
"\n",
"async def callback(\n",
" messages: List[Dict],\n",
" stream: bool = False,\n",
" session_state: Any = None, # noqa: ANN401\n",
" context: Optional[Dict[str, Any]] = None,\n",
") -> dict:\n",
" messages_list = messages[\"messages\"]\n",
" # get last message\n",
" latest_message = messages_list[-1]\n",
" query = latest_message[\"content\"]\n",
" context = latest_message.get(\"context\", None)\n",
" # call model end point\n",
" model_endpoint = ApplicationEndpoint(model_config)\n",
" response = model_endpoint(query, None)\n",
" print(response)\n",
" # we are formatting the response to follow the openAI chat protocol format\n",
" formatted_response = {\n",
" \"content\": response[\"response\"],\n",
" \"role\": \"assistant\",\n",
" \"context\": context,\n",
" }\n",
" messages[\"messages\"].append(formatted_response)\n",
" return {\"messages\": messages[\"messages\"], \"stream\": stream, \"session_state\": session_state, \"context\": context}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the context from the AI Search"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"def generate_text_from_index(search_term: str) -> str:\n",
" url = f\"{search_endpoint}/indexes/{index_name}/docs/search?api-version=2024-07-01\"\n",
" headers = {\"Content-Type\": \"application/json\",\"Api-key\": api_key}\n",
" search_query = {\"search\": search_term, \"top\": 10}\n",
" print(f'Search query is {search_query}')\n",
" response = requests.post(url=url, headers=headers, data=json.dumps(search_query))\n",
"\n",
" text = \"\"\n",
" print(f'Statis code is {response.status_code}')\n",
" if response.status_code == 200:\n",
" results = response.json()\n",
" print(results)\n",
" for result in results[\"value\"]:\n",
" text += result[\"content\"]\n",
" \n",
" return text[:5000]\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Specify a search term for document"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"search_term = \"Responsible AI\"\n",
"text = generate_text_from_index(search_term)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Call the simultor"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"outputs = await simulator(\n",
" target=callback,\n",
" text=text,\n",
" num_queries=2,\n",
" max_conversation_turns=2,\n",
" tasks=[\n",
" f\"I want to learn more about Responsible AI\",\n",
" f\"I want to know how to implement Responsible AI in my organization\",\n",
" ],\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save the simulate data"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"simulated_output_file = Path(\"simulated_output.json\")\n",
"with simulated_output_file.open(\"a\") as f:\n",
" json.dump(outputs, f)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"eval_data_file = Path(\"simulated_eval_data.jsonl\")\n",
"with eval_data_file.open(\"w\") as file:\n",
" for output in outputs:\n",
" file.write(output.to_eval_qr_json_lines())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Validate The result\n",
"\n",
"We will use Evaluate API provided by Prompt Flow SDK. It requires a target Application or python Function, which handles a call to LLMs and retrieve responses. \n",
"\n",
"In the notebook, we will use an Application Target `ModelEndpoints` to get answers from multiple model endpoints against provided question aka prompts. \n",
"\n",
"This application target requires list of model endpoints and their authentication keys. For simplicity, we have provided them in the `env_var` variable which is passed into init() function of `ModelEndpoints`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Please provide Azure AI Project details so that traces and eval results are pushing in the project in Azure AI Studio."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"azure_ai_project = {\n",
" \"subscription_id\": os.environ[\"AZURE_SUBSCRIPTION_ID\"],\n",
" \"resource_group_name\": os.environ[\"AZURE_AI_FOUNDRY_RESOURCE_GROUP\"],\n",
" \"project_name\": os.environ[\"AZURE_AI_FOUNDRY_PROJECT_NAME\"],\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data\n",
"\n",
"Following code reads Json file \"data.jsonl\" which contains inputs to the Application Target function. It provides question, context and ground truth on each line. "
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_json(\"simulated_eval_data.jsonl\", lines=True)\n",
"print(df.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configuration\n",
"To use Relevance and Cohenrence Evaluator, we will Azure Open AI model details as a Judge that can be passed as model config."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"model_config = {\n",
" \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n",
" \"azure_deployment\": os.environ.get(\"AZURE_OPENAI_DEPLOYMENT\"),\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run the evaluation\n",
"\n",
"The Following code runs Evaluate API and uses Content Safety, Relevance and Coherence Evaluator to evaluate results from different models.\n",
"\n",
"The following are the few parameters required by Evaluate API. \n",
"\n",
"+ Data file (Prompts): It represents data file 'data.jsonl' in JSON format. Each line contains question, context and ground truth for evaluators. \n",
"\n",
"+ Application Target: It is name of python class which can route the calls to specific model endpoints using model name in conditional logic. \n",
"\n",
"+ Model Name: It is an identifier of model so that custom code in the App Target class can identify the model type and call respective LLM model using endpoint URL and auth key. \n",
"\n",
"+ Evaluators: List of evaluators is provided, to evaluate given prompts (questions) as input and output (answers) from LLM models. "
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"import pathlib\n",
"\n",
"from azure.ai.evaluation import evaluate\n",
"from azure.ai.evaluation import (\n",
" ContentSafetyEvaluator,\n",
" RelevanceEvaluator,\n",
" CoherenceEvaluator,\n",
" GroundednessEvaluator,\n",
" FluencyEvaluator,\n",
" SimilarityEvaluator,\n",
")\n",
"from application_endpoint import ApplicationEndpoint\n",
"\n",
"\n",
"content_safety_evaluator = ContentSafetyEvaluator(\n",
" azure_ai_project=azure_ai_project, credential=DefaultAzureCredential()\n",
")\n",
"relevance_evaluator = RelevanceEvaluator(model_config)\n",
"coherence_evaluator = CoherenceEvaluator(model_config)\n",
"groundedness_evaluator = GroundednessEvaluator(model_config)\n",
"fluency_evaluator = FluencyEvaluator(model_config)\n",
"similarity_evaluator = SimilarityEvaluator(model_config)\n",
"\n",
"path = str(pathlib.Path(pathlib.Path.cwd())) + \"/simulated_eval_data.jsonl\"\n",
"\n",
"results = evaluate(\n",
" evaluation_name=\"Eval-Run\" + \"-\" + \"GPT35-08012025\",\n",
" data=path,\n",
" evaluators={\n",
" \"content_safety\": content_safety_evaluator,\n",
" \"coherence\": coherence_evaluator,\n",
" \"relevance\": relevance_evaluator,\n",
" \"groundedness\": groundedness_evaluator,\n",
" \"fluency\": fluency_evaluator,\n",
" # \"similarity\": similarity_evaluator,\n",
" },\n",
" azure_ai_project=azure_ai_project,\n",
" evaluator_config={\n",
" \"content_safety\": {\"column_mapping\": {\"query\": \"${data.query}\", \"response\": \"${data.response}\"}},\n",
" \"coherence\": {\"column_mapping\": {\"response\": \"${data.response}\", \"query\": \"${data.query}\"}},\n",
" \"relevance\": {\n",
" \"column_mapping\": {\"response\": \"${data.response}\", \"context\": \"${data.context}\", \"query\": \"${data.query}\"}\n",
" },\n",
" \"groundedness\": {\n",
" \"column_mapping\": {\n",
" \"response\": \"${data.response}\",\n",
" \"context\": \"${data.context}\",\n",
" \"query\": \"${data.query}\",\n",
" }\n",
" },\n",
" \"fluency\": {\n",
" \"column_mapping\": {\"response\": \"${data.response}\", \"context\": \"${data.context}\", \"query\": \"${data.query}\"}\n",
" },\n",
" # \"similarity\": {\n",
" # \"column_mapping\": {\"response\": \"${data.response}\", \"context\": \"${data.context}\", \"query\": \"${data.query}\"}\n",
" # },\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"View the results"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"pprint(results)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>inputs.query</th>\n",
" <th>inputs.response</th>\n",
" <th>inputs.context</th>\n",
" <th>outputs.coherence.coherence</th>\n",
" <th>outputs.coherence.gpt_coherence</th>\n",
" <th>outputs.coherence.coherence_reason</th>\n",
" <th>outputs.relevance.relevance</th>\n",
" <th>outputs.relevance.gpt_relevance</th>\n",
" <th>outputs.relevance.relevance_reason</th>\n",
" <th>outputs.groundedness.groundedness</th>\n",
" <th>outputs.groundedness.gpt_groundedness</th>\n",
" <th>outputs.groundedness.groundedness_reason</th>\n",
" <th>outputs.fluency.fluency</th>\n",
" <th>outputs.fluency.gpt_fluency</th>\n",
" <th>outputs.fluency.fluency_reason</th>\n",
" <th>line_number</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>I'm curious about the guidelines for Responsib...</td>\n",
" <td>The title of the document that outlines Micros...</td>\n",
" <td>None</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>The RESPONSE is coherent and effectively addre...</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>The RESPONSE accurately and completely provide...</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>The RESPONSE introduces information about Micr...</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>The response is clear and grammatically correc...</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>What are some key principles outlined in Micro...</td>\n",
" <td>Microsoft's Responsible AI Standard outlines s...</td>\n",
" <td>None</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>The RESPONSE is coherent and effectively addre...</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>The RESPONSE fully addresses the QUERY by list...</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>The RESPONSE is completely unrelated to any CO...</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>The RESPONSE is well-articulated, with good gr...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>I'm curious about the specific goals outlined ...</td>\n",
" <td>The General Requirements of the Responsible AI...</td>\n",
" <td>None</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>The RESPONSE is coherent as it directly answer...</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>The RESPONSE fully addresses the QUERY by prov...</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>The RESPONSE is entirely unrelated to any CONT...</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>The RESPONSE demonstrates proficient fluency w...</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>That's a comprehensive list of goals! Can you ...</td>\n",
" <td>When implementing Responsible AI in your organ...</td>\n",
" <td>None</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>The RESPONSE is coherent and effectively addre...</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>The RESPONSE fully addresses the QUERY by prov...</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>The RESPONSE introduces a topic about Responsi...</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>The RESPONSE is well-articulated, with good co...</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" inputs.query \\\n",
"0 I'm curious about the guidelines for Responsib... \n",
"1 What are some key principles outlined in Micro... \n",
"2 I'm curious about the specific goals outlined ... \n",
"3 That's a comprehensive list of goals! Can you ... \n",
"\n",
" inputs.response inputs.context \\\n",
"0 The title of the document that outlines Micros... None \n",
"1 Microsoft's Responsible AI Standard outlines s... None \n",
"2 The General Requirements of the Responsible AI... None \n",
"3 When implementing Responsible AI in your organ... None \n",
"\n",
" outputs.coherence.coherence outputs.coherence.gpt_coherence \\\n",
"0 4 4 \n",
"1 4 4 \n",
"2 4 4 \n",
"3 4 4 \n",
"\n",
" outputs.coherence.coherence_reason \\\n",
"0 The RESPONSE is coherent and effectively addre... \n",
"1 The RESPONSE is coherent and effectively addre... \n",
"2 The RESPONSE is coherent as it directly answer... \n",
"3 The RESPONSE is coherent and effectively addre... \n",
"\n",
" outputs.relevance.relevance outputs.relevance.gpt_relevance \\\n",
"0 4 4 \n",
"1 5 5 \n",
"2 5 5 \n",
"3 4 4 \n",
"\n",
" outputs.relevance.relevance_reason \\\n",
"0 The RESPONSE accurately and completely provide... \n",
"1 The RESPONSE fully addresses the QUERY by list... \n",
"2 The RESPONSE fully addresses the QUERY by prov... \n",
"3 The RESPONSE fully addresses the QUERY by prov... \n",
"\n",
" outputs.groundedness.groundedness outputs.groundedness.gpt_groundedness \\\n",
"0 1 1 \n",
"1 1 1 \n",
"2 1 1 \n",
"3 1 1 \n",
"\n",
" outputs.groundedness.groundedness_reason outputs.fluency.fluency \\\n",
"0 The RESPONSE introduces information about Micr... 3 \n",
"1 The RESPONSE is completely unrelated to any CO... 4 \n",
"2 The RESPONSE is entirely unrelated to any CONT... 4 \n",
"3 The RESPONSE introduces a topic about Responsi... 4 \n",
"\n",
" outputs.fluency.gpt_fluency \\\n",
"0 3 \n",
"1 4 \n",
"2 4 \n",
"3 4 \n",
"\n",
" outputs.fluency.fluency_reason line_number \n",
"0 The response is clear and grammatically correc... 0 \n",
"1 The RESPONSE is well-articulated, with good gr... 1 \n",
"2 The RESPONSE demonstrates proficient fluency w... 2 \n",
"3 The RESPONSE is well-articulated, with good co... 3 "
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(results[\"rows\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}