Lab4_simulate_datasets/simulate_evaluate_groundedness.ipynb (445 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluating Model Groundedness with Azure AI Evaluation SDK\n",
"\n",
"This notebook aims to simulate and evaluate the groundedness of a model endpoint using the Azure AI Evaluation SDK. Groundedness refers to the extent to which the responses generated by a model are based on reliable and verifiable information. Ensuring that a model's outputs are grounded is crucial for maintaining the accuracy and trustworthiness of AI systems.\n",
"\n",
"In this notebook, we will:\n",
"\n",
"1. Set up the Azure AI Evaluation SDK.\n",
"2. Define the dataset for evaluating groundedness, which will vary based on the specific use case of your model.\n",
"3. Simulate the model endpoint and generate responses.\n",
"4. Evaluate the groundedness of the model's responses using the Azure AI Evaluation SDK.\n",
"\n",
"The dataset used for evaluating groundedness will be tailored to the particular application of your model. For instance, if your model is designed for customer support, the dataset might consist of common customer queries and the corresponding accurate responses. If your model is used for medical diagnosis, the dataset would include medical cases and verified diagnostic information.\n",
"\n",
"By the end of this notebook, you will have a clear understanding of how to assess the groundedness of your model's outputs and ensure that they are based on solid and reliable information.\n",
"\n",
"This tutorial uses the following Azure AI services:\n",
"\n",
"- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n",
"\n",
"## Time\n",
"\n",
"You should expect to spend 30 minutes running this sample. \n",
"\n",
"## About this example\n",
"\n",
"This example demonstrates evaluating model endpoints responses against provided prompts using azure-ai-evaluation\n",
"\n",
"## Before you begin\n",
"\n",
"### Installation\n",
"\n",
"Install the following packages required to execute this notebook. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install azure-ai-evaluation --upgrade\n",
"%pip install promptflow-azure\n",
"%pip install azure-identity"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Parameters and imports"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we define the data, `grounding.json` on which we will simulate query and response pairs to help us evaluate the groundedness of our model's responses. Based on the use case of your model, the data you use to evaluate groundedness might differ. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"gather": {
"logged": 1733168529241
}
},
"outputs": [],
"source": [
"import os\n",
"from typing import Any, Dict, List, Optional\n",
"import json\n",
"from pathlib import Path\n",
"\n",
"from azure.ai.evaluation import evaluate\n",
"from azure.ai.evaluation import GroundednessEvaluator,GroundednessProEvaluator\n",
"from azure.ai.evaluation.simulator import Simulator\n",
"from openai import AzureOpenAI\n",
"import importlib.resources as pkg_resources\n",
"from azure.identity import DefaultAzureCredential, get_bearer_token_provider"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1733168531372
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"import os\n",
"from dotenv import load_dotenv\n",
"load_dotenv(\"../.credentials.env\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"gather": {
"logged": 1733168533295
}
},
"outputs": [],
"source": [
"azure_ai_project = {\n",
" \"subscription_id\": os.environ.get(\"AZURE_SUBSCRIPTION_ID\"),\n",
" \"resource_group_name\": os.environ.get(\"AZURE_RESOURCE_GROUP\"),\n",
" \"project_name\": os.environ.get(\"AZURE_PROJECT_NAME\"),\n",
"}\n",
"\n",
"model_config = {\n",
" \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n",
" \"api_key\": os.environ.get(\"AZURE_OPENAI_API_KEY\"),\n",
" \"azure_deployment\": os.environ.get(\"AZURE_OPENAI_DEPLOYMENT\"),\n",
" \"api_version\": os.environ.get(\"AZURE_OPENAI_API_VERSION\"),\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1733168535314
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"print(azure_ai_project)\n",
"print(model_config)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data\n",
"Here we define the data, `grounding.json` on which we will simulate query and response pairs to help us evaluate the groundedness of our model's responses. Based on the use case of your model, the data you use to evaluate groundedness might differ. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"gather": {
"logged": 1733168537661
}
},
"outputs": [],
"source": [
"resource_name = \"grounding.json\"\n",
"package = \"azure.ai.evaluation.simulator._data_sources\"\n",
"conversation_turns = []\n",
"\n",
"with pkg_resources.path(package, resource_name) as grounding_file, Path.open(grounding_file, \"r\") as file:\n",
" data = json.load(file)\n",
"\n",
"for item in data:\n",
" conversation_turns.append([item])\n",
" if len(conversation_turns) == 2:\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Target Endpoint\n",
"\n",
"We will use Evaluate API provided by Azure AI Evaluations SDK. It requires a target Application or python Function, which handles a call to LLMs and retrieve responses. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"gather": {
"logged": 1733168543338
}
},
"outputs": [],
"source": [
"def example_application_response(query: str, context: str) -> str:\n",
" deployment = os.environ.get(\"AZURE_OPENAI_DEPLOYMENT\")\n",
" endpoint = os.environ.get(\"AZURE_OPENAI_ENDPOINT\")\n",
" token_provider = get_bearer_token_provider(DefaultAzureCredential(), \"https://cognitiveservices.azure.com/.default\")\n",
"\n",
" # Get a client handle for the AOAI model\n",
" client = AzureOpenAI(\n",
" azure_endpoint=endpoint,\n",
" api_version=os.environ.get(\"AZURE_OPENAI_API_VERSION\"),\n",
" azure_ad_token_provider=token_provider,\n",
" )\n",
"\n",
" # Prepare the messages\n",
" messages = [\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": f\"You are a user assistant who helps answer questions based on some context.\\n\\nContext: '{context}'\",\n",
" },\n",
" {\"role\": \"user\", \"content\": query},\n",
" ]\n",
" # Call the model\n",
" completion = client.chat.completions.create(\n",
" model=deployment,\n",
" messages=messages,\n",
" max_tokens=800,\n",
" temperature=0.7,\n",
" top_p=0.95,\n",
" frequency_penalty=0,\n",
" presence_penalty=0,\n",
" stop=None,\n",
" stream=False,\n",
" )\n",
"\n",
" message = completion.to_dict()[\"choices\"][0][\"message\"]\n",
" if isinstance(message, dict):\n",
" message = message[\"content\"]\n",
" return message"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run the simulator\n",
"\n",
"The interactions between your endpoint (in this case, `example_application_response`) and the simulator is managed by a callback method, `custom_simulator_callback` and this method is used to format the request to your endpoint and the response from the endpoint."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"gather": {
"logged": 1733168547308
}
},
"outputs": [],
"source": [
"async def custom_simulator_callback(\n",
" messages: List[Dict],\n",
" stream: bool = False,\n",
" session_state: Optional[str] = None,\n",
" context: Optional[Dict[str, Any]] = None,\n",
") -> dict:\n",
" messages_list = messages[\"messages\"]\n",
" # get last message\n",
" latest_message = messages_list[-1]\n",
" application_input = latest_message[\"content\"]\n",
" context = latest_message.get(\"context\", None)\n",
" # call your endpoint or ai application here\n",
" response = example_application_response(query=application_input, context=context)\n",
" # we are formatting the response to follow the openAI chat protocol format\n",
" message = {\n",
" \"content\": response,\n",
" \"role\": \"assistant\",\n",
" \"context\": context,\n",
" }\n",
" messages[\"messages\"].append(message)\n",
" return {\"messages\": messages[\"messages\"], \"stream\": stream, \"session_state\": session_state, \"context\": context}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1733168550286
}
},
"outputs": [],
"source": [
"custom_simulator = Simulator(model_config=model_config)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1733168585321
}
},
"outputs": [],
"source": [
"outputs = await custom_simulator(\n",
" target=custom_simulator_callback,\n",
" conversation_turns=conversation_turns,\n",
" max_conversation_turns=1,\n",
" concurrent_async_tasks=10,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Convert the outputs to a format that can be evaluated"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"gather": {
"logged": 1733168643300
}
},
"outputs": [],
"source": [
"output_file = \"ground_sim_output.jsonl\"\n",
"with open(output_file, \"w\") as file:\n",
" for output in outputs:\n",
" file.write(output.to_eval_qr_json_lines())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Run the evaluation\n",
"\n",
"In this section, we will run the evaluation using the `GroundednessEvaluator` and the `evaluate` function from the Azure AI Evaluation SDK. The evaluation will assess the groundedness of the model's responses based on the dataset produced by the `Simulator` above."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1733168660289
}
},
"outputs": [],
"source": [
"from azure.identity import DefaultAzureCredential\n",
"credential = DefaultAzureCredential()\n",
"\n",
"groundedness_evaluator = GroundednessEvaluator(model_config=model_config)\n",
"groundedness_pro_evaluator = GroundednessProEvaluator(azure_ai_project=azure_ai_project, credential=credential)\n",
"\n",
"eval_output = evaluate(\n",
" data=output_file,\n",
" evaluators={\n",
" \"groundedness\": groundedness_evaluator,\n",
" \"groundedness_pro\": groundedness_pro_evaluator,\n",
"\n",
" },\n",
" azure_ai_project=azure_ai_project,\n",
")\n",
"\n",
"print(eval_output)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1733168702297
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"from pprint import pprint \n",
"pprint(eval_output)"
]
}
],
"metadata": {
"kernel_info": {
"name": "python310-sdkv2"
},
"kernelspec": {
"display_name": "aoai_eval",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
},
"microsoft": {
"host": {
"AzureML": {
"notebookHasBeenCompleted": true
}
},
"ms_spell_check": {
"ms_spell_check_language": "en"
}
},
"nteract": {
"version": "nteract-front-end@1.0.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}