2-notebooks/3-quality_attributes/2-evaluation.ipynb (487 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "ee0c0ebe",
"metadata": {},
"source": [
"# 🏋️♀️ Health & Fitness Evaluations with Azure AI Foundry 🏋️♂️\n",
"\n",
"This notebook demonstrates how to **evaluate** a Generative AI model (or application) using the **Azure AI Foundry** ecosystem. We'll highlight three key Python SDKs:\n",
"1. **`azure-ai-projects`** (`AIProjectClient`): manage & orchestrate evaluations in the cloud.\n",
"2. **`azure-ai-inference`**: perform model inference (optional but helpful if generating data for evaluation).\n",
"3. **`azure-ai-evaluation`**: run automated metrics for LLM output quality & safety.\n",
"\n",
"We'll create or use some synthetic \"health & fitness\" Q&A data, then measure how well your model is answering. We'll do both **local** evaluation and **cloud** evaluation (on an Azure AI Foundry project).\n",
"\n",
"> **Disclaimer**: This covers a hypothetical health & fitness scenario. **No real medical advice** is provided. Always consult professionals.\n",
"\n",
"## Notebook Contents\n",
"1. [Setup & Imports](#1-Setup-and-Imports)\n",
"2. [Local Evaluation Examples](#3-Local-Evaluation)\n",
"3. [Cloud Evaluation with `AIProjectClient`](#4-Cloud-Evaluation)\n",
"4. [Extra Topics](#5-Extra-Topics)\n",
" - [Risk & Safety Evaluators](#5.1-Risk-and-Safety)\n",
" - [More Quality Evaluators](#5.2-Quality)\n",
" - [Custom Evaluators](#5.3-Custom)\n",
" - [Simulators & Adversarial Data](#5.4-Simulators)\n",
"5. [Conclusion](#6-Conclusion)\n"
]
},
{
"cell_type": "markdown",
"id": "5bfadf84",
"metadata": {
"id": "1-Setup-and-Imports"
},
"source": [
"## 1. Setup and Imports\n",
"We'll install necessary libraries, import them, and define some synthetic data. \n",
"\n",
"### Dependencies\n",
"- `azure-ai-projects` for orchestrating evaluations in your Azure AI Foundry Project.\n",
"- `azure-ai-evaluation` for built-in or custom metrics (like Relevance, Groundedness, F1Score, etc.).\n",
"- `azure-ai-inference` (optional) if you'd like to generate completions to produce data to evaluate.\n",
"- `azure-identity` (for Azure authentication via `DefaultAzureCredential`).\n",
"\n",
"### Synthetic Data\n",
"We'll create a small JSONL with *health & fitness* Q&A pairs, including `query`, `response`, `context`, and `ground_truth`. This simulates a scenario where we have user questions, the model's answers, plus a reference ground truth.\n",
"\n",
"You can adapt this approach to any domain: e.g., finance, e-commerce, etc.\n",
"\n",
"<img src=\"./seq-diagrams/2-evals.png\" alt=\"Evaluation Flow\" width=\"30%\"/>\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "8b889daf",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%capture\n",
"# If you need to install these, uncomment:\n",
"# !pip install azure-ai-projects azure-ai-evaluation azure-ai-inference azure-identity\n",
"# !pip install opentelemetry-sdk azure-core-tracing-opentelemetry # optional for advanced tracing\n",
"\n",
"import json\n",
"import os\n",
"import uuid\n",
"from pathlib import Path\n",
"from typing import Dict, Any\n",
"\n",
"from azure.identity import DefaultAzureCredential\n",
"\n",
"# We'll create a synthetic dataset in JSON Lines format\n",
"synthetic_eval_data = [\n",
" {\n",
" \"query\": \"How can I start a beginner workout routine at home?\",\n",
" \"context\": \"Workout routines can include push-ups, bodyweight squats, lunges, and planks.\",\n",
" \"response\": \"You can just go for 10 push-ups total.\",\n",
" \"ground_truth\": \"At home, you can start with short, low-intensity workouts: push-ups, lunges, planks.\"\n",
" },\n",
" {\n",
" \"query\": \"Are diet sodas healthy for daily consumption?\",\n",
" \"context\": \"Sugar-free or diet drinks may reduce sugar intake, but they still contain artificial sweeteners.\",\n",
" \"response\": \"Yes, diet sodas are 100% healthy.\",\n",
" \"ground_truth\": \"Diet sodas have fewer sugars than regular soda, but 'healthy' is not guaranteed due to artificial additives.\"\n",
" },\n",
" {\n",
" \"query\": \"What's the capital of France?\",\n",
" \"context\": \"France is in Europe. Paris is the capital.\",\n",
" \"response\": \"London.\",\n",
" \"ground_truth\": \"Paris.\"\n",
" }\n",
"]\n",
"\n",
"# Write them to a local JSONL file\n",
"eval_data_path = Path(\"./health_fitness_eval_data.jsonl\")\n",
"with eval_data_path.open(\"w\", encoding=\"utf-8\") as f:\n",
" for row in synthetic_eval_data:\n",
" f.write(json.dumps(row) + \"\\n\")\n",
"\n",
"print(f\"Sample evaluation data written to {eval_data_path.resolve()}\")"
]
},
{
"cell_type": "markdown",
"id": "da2d5598",
"metadata": {
"id": "3-Local-Evaluation"
},
"source": [
"# 3. Local Evaluation Examples\n",
"\n",
"We'll show how to run local, code-based evaluation on a JSONL dataset. We'll:\n",
"1. **Load** the data.\n",
"2. **Define** one or more evaluators. (e.g. `F1ScoreEvaluator`, `RelevanceEvaluator`, `GroundednessEvaluator`, or custom.)\n",
"3. **Run** `evaluate(...)` to produce a dictionary of metrics.\n",
"\n",
"> We can also do multi-turn conversation data or add extra columns like `ground_truth` for advanced metrics.\n",
"\n",
"## Example 1: Combining F1Score, Relevance & Groundedness\n",
"We'll combine:\n",
"- `F1ScoreEvaluator` (NLP-based, compares `response` to `ground_truth`)\n",
"- `RelevanceEvaluator` (AI-assisted, uses GPT to judge how well `response` addresses `query`)\n",
"- `GroundednessEvaluator` (checks how well the response is anchored in the provided `context`)\n",
"- A custom code-based evaluator that logs response length.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c9f04f13",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from pathlib import Path\n",
"from azure.ai.evaluation import (\n",
" evaluate,\n",
" F1ScoreEvaluator,\n",
" RelevanceEvaluator,\n",
" GroundednessEvaluator\n",
")\n",
"\n",
"# Our custom evaluator to measure response length.\n",
"def response_length_eval(response, **kwargs):\n",
" return {\"resp_length\": len(response)}\n",
"\n",
"# We'll define an example GPT-based config (if we want AI-assisted evaluators). \n",
"# This is needed for AI-assisted evaluators. Fill with your Azure OpenAI config.\n",
"# If you skip some evaluators, you can omit.\n",
"model_config = {\n",
" \"azure_endpoint\": os.environ.get(\"AOAI_ENDPOINT\", \"https://dummy-endpoint.azure.com\"),\n",
" \"api_key\": os.environ.get(\"AOAI_API_KEY\", \"fake-key\"),\n",
" \"azure_deployment\": os.environ.get(\"AOAI_DEPLOYMENT\", \"gpt-4\"),\n",
" \"api_version\": os.environ.get(\"AOAI_API_VERSION\", \"2023-07-01-preview\"),\n",
"}\n",
"\n",
"eval_data_path = Path(\"./health_fitness_eval_data.jsonl\")\n",
"\n",
"f1_eval = F1ScoreEvaluator()\n",
"rel_eval = RelevanceEvaluator(model_config=model_config)\n",
"ground_eval = GroundednessEvaluator(model_config=model_config)\n",
"\n",
"# We'll run evaluate(...) with these evaluators.\n",
"results = evaluate(\n",
" data=str(eval_data_path),\n",
" evaluators={\n",
" \"f1_score\": f1_eval,\n",
" \"relevance\": rel_eval,\n",
" \"groundedness\": ground_eval,\n",
" \"resp_len\": response_length_eval\n",
" },\n",
" evaluator_config={\n",
" \"f1_score\": {\n",
" \"column_mapping\": {\n",
" \"response\": \"${data.response}\",\n",
" \"ground_truth\": \"${data.ground_truth}\"\n",
" }\n",
" },\n",
" \"relevance\": {\n",
" \"column_mapping\": {\n",
" \"query\": \"${data.query}\",\n",
" \"response\": \"${data.response}\"\n",
" }\n",
" },\n",
" \"groundedness\": {\n",
" \"column_mapping\": {\n",
" \"context\": \"${data.context}\",\n",
" \"response\": \"${data.response}\"\n",
" }\n",
" },\n",
" \"resp_len\": {\n",
" \"column_mapping\": {\n",
" \"response\": \"${data.response}\"\n",
" }\n",
" }\n",
" }\n",
")\n",
"\n",
"print(\"Local evaluation result =>\")\n",
"print(results)"
]
},
{
"cell_type": "markdown",
"id": "9d525400",
"metadata": {},
"source": [
"**Inspecting Local Results**\n",
"\n",
"The `evaluate(...)` call returns a dictionary with:\n",
"- **`metrics`**: aggregated metrics across rows (like average F1, Relevance, or Groundedness)\n",
"- **`rows`**: row-by-row results with inputs and evaluator outputs\n",
"- **`traces`**: debugging info (if any)\n",
"\n",
"You can further analyze these results, store them in a database, or integrate them into your CI/CD pipeline."
]
},
{
"cell_type": "markdown",
"id": "c7b903ea",
"metadata": {
"id": "4-Cloud-Evaluation"
},
"source": [
"# 4. Cloud Evaluation with `AIProjectClient`\n",
"\n",
"Sometimes, we want to:\n",
"- Evaluate large or sensitive datasets in the cloud (scalability, governed access).\n",
"- Keep track of evaluation results in an Azure AI Foundry project.\n",
"- Optionally schedule recurring evaluations.\n",
"\n",
"We'll do that by:\n",
"1. **Upload** the local JSONL to your Azure AI Foundry project.\n",
"2. **Create** an `Evaluation` referencing built-in or custom evaluator definitions.\n",
"3. **Poll** until the job is done (with retry logic for resilience).\n",
"4. **Review** the results in the portal or via `project_client.evaluations.get(...)`.\n",
"\n",
"### Prerequisites\n",
"- An Azure AI Foundry project with a valid **Connection String** (from your project’s Overview page).\n",
"- An Azure OpenAI deployment (if using AI-assisted evaluators).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "68d936ba",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from azure.ai.projects import AIProjectClient\n",
"from azure.ai.projects.models import (\n",
" Evaluation, Dataset, EvaluatorConfiguration, ConnectionType\n",
")\n",
"from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEvaluator\n",
"from azure.identity import DefaultAzureCredential\n",
"from azure.core.exceptions import ServiceResponseError\n",
"import time\n",
"\n",
"# 1) Connect to Azure AI Foundry project\n",
"project_conn_str = os.environ.get(\"PROJECT_CONNECTION_STRING\")\n",
"credential = DefaultAzureCredential()\n",
"\n",
"project_client = AIProjectClient.from_connection_string(\n",
" credential=credential,\n",
" conn_str=project_conn_str\n",
")\n",
"print(\"✅ Created AIProjectClient.\")\n",
"\n",
"# 2) Upload data for evaluation\n",
"uploaded_data_id, _ = project_client.upload_file(str(eval_data_path))\n",
"print(\"✅ Uploaded JSONL to project. Data asset ID:\", uploaded_data_id)\n",
"\n",
"# 3) Prepare an Azure OpenAI connection for AI-assisted evaluators\n",
"default_conn = project_client.connections.get_default(ConnectionType.AZURE_OPEN_AI)\n",
"\n",
"deployment_name = os.environ.get(\"AOAI_DEPLOYMENT\", \"gpt-4\")\n",
"api_version = os.environ.get(\"AOAI_API_VERSION\", \"2023-07-01-preview\")\n",
"\n",
"# 4) Construct the evaluation object\n",
"model_config = default_conn.to_evaluator_model_config(\n",
" deployment_name=deployment_name,\n",
" api_version=api_version\n",
")\n",
"\n",
"evaluation = Evaluation(\n",
" display_name=\"Health Fitness Remote Evaluation\",\n",
" description=\"Evaluating dataset for correctness.\",\n",
" data=Dataset(id=uploaded_data_id),\n",
" evaluators={\n",
" \"f1_score\": EvaluatorConfiguration(id=F1ScoreEvaluator.id),\n",
" \"relevance\": EvaluatorConfiguration(\n",
" id=RelevanceEvaluator.id,\n",
" init_params={\"model_config\": model_config}\n",
" ),\n",
" \"violence\": EvaluatorConfiguration(\n",
" id=ViolenceEvaluator.id,\n",
" init_params={\"azure_ai_project\": project_client.scope}\n",
" )\n",
" }\n",
")\n",
"\n",
"# Helper: Create evaluation with retry logic\n",
"def create_evaluation_with_retry(project_client, evaluation, max_retries=3, retry_delay=5):\n",
" for attempt in range(max_retries):\n",
" try:\n",
" result = project_client.evaluations.create(evaluation=evaluation)\n",
" return result\n",
" except ServiceResponseError as e:\n",
" if attempt == max_retries - 1:\n",
" raise\n",
" print(f\"⚠️ Attempt {attempt+1} failed: {str(e)}. Retrying in {retry_delay} seconds...\")\n",
" time.sleep(retry_delay)\n",
"\n",
"# 5) Create & track the evaluation using retry logic\n",
"cloud_eval = create_evaluation_with_retry(project_client, evaluation)\n",
"print(\"✅ Created evaluation job. ID:\", cloud_eval.id)\n",
"\n",
"# 6) Poll or fetch final status\n",
"fetched_eval = project_client.evaluations.get(cloud_eval.id)\n",
"print(\"Current status:\", fetched_eval.status)\n",
"if hasattr(fetched_eval, 'properties'):\n",
" link = fetched_eval.properties.get(\"AiStudioEvaluationUri\", \"\")\n",
" if link:\n",
" print(\"View details in Foundry:\", link)\n",
"else:\n",
" print(\"No link found.\")"
]
},
{
"cell_type": "markdown",
"id": "091290cd",
"metadata": {},
"source": [
"### Viewing Cloud Evaluation Results\n",
"- Navigate to the **Evaluations** tab in your AI Foundry project to see your evaluation job.\n",
"- Open the evaluation to view aggregated metrics and row-level details.\n",
"- For AI-assisted or risk & safety evaluators, you'll see both average scores and detailed per-row results."
]
},
{
"cell_type": "markdown",
"id": "55e3a2c4",
"metadata": {},
"source": [
"# 5. Extra Topics\n",
"We'll do a quick overview of some advanced features:\n",
"1. [Risk & Safety Evaluators](#5.1-Risk-and-Safety)\n",
"2. [Additional Quality Evaluators](#5.2-Quality)\n",
"3. [Custom Evaluators](#5.3-Custom)\n",
"4. [Simulators & Adversarial Data](#5.4-Simulators)\n"
]
},
{
"cell_type": "markdown",
"id": "7490e0eb",
"metadata": {
"id": "5.1-Risk-and-Safety"
},
"source": [
"## 5.1 Risk & Safety Evaluators\n",
"\n",
"Azure AI Foundry includes built-in evaluators that detect content risks. Examples include:\n",
"- **ViolenceEvaluator**: detects violent or harmful content.\n",
"- **SexualEvaluator**: checks for explicit content.\n",
"- **HateUnfairnessEvaluator**: flags hateful content.\n",
"- **SelfHarmEvaluator**: detects self-harm related content.\n",
"- **ProtectedMaterialEvaluator**: identifies copyrighted or protected content.\n",
"\n",
"These evaluators accept a `query` and `response` (and sometimes `context`) to provide severity labels and scores.\n",
"\n",
"For example:\n",
"```python\n",
"from azure.ai.evaluation import ViolenceEvaluator\n",
"\n",
"violence_eval = ViolenceEvaluator(\n",
" credential=DefaultAzureCredential(),\n",
" azure_ai_project={\n",
" \"subscription_id\": \"...\",\n",
" \"resource_group_name\": \"...\",\n",
" \"project_name\": \"...\"\n",
" }\n",
")\n",
"result = violence_eval(query=\"What is the capital of France?\", response=\"Paris\")\n",
"print(result)\n",
"```\n"
]
},
{
"cell_type": "markdown",
"id": "92a94f46",
"metadata": {
"id": "5.2-Quality"
},
"source": [
"## 5.2 Additional Quality Evaluators\n",
"Beyond `F1Score` and `Relevance`, there are many built-ins:\n",
"- **GroundednessEvaluator**: Checks if the response is anchored in the provided context.\n",
"- **CoherenceEvaluator**: Measures the logical flow of the response.\n",
"- **FluencyEvaluator**: Assesses grammatical correctness.\n",
"\n",
"These metrics can help you fine-tune your model’s performance."
]
},
{
"cell_type": "markdown",
"id": "cd6f718e",
"metadata": {
"id": "5.3-Custom"
},
"source": [
"## 5.3 Custom Evaluators\n",
"You can build your own evaluators. For instance, a simple evaluator that measures the length of a response:\n",
"```python\n",
"class AnswerLengthEvaluator:\n",
" def __call__(self, response: str, **kwargs):\n",
" return {\"answer_length\": len(response)}\n",
"```\n",
"\n",
"You can then integrate it with the local or cloud evaluation workflow."
]
},
{
"cell_type": "markdown",
"id": "bde67f1f",
"metadata": {
"id": "5.4-Simulators"
},
"source": [
"## 5.4 Simulators & Adversarial Data\n",
"If you need to generate synthetic or adversarial evaluation data, the `azure-ai-evaluation` package provides simulators. \n",
"\n",
"For example, you can simulate adversarial queries using `AdversarialSimulator` to test model safety and robustness."
]
},
{
"cell_type": "markdown",
"id": "2f63eee3",
"metadata": {
"id": "6-Conclusion"
},
"source": [
"# 6. Conclusion 🏁\n",
"\n",
"We covered:\n",
"1. **Local** evaluations with `evaluate(...)` on JSONL data (now including a groundedness metric).\n",
"2. **Cloud** evaluations with `AIProjectClient` including retry logic for robustness.\n",
"3. Built-in **risk & safety** and **quality** evaluators.\n",
"4. **Custom** evaluators for advanced scenarios.\n",
"5. **Simulators** for generating adversarial data.\n",
"\n",
"**Next Steps**:\n",
"- Adjust your model and prompts based on evaluation feedback.\n",
"- Integrate these evaluations into your CI/CD pipelines.\n",
"- Combine with observability tools for deeper insights.\n",
"\n",
"> **Best of luck** building robust and responsible AI solutions with Azure AI Foundry!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}