2-notebooks/3-quality_attributes/2-evaluation.ipynb

{ "cells": [ { "cell_type": "markdown", "id": "ee0c0ebe", "metadata": {}, "source": [ "# 🏋️‍♀️ Health & Fitness Evaluations with Azure AI Foundry 🏋️‍♂️\n", "\n", "This notebook demonstrates how to **evaluate** a Generative AI model (or application) using the **Azure AI Foundry** ecosystem. We'll highlight three key Python SDKs:\n", "1. **`azure-ai-projects`** (`AIProjectClient`): manage & orchestrate evaluations in the cloud.\n", "2. **`azure-ai-inference`**: perform model inference (optional but helpful if generating data for evaluation).\n", "3. **`azure-ai-evaluation`**: run automated metrics for LLM output quality & safety.\n", "\n", "We'll create or use some synthetic \"health & fitness\" Q&A data, then measure how well your model is answering. We'll do both **local** evaluation and **cloud** evaluation (on an Azure AI Foundry project).\n", "\n", "> **Disclaimer**: This covers a hypothetical health & fitness scenario. **No real medical advice** is provided. Always consult professionals.\n", "\n", "## Notebook Contents\n", "1. [Setup & Imports](#1-Setup-and-Imports)\n", "2. [Local Evaluation Examples](#3-Local-Evaluation)\n", "3. [Cloud Evaluation with `AIProjectClient`](#4-Cloud-Evaluation)\n", "4. [Extra Topics](#5-Extra-Topics)\n", " - [Risk & Safety Evaluators](#5.1-Risk-and-Safety)\n", " - [More Quality Evaluators](#5.2-Quality)\n", " - [Custom Evaluators](#5.3-Custom)\n", " - [Simulators & Adversarial Data](#5.4-Simulators)\n", "5. [Conclusion](#6-Conclusion)\n" ] }, { "cell_type": "markdown", "id": "5bfadf84", "metadata": { "id": "1-Setup-and-Imports" }, "source": [ "## 1. Setup and Imports\n", "We'll install necessary libraries, import them, and define some synthetic data. \n", "\n", "### Dependencies\n", "- `azure-ai-projects` for orchestrating evaluations in your Azure AI Foundry Project.\n", "- `azure-ai-evaluation` for built-in or custom metrics (like Relevance, Groundedness, F1Score, etc.).\n", "- `azure-ai-inference` (optional) if you'd like to generate completions to produce data to evaluate.\n", "- `azure-identity` (for Azure authentication via `DefaultAzureCredential`).\n", "\n", "### Synthetic Data\n", "We'll create a small JSONL with *health & fitness* Q&A pairs, including `query`, `response`, `context`, and `ground_truth`. This simulates a scenario where we have user questions, the model's answers, plus a reference ground truth.\n", "\n", "You can adapt this approach to any domain: e.g., finance, e-commerce, etc.\n", "\n", "<img src=\"./seq-diagrams/2-evals.png\" alt=\"Evaluation Flow\" width=\"30%\"/>\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "8b889daf", "metadata": { "tags": [] }, "outputs": [], "source": [ "%%capture\n", "# If you need to install these, uncomment:\n", "# !pip install azure-ai-projects azure-ai-evaluation azure-ai-inference azure-identity\n", "# !pip install opentelemetry-sdk azure-core-tracing-opentelemetry # optional for advanced tracing\n", "\n", "import json\n", "import os\n", "import uuid\n", "from pathlib import Path\n", "from typing import Dict, Any\n", "\n", "from azure.identity import DefaultAzureCredential\n", "\n", "# We'll create a synthetic dataset in JSON Lines format\n", "synthetic_eval_data = [\n", " {\n", " \"query\": \"How can I start a beginner workout routine at home?\",\n", " \"context\": \"Workout routines can include push-ups, bodyweight squats, lunges, and planks.\",\n", " \"response\": \"You can just go for 10 push-ups total.\",\n", " \"ground_truth\": \"At home, you can start with short, low-intensity workouts: push-ups, lunges, planks.\"\n", " },\n", " {\n", " \"query\": \"Are diet sodas healthy for daily consumption?\",\n", " \"context\": \"Sugar-free or diet drinks may reduce sugar intake, but they still contain artificial sweeteners.\",\n", " \"response\": \"Yes, diet sodas are 100% healthy.\",\n", " \"ground_truth\": \"Diet sodas have fewer sugars than regular soda, but 'healthy' is not guaranteed due to artificial additives.\"\n", " },\n", " {\n", " \"query\": \"What's the capital of France?\",\n", " \"context\": \"France is in Europe. Paris is the capital.\",\n", " \"response\": \"London.\",\n", " \"ground_truth\": \"Paris.\"\n", " }\n", "]\n", "\n", "# Write them to a local JSONL file\n", "eval_data_path = Path(\"./health_fitness_eval_data.jsonl\")\n", "with eval_data_path.open(\"w\", encoding=\"utf-8\") as f:\n", " for row in synthetic_eval_data:\n", " f.write(json.dumps(row) + \"\\n\")\n", "\n", "print(f\"Sample evaluation data written to {eval_data_path.resolve()}\")" ] }, { "cell_type": "markdown", "id": "da2d5598", "metadata": { "id": "3-Local-Evaluation" }, "source": [ "# 3. Local Evaluation Examples\n", "\n", "We'll show how to run local, code-based evaluation on a JSONL dataset. We'll:\n", "1. **Load** the data.\n", "2. **Define** one or more evaluators. (e.g. `F1ScoreEvaluator`, `RelevanceEvaluator`, `GroundednessEvaluator`, or custom.)\n", "3. **Run** `evaluate(...)` to produce a dictionary of metrics.\n", "\n", "> We can also do multi-turn conversation data or add extra columns like `ground_truth` for advanced metrics.\n", "\n", "## Example 1: Combining F1Score, Relevance & Groundedness\n", "We'll combine:\n", "- `F1ScoreEvaluator` (NLP-based, compares `response` to `ground_truth`)\n", "- `RelevanceEvaluator` (AI-assisted, uses GPT to judge how well `response` addresses `query`)\n", "- `GroundednessEvaluator` (checks how well the response is anchored in the provided `context`)\n", "- A custom code-based evaluator that logs response length.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c9f04f13", "metadata": {}, "outputs": [], "source": [ "import os\n", "from pathlib import Path\n", "from azure.ai.evaluation import (\n", " evaluate,\n", " F1ScoreEvaluator,\n", " RelevanceEvaluator,\n", " GroundednessEvaluator\n", ")\n", "\n", "# Our custom evaluator to measure response length.\n", "def response_length_eval(response, **kwargs):\n", " return {\"resp_length\": len(response)}\n", "\n", "# We'll define an example GPT-based config (if we want AI-assisted evaluators). \n", "# This is needed for AI-assisted evaluators. Fill with your Azure OpenAI config.\n", "# If you skip some evaluators, you can omit.\n", "model_config = {\n", " \"azure_endpoint\": os.environ.get(\"AOAI_ENDPOINT\", \"https://dummy-endpoint.azure.com\"),\n", " \"api_key\": os.environ.get(\"AOAI_API_KEY\", \"fake-key\"),\n", " \"azure_deployment\": os.environ.get(\"AOAI_DEPLOYMENT\", \"gpt-4\"),\n", " \"api_version\": os.environ.get(\"AOAI_API_VERSION\", \"2023-07-01-preview\"),\n", "}\n", "\n", "eval_data_path = Path(\"./health_fitness_eval_data.jsonl\")\n", "\n", "f1_eval = F1ScoreEvaluator()\n", "rel_eval = RelevanceEvaluator(model_config=model_config)\n", "ground_eval = GroundednessEvaluator(model_config=model_config)\n", "\n", "# We'll run evaluate(...) with these evaluators.\n", "results = evaluate(\n", " data=str(eval_data_path),\n", " evaluators={\n", " \"f1_score\": f1_eval,\n", " \"relevance\": rel_eval,\n", " \"groundedness\": ground_eval,\n", " \"resp_len\": response_length_eval\n", " },\n", " evaluator_config={\n", " \"f1_score\": {\n", " \"column_mapping\": {\n", " \"response\": \"${data.response}\",\n", " \"ground_truth\": \"${data.ground_truth}\"\n", " }\n", " },\n", " \"relevance\": {\n", " \"column_mapping\": {\n", " \"query\": \"${data.query}\",\n", " \"response\": \"${data.response}\"\n", " }\n", " },\n", " \"groundedness\": {\n", " \"column_mapping\": {\n", " \"context\": \"${data.context}\",\n", " \"response\": \"${data.response}\"\n", " }\n", " },\n", " \"resp_len\": {\n", " \"column_mapping\": {\n", " \"response\": \"${data.response}\"\n", " }\n", " }\n", " }\n", ")\n", "\n", "print(\"Local evaluation result =>\")\n", "print(results)" ] }, { "cell_type": "markdown", "id": "9d525400", "metadata": {}, "source": [ "**Inspecting Local Results**\n", "\n", "The `evaluate(...)` call returns a dictionary with:\n", "- **`metrics`**: aggregated metrics across rows (like average F1, Relevance, or Groundedness)\n", "- **`rows`**: row-by-row results with inputs and evaluator outputs\n", "- **`traces`**: debugging info (if any)\n", "\n", "You can further analyze these results, store them in a database, or integrate them into your CI/CD pipeline." ] }, { "cell_type": "markdown", "id": "c7b903ea", "metadata": { "id": "4-Cloud-Evaluation" }, "source": [ "# 4. Cloud Evaluation with `AIProjectClient`\n", "\n", "Sometimes, we want to:\n", "- Evaluate large or sensitive datasets in the cloud (scalability, governed access).\n", "- Keep track of evaluation results in an Azure AI Foundry project.\n", "- Optionally schedule recurring evaluations.\n", "\n", "We'll do that by:\n", "1. **Upload** the local JSONL to your Azure AI Foundry project.\n", "2. **Create** an `Evaluation` referencing built-in or custom evaluator definitions.\n", "3. **Poll** until the job is done (with retry logic for resilience).\n", "4. **Review** the results in the portal or via `project_client.evaluations.get(...)`.\n", "\n", "### Prerequisites\n", "- An Azure AI Foundry project with a valid **Connection String** (from your project’s Overview page).\n", "- An Azure OpenAI deployment (if using AI-assisted evaluators).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "68d936ba", "metadata": {}, "outputs": [], "source": [ "import os\n", "from azure.ai.projects import AIProjectClient\n", "from azure.ai.projects.models import (\n", " Evaluation, Dataset, EvaluatorConfiguration, ConnectionType\n", ")\n", "from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEvaluator\n", "from azure.identity import DefaultAzureCredential\n", "from azure.core.exceptions import ServiceResponseError\n", "import time\n", "\n", "# 1) Connect to Azure AI Foundry project\n", "project_conn_str = os.environ.get(\"PROJECT_CONNECTION_STRING\")\n", "credential = DefaultAzureCredential()\n", "\n", "project_client = AIProjectClient.from_connection_string(\n", " credential=credential,\n", " conn_str=project_conn_str\n", ")\n", "print(\"✅ Created AIProjectClient.\")\n", "\n", "# 2) Upload data for evaluation\n", "uploaded_data_id, _ = project_client.upload_file(str(eval_data_path))\n", "print(\"✅ Uploaded JSONL to project. Data asset ID:\", uploaded_data_id)\n", "\n", "# 3) Prepare an Azure OpenAI connection for AI-assisted evaluators\n", "default_conn = project_client.connections.get_default(ConnectionType.AZURE_OPEN_AI)\n", "\n", "deployment_name = os.environ.get(\"AOAI_DEPLOYMENT\", \"gpt-4\")\n", "api_version = os.environ.get(\"AOAI_API_VERSION\", \"2023-07-01-preview\")\n", "\n", "# 4) Construct the evaluation object\n", "model_config = default_conn.to_evaluator_model_config(\n", " deployment_name=deployment_name,\n", " api_version=api_version\n", ")\n", "\n", "evaluation = Evaluation(\n", " display_name=\"Health Fitness Remote Evaluation\",\n", " description=\"Evaluating dataset for correctness.\",\n", " data=Dataset(id=uploaded_data_id),\n", " evaluators={\n", " \"f1_score\": EvaluatorConfiguration(id=F1ScoreEvaluator.id),\n", " \"relevance\": EvaluatorConfiguration(\n", " id=RelevanceEvaluator.id,\n", " init_params={\"model_config\": model_config}\n", " ),\n", " \"violence\": EvaluatorConfiguration(\n", " id=ViolenceEvaluator.id,\n", " init_params={\"azure_ai_project\": project_client.scope}\n", " )\n", " }\n", ")\n", "\n", "# Helper: Create evaluation with retry logic\n", "def create_evaluation_with_retry(project_client, evaluation, max_retries=3, retry_delay=5):\n", " for attempt in range(max_retries):\n", " try:\n", " result = project_client.evaluations.create(evaluation=evaluation)\n", " return result\n", " except ServiceResponseError as e:\n", " if attempt == max_retries - 1:\n", " raise\n", " print(f\"⚠️ Attempt {attempt+1} failed: {str(e)}. Retrying in {retry_delay} seconds...\")\n", " time.sleep(retry_delay)\n", "\n", "# 5) Create & track the evaluation using retry logic\n", "cloud_eval = create_evaluation_with_retry(project_client, evaluation)\n", "print(\"✅ Created evaluation job. ID:\", cloud_eval.id)\n", "\n", "# 6) Poll or fetch final status\n", "fetched_eval = project_client.evaluations.get(cloud_eval.id)\n", "print(\"Current status:\", fetched_eval.status)\n", "if hasattr(fetched_eval, 'properties'):\n", " link = fetched_eval.properties.get(\"AiStudioEvaluationUri\", \"\")\n", " if link:\n", " print(\"View details in Foundry:\", link)\n", "else:\n", " print(\"No link found.\")" ] }, { "cell_type": "markdown", "id": "091290cd", "metadata": {}, "source": [ "### Viewing Cloud Evaluation Results\n", "- Navigate to the **Evaluations** tab in your AI Foundry project to see your evaluation job.\n", "- Open the evaluation to view aggregated metrics and row-level details.\n", "- For AI-assisted or risk & safety evaluators, you'll see both average scores and detailed per-row results." ] }, { "cell_type": "markdown", "id": "55e3a2c4", "metadata": {}, "source": [ "# 5. Extra Topics\n", "We'll do a quick overview of some advanced features:\n", "1. [Risk & Safety Evaluators](#5.1-Risk-and-Safety)\n", "2. [Additional Quality Evaluators](#5.2-Quality)\n", "3. [Custom Evaluators](#5.3-Custom)\n", "4. [Simulators & Adversarial Data](#5.4-Simulators)\n" ] }, { "cell_type": "markdown", "id": "7490e0eb", "metadata": { "id": "5.1-Risk-and-Safety" }, "source": [ "## 5.1 Risk & Safety Evaluators\n", "\n", "Azure AI Foundry includes built-in evaluators that detect content risks. Examples include:\n", "- **ViolenceEvaluator**: detects violent or harmful content.\n", "- **SexualEvaluator**: checks for explicit content.\n", "- **HateUnfairnessEvaluator**: flags hateful content.\n", "- **SelfHarmEvaluator**: detects self-harm related content.\n", "- **ProtectedMaterialEvaluator**: identifies copyrighted or protected content.\n", "\n", "These evaluators accept a `query` and `response` (and sometimes `context`) to provide severity labels and scores.\n", "\n", "For example:\n", "```python\n", "from azure.ai.evaluation import ViolenceEvaluator\n", "\n", "violence_eval = ViolenceEvaluator(\n", " credential=DefaultAzureCredential(),\n", " azure_ai_project={\n", " \"subscription_id\": \"...\",\n", " \"resource_group_name\": \"...\",\n", " \"project_name\": \"...\"\n", " }\n", ")\n", "result = violence_eval(query=\"What is the capital of France?\", response=\"Paris\")\n", "print(result)\n", "```\n" ] }, { "cell_type": "markdown", "id": "92a94f46", "metadata": { "id": "5.2-Quality" }, "source": [ "## 5.2 Additional Quality Evaluators\n", "Beyond `F1Score` and `Relevance`, there are many built-ins:\n", "- **GroundednessEvaluator**: Checks if the response is anchored in the provided context.\n", "- **CoherenceEvaluator**: Measures the logical flow of the response.\n", "- **FluencyEvaluator**: Assesses grammatical correctness.\n", "\n", "These metrics can help you fine-tune your model’s performance." ] }, { "cell_type": "markdown", "id": "cd6f718e", "metadata": { "id": "5.3-Custom" }, "source": [ "## 5.3 Custom Evaluators\n", "You can build your own evaluators. For instance, a simple evaluator that measures the length of a response:\n", "```python\n", "class AnswerLengthEvaluator:\n", " def __call__(self, response: str, **kwargs):\n", " return {\"answer_length\": len(response)}\n", "```\n", "\n", "You can then integrate it with the local or cloud evaluation workflow." ] }, { "cell_type": "markdown", "id": "bde67f1f", "metadata": { "id": "5.4-Simulators" }, "source": [ "## 5.4 Simulators & Adversarial Data\n", "If you need to generate synthetic or adversarial evaluation data, the `azure-ai-evaluation` package provides simulators. \n", "\n", "For example, you can simulate adversarial queries using `AdversarialSimulator` to test model safety and robustness." ] }, { "cell_type": "markdown", "id": "2f63eee3", "metadata": { "id": "6-Conclusion" }, "source": [ "# 6. Conclusion 🏁\n", "\n", "We covered:\n", "1. **Local** evaluations with `evaluate(...)` on JSONL data (now including a groundedness metric).\n", "2. **Cloud** evaluations with `AIProjectClient` including retry logic for robustness.\n", "3. Built-in **risk & safety** and **quality** evaluators.\n", "4. **Custom** evaluators for advanced scenarios.\n", "5. **Simulators** for generating adversarial data.\n", "\n", "**Next Steps**:\n", "- Adjust your model and prompts based on evaluation feedback.\n", "- Integrate these evaluations into your CI/CD pipelines.\n", "- Combine with observability tools for deeper insights.\n", "\n", "> **Best of luck** building robust and responsible AI solutions with Azure AI Foundry!" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 5 }

2-notebooks/3-quality_attributes/2-evaluation.ipynb (487 lines of code) (raw):