2_eval-design-ptn/02_azure-evaluation-sdk/01.1_quality-evaluator.ipynb (490 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Quality Evaluators with the Azure AI Evaluation SDK\n", "The following sample shows the basic way to evaluate a Generative AI application in your development environment with the Azure AI evaluation SDK.\n", "\n", "> ✨ ***Note*** <br>\n", "> Please check the reference document before you get started - https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "shellscript" } }, "source": [ "## πŸ”¨ Current Support and Limitations (as of 2025-01-14) \n", "- Check the region support for the Azure AI Evaluation SDK. https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning#region-support\n", "\n", "### Region support for evaluations\n", "| Region | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA, ECI (Text) | Groundedness (Text) | Protected Material (Text) | Hate and Unfairness, Sexual, Violent, Self-Harm, Protected Material (Image) |\n", "|---------------------|------------------------------------------------------------------|---------------------|----------------------------|----------------------------------------------------------------------------|\n", "| North Central US | no | no | no | yes |\n", "| East US 2 | yes | yes | yes | yes |\n", "| Sweden Central | yes | yes | yes | yes |\n", "| US North Central | yes | no | yes | yes |\n", "| France Central | yes | yes | yes | yes |\n", "| Switzerland West | yes | no | no | yes |\n", "\n", "### Region support for adversarial simulation\n", "| Region | Adversarial Simulation (Text) | Adversarial Simulation (Image) |\n", "|-------------------|-------------------------------|---------------------------------|\n", "| UK South | yes | no |\n", "| East US 2 | yes | yes |\n", "| Sweden Central | yes | yes |\n", "| US North Central | yes | yes |\n", "| France Central | yes | no |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## βœ”οΈ Pricing and billing\n", "- Effective 1/14/2025, Azure AI Safety Evaluations will no longer be free in public preview. It will be billed based on consumption as following:\n", "\n", "| Service Name | Safety Evaluations | Price Per 1K Tokens (USD) |\n", "|---------------------------|--------------------------|---------------------------|\n", "| Azure Machine Learning | Input pricing for 3P | $0.02 |\n", "| Azure Machine Learning | Output pricing for 3P | $0.06 |\n", "| Azure Machine Learning | Input pricing for 1P | $0.012 |\n", "| Azure Machine Learning | Output pricing for 1P | $0.012 |\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import os\n", "import json\n", "\n", "from pprint import pprint\n", "from azure.ai.evaluation import evaluate\n", "from azure.ai.evaluation import RelevanceEvaluator\n", "from azure.ai.evaluation import GroundednessEvaluator, GroundednessProEvaluator\n", "from azure.identity import DefaultAzureCredential\n", "from dotenv import load_dotenv\n", "from azure.ai.projects import AIProjectClient\n", "from azure.ai.projects.models import (\n", " Evaluation,\n", " Dataset,\n", " EvaluatorConfiguration,\n", " EvaluationSchedule,\n", " RecurrenceTrigger,\n", " ApplicationInsightsConfiguration\n", ")\n", "import pathlib\n", "\n", "from azure.ai.evaluation import (\n", " RelevanceEvaluator,\n", " CoherenceEvaluator,\n", " GroundednessEvaluator,\n", " FluencyEvaluator,\n", " SimilarityEvaluator,\n", " F1ScoreEvaluator,\n", " RetrievalEvaluator\n", ")\n", "\n", "load_dotenv(override=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "credential = DefaultAzureCredential()\n", "\n", "azure_ai_project_client = AIProjectClient.from_connection_string(\n", " credential=DefaultAzureCredential(),\n", " conn_str=os.environ.get(\"AZURE_AI_PROJECT_CONN_STR\"), # At the moment, it should be in the format \"<Region>.api.azureml.ms;<AzureSubscriptionId>;<ResourceGroup>;<HubName>\" Ex: eastus2.api.azureml.ms;xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxx;rg-sample;sample-project-eastus2\n", ")\n", "\n", "model_config = {\n", " \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n", " \"api_key\": os.environ.get(\"AZURE_OPENAI_API_KEY\"),\n", " \"azure_deployment\": os.environ.get(\"AZURE_OPENAI_CHAT_DEPLOYMENT_NAME\"),\n", " \"api_version\": os.environ.get(\"AZURE_OPENAI_API_VERSION\"),\n", " \"type\": \"azure_openai\",\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query_response = dict(\n", " query=\"Which tent is the most waterproof?\",\n", " context=\"The Alpine Explorer Tent is the most water-proof of all tents available.\",\n", " response=\"The Alpine Explorer Tent is the most waterproof.\"\n", ")\n", "\n", "conversation_str = \"\"\"{\"messages\": [ { \"content\": \"Which tent is the most waterproof?\", \"role\": \"user\" }, { \"content\": \"The Alpine Explorer Tent is the most waterproof\", \"role\": \"assistant\", \"context\": \"From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight.\" }, { \"content\": \"How much does it cost?\", \"role\": \"user\" }, { \"content\": \"$120.\", \"role\": \"assistant\", \"context\": \"The Alpine Explorer Tent is $120.\"} ] }\"\"\" \n", "conversation = json.loads(conversation_str)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ§ͺ AI-assisted Retrieval evaluator\n", "- Evaluates retrieval score for a given query and context or a multi-turn conversation, including reasoning.\n", "- The retrieval measure assesses the AI system's performance in retrieving information for additional context (e.g. a RAG scenario).\n", "- Retrieval scores range from 1 to 5, with 1 being the worst and 5 being the best." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialzing RetrievalEvaluator\n", "retrieval_eval = RetrievalEvaluator(model_config)\n", "# No need to set the model_config for GroundednessProEvaluator\n", "\n", "query_response = dict(\n", " query=\"Which tent is the most waterproof?\", # optional\n", " context=\"The Alpine Explorer Tent is the most water-proof of all tents available.\",\n", " response=\"The Alpine Explorer Tent is the most waterproof.\"\n", ")\n", "\n", "# query_response = dict(\n", "# query=\"μ–΄λ–€ ν…νŠΈκ°€ 방수 κΈ°λŠ₯이 μžˆμ–΄?\", # optional\n", "# context=\"μ•ŒνŒŒμΈ μ΅μŠ€ν”Œλ‘œλŸ¬ ν…νŠΈκ°€ λͺ¨λ“  ν…νŠΈ 쀑 κ°€μž₯ 방수 κΈ°λŠ₯이 뛰어남\",\n", "# response=\"μ•ŒνŒŒμΈ μ΅μŠ€ν”Œλ‘œλŸ¬ ν…νŠΈκ°€ 방수 κΈ°λŠ₯이 μžˆμŠ΅λ‹ˆλ‹€.\"\n", "# )\n", "\n", "# Running Groundedness Evaluator on a query and response pair\n", "retrieval_score = retrieval_eval(\n", " **query_response\n", ")\n", "print(retrieval_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ§ͺ AI-assisted RelevanceEvaluator\n", "- Relevance refers to how effectively a response addresses a question. It assesses the accuracy, completeness, and direct relevance of the response based solely on the given information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "relevance_eval = RelevanceEvaluator(model_config)\n", "\n", "query_response = dict(\n", " query=\"Which tent is the most waterproof?\",\n", " #context=\"The Alpine Explorer Tent is the most water-proof of all tents available.\",\n", " response=\"The Alpine Explorer Tent is the most waterproof.\"\n", ")\n", "\n", "relevance_score = relevance_eval(\n", " **query_response\n", ")\n", "print(relevance_score)\n", "\n", "# input conversation result\n", "relevance_conv_score = relevance_eval(conversation=conversation)\n", "print(relevance_conv_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ§ͺ AI-assisted CoherenceEvaluator\n", "- Coherence refers to the logical and orderly presentation of ideas in a response, allowing the reader to easily follow and understand the writer's train of thought. A coherent answer directly addresses the question with clear connections between sentences and paragraphs, using appropriate transitions and a logical sequence of ideas." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coherence_eval = CoherenceEvaluator(model_config)\n", "\n", "query_response = dict(\n", " query=\"Which tent is the most waterproof?\",\n", " #context=\"The Alpine Explorer Tent is the most water-proof of all tents available.\",\n", " response=\"The Alpine Explorer Tent is the most waterproof.\"\n", ")\n", "\n", "relevance_score = relevance_eval(\n", " **query_response\n", ")\n", "print(relevance_score)\n", "\n", "relevance_conv_score = relevance_eval(conversation=conversation)\n", "print(relevance_conv_score)\n", "\n", "coherence_score = coherence_eval(\n", " **query_response\n", ")\n", "print(coherence_score)\n", "\n", "# input conversation result\n", "coherence_conv_score = coherence_eval(conversation=conversation)\n", "print(coherence_conv_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ§ͺ AI-assisted FluencyEvaluator\n", "- Fluency refers to the effectiveness and clarity of written communication, focusing on grammatical accuracy, vocabulary range, sentence complexity, coherence, and overall readability. It assesses how smoothly ideas are conveyed and how easily the text can be understood by the reader." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fluency_eval = FluencyEvaluator(model_config)\n", "\n", "query_response = dict(\n", " #query=\"Which tent is the most waterproof?\",\n", " #context=\"The Alpine Explorer Tent is the most water-proof of all tents available.\",\n", " response=\"The Alpine Explorer Tent is the most waterproof.\"\n", ")\n", "\n", "fluency_score = fluency_eval(\n", " **query_response\n", ")\n", "print(fluency_score)\n", "\n", "# input conversation result\n", "fluency_conv_score = fluency_eval(conversation=conversation)\n", "print(fluency_conv_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ§ͺ AI-assisted SimilarityEvaluator\n", "- The similarity metric is calculated by instructing a language model to follow the definition (in the description) and a set of grading rubrics, evaluate the user inputs, and output a score on a 5-point scale (higher means better quality). See the definition and grading rubrics below.\n", "- The recommended scenario is NLP tasks with a user query. Use it when you want an objective evaluation of an AI model's performance, particularly in text generation tasks where you have access to ground truth responses. Similarity enables you to assess the generated text's semantic alignment with the desired content, helping to gauge the model's quality and accuracy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "similarity_eval = SimilarityEvaluator(model_config)\n", "\n", "\n", "query_response = dict(\n", " query=\"Which tent is the most waterproof?\",\n", " #context=\"The Alpine Explorer Tent is the most water-proof of all tents available.\",\n", " response=\"The Alpine Explorer Tent is the most waterproof.\",\n", " ground_truth=\"The Alpine Explorer tent has a rainfly waterproof rating of 2000mm\"\n", ")\n", "\n", "similarity_score = similarity_eval(\n", " **query_response\n", ")\n", "print(similarity_score)\n", "\n", "# input conversation Not support\n", "# similarity_conv_score = similarity_eval(conversation=conversation)\n", "# print(similarity_conv_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸ§ͺ AI-assisted Groundedness evaluator\n", "- Prompt-based groundedness using your own model deployment to output a score and an explanation for the score is currently supported in all regions.\n", "- Groundedness Pro evaluator leverages Azure AI Content Safety Service (AACS) via integration into the Azure AI Foundry evaluations. No deployment is required, as a back-end service will provide the models for you to output a score and reasoning. Groundedness Pro is currently supported in the East US 2 and Sweden Central regions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialzing Groundedness and Groundedness Pro evaluators\n", "groundedness_eval = GroundednessEvaluator(model_config)\n", "# No need to set the model_config for GroundednessProEvaluator\n", "\n", "query_response = dict(\n", " query=\"Which tent is the most waterproof?\", # optional\n", " context=\"The Alpine Explorer Tent is the most water-proof of all tents available.\",\n", " response=\"The Alpine Explorer Tent is the most waterproof.\"\n", ")\n", "\n", "# query_response = dict(\n", "# query=\"μ–΄λ–€ ν…νŠΈκ°€ 방수 κΈ°λŠ₯이 μžˆμ–΄?\", # optional\n", "# context=\"μ•ŒνŒŒμΈ μ΅μŠ€ν”Œλ‘œλŸ¬ ν…νŠΈκ°€ λͺ¨λ“  ν…νŠΈ 쀑 κ°€μž₯ 방수 κΈ°λŠ₯이 뛰어남\",\n", "# response=\"μ•ŒνŒŒμΈ μ΅μŠ€ν”Œλ‘œλŸ¬ ν…νŠΈκ°€ 방수 κΈ°λŠ₯이 μžˆμŠ΅λ‹ˆλ‹€.\"\n", "# )\n", "\n", "# Running Groundedness Evaluator on a query and response pair\n", "groundedness_score = groundedness_eval(\n", " **query_response\n", ")\n", "print(groundedness_score)\n", "\n", "# input conversation result for the groundedness evaluator\n", "groundedness_conv_score = groundedness_eval(conversation=conversation)\n", "print(groundedness_conv_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## πŸš€ Run Evaluators Local" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "input_path = \"data/evaluate_test_data.jsonl\"\n", "output_path = \"data/local_evaluation_output.json\"\n", "\n", "# https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/flow-evaluate-sdk\n", "retrieval_evaluator = RetrievalEvaluator(model_config)\n", "fluency_evaluator = FluencyEvaluator(model_config)\n", "groundedness_evaluator = GroundednessEvaluator(model_config)\n", "relevance_evaluator = RelevanceEvaluator(model_config)\n", "coherence_evaluator = CoherenceEvaluator(model_config)\n", "similarity_evaluator = SimilarityEvaluator(model_config)\n", "\n", "\n", "column_mapping = {\n", " \"query\": \"${data.query}\",\n", " \"ground_truth\": \"${data.ground_truth}\",\n", " \"response\": \"${data.response}\",\n", " \"context\": \"${data.context}\",\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "result = evaluate(\n", " evaluation_name=f\"evaluation_local_{int(time.time())}\",\n", " data=input_path,\n", " evaluators={\n", " \"Retrieval\": retrieval_evaluator,\n", " \"Fluency\": fluency_evaluator,\n", " \"Groundedness\": groundedness_evaluator,\n", " \"Relevance\": relevance_evaluator,\n", " \"Coherence\": coherence_evaluator,\n", " \"Similarity\": similarity_evaluator\n", "\n", " },\n", " evaluator_config={\n", " \"retrieval\": {\"column_mapping\": column_mapping},\n", " \"relevance\": {\"column_mapping\": column_mapping},\n", " \"similarity\": {\"column_mapping\": column_mapping},\n", " \"groundedness\": {\"column_mapping\": column_mapping},\n", " \"coherence\": {\"column_mapping\": column_mapping},\n", " \"similarity\": {\"column_mapping\": column_mapping},\n", " },\n", " #if you want to record the data in Azure AI Foundry add the below\n", " #azure_ai_project=azure_ai_project_client,\n", " output_path=output_path,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import matplotlib.pyplot as plt\n", "from collections import Counter\n", "\n", "with open(output_path, \"r\", encoding=\"utf-8\") as f:\n", " data = json.load(f)\n", "\n", "rows = data[\"rows\"]\n", "\n", "\n", "rating_fields = [\n", " \"outputs.Retrieval.retrieval\",\n", " \"outputs.Coherence.coherence\",\n", " \"outputs.Fluency.fluency\",\n", " \"outputs.Groundedness.groundedness\",\n", " \"outputs.Relevance.relevance\",\n", " \"outputs.Similarity.similarity\",\n", "]\n", "\n", "fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 15))\n", "\n", "for ax, field in zip(axes.flatten(), rating_fields):\n", " # count the occurrences of each score\n", " counter = Counter(row[field] for row in rows if field in row)\n", " \n", " \n", " x = [1, 2, 3, 4, 5]\n", " y = [counter.get(score, 0) for score in x]\n", " \n", " ax.bar(x, y)\n", " ax.set_title(field)\n", " ax.set_xlabel(\"Score\")\n", " ax.set_ylabel(\"Count\")\n", "\n", "plt.tight_layout()\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "venv_agent", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 2 }