2_eval-design-ptn/02_azure-evaluation-sdk/01.6_batch-eval-with-custom_eval.ipynb (806 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Batch evaluation with your custom evaluators\n", "The following sample shows the basic way to evaluate a Generative AI application in your development environment with your custom evaluators.\n", "\n", "> ✨ ***Note*** <br>\n", "> Please check the reference document before you get started - https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "shellscript" } }, "source": [ "## 🔨 Current Support and Limitations (as of 2025-01-14) \n", "- Check the region support for the Azure AI Evaluation SDK. https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning#region-support\n", "\n", "### Region support for evaluations\n", "| Region | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA, ECI (Text) | Groundedness (Text) | Protected Material (Text) | Hate and Unfairness, Sexual, Violent, Self-Harm, Protected Material (Image) |\n", "|---------------------|------------------------------------------------------------------|---------------------|----------------------------|----------------------------------------------------------------------------|\n", "| North Central US | no | no | no | yes |\n", "| East US 2 | yes | yes | yes | yes |\n", "| Sweden Central | yes | yes | yes | yes |\n", "| US North Central | yes | no | yes | yes |\n", "| France Central | yes | yes | yes | yes |\n", "| Switzerland West | yes | no | no | yes |\n", "\n", "### Region support for adversarial simulation\n", "| Region | Adversarial Simulation (Text) | Adversarial Simulation (Image) |\n", "|-------------------|-------------------------------|---------------------------------|\n", "| UK South | yes | no |\n", "| East US 2 | yes | yes |\n", "| Sweden Central | yes | yes |\n", "| US North Central | yes | yes |\n", "| France Central | yes | no |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ✔️ Pricing and billing\n", "- Effective 1/14/2025, Azure AI Safety Evaluations will no longer be free in public preview. It will be billed based on consumption as following:\n", "\n", "| Service Name | Safety Evaluations | Price Per 1K Tokens (USD) |\n", "|---------------------------|--------------------------|---------------------------|\n", "| Azure Machine Learning | Input pricing for 3P | $0.02 |\n", "| Azure Machine Learning | Output pricing for 3P | $0.06 |\n", "| Azure Machine Learning | Input pricing for 1P | $0.012 |\n", "| Azure Machine Learning | Output pricing for 1P | $0.012 |\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import os\n", "import json\n", "\n", "from pprint import pprint\n", "from azure.ai.evaluation import evaluate\n", "from azure.ai.evaluation import RelevanceEvaluator\n", "from azure.ai.evaluation import GroundednessEvaluator, GroundednessProEvaluator\n", "from azure.identity import DefaultAzureCredential\n", "from dotenv import load_dotenv\n", "from azure.ai.projects import AIProjectClient\n", "from azure.ai.projects.models import (\n", " Evaluation,\n", " Dataset,\n", " EvaluatorConfiguration,\n", " ConnectionType,\n", " EvaluationSchedule,\n", " RecurrenceTrigger,\n", " ApplicationInsightsConfiguration,\n", ")\n", "import pathlib\n", "\n", "from azure.ai.evaluation import evaluate\n", "from azure.ai.evaluation import (\n", " ContentSafetyEvaluator,\n", " RelevanceEvaluator,\n", " CoherenceEvaluator,\n", " GroundednessEvaluator,\n", " FluencyEvaluator,\n", " SimilarityEvaluator,\n", " F1ScoreEvaluator,\n", " RetrievalEvaluator,\n", ")\n", "\n", "from azure.ai.ml import MLClient\n", "\n", "\n", "load_dotenv(override=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 🚀 Run Evaluators in Azure Cloud (azure.ai.evaluation.evaluate)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "credential = DefaultAzureCredential()\n", "\n", "azure_ai_project_conn_str = os.environ.get(\"AZURE_AI_PROJECT_CONN_STR\")\n", "subscription_id = azure_ai_project_conn_str.split(\";\")[1]\n", "resource_group_name = azure_ai_project_conn_str.split(\";\")[2]\n", "project_name = azure_ai_project_conn_str.split(\";\")[3]\n", "\n", "azure_ai_project_dict = {\n", " \"subscription_id\": subscription_id,\n", " \"resource_group_name\": resource_group_name,\n", " \"project_name\": project_name,\n", "}\n", "\n", "azure_ai_project_client = AIProjectClient.from_connection_string(\n", " credential=DefaultAzureCredential(), conn_str=azure_ai_project_conn_str\n", ")\n", "\n", "\n", "model_config = {\n", " \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n", " \"api_key\": os.environ.get(\"AZURE_OPENAI_API_KEY\"),\n", " \"azure_deployment\": os.environ.get(\"AZURE_OPENAI_CHAT_DEPLOYMENT_NAME\"),\n", " \"api_version\": os.environ.get(\"AZURE_OPENAI_API_VERSION\"),\n", " \"type\": \"azure_openai\",\n", "}\n", "\n", "ml_client = MLClient(credential, subscription_id, resource_group_name, project_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## set the evaluation dataset\n", "- Use your query data set on your storage account. These response records serve as the seed for creating assessments. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openai import AzureOpenAI\n", "\n", "\n", "aoai_api_endpoint = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n", "aoai_api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n", "aoai_api_version = os.getenv(\"AZURE_OPENAI_API_VERSION\")\n", "aoai_deployment_name = os.getenv(\"AZURE_OPENAI_CHAT_DEPLOYMENT_NAME\")\n", "\n", "try:\n", " client = AzureOpenAI(\n", " azure_endpoint=aoai_api_endpoint,\n", " api_key=aoai_api_key,\n", " api_version=aoai_api_version,\n", " )\n", "\n", " print(\"=== Initialized AzuureOpenAI client ===\")\n", " print(f\"AZURE_OPENAI_ENDPOINT={aoai_api_endpoint}\")\n", " print(f\"AZURE_OPENAI_API_VERSION={aoai_api_version}\")\n", " print(f\"AZURE_OPENAI_DEPLOYMENT_NAME={aoai_deployment_name}\")\n", "\n", "except (ValueError, TypeError) as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### You have all the data (Query + Context + Response + Ground Truth)\n", "- Assume that you already have all the data (Query + Context + Response + Ground Truth) in a local folder" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "input_path = \"./data/sythetic_evaluation_data.jsonl\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 🚀 Run Evaluators for local and upload to cloud (azure.ai.evaluation.evaluate)\n", "- set up your custom GroundnessEvaluator, ExactMatchEvaluator for Local environment with the following steps." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from typing_extensions import override\n", "\n", "class CustomGroundednessEvaluator(GroundednessEvaluator):\n", " \"\"\"\n", " Evaluates groundedness score for a given query (optional), response, and context or a multi-turn conversation,\n", " including reasoning.\n", "\n", " The groundedness measure assesses the correspondence between claims in an AI-generated answer and the source\n", " context, making sure that these claims are substantiated by the context. Even if the responses from LLM are\n", " factually correct, they'll be considered ungrounded if they can't be verified against the provided sources\n", " (such as your input source or your database). Use the groundedness metric when you need to verify that\n", " AI-generated responses align with and are validated by the provided context.\n", "\n", " Groundedness scores range from 0.0 to 1.0, with 0.0 being the least grounded and 1.0 being the grounded.\n", "\n", " :param model_config: Configuration for the Azure OpenAI model.\n", " :type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,\n", " ~azure.ai.evaluation.OpenAIModelConfiguration]\n", "\n", " .. admonition:: Example:\n", "\n", " .. literalinclude:: ../samples/evaluation_samples_evaluate.py\n", " :start-after: [START groundedness_evaluator]\n", " :end-before: [END groundedness_evaluator]\n", " :language: python\n", " :dedent: 8\n", " :caption: Initialize and call a GroundednessEvaluator.\n", "\n", " .. note::\n", "\n", " To align with our support of a diverse set of models, an output key without the `gpt_` prefix has been added.\n", " To maintain backwards compatibility, the old key with the `gpt_` prefix is still be present in the output;\n", " however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.\n", " \"\"\"\n", " current_dir = os.getcwd()\n", " \n", " # need to set the new prompty file path because the variables are still used in the parent call method\n", " _PROMPTY_FILE_NO_QUERY = os.path.join(current_dir, 'custom-groundedness.prompty')\n", " _PROMPTY_FILE_WITH_QUERY = os.path.join(current_dir, 'custom-groundedness.prompty')\n", "\n", " \n", " @override\n", " def __init__(self, model_config):\n", " \n", " super().__init__(model_config)\n", " current_dir = os.getcwd()\n", " prompty_path = os.path.join(current_dir, \"custom-groundedness.prompty\") # Default to no query\n", " super(GroundednessEvaluator, self).__init__(model_config=model_config, prompty_file=prompty_path, result_key=\"custom-groundedness\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ExactMatchEvaluator._exact_match import ExactMatchEvaluator\n", "\n", "# check the source code of the ExactMatchEvaluator\n", "# This is a custom evaluator that relates to the task of evaluating in local environment\n", "'''\n", " class ExactMatchEvaluator(EvaluatorBase):\n", " \"\"\"\n", " Evaluates whether the response exactly matches the ground truth.\n", "\n", " This evaluator returns a score of 1.0 if the response is identical to the ground truth,\n", " and 0.0 otherwise. It is useful for tasks that require strict correctness such as factual QA.\n", "\n", " Example:\n", " evaluator = ExactMatchEvaluator()\n", " result = evaluator(ground_truth=\"Hello, world!\", response=\"Hello, world!\")\n", " print(result) # {'exact_match_score': 1.0}\n", " \"\"\"\n", "\n", " id = \"update with your azure ml asset id\"\n", " \"\"\"Evaluator identifier, experimental and to be used only with evaluation in cloud.\"\"\"\n", "\n", " @override\n", " def __init__(self):\n", " super().__init__()\n", "\n", " @override\n", " async def _do_eval(self, eval_input: Dict) -> Dict[str, float]:\n", " \"\"\"Evaluate whether the response matches the ground truth exactly.\"\"\"\n", " ground_truth = eval_input[\"ground_truth\"].strip()\n", " response = eval_input[\"response\"].strip()\n", " \n", " score = 1.0 if ground_truth == response else 0.0\n", "\n", " return {\n", " \"exact_match_score\": score,\n", " }\n", "\n", " @overload\n", " def __call__(self, *, ground_truth: str, response: str):\n", " \"\"\"\n", " Evaluate whether the response matches the ground truth exactly.\n", "\n", " :keyword response: The response to be evaluated.\n", " :paramtype response: str\n", " :keyword ground_truth: The ground truth to be compared against.\n", " :paramtype ground_truth: str\n", " :return: The exact match score.\n", " :rtype: Dict[str, float]\n", " \"\"\"\n", "\n", " @override\n", " def __call__(self, *args, **kwargs):\n", " \"\"\"Evaluate whether the response matches the ground truth exactly.\"\"\"\n", " return super().__call__(*args, **kwargs)\n", "\n", "'''" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "exactMatchEvaluator = ExactMatchEvaluator()\n", "\n", "exactMatch = exactMatchEvaluator(\n", " ground_truth=\"What is the speed of light?\", response=\"What is the speed of light?\"\n", ")\n", "\n", "exactMatch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "input_path = input_path \n", "output_path = \"./data/local_upload_cloud_evaluation_output.json\"\n", "\n", "\n", "# https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/flow-evaluate-sdk\n", "\n", "retrieval_evaluator = RetrievalEvaluator(model_config)\n", "custom_groundedness_evaluator = CustomGroundednessEvaluator(model_config)\n", "relevance_evaluator = RelevanceEvaluator(model_config)\n", "similarity_evaluator = SimilarityEvaluator(model_config)\n", "exactMatchEvaluator = ExactMatchEvaluator()\n", "\n", "column_mapping = {\n", " \"query\": \"${data.query}\",\n", " \"ground_truth\": \"${data.ground_truth}\",\n", " \"response\": \"${data.response}\",\n", " \"context\": \"${data.context}\",\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "\n", "result = evaluate(\n", " evaluation_name=f\"custom_evaluation_local_upload_cloud_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\",\n", " data=input_path,\n", " evaluators={\n", " \"retrieval\": retrieval_evaluator,\n", " \"custom-groundedness\": custom_groundedness_evaluator,\n", " \"relevance\": relevance_evaluator,\n", " \"similarity\": similarity_evaluator,\n", " \"exact_match\": exactMatchEvaluator,\n", " },\n", " evaluator_config={\n", " \"retrieval\": {\"column_mapping\": column_mapping},\n", " \"custom-groundedness\": {\"column_mapping\": column_mapping},\n", " \"relevance\": {\"column_mapping\": column_mapping},\n", " \"similarity\": {\"column_mapping\": column_mapping},\n", " \"exact_match\": {\"column_mapping\": column_mapping},\n", " },\n", " azure_ai_project=azure_ai_project_dict,\n", " output_path=output_path,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize evaluation Results as html\n", "- You can visualize the evaluation results as HTML with the following steps." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import matplotlib.pyplot as plt\n", "import base64\n", "from io import BytesIO\n", "import numpy as np\n", "from datetime import datetime\n", "import json\n", "\n", "def generate_evaluation_report(data_file):\n", " import matplotlib.pyplot as plt\n", "\n", " # Load JSON file\n", " with open(data_file, 'r', encoding='utf-8') as f:\n", " data = json.load(f)\n", " rows = data.get('rows', [])\n", "\n", " # Define score ranges\n", " score_range = [0, 1, 2, 3, 4, 5]\n", " freq_retrieval = {score: 0 for score in score_range}\n", " freq_relevance = {score: 0 for score in score_range}\n", " freq_similarity = {score: 0 for score in score_range}\n", " freq_groundedness = {score: 0 for score in score_range}\n", " freq_exact_match = {score: 0 for score in [0, 1]}\n", "\n", " # Count occurrences of each score\n", " for row in rows:\n", " retrieval_score = row.get('outputs.retrieval.retrieval', 0)\n", " relevance_score = row.get('outputs.relevance.relevance', 0)\n", " similarity_score = row.get('outputs.similarity.similarity', 0)\n", " groundedness_score = row.get('outputs.custom-groundedness.custom-groundedness', 0)\n", " exact_match_score = row.get('outputs.exact_match.exact_match_score', 0)\n", " \n", " if retrieval_score in freq_retrieval:\n", " freq_retrieval[retrieval_score] += 1\n", " if relevance_score in freq_relevance:\n", " freq_relevance[relevance_score] += 1\n", " if similarity_score in freq_similarity:\n", " freq_similarity[similarity_score] += 1\n", " if groundedness_score in freq_groundedness:\n", " freq_groundedness[groundedness_score] += 1\n", " if exact_match_score in freq_exact_match:\n", " freq_exact_match[exact_match_score] += 1\n", "\n", " # Function to generate bar chart\n", " def generate_chart(freq_dict, title):\n", " fig, ax = plt.subplots()\n", " x = np.arange(len(freq_dict))\n", " ax.bar(x, [freq_dict.get(score, 0) for score in freq_dict], width=0.5, label=title)\n", " ax.set_xticks(x)\n", " ax.set_xticklabels(list(freq_dict.keys()))\n", " ax.set_xlabel('Score')\n", " ax.set_ylabel('Frequency')\n", " ax.set_title(title)\n", " ax.legend()\n", " buf = BytesIO()\n", " plt.savefig(buf, format='png')\n", " buf.seek(0)\n", " chart_data = base64.b64encode(buf.read()).decode('utf-8')\n", " plt.close(fig)\n", " return chart_data\n", "\n", " # Generate charts\n", " retrieval_chart = generate_chart(freq_retrieval, 'Retrieval Score Distribution')\n", " exact_match_chart = generate_chart(freq_exact_match, 'Exact Match Score Distribution')\n", "\n", " # Generate combined response chart\n", " def generate_response_chart():\n", " fig, ax = plt.subplots()\n", " x = np.arange(len(score_range))\n", " width = 0.3\n", " ax.bar(x - width, [freq_relevance[score] for score in score_range], width, label='Relevance')\n", " ax.bar(x, [freq_similarity[score] for score in score_range], width, label='Similarity')\n", " ax.bar(x + width, [freq_groundedness[score] for score in score_range], width, label='Custom-Groundedness')\n", " ax.set_xticks(x)\n", " ax.set_xticklabels(score_range)\n", " ax.set_xlabel('Score')\n", " ax.set_ylabel('Frequency')\n", " ax.set_title('Response Score Distribution')\n", " ax.legend()\n", " buf = BytesIO()\n", " plt.savefig(buf, format='png')\n", " buf.seek(0)\n", " chart_data = base64.b64encode(buf.read()).decode('utf-8')\n", " plt.close(fig)\n", " return chart_data\n", "\n", " response_chart = generate_response_chart()\n", "\n", " # Generate HTML table\n", " table_html = '<table border=\"1\" style=\"border-collapse: collapse;\"><tr><th>Query</th><th>Response</th><th>Relevance</th><th>Similarity</th><th>Custom-Groundedness</th><th>Exact Match</th></tr>'\n", " for row in rows:\n", " table_html += f\"<tr><td>{row.get('inputs.query', '')}</td><td>{row.get('inputs.response', '')}</td><td>{row.get('outputs.relevance.relevance', '')}</td><td>{row.get('outputs.similarity.similarity', '')}</td><td>{row.get('outputs.custom-groundedness.custom-groundedness', '')}</td><td>{row.get('outputs.exact_match.exact_match_score', '')}</td></tr>\"\n", " table_html += '</table>'\n", "\n", " # Generate HTML content\n", " html_content = f\"\"\"\n", " <html>\n", " <head>\n", " <meta charset=\"UTF-8\">\n", " <title>Evaluation Results</title>\n", " <style>\n", " .image-container {{ display: flex; flex-wrap: wrap; justify-content: space-around; }}\n", " .image-container div {{ margin: 10px; text-align: center; }}\n", " img {{ max-width: 100%; height: auto; }}\n", " </style>\n", " </head>\n", " <body>\n", " <h1>Evaluation Results</h1>\n", " <div class=\"image-container\">\n", " <div><h2>Retrieval Score Distribution</h2><img src=\"data:image/png;base64,{retrieval_chart}\"/></div>\n", " <div><h2>Response Score Distribution</h2><img src=\"data:image/png;base64,{response_chart}\"/></div>\n", " <div><h2>Exact Match Score Distribution</h2><img src=\"data:image/png;base64,{exact_match_chart}\"/></div>\n", " </div>\n", " <h2>Results Table</h2>\n", " {table_html}\n", " </body>\n", " </html>\n", " \"\"\"\n", "\n", " # Save HTML file with timestamp\n", " timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')\n", " filename = f'{timestamp}_evaluation_results.html'\n", " with open(filename, 'w', encoding='utf-8') as f:\n", " f.write(html_content)\n", "\n", " print(f\"HTML file '{filename}' generated.\")\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "generate_evaluation_report('data/local_upload_cloud_evaluation_output.json')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![\"evaluation_result\"](images/evaluation_result_local_upload_cloud1.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 🚀 Run Evaluators in Azure Cloud (azure.ai.projects.models.Evaluation)\n", "- set up your custom evaluator for cloud environment with the following steps." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError\n", "from azure.ai.ml.entities import Model\n", "\n", "logger = logging.getLogger(__name__)\n", "\n", "\n", "def get_or_create_model_asset(\n", " ml_client,\n", " model_name,\n", " model_dir=\"ExactMatchEvaluator\",\n", " model_type=\"custom_model\",\n", " update=True,\n", "):\n", "\n", " try:\n", " latest_model_version = max(\n", " [int(m.version) for m in ml_client.models.list(name=model_name)]\n", " )\n", " if update:\n", " raise ResourceExistsError(\"Found Model asset, but will update the Model.\")\n", " else:\n", " model_asset = ml_client.models.get(\n", " name=model_name, version=latest_model_version\n", " )\n", " logger.info(f\"Found Model asset: {model_name}. Will not create again\")\n", " except (ResourceNotFoundError, ResourceExistsError) as e:\n", "\n", " logger.info(f\"Exception: {e}\")\n", " run_model = Model(\n", " name=model_name,\n", " path=model_dir,\n", " description=\"Model created from run.\",\n", " properties={\n", " \"_default-display-file\": \"./ExactMatchEvaluator/_exact_match_cloud.py\",\n", " \"is-evaluator\": True,\n", " \"is-promptflow\": True,\n", " \"show-artifact\": True,\n", " },\n", " type=model_type,\n", " )\n", " model_asset = ml_client.models.create_or_update(run_model)\n", " logger.info(f\"Created Model asset: {model_name}\")\n", "\n", " return model_asset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "exact_match_model_asset = get_or_create_model_asset(\n", " ml_client,\n", " model_name=\"ExactMatchEvaluator\",\n", " model_dir=\"ExactMatchEvaluator\",\n", " model_type=\"custom_model\",\n", " update=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get the azure ml asset id from Azure Machine Learning\n", "- You can get the asset id from Azure Machine Learning workspace.\n", "![azureml asset id](images/copy_azure_ml_asset_id.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# id for each evaluator can be found in your AI Studio registry - please see documentation for more information\n", "# init_params is the configuration for the model to use to perform the evaluation\n", "# data_mapping is used to map the output columns of your query to the names required by the evaluator\n", "# Evaluator parameter format - https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk#evaluator-parameter-format\n", "evaluators_cloud = {\n", " \"exact_match\": EvaluatorConfiguration(\n", " # need to check the azure machine learning service to find the appropriate id\n", " # id=\"azureml://locations/swedencentral/workspaces/c68cb823-8b8f-4f88-bcc0-2c9f49675905/models/ExactMatchEvaluator/versions/5\",\n", " id=\"your azureml asset id for ExactMatchEvaluator\",\n", " data_mapping={\n", " \"query\": \"${data.query}\",\n", " \"response\": \"${data.predicted}\",\n", " \"ground_truth\": \"${data.actual}\",\n", " },\n", " ),\n", " \"similarity\": EvaluatorConfiguration(\n", " # currently bug in the SDK, please use the id below\n", " # id=SimilarityEvaluator.id,\n", " id=\"azureml://registries/azureml/models/Similarity-Evaluator/versions/3\",\n", " init_params={\"model_config\": model_config},\n", " data_mapping={\n", " \"query\": \"${data.query}\",\n", " \"response\": \"${data.predicted}\",\n", " \"ground_truth\": \"${data.actual}\",\n", " },\n", " ),\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data\n", "- The following code demonstrates how to upload the data for evaluation to your Azure AI project. Below we use evaluate_test_data.jsonl which exemplifies LLM-generated data in the query-response format expected by the Azure AI Evaluation SDK. For your use case, you should upload data in the same format, which can be generated using the Simulator from Azure AI Evaluation SDK.\n", "\n", "- Alternatively, if you already have an existing dataset for evaluation, you can use that by finding the link to your dataset in your registry or find the dataset ID." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# # Upload data for evaluation\n", "data_id, _ = azure_ai_project_client.upload_file(\"data/custom_data.jsonl\")\n", "# data_id = \"azureml://registries/<registry>/data/<dataset>/versions/<version>\"\n", "# To use an existing dataset, replace the above line with the following line\n", "# data_id = \"<dataset_id>\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure Evaluators to Run\n", "- The code below demonstrates how to configure the evaluators you want to run. In this example, we use the F1ScoreEvaluator, RelevanceEvaluator and the ViolenceEvaluator, but all evaluators supported by Azure AI Evaluation are supported by cloud evaluation and can be configured here. You can either import the classes from the SDK and reference them with the .id property, or you can find the fully formed id of the evaluator in the AI Studio registry of evaluators, and use it here. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evaluation = Evaluation(\n", " display_name=f\"custom_evaluation_cloud_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\",\n", " description=\"Cloud Evaluation of dataset\",\n", " data=Dataset(id=data_id),\n", " evaluators=evaluators_cloud,\n", ")\n", "\n", "# Create evaluation\n", "evaluation_response = azure_ai_project_client.evaluations.create(\n", " evaluation=evaluation,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tqdm import notebook\n", "import time\n", "\n", "\n", "# Monitor the status of the run_result\n", "def monitor_status(project_client: AIProjectClient, evaluation_response_id: str):\n", " with notebook.tqdm(total=3, desc=\"Running Status\", unit=\"step\") as pbar:\n", " status = project_client.evaluations.get(evaluation_response_id).status\n", " if status == \"Queued\":\n", " pbar.update(1)\n", " while status != \"Completed\" and status != \"Failed\":\n", " if status == \"Running\" and pbar.n < 2:\n", " pbar.update(1)\n", " notebook.tqdm.write(f\"Current Status: {status}\")\n", " time.sleep(10)\n", " status = project_client.evaluations.get(evaluation_response_id).status\n", " while pbar.n < 3:\n", " pbar.update(1)\n", " notebook.tqdm.write(\"Operation Completed\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "monitor_status(azure_ai_project_client, evaluation_response.id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add custom chart \n", "- After running the evaluation, you can add custom chart to visualize the evaluation results in Azure AI Foundry. \n", "![custom chart](images/exact_match_chart.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "venv_eval", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 2 }