2_eval-design-ptn/02_azure-evaluation-sdk/01.5_custom-evaluator.ipynb (450 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Custom Evaluator with the Azure AI Evaluation SDK\n", "The following sample shows the basic way to create custom evaluator to test a Generative AI application in your development environment with the Azure AI evaluation SDK.\n", "\n", "> โœจ ***Note*** <br>\n", "> Please check the reference document before you get started - https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "shellscript" } }, "source": [ "## ๐Ÿ”จ Current Support and Limitations (as of 2025-01-14) \n", "- Check the region support for the Azure AI Evaluation SDK. https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning#region-support\n", "\n", "### Region support for evaluations\n", "| Region | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA, ECI (Text) | Groundedness (Text) | Protected Material (Text) | Hate and Unfairness, Sexual, Violent, Self-Harm, Protected Material (Image) |\n", "|---------------------|------------------------------------------------------------------|---------------------|----------------------------|----------------------------------------------------------------------------|\n", "| North Central US | no | no | no | yes |\n", "| East US 2 | yes | yes | yes | yes |\n", "| Sweden Central | yes | yes | yes | yes |\n", "| US North Central | yes | no | yes | yes |\n", "| France Central | yes | yes | yes | yes |\n", "| Switzerland West | yes | no | no | yes |\n", "\n", "### Region support for adversarial simulation\n", "| Region | Adversarial Simulation (Text) | Adversarial Simulation (Image) |\n", "|-------------------|-------------------------------|---------------------------------|\n", "| UK South | yes | no |\n", "| East US 2 | yes | yes |\n", "| Sweden Central | yes | yes |\n", "| US North Central | yes | yes |\n", "| France Central | yes | no |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## โœ”๏ธ Pricing and billing\n", "- Effective 1/14/2025, Azure AI Safety Evaluations will no longer be free in public preview. It will be billed based on consumption as following:\n", "\n", "| Service Name | Safety Evaluations | Price Per 1K Tokens (USD) |\n", "|---------------------------|--------------------------|---------------------------|\n", "| Azure Machine Learning | Input pricing for 3P | $0.02 |\n", "| Azure Machine Learning | Output pricing for 3P | $0.06 |\n", "| Azure Machine Learning | Input pricing for 1P | $0.012 |\n", "| Azure Machine Learning | Output pricing for 1P | $0.012 |\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import os\n", "import json\n", "\n", "from pprint import pprint\n", "from azure.ai.evaluation import evaluate\n", "from azure.ai.evaluation import RelevanceEvaluator\n", "from azure.ai.evaluation import GroundednessEvaluator, GroundednessProEvaluator\n", "from azure.identity import DefaultAzureCredential\n", "from dotenv import load_dotenv\n", "from azure.ai.projects import AIProjectClient\n", "from azure.ai.projects.models import (\n", " Evaluation,\n", " Dataset,\n", " EvaluatorConfiguration,\n", " ConnectionType,\n", " EvaluationSchedule,\n", " RecurrenceTrigger,\n", " ApplicationInsightsConfiguration\n", ")\n", "import pathlib\n", "\n", "from azure.ai.evaluation import evaluate\n", "from azure.ai.evaluation import (\n", " ContentSafetyEvaluator,\n", " RelevanceEvaluator,\n", " CoherenceEvaluator,\n", " GroundednessEvaluator,\n", " FluencyEvaluator,\n", " SimilarityEvaluator,\n", " F1ScoreEvaluator,\n", " RetrievalEvaluator\n", ")\n", "\n", "\n", "\n", "load_dotenv(override=True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "credential = DefaultAzureCredential()\n", "\n", "azure_ai_project_conn_str = os.environ.get(\"AZURE_AI_PROJECT_CONN_STR\")\n", "subscription_id = azure_ai_project_conn_str.split(\";\")[1]\n", "resource_group_name = azure_ai_project_conn_str.split(\";\")[2]\n", "project_name = azure_ai_project_conn_str.split(\";\")[3]\n", "\n", "azure_ai_project_dict = {\n", " \"subscription_id\": subscription_id,\n", " \"resource_group_name\": resource_group_name,\n", " \"project_name\": project_name,\n", "}\n", "\n", "azure_ai_project_client = AIProjectClient.from_connection_string(\n", " credential=DefaultAzureCredential(),\n", " conn_str=azure_ai_project_conn_str\n", ")\n", "\n", "\n", "model_config = {\n", " \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n", " \"api_key\": os.environ.get(\"AZURE_OPENAI_API_KEY\"),\n", " \"azure_deployment\": os.environ.get(\"AZURE_OPENAI_CHAT_DEPLOYMENT_NAME\"),\n", " \"api_version\": os.environ.get(\"AZURE_OPENAI_API_VERSION\"),\n", " \"type\": \"azure_openai\",\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "input_path = \"data/sythetic_evaluation_data.jsonl\"\n", "output_path = \"data/custom_evaluation_output.json\"\n", "\n", "\n", "# https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/flow-evaluate-sdk\n", "retrieval_evaluator = RetrievalEvaluator(model_config)\n", "fluency_evaluator = FluencyEvaluator(model_config)\n", "groundedness_evaluator = GroundednessEvaluator(model_config)\n", "relevance_evaluator = RelevanceEvaluator(model_config)\n", "coherence_evaluator = CoherenceEvaluator(model_config)\n", "similarity_evaluator = SimilarityEvaluator(model_config)\n", "\n", "column_mapping = {\n", " \"query\": \"${data.query}\",\n", " \"ground_truth\": \"${data.ground_truth}\",\n", " \"response\": \"${data.response}\",\n", " \"context\": \"${data.context}\",\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ๐Ÿงช AI-assisted Groundedness evaluator\n", "- Prompt-based groundedness using your own model deployment to output a score and an explanation for the score is currently supported in all regions.\n", "- Groundedness Pro evaluator leverages Azure AI Content Safety Service (AACS) via integration into the Azure AI Foundry evaluations. No deployment is required, as a back-end service will provide the models for you to output a score and reasoning. Groundedness Pro is currently supported in the East US 2 and Sweden Central regions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Initialzing Groundedness and Groundedness Pro evaluators\n", "groundedness_eval = GroundednessEvaluator(model_config)\n", "# No need to set the model_config for GroundednessProEvaluator\n", "\n", "query_response = dict(\n", " query=\"Which tent is the most waterproof?\", # optional\n", " context=\"The Alpine Explorer Tent is the most water-proof of all tents available.\",\n", " response=\"The Alpine Explorer Tent is the most waterproof.\"\n", ")\n", "\n", "\n", "# query_response = dict(\n", "# query=\"์–ด๋–ค ํ…ํŠธ๊ฐ€ ๋ฐฉ์ˆ˜ ๊ธฐ๋Šฅ์ด ์žˆ์–ด?\", # optional\n", "# context=\"์•ŒํŒŒ์ธ ์ต์Šคํ”Œ๋กœ๋Ÿฌ ํ…ํŠธ๊ฐ€ ๋ชจ๋“  ํ…ํŠธ ์ค‘ ๊ฐ€์žฅ ๋ฐฉ์ˆ˜ ๊ธฐ๋Šฅ์ด ๋›ฐ์–ด๋‚จ\",\n", "# response=\"์•ŒํŒŒ์ธ ์ต์Šคํ”Œ๋กœ๋Ÿฌ ํ…ํŠธ๊ฐ€ ๋ฐฉ์ˆ˜ ๊ธฐ๋Šฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.\"\n", "# )\n", "\n", "# Running Groundedness Evaluator on a query and response pair\n", "groundedness_score = groundedness_eval(\n", " **query_response\n", ")\n", "print(groundedness_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ๐Ÿงช Customize prebuilt GroundnessEvaluator\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "import os\n", "from typing_extensions import override\n", "\n", "\n", "# Since the prebuilt evaluators are not designed to ouput the results as boolean values, you need to use numbers to represent the boolean values\n", "# 1 for True and 0 for False\n", "\n", "\n", "class CustomGroundednessEvaluator(GroundednessEvaluator):\n", " \"\"\"\n", " Evaluates groundedness score for a given query (optional), response, and context or a multi-turn conversation,\n", " including reasoning.\n", "\n", " The groundedness measure assesses the correspondence between claims in an AI-generated answer and the source\n", " context, making sure that these claims are substantiated by the context. Even if the responses from LLM are\n", " factually correct, they'll be considered ungrounded if they can't be verified against the provided sources\n", " (such as your input source or your database). Use the groundedness metric when you need to verify that\n", " AI-generated responses align with and are validated by the provided context.\n", "\n", " Groundedness scores range from 0.0 to 1.0, with 0.0 being the least grounded and 1.0 being the grounded.\n", "\n", " :param model_config: Configuration for the Azure OpenAI model.\n", " :type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,\n", " ~azure.ai.evaluation.OpenAIModelConfiguration]\n", "\n", " .. admonition:: Example:\n", "\n", " .. literalinclude:: ../samples/evaluation_samples_evaluate.py\n", " :start-after: [START groundedness_evaluator]\n", " :end-before: [END groundedness_evaluator]\n", " :language: python\n", " :dedent: 8\n", " :caption: Initialize and call a GroundednessEvaluator.\n", "\n", " .. note::\n", "\n", " To align with our support of a diverse set of models, an output key without the `gpt_` prefix has been added.\n", " To maintain backwards compatibility, the old key with the `gpt_` prefix is still be present in the output;\n", " however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.\n", " \"\"\"\n", " \n", " # need to set the new prompty file path because the variables are still used in the parent call method\n", " current_dir = os.getcwd()\n", " _PROMPTY_FILE_NO_QUERY = os.path.join(current_dir, \"custom-groundedness.prompty\") \n", " _PROMPTY_FILE_WITH_QUERY = os.path.join(current_dir, \"custom-groundedness.prompty\") \n", "\n", " \n", " @override\n", " def __init__(self, model_config):\n", " \n", " super().__init__(model_config)\n", " current_dir = os.getcwd()\n", " prompty_path = os.path.join(current_dir, \"custom-groundedness.prompty\") # Default to no query\n", " super(GroundednessEvaluator, self).__init__(model_config=model_config, prompty_file=prompty_path, result_key=\"custom-groundedness\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialzing Groundedness and Groundedness Pro evaluators\n", "custom_groundedness_eval = CustomGroundednessEvaluator(model_config)\n", "# No need to set the model_config for GroundednessProEvaluator\n", "\n", "query_response = dict(\n", " query=\"Which tent is the most waterproof?\", # optional\n", " context=\"The Alpine Explorer Tent is the most water-proof of all tents available.\",\n", " response=\"The Alpine Explorer Tent is the most waterproof.\"\n", ")\n", "\n", "\n", "# query_response = dict(\n", "# query=\"์–ด๋–ค ํ…ํŠธ๊ฐ€ ๋ฐฉ์ˆ˜ ๊ธฐ๋Šฅ์ด ์žˆ์–ด?\", # optional\n", "# context=\"์•ŒํŒŒ์ธ ์ต์Šคํ”Œ๋กœ๋Ÿฌ ํ…ํŠธ๊ฐ€ ๋ชจ๋“  ํ…ํŠธ ์ค‘ ๊ฐ€์žฅ ๋ฐฉ์ˆ˜ ๊ธฐ๋Šฅ์ด ๋›ฐ์–ด๋‚จ\",\n", "# response=\"์•ŒํŒŒ์ธ ์ต์Šคํ”Œ๋กœ๋Ÿฌ ํ…ํŠธ๊ฐ€ ๋ฐฉ์ˆ˜ ๊ธฐ๋Šฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.\"\n", "# )\n", "\n", "# Running Groundedness Evaluator on a query and response pair\n", "custom_groundedness_score = custom_groundedness_eval(\n", " **query_response\n", ")\n", "print(custom_groundedness_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ๐Ÿงช Customize prebuilt RetrievalEvaluator" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [], "source": [ "input_path = \"./data/queries_responses_ada2_hybrid.jsonl\"\n", "\n", "context_list = []\n", "\n", "with open(input_path, 'r') as file:\n", " context_list = [json.loads(next(file))['document_content'] for _ in range(3)]\n", "\n", "query = \"<๊ฐค๋Ÿฌ๊ทธ S24 ์‹œ๋ฆฌ์ฆˆ>์˜ ์ƒˆ๋กœ์šด ์ ๊ณผ ๋‹ค๋ฅธ ์ ์€ ๋ฌด์—‡์ธ๊ฐ€์š”?\"\n", "# context = \"\\n \".join(context_list)" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'custom-retrieval': 5.0, 'gpt_custom-retrieval': 5.0, 'custom-retrieval_reason': 'The context chunks are highly relevant to the query, providing detailed information about the new features and differences of the ๊ฐค๋Ÿฌ๊ทธ S24 series. The most relevant information is presented at the top, making it easy for the reader to find the answers they are looking for.'}\n" ] } ], "source": [ "from CustomRetrievalEvaluator._custom_retrieval import CustomRetrievalEvaluator\n", "\n", "custom_retrieval_eval = CustomRetrievalEvaluator(model_config)\n", "\n", "query_response = dict(\n", " query=query, \n", " context=context_list\n", ")\n", "\n", "# Running RetrievalEvaluator Evaluator on a query and response pair\n", "retrieval_score = custom_retrieval_eval(\n", " **query_response\n", ")\n", "print(retrieval_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ๐Ÿงช Create New Custom Evaluator\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "\n", "from promptflow.client import load_flow\n", "\n", "\n", "class FriendlinessEvaluator:\n", " def __init__(self, model_config):\n", " current_dir = os.getcwd()\n", " prompty_path = os.path.join(current_dir, \"friendliness.prompty\")\n", " self._flow = load_flow(source=prompty_path, model={\"configuration\": model_config})\n", "\n", " def __call__(self, *, response: str, **kwargs):\n", " llm_response = self._flow(response=response)\n", " try:\n", " response = json.loads(llm_response)\n", " except Exception:\n", " response = llm_response\n", " return response" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "friendliness_eval = FriendlinessEvaluator(model_config)\n", "\n", "friendliness_score = friendliness_eval(response=\"I will not apologize for my behavior!\")\n", "print(friendliness_score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "friendliness_eval = FriendlinessEvaluator(model_config)\n", "\n", "friendliness_score = friendliness_eval(response=\"I love you!\")\n", "print(friendliness_score)" ] } ], "metadata": { "kernelspec": { "display_name": "venv_agent", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 2 }