2_eval-design-ptn/02_azure-evaluation-sdk/01.2_batch-eval-with-your-data.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Batch evaluation with your own data\n", "The following sample shows the basic way to evaluate a Generative AI application in your development environment with the Azure AI evaluation SDK.\n", "\n", "> ✨ ***Note*** <br>\n", "> Please check the reference document before you get started - https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "shellscript" } }, "source": [ "## 🔨 Current Support and Limitations (as of 2025-01-14) \n", "- Check the region support for the Azure AI Evaluation SDK. https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning#region-support\n", "\n", "### Region support for evaluations\n", "| Region | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA, ECI (Text) | Groundedness (Text) | Protected Material (Text) | Hate and Unfairness, Sexual, Violent, Self-Harm, Protected Material (Image) |\n", "|---------------------|------------------------------------------------------------------|---------------------|----------------------------|----------------------------------------------------------------------------|\n", "| North Central US | no | no | no | yes |\n", "| East US 2 | yes | yes | yes | yes |\n", "| Sweden Central | yes | yes | yes | yes |\n", "| US North Central | yes | no | yes | yes |\n", "| France Central | yes | yes | yes | yes |\n", "| Switzerland West | yes | no | no | yes |\n", "\n", "### Region support for adversarial simulation\n", "| Region | Adversarial Simulation (Text) | Adversarial Simulation (Image) |\n", "|-------------------|-------------------------------|---------------------------------|\n", "| UK South | yes | no |\n", "| East US 2 | yes | yes |\n", "| Sweden Central | yes | yes |\n", "| US North Central | yes | yes |\n", "| France Central | yes | no |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ✔️ Pricing and billing\n", "- Effective 1/14/2025, Azure AI Safety Evaluations will no longer be free in public preview. It will be billed based on consumption as following:\n", "\n", "| Service Name | Safety Evaluations | Price Per 1K Tokens (USD) |\n", "|---------------------------|--------------------------|---------------------------|\n", "| Azure Machine Learning | Input pricing for 3P | $0.02 |\n", "| Azure Machine Learning | Output pricing for 3P | $0.06 |\n", "| Azure Machine Learning | Input pricing for 1P | $0.012 |\n", "| Azure Machine Learning | Output pricing for 1P | $0.012 |\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import os\n", "import json\n", "\n", "from pprint import pprint\n", "from azure.ai.evaluation import evaluate\n", "from azure.ai.evaluation import RelevanceEvaluator\n", "from azure.ai.evaluation import GroundednessEvaluator, GroundednessProEvaluator\n", "from azure.identity import DefaultAzureCredential\n", "from dotenv import load_dotenv\n", "from azure.ai.projects import AIProjectClient\n", "from azure.ai.projects.models import (\n", " Evaluation,\n", " Dataset,\n", " EvaluatorConfiguration,\n", " ConnectionType,\n", " EvaluationSchedule,\n", " RecurrenceTrigger,\n", " ApplicationInsightsConfiguration,\n", ")\n", "import pathlib\n", "\n", "from azure.ai.evaluation import evaluate\n", "from azure.ai.evaluation import (\n", " ContentSafetyEvaluator,\n", " RelevanceEvaluator,\n", " CoherenceEvaluator,\n", " GroundednessEvaluator,\n", " FluencyEvaluator,\n", " SimilarityEvaluator,\n", " F1ScoreEvaluator,\n", " RetrievalEvaluator,\n", ")\n", "\n", "from azure.ai.ml import MLClient\n", "\n", "load_dotenv(override=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 🚀 Run Evaluators for local and upload to cloud (azure.ai.evaluation.evaluate)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "credential = DefaultAzureCredential()\n", "\n", "azure_ai_project_conn_str = os.environ.get(\"AZURE_AI_PROJECT_CONN_STR\")\n", "subscription_id = azure_ai_project_conn_str.split(\";\")[1]\n", "resource_group_name = azure_ai_project_conn_str.split(\";\")[2]\n", "project_name = azure_ai_project_conn_str.split(\";\")[3]\n", "\n", "azure_ai_project_dict = {\n", " \"subscription_id\": subscription_id,\n", " \"resource_group_name\": resource_group_name,\n", " \"project_name\": project_name,\n", "}\n", "\n", "azure_ai_project_client = AIProjectClient.from_connection_string(\n", " credential=DefaultAzureCredential(), conn_str=azure_ai_project_conn_str\n", ")\n", "\n", "model_config = {\n", " \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n", " \"api_key\": os.environ.get(\"AZURE_OPENAI_API_KEY\"),\n", " \"azure_deployment\": os.environ.get(\"AZURE_OPENAI_CHAT_DEPLOYMENT_NAME\"),\n", " \"api_version\": os.environ.get(\"AZURE_OPENAI_API_VERSION\"),\n", " \"type\": \"azure_openai\",\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 🚀 Generate response dataset with Azure OpenAI\n", "- Use your models to generate answers based on the query data set. These response records serve as the seed for creating assessments. By customizing your prompts, you can produce text tailored to your domain." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openai import AzureOpenAI\n", "\n", "aoai_api_endpoint = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n", "aoai_api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n", "aoai_api_version = os.getenv(\"AZURE_OPENAI_API_VERSION\")\n", "aoai_deployment_name = os.getenv(\"AZURE_OPENAI_CHAT_DEPLOYMENT_NAME\")\n", "\n", "try:\n", " client = AzureOpenAI(\n", " azure_endpoint=aoai_api_endpoint,\n", " api_key=aoai_api_key,\n", " api_version=aoai_api_version,\n", " )\n", "\n", " print(\"=== Initialized AzuureOpenAI client ===\")\n", " print(f\"AZURE_OPENAI_ENDPOINT={aoai_api_endpoint}\")\n", " print(f\"AZURE_OPENAI_API_VERSION={aoai_api_version}\")\n", " print(f\"AZURE_OPENAI_DEPLOYMENT_NAME={aoai_deployment_name}\")\n", "\n", "except (ValueError, TypeError) as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Case1. When you have all the data (Query + Response + Ground Truth)\n", "- If you have all the data, you can evaluate the model / service with the following steps.\n", "- Assume that you already upload the data file (csv) to Azure Blob storage " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.storage.blob import BlobServiceClient\n", "\n", "\n", "def upload_evaldata_to_blob(save_path, blob_name, container_name=\"eval-container\"):\n", " # Create a blob client using the local file name as the name for the blob\n", " # Retrieve the storage connection string from environment variables\n", " blob_conn_str = os.getenv(\"AZURE_STORAGE_BLOB_CONNECTION_STRING\")\n", " if not blob_conn_str or blob_conn_str == \"\":\n", " raise ValueError(\"AZURE_STORAGE_BLOB_CONNECTION_STRING must be set.\")\n", "\n", " blob_service_client = BlobServiceClient.from_connection_string(blob_conn_str)\n", "\n", " if not blob_service_client.get_container_client(container_name).exists():\n", " blob_service_client.create_container(container_name)\n", "\n", " container_client = blob_service_client.get_container_client(container_name)\n", "\n", " # Upload the created file\n", " with open(f\"data/{blob_name}\", \"rb\") as data:\n", " container_client.upload_blob(blob_name, data, overwrite=True)\n", " blob_client = container_client.get_blob_client(blob_name)\n", "\n", " # Download CSV from Azure Blob and save locally\n", " with open(save_path, \"wb\") as f:\n", " blob_data = blob_client.download_blob()\n", " blob_data.readinto(f)\n", " print(f\"Downloaded blob data to {save_path}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "blob_name = \"eval_all_data.csv\"\n", "container_name = \"eval-container\"\n", "save_path = \"case1_temp_data.csv\"\n", "upload_evaldata_to_blob(save_path, blob_name, container_name)\n", "df = pd.read_csv(save_path)\n", "df.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a jsonl file from the query data\n", "outname = \"case1_all_data.jsonl\"\n", "\n", "outdir = \"./data\"\n", "if not os.path.exists(outdir):\n", " os.mkdir(outdir)\n", "\n", "input_path = os.path.join(outdir, outname)\n", "df.to_json(input_path, orient=\"records\", lines=True, force_ascii=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Case2. When you have only query and ground truth data (Query + Ground Truth)\n", "- If you have all the data, you can generate the response and evaluate the model / service with the following steps.\n", "- Assume that you already upload the data file (csv) to Azure Blob storage " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "blob_name = \"eval_all_data.csv\"\n", "container_name = \"eval-container\"\n", "save_path = \"case2_temp_data.csv\"\n", "upload_evaldata_to_blob(save_path, blob_name, container_name)\n", "df = pd.read_csv(save_path)\n", "df.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a jsonl file from the query data\n", "outname = \"case2_temp_data.jsonl\"\n", "\n", "outdir = \"./data\"\n", "if not os.path.exists(outdir):\n", " os.mkdir(outdir)\n", "\n", "query_path = os.path.join(outdir, outname)\n", "df.to_json(query_path, orient=\"records\", lines=True, force_ascii=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import tqdm\n", "\n", "# This is the final jsonl file with the response added\n", "# it will be used for evaluation\n", "input_path = \"./data/case2_query_response_data.jsonl\"\n", "with open(input_path, \"w\", encoding=\"utf-8\") as outfile:\n", " outfile.write(\"\")\n", "\n", "\n", "with open(query_path, \"r\", encoding=\"utf-8\") as infile, open(\n", " input_path, \"a\", encoding=\"utf-8\"\n", ") as outfile:\n", "\n", " for idx, line in enumerate(infile):\n", " print(f\"=== Processing line {idx} ===\")\n", " data = json.loads(line)\n", " resp = client.chat.completions.create(\n", " model=aoai_deployment_name,\n", " messages=[{\"role\": \"user\", \"content\": data[\"query\"]}],\n", " temperature=0,\n", " )\n", "\n", " response_text = resp.choices[0].message.content\n", " print(response_text)\n", " data[\"response\"] = response_text\n", "\n", " outfile.write(json.dumps(data, ensure_ascii=False) + \"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate the model / service with the set up data \n", "- Check that you have already run case1 or case2 as input data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "input_path = input_path # be sure which case you took\n", "output_path = \"./data/cloud_evaluation_output.json\"\n", "\n", "\n", "# https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/flow-evaluate-sdk\n", "retrieval_evaluator = RetrievalEvaluator(model_config)\n", "fluency_evaluator = FluencyEvaluator(model_config)\n", "groundedness_evaluator = GroundednessEvaluator(model_config)\n", "relevance_evaluator = RelevanceEvaluator(model_config)\n", "coherence_evaluator = CoherenceEvaluator(model_config)\n", "similarity_evaluator = SimilarityEvaluator(model_config)\n", "\n", "column_mapping = {\n", " \"query\": \"${data.query}\",\n", " \"ground_truth\": \"${data.ground_truth}\",\n", " \"response\": \"${data.response}\",\n", " \"context\": \"${data.context}\",\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "\n", "result = evaluate(\n", " evaluation_name=f\"evaluation_local_upload_cloud_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\",\n", " data=input_path,\n", " evaluators={\n", " \"groundedness\": groundedness_evaluator,\n", " \"retrieval\": retrieval_evaluator,\n", " \"relevance\": relevance_evaluator,\n", " \"coherence\": coherence_evaluator,\n", " \"fluency\": fluency_evaluator,\n", " \"similarity\": similarity_evaluator,\n", " },\n", " evaluator_config={\n", " \"groundedness\": {\"column_mapping\": column_mapping},\n", " \"retrieval\": {\"column_mapping\": column_mapping},\n", " \"relevance\": {\"column_mapping\": column_mapping},\n", " \"coherence\": {\"column_mapping\": column_mapping},\n", " \"fluency\": {\"column_mapping\": column_mapping},\n", " \"similarity\": {\"column_mapping\": column_mapping},\n", " },\n", " azure_ai_project=azure_ai_project_dict,\n", " output_path=output_path,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!az login --scope https://graph.microsoft.com//.default" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 🚀 Run Evaluators in Azure Cloud (azure.ai.projects.models.Evaluation)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# id for each evaluator can be found in your AI Studio registry - please see documentation for more information\n", "# init_params is the configuration for the model to use to perform the evaluation\n", "# data_mapping is used to map the output columns of your query to the names required by the evaluator\n", "# Evaluator parameter format - https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk#evaluator-parameter-format\n", "evaluators_cloud = {\n", " \"f1_score\": EvaluatorConfiguration(\n", " id=F1ScoreEvaluator.id,\n", " ),\n", " \"relevance\": EvaluatorConfiguration(\n", " id=RelevanceEvaluator.id,\n", " init_params={\"model_config\": model_config},\n", " data_mapping={\n", " \"query\": \"${data.query}\",\n", " \"context\": \"${data.context}\",\n", " \"response\": \"${data.response}\",\n", " },\n", " ),\n", " \"groundedness\": EvaluatorConfiguration(\n", " id=GroundednessEvaluator.id,\n", " init_params={\"model_config\": model_config},\n", " data_mapping={\n", " \"query\": \"${data.query}\",\n", " \"context\": \"${data.context}\",\n", " \"response\": \"${data.response}\",\n", " },\n", " ),\n", " # \"retrieval\": EvaluatorConfiguration(\n", " # #from azure.ai.evaluation._evaluators._common.math import list_mean_nan_safe\\nModuleNotFoundError: No module named 'azure.ai.evaluation._evaluators._common.math'\n", " # id=RetrievalEvaluator.id,\n", " # #id=\"azureml://registries/azureml/models/Retrieval-Evaluator/versions/2\",\n", " # init_params={\"model_config\": model_config},\n", " # data_mapping={\"query\": \"${data.query}\", \"context\": \"${data.context}\", \"response\": \"${data.response}\"},\n", " # ),\n", " \"coherence\": EvaluatorConfiguration(\n", " id=CoherenceEvaluator.id,\n", " init_params={\"model_config\": model_config},\n", " data_mapping={\"query\": \"${data.query}\", \"response\": \"${data.response}\"},\n", " ),\n", " \"fluency\": EvaluatorConfiguration(\n", " id=FluencyEvaluator.id,\n", " init_params={\"model_config\": model_config},\n", " data_mapping={\n", " \"query\": \"${data.query}\",\n", " \"context\": \"${data.context}\",\n", " \"response\": \"${data.response}\",\n", " },\n", " ),\n", " \"similarity\": EvaluatorConfiguration(\n", " # currently bug in the SDK, please use the id below\n", " # id=SimilarityEvaluator.id,\n", " id=\"azureml://registries/azureml/models/Similarity-Evaluator/versions/3\",\n", " init_params={\"model_config\": model_config},\n", " data_mapping={\n", " \"query\": \"${data.query}\",\n", " \"response\": \"${data.response}\",\n", " \"ground_truth\": \"${data.ground_truth}\",\n", " },\n", " ),\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data\n", "- The following code demonstrates how to upload the data for evaluation to your Azure AI project. Below we use evaluate_test_data.jsonl which exemplifies LLM-generated data in the query-response format expected by the Azure AI Evaluation SDK. For your use case, you should upload data in the same format, which can be generated using the Simulator from Azure AI Evaluation SDK.\n", "\n", "- Alternatively, if you already have an existing dataset for evaluation, you can use that by finding the link to your dataset in your registry or find the dataset ID." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# # Upload data for evaluation\n", "data_id, _ = azure_ai_project_client.upload_file(\"data/evaluate_test_data.jsonl\")\n", "# data_id = \"azureml://registries/<registry>/data/<dataset>/versions/<version>\"\n", "# To use an existing dataset, replace the above line with the following line\n", "# data_id = \"<dataset_id>\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure Evaluators to Run\n", "- The code below demonstrates how to configure the evaluators you want to run. In this example, we use the F1ScoreEvaluator, RelevanceEvaluator and the ViolenceEvaluator, but all evaluators supported by Azure AI Evaluation are supported by cloud evaluation and can be configured here. You can either import the classes from the SDK and reference them with the .id property, or you can find the fully formed id of the evaluator in the AI Studio registry of evaluators, and use it here. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evaluation = Evaluation(\n", " display_name=f\"evaluation_cloud_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\",\n", " description=\"Cloud Evaluation of dataset\",\n", " data=Dataset(id=data_id),\n", " evaluators=evaluators_cloud,\n", ")\n", "\n", "# Create evaluation\n", "evaluation_response = azure_ai_project_client.evaluations.create(\n", " evaluation=evaluation,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tqdm import notebook\n", "import time\n", "\n", "\n", "# Monitor the status of the run_result\n", "def monitor_status(project_client: AIProjectClient, evaluation_response_id: str):\n", " with notebook.tqdm(total=3, desc=\"Running Status\", unit=\"step\") as pbar:\n", " status = project_client.evaluations.get(evaluation_response_id).status\n", " if status == \"Queued\":\n", " pbar.update(1)\n", " while status != \"Completed\" and status != \"Failed\":\n", " if status == \"Running\" and pbar.n < 2:\n", " pbar.update(1)\n", " notebook.tqdm.write(f\"Current Status: {status}\")\n", " time.sleep(10)\n", " status = project_client.evaluations.get(evaluation_response_id).status\n", " while pbar.n < 3:\n", " pbar.update(1)\n", " notebook.tqdm.write(\"Operation Completed\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "monitor_status(azure_ai_project_client, evaluation_response.id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check the evaluation result in Azure AI Foundry \n", "- After running the evaluation, you can check the evaluation results in Azure AI Foundry. You can find the evaluation results in the Evaluation tab of your project." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get evaluation\n", "get_evaluation_response = azure_ai_project_client.evaluations.get(\n", " evaluation_response.id\n", ")\n", "\n", "print(\"----------------------------------------------------------------\")\n", "print(\"Created evaluation, evaluation ID: \", get_evaluation_response.id)\n", "print(\"Evaluation status: \", get_evaluation_response.status)\n", "print(\n", " \"AI Foundry Portal URI: \",\n", " get_evaluation_response.properties[\"AiStudioEvaluationUri\"],\n", ")\n", "print(\"----------------------------------------------------------------\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Cloud Evaluation Result](./images/cloud_evaluation_result.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"----------------------------------------------------------------\")\n", "print(\"Created evaluation, evaluation ID: \", get_evaluation_response.id)\n", "print(\"Evaluation status: \", get_evaluation_response.status)\n", "print(\n", " \"AI Foundry Portal URI: \",\n", " get_evaluation_response.properties[\"AiStudioEvaluationUri\"],\n", ")\n", "print(\"-----------------------------------\")" ] } ], "metadata": { "kernelspec": { "display_name": "venv_agent", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 2 }

2_eval-design-ptn/02_azure-evaluation-sdk/01.2_batch-eval-with-your-data.ipynb (640 lines of code) (raw):