Lab1_ai_evaluation/evaluate_base_model

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assess Base Model Endpoints using Azure AI Evaluation APIs\n", "\n", "\n", "## Objective\n", "\n", "This tutorial walks you through how to check prompts against different model endpoints on Azure AI Platform or other platforms.\n", "\n", "We'll use a Python Class as the target application, which is sent to the Evaluate API from PromptFlow SDK to see how LLM models respond to the prompts.\n", "\n", "We'll be using these Azure AI services: - [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n", "\n", "## Time\n", "\n", "Allocate approximately 30 minutes to run this sample.\n", "\n", "\n", "## About this example\n", "\n", "This example illustrates how to evaluate model endpoint responses using azure-ai-evaluation.\n", "\n", " \n", "\n", "## Before you begin\n", "\n", "### Installation\n", "\n", "Ensure you install the necessary packages to execute this notebook" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install azure-ai-evaluation\n", "%pip install promptflow-azure\n", "%pip install promptflow-tracing\n", "%pip install promptflow-evals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Detailed Explanation of Parameters and Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733829168235 } }, "outputs": [], "source": [ "from pprint import pprint\n", "\n", "import pandas as pd\n", "import random\n", "import os\n", "from dotenv import load_dotenv\n", "load_dotenv(\"../.credentials.env\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## App we're aiming for - Target Application\n", "\n", "We will utilize the Evaluate API provided by the Prompt Flow SDK. This requires a target application or Python function to handle calls to LLMs and retrieve responses.\n", "\n", "In the notebook, we will use an application target called `ModelEndpoints` to obtain answers from multiple model endpoints based on provided questions, also known as prompts.\n", "\n", "This application target requires a list of model endpoints and their authentication keys. For simplicity, these have been provided in the `env_var` variable, which is passed into the init() function of `ModelEndpoints`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733829170060 } }, "outputs": [], "source": [ "env_var = { \n", " \"gpt-35-turbo\": {\n", " \"endpoint\": os.environ.get(\"AZURE_OPENAI_GPT35_ENDPOINT\"),\n", " \"key\": os.environ.get(\"AZURE_OPENAI_GPT35_API_KEY\"),\n", " },\n", " \"gpt-4\": {\n", " \"endpoint\": os.environ.get(\"AZURE_OPENAI_GPT4_ENDPOINT\"),\n", " \"key\": os.environ.get(\"AZURE_OPENAI_GPT4_API_KEY\"),\n", " },\n", " \"gpt-4o\": {\n", " \"endpoint\": os.environ.get(\"AZURE_OPENAI_GPT4o_ENDPOINT\"),\n", " \"key\": os.environ.get(\"AZURE_OPENAI_GPT4o_API_KEY\"),\n", " },\n", " \"gpt-4o-mini\" : { \n", " \"endpoint\" : os.environ.get(\"AZURE_OPENAI_GPT4o-mini_ENDPOINT\"), \n", " \"key\" : os.environ.get(\"AZURE_OPENAI_GPT4o-mini_API_KEY\"), \n", " }, \n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " Azure AI Project details to ensure traces and evaluation results are integrated into Azure AI Studio." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733829171729 } }, "outputs": [], "source": [ "# Initialize Azure AI project and Azure OpenAI conncetion with your environment variables\n", "azure_ai_project = {\n", " \"subscription_id\": os.environ.get(\"AZURE_SUBSCRIPTION_ID\"),\n", " \"resource_group_name\": os.environ.get(\"AZURE_RESOURCE_GROUP\"),\n", " \"project_name\": os.environ.get(\"AZURE_PROJECT_NAME\"),\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Azure AI Project:\",azure_ai_project)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Endpoints\n", "This code demonstrates how to call various model endpoints, configured based on the `env_var` set above. For any model in `env_var` that is not deployed in your AI project, please comment it out. If you wish to test a model not listed below, include its type in the `__call__` function and create a helper function to call the model's endpoint via REST." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "The following code reads a JSON file named \"ai_data.jsonl,\" which contains inputs for the Application Target function. Each line includes a question, context, and ground truth." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733829176661 } }, "outputs": [], "source": [ "df = pd.read_json(\"ai_data.jsonl\", lines=True)\n", "print(df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuration\n", "To utilize the Relevance and Coherence Evaluator, we will use the Azure Open AI model details as a Judge, which can be included in the model configuration." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733829180194 } }, "outputs": [], "source": [ "model_config = {\n", " \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n", " \"api_key\": os.environ.get(\"AZURE_OPENAI_API_KEY\"),\n", " \"azure_deployment\": os.environ.get(\"AZURE_OPENAI_DEPLOYMENT\"),\n", " \"api_version\": os.environ.get(\"AZURE_OPENAI_API_VERSION\"),\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Execute the Evaluation\n", "\n", "The following code executes the Evaluate API and utilizes the Content Safety, Relevance, and Coherence Evaluator to assess results from various models.\n", "\n", "Below are the parameters required for the Evaluate API:\n", "\n", "+ Data file (Prompts): This represents the data file 'ai_data.jsonl' in JSON format. Each line contains a question, context, and ground truth for evaluators. \n", "+ Application Target: This is the name of the Python class that can route calls to specific model endpoints using model names in conditional logic. \n", "+ Model Name: This is an identifier for the model, allowing custom code in the Application Target class to identify the model type and call the respective LLM model using the endpoint URL and authentication key. \n", "+ Evaluators: A list of evaluators provided to assess the given prompts (questions) as input and the output (answers) from LLM models." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733829183716 }, "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "with open(\"target_ai_api/target_ai_api.py\") as fin:\n", " print(fin.read())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733829185720 }, "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "from target_ai_api.target_ai_api import ModelEndpoints" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733770426854 } }, "outputs": [], "source": [ "import pathlib\n", "import time\n", "from target_ai_api.target_ai_api import ModelEndpoints\n", "from azure.ai.evaluation import evaluate\n", "from azure.ai.evaluation import (\n", " RelevanceEvaluator,\n", " CoherenceEvaluator,\n", " GroundednessEvaluator,\n", " FluencyEvaluator,\n", " SimilarityEvaluator,\n", ")\n", "\n", "relevance_evaluator = RelevanceEvaluator(model_config)\n", "coherence_evaluator = CoherenceEvaluator(model_config)\n", "groundedness_evaluator = GroundednessEvaluator(model_config)\n", "fluency_evaluator = FluencyEvaluator(model_config)\n", "similarity_evaluator = SimilarityEvaluator(model_config)\n", "\n", "\n", "models = [\"gpt-35-turbo\",\"gpt-4\",\"gpt-4o\",\"gpt-4o-mini\"]\n", "\n", "path = str(pathlib.Path(pathlib.Path.cwd())) + \"/ai_data.jsonl\"\n", "\n", "for model in models:\n", " print(\" Evaluating AI-assisted metrics - \", model)\n", " print(\"-----------------------------------\")\n", " randomNum = random.randint(1111, 9999)\n", " results = evaluate(\n", " azure_ai_project=azure_ai_project, \n", " evaluation_name=\"Eval_\" + model.title() + \"_Run-\" + str(randomNum),\n", " data=path,\n", " target=ModelEndpoints(env_var, model),\n", " \n", " evaluators={\n", " \"relevance\": relevance_evaluator,\n", " \"groundedness\": groundedness_evaluator,\n", " \"coherence\": coherence_evaluator,\n", " \"fluency\": fluency_evaluator,\n", " \"similarity\": similarity_evaluator,\n", "\n", " },\n", " evaluator_config={\n", "\n", " \"relevance\": {\n", " \"column_mapping\": {\n", " \"query\": \"${data.query}\", \n", " \"context\": \"${data.context}\", \n", " \"response\": \"${target.response}\"}\n", " },\n", "\n", " \"groundedness\": {\n", " \"column_mapping\": {\n", " \"query\": \"${data.query}\", \n", " \"context\": \"${data.context}\", \n", " \"response\": \"${target.response}\"}\n", " },\n", "\n", " \"coherence\": {\n", " \"column_mapping\": {\n", " \"query\": \"${data.query}\", \n", " \"context\": \"${data.context}\", \n", " \"response\": \"${target.response}\"}\n", " },\n", "\n", " \"fluency\": {\n", " \"column_mapping\": {\n", " \"query\": \"${data.query}\", \n", " \"context\": \"${data.context}\", \n", " \"response\": \"${target.response}\"}\n", " },\n", "\n", " \"similarity\": {\n", " \"column_mapping\": {\n", " \"ground_truth\": \"${data.ground_truth}\",\n", " \"context\": \"${data.context}\", \n", " \"query\": \"${data.query}\",\n", " \"response\": \"${target.response}\"}\n", " },\n", " },\n", " )\n", " #time.sleep(60) ## To avoid rate limiting throttling\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "View the results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733770438910 } }, "outputs": [], "source": [ "pd.DataFrame(results[\"rows\"])" ] } ], "metadata": { "kernel_info": { "name": "python310-sdkv2" }, "kernelspec": { "display_name": "aoai_eval", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" }, "microsoft": { "host": { "AzureML": { "notebookHasBeenCompleted": true } }, "ms_spell_check": { "ms_spell_check_language": "en" } }, "nteract": { "version": "nteract-front-end@1.0.0" } }, "nbformat": 4, "nbformat_minor": 2 }

Lab1_ai_evaluation/evaluate_base_model_endpoint.ipynb (421 lines of code) (raw):