Lab2_nlp_evaluation/nlp_base_model_evaluators.ipynb (257 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluate mutliple models in quantitative NLP evaluators\n",
"\n",
"## Objective\n",
"This notebook demonstrates how to use NLP-based evaluators to assess the quality of generated text by comparing it to reference text. By the end of this tutorial, you'll be able to:\n",
" - Understand different NLP evaluators such as `BleuScoreEvaluator`, `GleuScoreEvaluator`, `MeteorScoreEvaluator`, and `RougeScoreEvaluator`.\n",
" - Evaluate dataset using these evaluators.\n",
"\n",
"## Time\n",
"You should expect to spend about 10 minutes running this notebook.\n",
"\n",
"## Before you begin\n",
"\n",
"### Installation\n",
"Install the following packages required to execute this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1733832648838
}
},
"outputs": [],
"source": [
"# Install the packages\n",
"%pip install azure-ai-evaluation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1733832651097
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"import os\n",
"from pprint import pprint\n",
"from dotenv import load_dotenv\n",
"load_dotenv(\"../.credentials.env\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NLP Evaluators"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize Azure AI project and Azure OpenAI conncetion with your environment variables\n",
"azure_ai_project = {\n",
" \"subscription_id\": os.environ.get(\"AZURE_SUBSCRIPTION_ID\"),\n",
" \"resource_group_name\": os.environ.get(\"AZURE_RESOURCE_GROUP\"),\n",
" \"project_name\": os.environ.get(\"AZURE_PROJECT_NAME\"),\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up env vars for model endpoints and keys"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"env_var = { \n",
" \"gpt-35-turbo\": {\n",
" \"endpoint\": os.environ.get(\"AZURE_OPENAI_GPT35_ENDPOINT\"),\n",
" \"key\": os.environ.get(\"AZURE_OPENAI_GPT35_API_KEY\"),\n",
" },\n",
" \"gpt-4\": {\n",
" \"endpoint\": os.environ.get(\"AZURE_OPENAI_GPT4_ENDPOINT\"),\n",
" \"key\": os.environ.get(\"AZURE_OPENAI_GPT4_API_KEY\"),\n",
" },\n",
" \"gpt-4o\": {\n",
" \"endpoint\": os.environ.get(\"AZURE_OPENAI_GPT4o_ENDPOINT\"),\n",
" \"key\": os.environ.get(\"AZURE_OPENAI_GPT4o_API_KEY\"),\n",
" },\n",
" \"gpt-4o-mini\" : { \n",
" \"endpoint\" : os.environ.get(\"AZURE_OPENAI_GPT4o-mini_ENDPOINT\"), \n",
" \"key\" : os.environ.get(\"AZURE_OPENAI_GPT4o-mini_API_KEY\"), \n",
" }, \n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open(\"target_nlp_api/target_nlp_api.py\") as fin:\n",
" print(fin.read())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from target_nlp_api.target_nlp_api import ModelEndpoints"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azure.ai.evaluation import BleuScoreEvaluator\n",
"from azure.ai.evaluation import GleuScoreEvaluator\n",
"from azure.ai.evaluation import MeteorScoreEvaluator\n",
"from azure.ai.evaluation import RougeScoreEvaluator, RougeType\n",
"\n",
"bleu = BleuScoreEvaluator()\n",
"gleu = GleuScoreEvaluator()\n",
"meteor = MeteorScoreEvaluator(alpha=0.9, beta=3.0, gamma=0.5)\n",
"rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1733832746096
}
},
"outputs": [],
"source": [
"from azure.ai.evaluation import evaluate\n",
"import random\n",
"import pathlib\n",
"import sys\n",
"\n",
"from target_nlp_api.target_nlp_api import ModelEndpoints\n",
"\n",
"models = [\"gpt-35-turbo\",\"gpt-4\",\"gpt-4o\",\"gpt-4o-mini\"]\n",
"\n",
"for model in models:\n",
" print(\" Evaluating NLP metrics - \", model)\n",
" print(\"-----------------------------------\")\n",
" randomNum = random.randint(1111, 9999)\n",
" result = evaluate(\n",
" azure_ai_project=azure_ai_project, \n",
" data=\"ai_data.jsonl\",\n",
" evaluation_name = \"NLP-\" + model.title() + \"_Run-\" + str(randomNum),\n",
" target = ModelEndpoints(env_var, model),\n",
"\n",
" evaluators={\n",
" \"bleu\": bleu,\n",
" \"gleu\": gleu,\n",
" \"meteor\": meteor,\n",
" \"rouge\": rouge,\n",
" },\n",
" evaluator_config={\n",
" \"bleu\": {\n",
" \"column_mapping\": {\n",
" \"ground_truth\": \"${data.ground_truth}\",\n",
" \"response\": \"${target.response}\"}\n",
" },\n",
" }\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"View the results, Alternatively you can view the results in AI Foundry"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"gather": {
"logged": 1733832746266
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"pd.DataFrame(result[\"rows\"])"
]
}
],
"metadata": {
"kernel_info": {
"name": "python310-sdkv2"
},
"kernelspec": {
"display_name": "model_eval",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
},
"microsoft": {
"host": {
"AzureML": {
"notebookHasBeenCompleted": true
}
},
"ms_spell_check": {
"ms_spell_check_language": "en"
}
},
"nteract": {
"version": "nteract-front-end@1.0.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}