Lab2_nlp_evaluation/nlp_base_model

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Evaluate mutliple models in quantitative NLP evaluators\n", "\n", "## Objective\n", "This notebook demonstrates how to use NLP-based evaluators to assess the quality of generated text by comparing it to reference text. By the end of this tutorial, you'll be able to:\n", " - Understand different NLP evaluators such as `BleuScoreEvaluator`, `GleuScoreEvaluator`, `MeteorScoreEvaluator`, and `RougeScoreEvaluator`.\n", " - Evaluate dataset using these evaluators.\n", "\n", "## Time\n", "You should expect to spend about 10 minutes running this notebook.\n", "\n", "## Before you begin\n", "\n", "### Installation\n", "Install the following packages required to execute this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832648838 } }, "outputs": [], "source": [ "# Install the packages\n", "%pip install azure-ai-evaluation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832651097 }, "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "import os\n", "from pprint import pprint\n", "from dotenv import load_dotenv\n", "load_dotenv(\"../.credentials.env\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NLP Evaluators" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize Azure AI project and Azure OpenAI conncetion with your environment variables\n", "azure_ai_project = {\n", " \"subscription_id\": os.environ.get(\"AZURE_SUBSCRIPTION_ID\"),\n", " \"resource_group_name\": os.environ.get(\"AZURE_RESOURCE_GROUP\"),\n", " \"project_name\": os.environ.get(\"AZURE_PROJECT_NAME\"),\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up env vars for model endpoints and keys" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "env_var = { \n", " \"gpt-35-turbo\": {\n", " \"endpoint\": os.environ.get(\"AZURE_OPENAI_GPT35_ENDPOINT\"),\n", " \"key\": os.environ.get(\"AZURE_OPENAI_GPT35_API_KEY\"),\n", " },\n", " \"gpt-4\": {\n", " \"endpoint\": os.environ.get(\"AZURE_OPENAI_GPT4_ENDPOINT\"),\n", " \"key\": os.environ.get(\"AZURE_OPENAI_GPT4_API_KEY\"),\n", " },\n", " \"gpt-4o\": {\n", " \"endpoint\": os.environ.get(\"AZURE_OPENAI_GPT4o_ENDPOINT\"),\n", " \"key\": os.environ.get(\"AZURE_OPENAI_GPT4o_API_KEY\"),\n", " },\n", " \"gpt-4o-mini\" : { \n", " \"endpoint\" : os.environ.get(\"AZURE_OPENAI_GPT4o-mini_ENDPOINT\"), \n", " \"key\" : os.environ.get(\"AZURE_OPENAI_GPT4o-mini_API_KEY\"), \n", " }, \n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(\"target_nlp_api/target_nlp_api.py\") as fin:\n", " print(fin.read())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from target_nlp_api.target_nlp_api import ModelEndpoints" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from azure.ai.evaluation import BleuScoreEvaluator\n", "from azure.ai.evaluation import GleuScoreEvaluator\n", "from azure.ai.evaluation import MeteorScoreEvaluator\n", "from azure.ai.evaluation import RougeScoreEvaluator, RougeType\n", "\n", "bleu = BleuScoreEvaluator()\n", "gleu = GleuScoreEvaluator()\n", "meteor = MeteorScoreEvaluator(alpha=0.9, beta=3.0, gamma=0.5)\n", "rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832746096 } }, "outputs": [], "source": [ "from azure.ai.evaluation import evaluate\n", "import random\n", "import pathlib\n", "import sys\n", "\n", "from target_nlp_api.target_nlp_api import ModelEndpoints\n", "\n", "models = [\"gpt-35-turbo\",\"gpt-4\",\"gpt-4o\",\"gpt-4o-mini\"]\n", "\n", "for model in models:\n", " print(\" Evaluating NLP metrics - \", model)\n", " print(\"-----------------------------------\")\n", " randomNum = random.randint(1111, 9999)\n", " result = evaluate(\n", " azure_ai_project=azure_ai_project, \n", " data=\"ai_data.jsonl\",\n", " evaluation_name = \"NLP-\" + model.title() + \"_Run-\" + str(randomNum),\n", " target = ModelEndpoints(env_var, model),\n", "\n", " evaluators={\n", " \"bleu\": bleu,\n", " \"gleu\": gleu,\n", " \"meteor\": meteor,\n", " \"rouge\": rouge,\n", " },\n", " evaluator_config={\n", " \"bleu\": {\n", " \"column_mapping\": {\n", " \"ground_truth\": \"${data.ground_truth}\",\n", " \"response\": \"${target.response}\"}\n", " },\n", " }\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "View the results, Alternatively you can view the results in AI Foundry" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832746266 } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "pd.DataFrame(result[\"rows\"])" ] } ], "metadata": { "kernel_info": { "name": "python310-sdkv2" }, "kernelspec": { "display_name": "model_eval", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" }, "microsoft": { "host": { "AzureML": { "notebookHasBeenCompleted": true } }, "ms_spell_check": { "ms_spell_check_language": "en" } }, "nteract": { "version": "nteract-front-end@1.0.0" } }, "nbformat": 4, "nbformat_minor": 2 }

Lab2_nlp_evaluation/nlp_base_model_evaluators.ipynb (257 lines of code) (raw):