Lab2_nlp_evaluation/nlp_evaluators.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Evaluate with quantitative NLP evaluators\n", "\n", "## Objective\n", "This notebook demonstrates how to use NLP-based evaluators to assess the quality of generated text by comparing it to reference text. By the end of this tutorial, you'll be able to:\n", " - Understand different NLP evaluators such as `BleuScoreEvaluator`, `GleuScoreEvaluator`, `MeteorScoreEvaluator`, and `RougeScoreEvaluator`.\n", " - Evaluate dataset using these evaluators.\n", "\n", "## Time\n", "You should expect to spend about 10 minutes running this notebook.\n", "\n", "## Before you begin\n", "\n", "### Installation\n", "Install the following packages required to execute this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832648838 } }, "outputs": [], "source": [ "# Install the packages\n", "%pip install azure-ai-evaluation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832651097 }, "jupyter": { "outputs_hidden": false, "source_hidden": false }, "nteract": { "transient": { "deleting": false } } }, "outputs": [], "source": [ "import os\n", "from pprint import pprint\n", "from dotenv import load_dotenv\n", "load_dotenv(\"../.credentials.env\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NLP Evaluators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### BleuScoreEvaluator\n", "\n", "BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine\n", "translation. It is widely used in text summarization and text generation use cases. It evaluates how closely the\n", "generated text matches the reference text. The BLEU score ranges from 0 to 1, with higher scores indicating\n", "better quality." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832657177 } }, "outputs": [], "source": [ "from azure.ai.evaluation import BleuScoreEvaluator\n", "\n", "bleu = BleuScoreEvaluator()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832659756 } }, "outputs": [], "source": [ "result = bleu(response=\"London is the capital of England.\", ground_truth=\"The capital of England is London.\")\n", "\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### GleuScoreEvaluator\n", "\n", "The GLEU (Google-BLEU) score evaluator measures the similarity between generated and reference texts by\n", "evaluating n-gram overlap, considering both precision and recall. This balanced evaluation, designed for\n", "sentence-level assessment, makes it ideal for detailed analysis of translation quality. GLEU is well-suited for\n", "use cases such as machine translation, text summarization, and text generation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832671954 } }, "outputs": [], "source": [ "from azure.ai.evaluation import GleuScoreEvaluator\n", "\n", "gleu = GleuScoreEvaluator()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832673153 } }, "outputs": [], "source": [ "result = gleu(response=\"London is the capital of England.\", ground_truth=\"The capital of England is London.\")\n", "\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### MeteorScoreEvaluator\n", "\n", "The METEOR (Metric for Evaluation of Translation with Explicit Ordering) score grader evaluates generated text by\n", "comparing it to reference texts, focusing on precision, recall, and content alignment. It addresses limitations of\n", "other metrics like BLEU by considering synonyms, stemming, and paraphrasing. METEOR score considers synonyms and\n", "word stems to more accurately capture meaning and language variations. In addition to machine translation and\n", "text summarization, paraphrase detection is an optimal use case for the METEOR score." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832676855 } }, "outputs": [], "source": [ "from azure.ai.evaluation import MeteorScoreEvaluator\n", "\n", "meteor = MeteorScoreEvaluator(alpha=0.9, beta=3.0, gamma=0.5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832678928 } }, "outputs": [], "source": [ "result = meteor(response=\"London is the capital of England.\", ground_truth=\"The capital of England is London.\")\n", "\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### RougeScoreEvaluator\n", "\n", "ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic\n", "summarization and machine translation. It measures the overlap between generated text and reference summaries.\n", "ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. Text\n", "summarization and document comparison are among optimal use cases for ROUGE, particularly in scenarios where text\n", "coherence and relevance are critical.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832695388 } }, "outputs": [], "source": [ "from azure.ai.evaluation import RougeScoreEvaluator, RougeType\n", "\n", "rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832715092 } }, "outputs": [], "source": [ "result = rouge(response=\"London is the capital of England.\", ground_truth=\"The capital of England is London.\")\n", "\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate a Dataset using Math Evaluators\n", "\n", "The code below uses the Evaluate API with BLEU, GLEU, METEOR, and ROUGE evaluators to evaluate the results on a dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832746096 } }, "outputs": [], "source": [ "from azure.ai.evaluation import evaluate\n", "import random\n", "\n", "randomNum = random.randint(1111, 9999)\n", "result = evaluate(\n", " data=\"nlp_data.jsonl\",\n", " evaluation_name=\"NLP-demo-\" + str(randomNum),\n", " evaluators={\n", " \"bleu\": bleu,\n", " \"gleu\": gleu,\n", " \"meteor\": meteor,\n", " \"rouge\": rouge,\n", " },\n", " # Optionally provide your AI Studio project information to track your evaluation results in your Azure AI Studio project\n", " azure_ai_project = {\n", " \"subscription_id\": os.environ.get(\"AZURE_SUBSCRIPTION_ID\"),\n", " \"resource_group_name\": os.environ.get(\"AZURE_RESOURCE_GROUP\"),\n", " \"project_name\": os.environ.get(\"AZURE_PROJECT_NAME\"),\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "View the results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "gather": { "logged": 1733832746266 } }, "outputs": [], "source": [ "from pprint import pprint\n", "\n", "pprint(result)" ] } ], "metadata": { "kernel_info": { "name": "python310-sdkv2" }, "kernelspec": { "display_name": "aoai_eval", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" }, "microsoft": { "host": { "AzureML": { "notebookHasBeenCompleted": true } }, "ms_spell_check": { "ms_spell_check_language": "en" } }, "nteract": { "version": "nteract-front-end@1.0.0" } }, "nbformat": 4, "nbformat_minor": 2 }

Lab2_nlp_evaluation/nlp_evaluators.ipynb (340 lines of code) (raw):