# Custom Evaluator with the Azure AI Evaluation SDK
The following sample shows the basic way to create custom evaluator to test a Generative AI application in your development environment with the Azure AI evaluation SDK.

> ‚ú® ***Note*** <br>
> Please check the reference document before you get started - https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk

## üî® Current Support and Limitations (as of 2025-01-14) 
- Check the region support for the Azure AI Evaluation SDK. https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning#region-support

### Region support for evaluations
| Region              | Hate and Unfairness, Sexual, Violent, Self-Harm, XPIA, ECI (Text) | Groundedness (Text) | Protected Material (Text) | Hate and Unfairness, Sexual, Violent, Self-Harm, Protected Material (Image) |
|---------------------|------------------------------------------------------------------|---------------------|----------------------------|----------------------------------------------------------------------------|
| North Central US    | no                                                               | no                  | no                         | yes                                                                        |
| East US 2           | yes                                                              | yes                 | yes                        | yes                                                                        |
| Sweden Central      | yes                                                              | yes                 | yes                        | yes                                                                        |
| US North Central    | yes                                                              | no                  | yes                        | yes                                                                        |
| France Central      | yes                                                              | yes                 | yes                        | yes                                                                        |
| Switzerland West    | yes                                                              | no                  | no                         | yes                                                                        |

### Region support for adversarial simulation
| Region            | Adversarial Simulation (Text) | Adversarial Simulation (Image) |
|-------------------|-------------------------------|---------------------------------|
| UK South          | yes                           | no                              |
| East US 2         | yes                           | yes                             |
| Sweden Central    | yes                           | yes                             |
| US North Central  | yes                           | yes                             |
| France Central    | yes                           | no                              |


## ‚úîÔ∏è Pricing and billing
- Effective 1/14/2025, Azure AI Safety Evaluations will no longer be free in public preview. It will be billed based on consumption as following:

| Service Name              | Safety Evaluations       | Price Per 1K Tokens (USD) |
|---------------------------|--------------------------|---------------------------|
| Azure Machine Learning    | Input pricing for 3P     | $0.02                     |
| Azure Machine Learning    | Output pricing for 3P    | $0.06                     |
| Azure Machine Learning    | Input pricing for 1P     | $0.012                    |
| Azure Machine Learning    | Output pricing for 1P    | $0.012                    |


In [2]:
import pandas as pd
import os
import json

from pprint import pprint
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import RelevanceEvaluator
from azure.ai.evaluation import GroundednessEvaluator, GroundednessProEvaluator
from azure.identity import DefaultAzureCredential
from dotenv import load_dotenv
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    Evaluation,
    Dataset,
    EvaluatorConfiguration,
    ConnectionType,
    EvaluationSchedule,
    RecurrenceTrigger,
    ApplicationInsightsConfiguration
)
import pathlib

from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    ContentSafetyEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    GroundednessEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
    F1ScoreEvaluator,
    RetrievalEvaluator
)



load_dotenv(override=True)

True

In [5]:
credential = DefaultAzureCredential()

azure_ai_project_conn_str = os.environ.get("AZURE_AI_PROJECT_CONN_STR")
subscription_id = azure_ai_project_conn_str.split(";")[1]
resource_group_name = azure_ai_project_conn_str.split(";")[2]
project_name = azure_ai_project_conn_str.split(";")[3]

azure_ai_project_dict = {
    "subscription_id": subscription_id,
    "resource_group_name": resource_group_name,
    "project_name": project_name,
}

azure_ai_project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str=azure_ai_project_conn_str
)


model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"),
    "api_version": os.environ.get("AZURE_OPENAI_API_VERSION"),
    "type": "azure_openai",
}

In [None]:
input_path = "data/sythetic_evaluation_data.jsonl"
output_path = "data/custom_evaluation_output.json"


# https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/flow-evaluate-sdk
retrieval_evaluator = RetrievalEvaluator(model_config)
fluency_evaluator = FluencyEvaluator(model_config)
groundedness_evaluator = GroundednessEvaluator(model_config)
relevance_evaluator = RelevanceEvaluator(model_config)
coherence_evaluator = CoherenceEvaluator(model_config)
similarity_evaluator = SimilarityEvaluator(model_config)

column_mapping = {
    "query": "${data.query}",
    "ground_truth": "${data.ground_truth}",
    "response": "${data.response}",
    "context": "${data.context}",
}

## üß™ AI-assisted Groundedness evaluator
- Prompt-based groundedness using your own model deployment to output a score and an explanation for the score is currently supported in all regions.
- Groundedness Pro evaluator leverages Azure AI Content Safety Service (AACS) via integration into the Azure AI Foundry evaluations. No deployment is required, as a back-end service will provide the models for you to output a score and reasoning. Groundedness Pro is currently supported in the East US 2 and Sweden Central regions.

In [None]:

# Initialzing Groundedness and Groundedness Pro evaluators
groundedness_eval = GroundednessEvaluator(model_config)
# No need to set the model_config for GroundednessProEvaluator

query_response = dict(
    query="Which tent is the most waterproof?", # optional
    context="The Alpine Explorer Tent is the most water-proof of all tents available.",
    response="The Alpine Explorer Tent is the most waterproof."
)


# query_response = dict(
#     query="Ïñ¥Îñ§ ÌÖêÌä∏Í∞Ä Î∞©Ïàò Í∏∞Îä•Ïù¥ ÏûàÏñ¥?", # optional
#     context="ÏïåÌååÏù∏ ÏùµÏä§ÌîåÎ°úÎü¨ ÌÖêÌä∏Í∞Ä Î™®Îì† ÌÖêÌä∏ Ï§ë Í∞ÄÏû• Î∞©Ïàò Í∏∞Îä•Ïù¥ Îõ∞Ïñ¥ÎÇ®",
#     response="ÏïåÌååÏù∏ ÏùµÏä§ÌîåÎ°úÎü¨ ÌÖêÌä∏Í∞Ä Î∞©Ïàò Í∏∞Îä•Ïù¥ ÏûàÏäµÎãàÎã§."
# )

# Running Groundedness Evaluator on a query and response pair
groundedness_score = groundedness_eval(
    **query_response
)
print(groundedness_score)

## üß™ Customize prebuilt GroundnessEvaluator


In [None]:

import os
from typing_extensions import override


# Since the prebuilt evaluators are not designed to ouput the results as boolean values, you need to use numbers to represent the boolean values
# 1 for True and 0 for False


class CustomGroundednessEvaluator(GroundednessEvaluator):
    """
    Evaluates groundedness score for a given query (optional), response, and context or a multi-turn conversation,
    including reasoning.

    The groundedness measure assesses the correspondence between claims in an AI-generated answer and the source
    context, making sure that these claims are substantiated by the context. Even if the responses from LLM are
    factually correct, they'll be considered ungrounded if they can't be verified against the provided sources
    (such as your input source or your database). Use the groundedness metric when you need to verify that
    AI-generated responses align with and are validated by the provided context.

    Groundedness scores range from 0.0 to 1.0, with 0.0 being the least grounded and 1.0 being the grounded.

    :param model_config: Configuration for the Azure OpenAI model.
    :type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
        ~azure.ai.evaluation.OpenAIModelConfiguration]

    .. admonition:: Example:

        .. literalinclude:: ../samples/evaluation_samples_evaluate.py
            :start-after: [START groundedness_evaluator]
            :end-before: [END groundedness_evaluator]
            :language: python
            :dedent: 8
            :caption: Initialize and call a GroundednessEvaluator.

    .. note::

        To align with our support of a diverse set of models, an output key without the `gpt_` prefix has been added.
        To maintain backwards compatibility, the old key with the `gpt_` prefix is still be present in the output;
        however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
    """
  
    # need to set the new prompty file path because the variables are still used in the parent call method
    current_dir = os.getcwd()
    _PROMPTY_FILE_NO_QUERY = os.path.join(current_dir, "custom-groundedness.prompty") 
    _PROMPTY_FILE_WITH_QUERY = os.path.join(current_dir, "custom-groundedness.prompty") 

    
    @override
    def __init__(self, model_config):
        
        super().__init__(model_config)
        current_dir = os.getcwd()
        prompty_path = os.path.join(current_dir, "custom-groundedness.prompty")  # Default to no query
        super(GroundednessEvaluator, self).__init__(model_config=model_config, prompty_file=prompty_path, result_key="custom-groundedness")

In [None]:
# Initialzing Groundedness and Groundedness Pro evaluators
custom_groundedness_eval = CustomGroundednessEvaluator(model_config)
# No need to set the model_config for GroundednessProEvaluator

query_response = dict(
    query="Which tent is the most waterproof?", # optional
    context="The Alpine Explorer Tent is the most water-proof of all tents available.",
    response="The Alpine Explorer Tent is the most waterproof."
)


# query_response = dict(
#     query="Ïñ¥Îñ§ ÌÖêÌä∏Í∞Ä Î∞©Ïàò Í∏∞Îä•Ïù¥ ÏûàÏñ¥?", # optional
#     context="ÏïåÌååÏù∏ ÏùµÏä§ÌîåÎ°úÎü¨ ÌÖêÌä∏Í∞Ä Î™®Îì† ÌÖêÌä∏ Ï§ë Í∞ÄÏû• Î∞©Ïàò Í∏∞Îä•Ïù¥ Îõ∞Ïñ¥ÎÇ®",
#     response="ÏïåÌååÏù∏ ÏùµÏä§ÌîåÎ°úÎü¨ ÌÖêÌä∏Í∞Ä Î∞©Ïàò Í∏∞Îä•Ïù¥ ÏûàÏäµÎãàÎã§."
# )

# Running Groundedness Evaluator on a query and response pair
custom_groundedness_score = custom_groundedness_eval(
    **query_response
)
print(custom_groundedness_score)

## üß™ Customize prebuilt RetrievalEvaluator

In [113]:
input_path = "./data/queries_responses_ada2_hybrid.jsonl"

context_list = []

with open(input_path, 'r') as file:
    context_list = [json.loads(next(file))['document_content'] for _ in range(3)]

query = "<Í∞§Îü¨Í∑∏ S24 ÏãúÎ¶¨Ï¶à>Ïùò ÏÉàÎ°úÏö¥ Ï†êÍ≥º Îã§Î•∏ Ï†êÏùÄ Î¨¥ÏóáÏù∏Í∞ÄÏöî?"
# context = "\n ".join(context_list)

In [115]:
from CustomRetrievalEvaluator._custom_retrieval import CustomRetrievalEvaluator

custom_retrieval_eval = CustomRetrievalEvaluator(model_config)

query_response = dict(
    query=query, 
    context=context_list
)

# Running RetrievalEvaluator Evaluator on a query and response pair
retrieval_score = custom_retrieval_eval(
    **query_response
)
print(retrieval_score)

{'custom-retrieval': 5.0, 'gpt_custom-retrieval': 5.0, 'custom-retrieval_reason': 'The context chunks are highly relevant to the query, providing detailed information about the new features and differences of the Í∞§Îü¨Í∑∏ S24 series. The most relevant information is presented at the top, making it easy for the reader to find the answers they are looking for.'}


## üß™ Create New Custom Evaluator


In [None]:
import json
import os

from promptflow.client import load_flow


class FriendlinessEvaluator:
    def __init__(self, model_config):
        current_dir = os.getcwd()
        prompty_path = os.path.join(current_dir, "friendliness.prompty")
        self._flow = load_flow(source=prompty_path, model={"configuration": model_config})

    def __call__(self, *, response: str, **kwargs):
        llm_response = self._flow(response=response)
        try:
            response = json.loads(llm_response)
        except Exception:
            response = llm_response
        return response

In [None]:
friendliness_eval = FriendlinessEvaluator(model_config)

friendliness_score = friendliness_eval(response="I will not apologize for my behavior!")
print(friendliness_score)

In [None]:
friendliness_eval = FriendlinessEvaluator(model_config)

friendliness_score = friendliness_eval(response="I love you!")
print(friendliness_score)