# Evaluating Model Groundedness with Azure AI Evaluation SDK

This notebook aims to simulate and evaluate the groundedness of a model endpoint using the Azure AI Evaluation SDK. Groundedness refers to the extent to which the responses generated by a model are based on reliable and verifiable information. Ensuring that a model's outputs are grounded is crucial for maintaining the accuracy and trustworthiness of AI systems.

In this notebook, we will:

1. Set up the Azure AI Evaluation SDK.
2. Define the dataset for evaluating groundedness, which will vary based on the specific use case of your model.
3. Simulate the model endpoint and generate responses.
4. Evaluate the groundedness of the model's responses using the Azure AI Evaluation SDK.

The dataset used for evaluating groundedness will be tailored to the particular application of your model. For instance, if your model is designed for customer support, the dataset might consist of common customer queries and the corresponding accurate responses. If your model is used for medical diagnosis, the dataset would include medical cases and verified diagnostic information.

By the end of this notebook, you will have a clear understanding of how to assess the groundedness of your model's outputs and ensure that they are based on solid and reliable information.

This tutorial uses the following Azure AI services:

- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)

## Time

You should expect to spend 30 minutes running this sample. 

## About this example

This example demonstrates evaluating model endpoints responses against provided prompts using azure-ai-evaluation

## Before you begin

### Installation

Install the following packages required to execute this notebook. 


In [None]:
%pip install azure-ai-evaluation --upgrade
%pip install promptflow-azure
%pip install azure-identity

### Parameters and imports

Here we define the data, `grounding.json` on which we will simulate query and response pairs to help us evaluate the groundedness of our model's responses. Based on the use case of your model, the data you use to evaluate groundedness might differ. 

In [2]:
import os
from typing import Any, Dict, List, Optional
import json
from pathlib import Path

from azure.ai.evaluation import evaluate
from azure.ai.evaluation import GroundednessEvaluator,GroundednessProEvaluator
from azure.ai.evaluation.simulator import Simulator
from openai import AzureOpenAI
import importlib.resources as pkg_resources
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

In [None]:
import os
from dotenv import load_dotenv
load_dotenv("../.credentials.env")

In [4]:
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_PROJECT_NAME"),
}

model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
    "api_version": os.environ.get("AZURE_OPENAI_API_VERSION"),
}

In [None]:
print(azure_ai_project)
print(model_config)

## Data
Here we define the data, `grounding.json` on which we will simulate query and response pairs to help us evaluate the groundedness of our model's responses. Based on the use case of your model, the data you use to evaluate groundedness might differ. 

In [6]:
resource_name = "grounding.json"
package = "azure.ai.evaluation.simulator._data_sources"
conversation_turns = []

with pkg_resources.path(package, resource_name) as grounding_file, Path.open(grounding_file, "r") as file:
    data = json.load(file)

for item in data:
    conversation_turns.append([item])
    if len(conversation_turns) == 2:
        break

## Target Endpoint

We will use Evaluate API provided by Azure AI Evaluations SDK. It requires a target Application or python Function, which handles a call to LLMs and retrieve responses. 

In [7]:
def example_application_response(query: str, context: str) -> str:
    deployment = os.environ.get("AZURE_OPENAI_DEPLOYMENT")
    endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
    token_provider = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")

    # Get a client handle for the AOAI model
    client = AzureOpenAI(
        azure_endpoint=endpoint,
        api_version=os.environ.get("AZURE_OPENAI_API_VERSION"),
        azure_ad_token_provider=token_provider,
    )

    # Prepare the messages
    messages = [
        {
            "role": "system",
            "content": f"You are a user assistant who helps answer questions based on some context.\n\nContext: '{context}'",
        },
        {"role": "user", "content": query},
    ]
    # Call the model
    completion = client.chat.completions.create(
        model=deployment,
        messages=messages,
        max_tokens=800,
        temperature=0.7,
        top_p=0.95,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None,
        stream=False,
    )

    message = completion.to_dict()["choices"][0]["message"]
    if isinstance(message, dict):
        message = message["content"]
    return message

## Run the simulator

The interactions between your endpoint (in this case, `example_application_response`) and the simulator is managed by a callback method, `custom_simulator_callback` and this method is used to format the request to your endpoint and the response from the endpoint.

In [8]:
async def custom_simulator_callback(
    messages: List[Dict],
    stream: bool = False,
    session_state: Optional[str] = None,
    context: Optional[Dict[str, Any]] = None,
) -> dict:
    messages_list = messages["messages"]
    # get last message
    latest_message = messages_list[-1]
    application_input = latest_message["content"]
    context = latest_message.get("context", None)
    # call your endpoint or ai application here
    response = example_application_response(query=application_input, context=context)
    # we are formatting the response to follow the openAI chat protocol format
    message = {
        "content": response,
        "role": "assistant",
        "context": context,
    }
    messages["messages"].append(message)
    return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}

In [None]:
custom_simulator = Simulator(model_config=model_config)

In [None]:
outputs = await custom_simulator(
    target=custom_simulator_callback,
    conversation_turns=conversation_turns,
    max_conversation_turns=1,
    concurrent_async_tasks=10,
)

### Convert the outputs to a format that can be evaluated

In [11]:
output_file = "ground_sim_output.jsonl"
with open(output_file, "w") as file:
    for output in outputs:
        file.write(output.to_eval_qr_json_lines())


## Run the evaluation

In this section, we will run the evaluation using the `GroundednessEvaluator` and the `evaluate` function from the Azure AI Evaluation SDK. The evaluation will assess the groundedness of the model's responses based on the dataset produced by the `Simulator` above.

In [None]:
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()

groundedness_evaluator = GroundednessEvaluator(model_config=model_config)
groundedness_pro_evaluator = GroundednessProEvaluator(azure_ai_project=azure_ai_project, credential=credential)

eval_output = evaluate(
    data=output_file,
    evaluators={
        "groundedness": groundedness_evaluator,
        "groundedness_pro": groundedness_pro_evaluator,

    },
    azure_ai_project=azure_ai_project,
)

print(eval_output)

In [None]:
from pprint import pprint 
pprint(eval_output)