## Using Phi-3 as relevance judge

In this notebook we will use [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) as a relevance judge between a query and a document. 

## Requirements

For this notebook, you will need a working Python environment (**Python 3.10.x** or later) and some Python dependencies:
- `torch`, to use PyTorch as our backend
- Huggingface's `transformers` library
- `accelerate` and`bitsandbytes` for quantization support (a GPU is required to enable quantization)
- `scikit-learn` for metrics computation
- `pandas` for generic data handling

## Installing packages

Let's start by installing the necessary Python libraries (preferably in a virtual environment)


In [None]:
!pip install -U torch transformers accelerate bitsandbytes scikit-learn pandas

---

## Implementation

First, the necessary imports:

In [None]:
from collections import Counter
from functools import partial
from typing import Any, Iterable, List, Optional
import json
import re

from sklearn.metrics import f1_score, precision_score, recall_score
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    pipeline,
)
import pandas as pd
import torch

Now, let's create a class that will responsible for loading the `Phi-3` model and perform inference on its inputs. A few notes before we dive into the code:
* Even though Phi-3 is a small language model (SLM) with a parameter count of 3.8B we load it with 4-bit quantization that makes it a good choice even for consumer-grade GPUs
* Following the example code provided in the corresponding HF page [here](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) we are also using text generation pipelines to perform inference. More optimized setups are possible but out of scope for this notebook
* Regular expressions are used to extract the answer from the LLM output. The `response_types` argument defines the set of acceptable classes (e.g. `Relevant`, `Not Relevant`)
* There are two options for decoding:
    * `greedy decoding`, where sampling is disabled and the outputs are (more or less) deterministic
    * `beam decoding`, where multiple LLM calls for the same set of inputs are performed and the results are aggregated through majority voting. In the code below `iterations` is the number of LLM calls requested with an appropriate setting for the `temperature` (e.g. 0.5)



In [None]:
def get_device_map():
    """Retrieve the backend"""
    if torch.cuda.is_available():
        return "cuda"

    if torch.backends.mps.is_available():
        return "mps"

    return "auto"


class Phi3Evaluator:
    """Evaluator based on the Phi-3 model"""

    def __init__(
        self,
        model_name_or_path: str,
        response_types: List[str],
        iterations: int = 1,
        temperature: float = 0.0,
    ):

        # set 4-bit quantization
        quant_config = BitsAndBytesConfig(load_in_4bit=True)
        model = AutoModelForCausalLM.from_pretrained(
            model_name_or_path,
            torch_dtype="auto",
            trust_remote_code=True,
            device_map=get_device_map(),
            quantization_config=quant_config if get_device_map() == "cuda" else None,
        )
        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

        # defining the text generation pipeline
        self.pipeline = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            device_map=get_device_map(),
        )
        # define an appropriate regex for the expected outputs
        self.regex_str = (
            r"(" + r"|".join(response_types) + r")" if response_types else None
        )
        # temperature setting
        self._temperature = temperature
        # number of LLM calls
        self._iterations = iterations

    def _get_generation_args(self):
        """Arguments for the text generation pipeline"""
        if self._temperature > 0.0:
            return {
                "return_full_text": False,
                "temperature": self._temperature,
                "do_sample": True,
                "num_return_sequences": self._iterations,
                "num_beams": 2 * self._iterations,
            }
        return {
            "return_full_text": False,
            "do_sample": False,
            "num_return_sequences": 1,
        }

    def __call__(
        self, prompts: Iterable[str], max_output_tokens: int, batch_size: int = 8
    ) -> Iterable[str]:
        """Generate responses to the given prompts"""

        # set args for the text generation pipeline
        gen_args = self._get_generation_args()
        gen_args.update({"batch_size": batch_size, "max_new_tokens": max_output_tokens})
        inputs_for_pipeline = [
            [{"role": "user", "content": prompt}] for prompt in prompts
        ]

        for output in self.pipeline(inputs_for_pipeline, **gen_args):
            output = [output] if isinstance(output, dict) else output
            llm_outputs = [item["generated_text"] for item in output]
            parsed_outputs = [
                self._postprocess_output(llm_output) for llm_output in llm_outputs
            ]
            clean_outputs = [
                output for output in parsed_outputs if output != "Unparsable"
            ]
            if clean_outputs:
                winning_label = Counter(clean_outputs).most_common(1)
                yield llm_outputs, winning_label[0][0], round(
                    winning_label[0][1] / len(clean_outputs), 2
                )
            else:
                yield llm_outputs, "Unparsable", -1.0

    def _postprocess_output(self, llm_out: str) -> str:
        """Cleans the output from the LLM"""
        if not self.regex_str:
            return llm_out

        re_search = re.search(self.regex_str, llm_out, re.IGNORECASE)
        if re_search:
            return re_search.group()

        return "Unparsable"

---

## Prompts

In this section we define the prompts that we will use later for LLM inference. 

There are three types of prompt templates namely: 
* `pointwise`
* `pointwise` with chain-of-thought
* `pairwise`


In [None]:
QA_POINTWISE_RELEVANCE_PROMPT_TEMPLATE = (
    "You are an expert in information retrieval and your task is to estimate the relevance of a retrieved document to a"
    " query.\n"
    "More specifically, you will be provided with two pieces of information:\n"
    "- Query, which is the question we want to answer\n"
    "- Retrieved Document, which is the document we want to evaluate.\n\n"
    'Your task is to predict "Relevant" if the Retrieved Document contains the information required to provide an '
    'answer to the Query, otherwise you should print "Not Relevant" \n'
    "#####\n"
    "Here are your inputs:\n"
    "Query: {query_text}\n"
    "Retrieved Document: {retrieved_text}\n"
    "#####\n\n"
    "Take a step back and reflect carefully on how best to solve your task\n"
    'You should provide your answer in the form of a boolean value: "Relevant" or "Not Relevant"\n'
)

CHAIN_OF_THOUGHT_PROMPT_TEMPLATE = (
    "You are an expert in information retrieval and your task is to decide if a retrieved "
    "document is relevant to a query or not. You will be provided with two pieces of information:\n"
    "- QUERY, which is a web search\n"
    "- DOCUMENT, which is a web page snippet\n"
    'Your task is to predict "Relevant" if the DOCUMENT contains the information required '
    'to provide an answer to the QUERY, otherwise you should predict "Not Relevant".\n'
    "Solve this task step by step and use the following examples for help.\n\n"
    "####\n"
    "Example 1\n"
    "QUERY: define interconnected\n"
    "DOCUMENT: Matching (adjective) corresponding in pattern, colour, or design; complementary.\n"
    'Intent: The query is looking for the dictionary definition of the term "interconnected".\n'
    'Key Information: The document provides a definition of the term "matching".\n'
    'Explanation: The term "matching" is closely related to the word "interconnected". However, the query is '
    'asking for the dictionary definition of "interconnected", not "matching".\n'
    'Answer: "Not Relevant"\n\n'
    "Example 2\n"
    "QUERY: what are some good chocolate cake recipes\n"
    "DOCUMENT: Here are some of the best chocolate cakes to make yourself.\n"
    "Intent: The query is looking for chocolate cake recipes.\n"
    "Key Information: The document provides a list of chocolate cakes you can make yourself.\n"
    "Explanation: The document provides information about chocolate cakes you can make yourself. This makes it "
    "very likely that it provides their recipes as well.\n"
    'Answer: "Relevant"\n\n'
    "Example 3\n"
    "QUERY: enable parental controls on Netflix\n"
    "DOCUMENT: Here's how to turn on parental controls on Amazon Prime Video:\n"
    "Intent: The query is looking for instructions on how to enable parental controls on Netflix.\n"
    "Key Information: The document provides instructions on how to turn on parental controls on Amazon Prime Video.\n"
    "Explanation: The document provides instructions for how to turn on parental controls for Amazon Prime Video, "
    "not Netflix.\n"
    'Answer: "Not Relevant"\n\n'
    "####\n\n"
    "To solve your task, first check if any of the examples are useful, then think carefully if "
    "Key Information matches the Intent. Use the following format:\n"
    "Intent: [the intent behind QUERY]\n"
    "Key Information: [the key information contained in DOCUMENT]\n"
    "Explanation: [your explanation]\n"
    "Answer: [your answer]\n"
    '[your answer] should be either "Relevant" or "Not Relevant".\n'
    "Here are the QUERY and DOCUMENT for you to evaluate:\n"
    "QUERY: {query_text}\n"
    "DOCUMENT: {retrieved_text}\n"
)


QA_PAIRWISE_RELEVANCE_PROMPT_TEMPLATE = (
    "You are an expert in information retrieval and your task is to estimate the relevance of a retrieved document to a query.\n"
    "More specifically, you will be provided with three pieces of information:\n"
    "- Query, which is the question we want to answer\n"
    "- Positive Document, which is a document that contains the correct answer to the query\n"
    "- Retrieved Document, which is the document we want to evaluate\n"
    'Your task is to predict "Relevant" if the Retrieved Document contains the information required to provide an answer to the Query, otherwise you should print "Not Relevant" \n'
    "You can take advantage of the information in the Positive Document to identify the correct answer to the Query and then verify that the Retrieved Document contains that piece of information\n"
    "#####\n"
    "Here are your inputs:\n"
    "Query: {query_text}\n"
    "Positive Document: {positive_text}\n"
    "Retrieved Document: {retrieved_text}\n"
    "#####\n\n"
    "Take a step back and reflect carefully on how best to solve your task\n"
    'You should provide your answer in the form of a boolean value: "Relevant" or "Not Relevant"\n'
    "Good luck!"
)

We also define a helper structure containing:
* `prompt_inputs`, specifies the list of attributes that need to be set in the prompt template. These attributes have the same name in the training data
* `prompt_template`, the prompt template to use
* `response_types`, the names of the expected output classes.
* `metadata`, the extra attributes that need to be preserved
* `max_output_tokens`, the maximum number of tokens that the LLM outputs


In [None]:
POOL = {
    "qa_pointwise": {
        "prompt_inputs": ["query_text", "retrieved_text"],
        "prompt_template": QA_POINTWISE_RELEVANCE_PROMPT_TEMPLATE,
        "response_types": ["Relevant", "Not Relevant"],
        "metadata": ["qid", "retrieved_doc_id", "human_judgment"],
        "max_output_tokens": 4,
    },
    "qa_pairwise": {
        "prompt_inputs": ["query_text", "positive_text", "retrieved_text"],
        "prompt_template": QA_PAIRWISE_RELEVANCE_PROMPT_TEMPLATE,
        "response_types": ["Relevant", "Not Relevant"],
        "metadata": ["qid", "retrieved_doc_id", "human_judgment"],
        "max_output_tokens": 4,
    },
    "chain_of_thought": {
        "prompt_inputs": ["query_text", "retrieved_text"],
        "prompt_template": CHAIN_OF_THOUGHT_PROMPT_TEMPLATE,
        "response_types": ['Answer: "Relevant"', 'Answer: "Not Relevant"'],
        "metadata": ["qid", "retrieved_doc_id", "human_judgment"],
        "max_output_tokens": 250,
    },
}


def get_llm_evaluator(
    model_name_or_path: str, task_type: str, iterations: int, temperature: float
):
    """Helper function that returns the evaluator"""
    # quick sanity check
    if task_type not in POOL:
        raise ValueError(
            f"Task type {task_type} not supported please select one of {list(POOL.keys())}"
        )
    task_type_def = POOL[task_type]

    return Phi3Evaluator(
        model_name_or_path,
        response_types=task_type_def["response_types"],
        iterations=iterations,
        temperature=temperature,
    )

We are now ready to define the parameters of our run:
* `MODEL_NAME`, the name of the language model
* `BATCH_SIZE`, the batch size to use for inference
* `TASK_TYPE`, one of `qa_pointwise`, `qa_pairwise`, `chain_of_thought`
* `TEMPERATURE` & `ITERATIONS` are decoding options explained at the beginning of the notebook

In [None]:
MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
BATCH_SIZE = 2
TASK_TYPE = "qa_pointwise"
TEMPERATURE = 0.0  # values > 0 will activate sampling, in that case you should also increase the number of iterations (> 1)
ITERATIONS = 1

and create an instance of our evaluator

In [None]:
llm_evaluator = get_llm_evaluator(
    model_name_or_path=MODEL_NAME,
    task_type=TASK_TYPE,
    iterations=ITERATIONS,
    temperature=TEMPERATURE,
)

---

## Running the pipeline

Let's execute the pipeline by first adding a few test data points

In [None]:
SAMPLE_DATA = [
    {
        "qid": "155234",
        "query_text": "do bigger tires affect gas mileage",
        "positive_text": " Tire Width versus Gas Mileage. Tire width is one of the only tire size factors that can influence gas mileage in a positive way. For example, a narrow tire will have less wind resistance, rolling resistance, and weight; thus increasing gas mileage.",
        "retrieved_text": " Bigger tires are an excellent upgrade for off road driving but they decrease gas mileage on road. It\u00e2\u0080\u0099s not uncommon to experience a 2-4 mpg reduction in gas mileage with your off road tires. Why is this? 1. They are HEAVY. Most weigh 20+ lbs more than stock tires. 2. Increased Rolling Resistance from aggressive tread. 3.",
        "retrieved_doc_id": "1048111",
        "human_judgment": "Relevant",
    },
    {
        "qid": "300674",
        "query_text": "how many years did william bradford serve as governor of plymouth colony?",
        "positive_text": " http://en.wikipedia.org/wiki/William_Bradford_(Plymouth_Colony_governor) William Bradford (c.1590 \u00e2\u0080\u0093 1657) was an English Separatist leader in Leiden, Holland and in Plymouth Colony was a signatory to the Mayflower Compact. He served as Plymouth Colony Governor five times covering about thirty years between 1621 and 1657.",
        "retrieved_text": " William Bradford was the governor of Plymouth Colony for 30 years. The colony was founded by people called Puritans. They were some of the first people from England to settle in what is now the United States. Bradford helped make Plymouth the first lasting colony in New England.",
        "retrieved_doc_id": "2495763",
        "human_judgment": "Relevant",
    },
    {
        "qid": "125705",
        "query_text": "define preventive",
        "positive_text": " Adjective[edit] preventive \u00e2\u0080\u008e(comparative more preventive, superlative most preventive) 1  Preventing, hindering, or acting as an obstacle to.  Carried out to deter military aggression.",
        "retrieved_text": " The Prevention Institute defines prevention as a systematic process that promotes safe and healthy environments and behaviors, reducing the likelihood or frequency of an incident, injury or condition occurring (2007).",
        "retrieved_doc_id": "6464885",
        "human_judgment": "Not Relevant",
    },
    {
        "qid": "1101276",
        "query_text": "do spiders eat other animals",
        "positive_text": " Spiders are animals that have 8 legs and use their fangs to inject venom into other animals and sometimes humans. But what do spiders eat? This post will answer that question, and also look at some interesting facts about spiders. What do spiders eat? Different species of spiders eat different things. Most species trap small insects and other spiders in their webs and eat them. A few large species of spiders prey on small birds and lizards. One species is vegetarian, feeding on acacia trees. Some baby spiders eat plant nectar. In captivity, spiders have been known to eat egg yoke, bananas, marmalade, milk and sausages. Interesting Facts About Spiders",
        "retrieved_text": " Home > Spider FAQ > What do spiders eat? Virtually all spiders are predatory on other animals, especially insects and other spiders. Very large spiders are capable of preying on small vertebrate animals such as lizards, frogs, fish, tadpoles, or even small snakes or baby rodents. Large orb weavers have been observed to occasionally ensnare small birds or bats.",
        "retrieved_doc_id": "1596582",
        "human_judgment": "Relevant",
    },
    {
        "qid": "89786",
        "query_text": "central city definition",
        "positive_text": " Definition of central city. : a city that constitutes the densely populated center of a metropolitan area.",
        "retrieved_text": " Central City (DC Comics) For other uses, see Central City (disambiguation). Central City is a fictional American city appearing in comic books published by DC Comics. It is the home of the Silver Age version of the Flash (Barry Allen), and first appeared in Showcase #4 in September\u00e2\u0080\u0093October 1956.",
        "retrieved_doc_id": "213222",
        "human_judgment": "Not Relevant",
    },
]

Each item in the list if a dictionary with the following keys:
* `qid`: The query id in the original MSMARCO dataset
* `query_text`: self-explanatory
* `positive_text`: The text for the document that has been marked as relevant in the oringal `qrels` file
* `retrieved_doc_id`: the id of the retrieved document (after reranking) which is being judged for relevance
* `retrieved_text`: the text of the retrieved document
* `human_judgment`: The result of the human annotation, here it is either "Relevant" or "Not Relevant"

Let's also add two helper functions that allow us to iterate over the data

In [None]:
def generate_prompts(data: list, task_type: str):
    """Generates prompts"""
    necessary_keys = POOL[task_type]["prompt_inputs"] + POOL[task_type]["metadata"]
    for line in data:
        assert all(
            key in line for key in necessary_keys
        ), f"Missing keys in line: {line}"

        prompt_inputs = {key: line[key] for key in POOL[task_type]["prompt_inputs"]}

        prompt_template = POOL[task_type]["prompt_template"]
        prompt = prompt_template.format(**prompt_inputs)

        yield prompt


def generate_data(data: list, task_type: str):
    """Iterates over the input data"""
    necessary_keys = POOL[task_type]["prompt_inputs"] + POOL[task_type]["metadata"]
    for line in data:
        assert all(
            key in line for key in necessary_keys
        ), f"Missing keys in line: {line}"

        yield line

In [None]:
gen_prompts = partial(generate_prompts, data=SAMPLE_DATA, task_type=TASK_TYPE)
gen_data = partial(generate_data, data=SAMPLE_DATA, task_type=TASK_TYPE)

And now we are ready to execute the pipeline and store the results

In [None]:
outputs = []

for (raw_responses, mapped_response, agreement), data_tuple in zip(
    llm_evaluator(
        gen_prompts(),
        max_output_tokens=POOL[TASK_TYPE]["max_output_tokens"],
        batch_size=BATCH_SIZE,
    ),
    gen_data(),
):

    # create a json structure to hold the data
    json_out = {
        key: data_tuple[key]
        for key in POOL[TASK_TYPE]["metadata"] + POOL[TASK_TYPE]["prompt_inputs"]
    }

    json_out["LLM_raw_response"] = raw_responses
    json_out["LLM_mapped_response"] = mapped_response
    json_out["LLM_agreement"] = agreement

    outputs.append(json_out)

Collect outputs into a Pandas dataframe

In [None]:
df = pd.DataFrame.from_dict(outputs)

Quick scan of the outputs

In [None]:
df.head()

And finally, let's measure the performance of the LLM. 

First, we compute the micro-F1 score which takes into account both classes

In [None]:
f1_score(y_true=df["human_judgment"], y_pred=df["LLM_mapped_response"], average="micro")

or we can focus on the `Relevant` class

Precision

In [None]:
precision_score(
    y_true=df["human_judgment"], y_pred=df["LLM_mapped_response"], pos_label="Relevant"
)

Recall

In [None]:
recall_score(
    y_true=df["human_judgment"], y_pred=df["LLM_mapped_response"], pos_label="Relevant"
)

binary-F1

In [None]:
f1_score(
    y_true=df["human_judgment"],
    y_pred=df["LLM_mapped_response"],
    average="binary",
    pos_label="Relevant",
)