# General-purpose Text Embeddings on the IPU

This notebook describes how to use supported embeddings models to generate SOTA text embeddings on the IPU. You can use the following:
* [E5 model](https://arxiv.org/pdf/2212.03533.pdf) (Emb**E**ddings from bidir**E**ctional **E**ncoder r**E**presentations) to generate text embeddings on the IPU.
* [Sentence Transformers MPNet Base V2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), which is an embeddings model based on the MPNet base model.
* [Sentence-T5](https://arxiv.org/abs/2108.08877), which runs on a pre-trained T5 model encoder.

## Environment setup

The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

[![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://ipu.dev/cPfXNO)

To run the demo using other IPU hardware, you need to have the Poplar SDK enabled and a PopTorch wheel installed. Refer to the [Getting Started guide for your system](https://docs.graphcore.ai/en/latest/getting-started.html) for details on how to do this. Also refer to the Jupyter Quick Start guide for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

First, install the requirements for running this notebook:

In [None]:
# Install Optimum Graphcore if it is not in your environment
! pip install optimum-graphcore==0.7.1 sentence-transformers==2.2.2 graphcore-cloud-tools[logger]@git+https://github.com/graphcore/graphcore-cloud-tools

%load_ext graphcore_cloud_tools.notebook_logging.gc_logger

In order to improve usability and support for future users, Graphcore would like to collect information about the applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:

- User progression through the notebook
- Notebook details: number of cells, code being run and the output of the cells
- Environment details

You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell.

Import the required modules for the notebook:

In [2]:
import os
import torch
import poptorch
import numpy as np
from tqdm.notebook import tqdm
import logging

We need to instantiate some global parameters that will be used to run the models.

The **micro batch size** (number of batches to process in parallel) is set to 2. This is smaller than usual because of the effect of the micro batch size on device memory. 

We use on-IPU loops (**device iterations**) which iterate over a number of batches sequentially (where the iteration takes place on-device in one dataloader call), to extend the batch size. This yields a greater throughput because it is more efficient than loading smaller batches on the host a large number of times. 

Data parallelism is controlled by the **replication factor**, which specifies how many devices the batch sizes are replicated over. This value is set to `None` by default as it will be automatically determined by the `pod_type` of the machine being used. By default, the model itself requires 1 IPU to run, and if it is running on a IPU-POD4 (4 IPU) machine, the replication factor is set to 4. Similarly, if the model is running on an IPU-POD16, the replication factor is set to 16. This can be overridden with a different value if needed. Specifically, if `replication_factor=N` the model will be replicated over `N` IPUs as long as `N * n_ipu (number of IPUs a single instance of the model uses) <= total available IPUs`.

The total effective batch size for inference is calculated by:
```
effective_batch_size = replication_factor * device_iterations * micro_batch_size
```

The model itself, through model pipelining, can also be run over 4 IPUs (by setting `ipus_per_replica` to 4), in which case the replication factor will be adjusted accordingly. The reason we might want to spread the model over more IPUs is to reduce the memory consumption of the model over a single machine allowing for higher batch sizes to be used. For example, with 4 IPUs, we compute far fewer layers per IPU, while with 1 IPU, all model layers are on a single IPU. This is particularly beneficial on an IPU-POD16 machine, as the 4-IPU pipelined version of the model can be run at a higher effective batch size (with a higher micro batch size) and achieve an even higher overall batched throughput.

In [3]:
logger = logging.getLogger("")

n_ipu = int(os.getenv("NUM_AVAILABLE_IPU", 4))
ipus_per_replica = 1
micro_batch_size = 2
device_iterations = 512//n_ipu
replication_factor = None

random_seed = 42

To run embeddings models, we will set up a generic IPU embeddings class which loads the pre-trained model onto the IPU and runs the embedding pooling and normalisation stages in the forward pass along with the model. You may want to change the internal pooling (`pool(...)`) function in the class to support other pooling methods. The class currently supports averaging and classification using the encoder output state by passing `pool_type` when calling the model. 

In [4]:
import logging
from typing import Optional, List

from transformers import AutoModel
from optimum.graphcore.modeling_utils import to_pipelined
from optimum.graphcore import IPUConfig

logger = logging.getLogger("e5")

class IPUEmbeddingsModel(torch.nn.Module):
    def __init__(self, model, ipu_config: IPUConfig, fp16=True):
        super().__init__()
        self.encoder = to_pipelined(model, ipu_config)
        self.encoder = self.encoder.parallelize()
        if fp16: self.encoder = self.encoder.half()
    
    def pool(self, last_hidden_states: torch.Tensor, attention_mask: torch.Tensor, pool_type: str) -> torch.Tensor:
             
        last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    
        if pool_type == "avg":
            emb = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
        elif pool_type == "cls":
            emb = last_hidden[:, 0]
        else:
            raise ValueError(f"pool_type {pool_type} not supported")

        return emb
    
    def forward(self, pool_type: str ='avg', **kwargs) -> torch.Tensor:
        outputs = self.encoder(**kwargs)
        
        embeds = self.pool(outputs.last_hidden_state, kwargs["attention_mask"], pool_type=pool_type)
        embeds = torch.nn.functional.normalize(embeds, p=2, dim=-1)

        return embeds

Before setting up and running each of the models, let's create a simple `infer` function which handles loading the batches from a dataloader and generating the embeddings from any model:

In [5]:
import time

def infer(model, dataloader):
    encoded_embeds = []
    with torch.no_grad():
        for batch_dict in tqdm(dataloader, desc='encoding'):
            lat = time.time()
            outputs = model(**batch_dict)
            lat = time.time() - lat
            
            encoded_embeds.append(outputs)
            print(f"batch len: {len(batch_dict['input_ids'])} | batch latency: {lat}s | per_sample: {lat/len(batch_dict['input_ids'])}s | throughput: {len(batch_dict['input_ids'])/lat} samples/s")
    
    return torch.cat(encoded_embeds, axis=0)

## Embeddings with E5-Large

First, use `AutoConfig` from Transformers to load the model config for the E5 large model. E5 uses a bidirectional encoder, essentially the encoder stage of a BERT model, to generate the trained embeddings. The config will define the architecture of the model, such as the number of encoder layers and size of the hidden dimension within the model. The sequence length for the model is set by default to the maximum defined sequence length in the model config (`max_position_embeddings`) and can be adjusted by changing the `e5_seq_len` parameter, with a maximum value of 512.

We also need to tokenize the dataset. For this we define a custom transform function which applies the pre-trained tokenization for each model to the dataset. We will call this function when loading the function, to avoid loading multiple tokenized datasets at the same time.

We define some IPU-specific configurations to get the most out of the model. The `get_ipu_config` function will set up the IPU config according to the model config, taking into consideration the defined number of IPUs for model parallelism, the number of IPUs available and batching configurations.

In [6]:
from config import get_ipu_config
from transformers import AutoConfig, AutoTokenizer, BatchEncoding

In [7]:
e5_model_name = 'intfloat/e5-large'
e5_tokenizer = AutoTokenizer.from_pretrained(e5_model_name)
e5_model_config = AutoConfig.from_pretrained(e5_model_name)
e5_model = AutoModel.from_pretrained(e5_model_name, config=e5_model_config)

e5_seq_len = e5_model_config.max_position_embeddings

def e5_transform_func(example) -> BatchEncoding:
    return e5_tokenizer(
        example['text'],
        max_length = e5_seq_len,
        padding="max_length",
        truncation=True
    )

e5_ipu_config = get_ipu_config(
    e5_model_config, n_ipu, ipus_per_replica, device_iterations, replication_factor, random_seed)

## Embeddings with All-MPNet

For MPNet, we do the same for the pre-trained model. The maximum value for the sequence length is 512.

In [8]:
mpnet_model_name = 'sentence-transformers/all-mpnet-base-v2'
mpnet_tokenizer = AutoTokenizer.from_pretrained(mpnet_model_name)
mpnet_model_config = AutoConfig.from_pretrained(mpnet_model_name)
mpnet_model = AutoModel.from_pretrained(mpnet_model_name, config=mpnet_model_config)

mpnet_seq_len = mpnet_model_config.max_position_embeddings

def mpnet_transform_func(example) -> BatchEncoding:
    return mpnet_tokenizer(
        example['text'],
        max_length=mpnet_seq_len,
        padding="max_length",
        truncation=True
    )
    
mpnet_ipu_config = get_ipu_config(
    mpnet_model_config, n_ipu, ipus_per_replica, device_iterations, replication_factor, random_seed)

## Embeddings with Sentence-T5

Note that for T5, we need to use `T5EncoderModel` instead of `AutoModel`. We must manually specify the encoder as T5 is an encoder-decoder model, and we don't want to load the decoder for embeddings generation. 

Transformers `AutoModel` supports a 1-to-1 mapping of architecture definitions to model types, and it will load the `T5Model` class by default. We can override this by directly importing and loading the pre-trained model using `T5EncoderModel`. For T5 the sequence length is determined by the `n_positions` parameter in the model config. The expected maximum sequence length for T5 is also 512.

In [None]:
from transformers.models.t5.modeling_t5 import T5EncoderModel

t5_model_name = 'sentence-transformers/sentence-t5-base'
t5_tokenizer = AutoTokenizer.from_pretrained(t5_model_name)
t5_model_config = AutoConfig.from_pretrained(t5_model_name)
t5_model = T5EncoderModel.from_pretrained(t5_model_name, config=t5_model_config)

t5_seq_len = t5_model_config.n_positions

def t5_transform_func(example) -> BatchEncoding:
    return t5_tokenizer(
        example['text'],
        max_length=t5_seq_len,
        padding="max_length",
        truncation=True
    )



t5_ipu_config = get_ipu_config(
    t5_model_config, n_ipu, ipus_per_replica, device_iterations * 4, replication_factor, random_seed)


## Creating the embeddings model

We'll wrap this behaviour into a simple function so we can iteratively run all three models and initialise `poptorch.Dataloader` to create an IPU-ready batched dataloader. We pass an arbitrary call to the model using the first batch to ensure we have compiled the model executable (or loaded the already compiled executable).

The function goes through the model and dataset setup for a given model and:
1. Initialises the `IPUEmbeddingsModel` class with the loaded model and IPU config.
2. Converts the IPU config into an IPU options object and passes this to a `poptorch.inferenceModel` wrapper to prepare the model for the IPU.
3. Initialises [`poptorch.Dataloader`](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/batching.html) to batch the data according to the IPU options and the defined micro batch size.
4. Runs the model once with a batch to compile or load the compiled executable. 

In [10]:
from transformers import default_data_collator as data_collator

def create_model(model, ipu_config, dataset, micro_batch_size):
    model = IPUEmbeddingsModel(model, ipu_config)

    ipu_options = ipu_config.to_options(for_inference=True)
    model = poptorch.inferenceModel(model, ipu_options)

    dataloader = poptorch.DataLoader(
        ipu_options,
        dataset['train'],
        batch_size=micro_batch_size,
        shuffle=False,
        drop_last=True,
        num_workers=2,
        collate_fn=data_collator
    )

    model(**next(iter(dataloader)))
    return model, dataloader

Let's load a dataset we'll use to try out the models. Using the Hugging Face `datasets` library we can load a pre-existing dataset from the Hugging Face Hub. In this case, let's use the `rotten_tomatoes` film review dataset. Later in the notebook, we will use this dataset to create a basic semantic search functionality.

In [None]:
from datasets import Dataset, load_dataset
dataset = load_dataset("rotten_tomatoes")
print(dataset)

## Run the E5 model

The dataset first needs to be tokenized using the pre-trained tokenizer for each model, we can use the `map()` method to tokenize each of the inputs of the dataset using the model-specific transform function. Then we can convert the Hugging Face Arrow format dataset to a PyTorch-ready dataset with `set_format` which converts the tokenized inputs into tensors.

To run the model, simply call the `infer` function we created earlier to generate embeddings for the full dataset. 

In [None]:
tokenized_dataset = dataset.map(e5_transform_func, batched=True)
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
print(e5_ipu_config)

model, dataloader = create_model(e5_model, e5_ipu_config, tokenized_dataset, micro_batch_size)

e5_data_embeddings = infer(model, dataloader)

model.detachFromDevice()

## Run the All-MPNet model

In [None]:
tokenized_dataset = dataset.map(mpnet_transform_func, batched=True)
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

model, dataloader = create_model(mpnet_model, mpnet_ipu_config, tokenized_dataset, micro_batch_size)

mpnet_data_embeddings = infer(model, dataloader)

model.detachFromDevice()

## Run the Sentence-T5 model

In [None]:
tokenized_dataset = dataset.map(t5_transform_func, batched=True)
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

print(t5_ipu_config)
print(micro_batch_size)

model, dataloader = create_model(t5_model, t5_ipu_config, tokenized_dataset, micro_batch_size)

t5_data_embeddings = infer(model, dataloader)

model.detachFromDevice()

The embeddings for a single sequence represent low-dimensional numerical representations of the word-level and sentence-level context for each token. These pre-trained embeddings can be used in applications like embedding retrieval for recommender systems, or semantic searches for query-matching using cosine-similarity. Both of these use cases take advantage of the generated embeddings space, by performing a relative comparison of the user input sequence embeddings using some proximity metric.

We'll use the open source `sentence_transformers` library which provides utilities for embeddings tasks to perform a semantic search on a user query to retrieve the sequences from the dataset that are most similar to the query. This is a helpful utility for making, for example, more responsive FAQs.

## Semantic search with generated embeddings

Using the `rotten_tomatoes` dataset, lets create a simple similarity search engine using the `sentence_transformers` semantic search function, which uses cosine similarity to retrieve close-proximity sentences from a given set of embeddings to a given query. We have already generated embeddings for the dataset, so the next step is to do the same with a given query and perform the search.

First, to process the query, we need to tokenize it and convert it to a single-batch input for the model. This has been wrapped into a simple function which tokenizes and prepares a dictionary of model inputs (`input_ids`, `attention_mask`, ...) to which we just need to pass a string.

In [None]:
def prepare_query(query: str):
    t_query = mpnet_tokenizer(
            query,
            max_length=mpnet_model_config.max_position_embeddings,
            padding="max_length",
            truncation=True
        )

    return {k: torch.as_tensor([t_query[k]]) for k in t_query}

Next, to perform inference with a single input (so an effective batch size of 1) we re-instantiate the model by setting all device batching, replication and micro batch-size to 1 and re-compile the model. For this example, we use the All-MPNet model. The change in batch size necessitates a recompilation, since the input shape to the model has been changed. We will follow the steps to initiate the model outlined earlier in the notebook, with the only change being setting the `get_ipu_config` function to have all batching turned off.

In [None]:
mpnet_infer_ipu_config = get_ipu_config(
    mpnet_model_config, n_ipu, ipus_per_replica=1, device_iterations=1, replication_factor=1, random_seed=random_seed)

model = IPUEmbeddingsModel(mpnet_model, mpnet_infer_ipu_config)
model = poptorch.inferenceModel(model, mpnet_infer_ipu_config.to_options(for_inference=True))

o=model(**prepare_query("Running once to compile"))

Finally, we can use the model to embed a single query, and perform a semantic search across the full dataset embeddings to retrieve highly relevant reviews to the query.

In [None]:
from sentence_transformers.util import semantic_search

query = "Strongly disliked this action movie"

query_embeddings = model(**prepare_query(query))
hits = semantic_search(query_embeddings.float(), mpnet_data_embeddings.float(), top_k=10)

print(f"\n SEARCH QUERY: {query}")
for n, res in enumerate(hits[0]):
    print(f"\n Result (rank {n+1}) | Score: {res['score']} | Text: {dataset['train']['text'][res['corpus_id']]} ")

In [None]:
model.detachFromDevice()