# Quantization of Text Embedding model from Sentence Transformers library

In [None]:
# %pip install "optimum-intel[openvino]" evaluate

## Quantize staticly model to 8-bit with NNCF via Optimum-Intel API

The code snippet below shows how to use Optimum-Intel [Model Optimization API](https://huggingface.co/docs/optimum/en/intel/openvino/optimization#static-quantization) to quantize the model staticly. It leaverages [NNCF](https://github.com/openvinotoolkit/nncf) capabilites for static quantization of Transformer models where a combination of the special quantization scheme + SmoothQuant method + Bias Correction method are used to provide state-of-the-art accuracy.

The static quantization requires some data to estimate quantization parameters of activations. It means that some calibration dataset should be provided. `OVQuantizer` class used for quantization provides an API to build such a dataset with `.get_calibration_dataset()` method.

In [2]:
from functools import partial

from transformers import AutoTokenizer

from optimum.intel import OVConfig, OVModelForFeatureExtraction, OVQuantizationConfig, OVQuantizer


MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
base_model_path = "all-MiniLM-L6-v2"
int8_ptq_model_path = "all-MiniLM-L6-v2_int8"

model = OVModelForFeatureExtraction.from_pretrained(MODEL_ID)
model.save_pretrained(base_model_path)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.save_pretrained(base_model_path)


quantizer = OVQuantizer.from_pretrained(model)

def preprocess_function(examples, tokenizer):
    return tokenizer(examples["sentence"], padding="max_length", max_length=384, truncation=True)


calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="sst2",
    preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
    num_samples=300,
    dataset_split="train",
)

ov_config = OVConfig(quantization_config=OVQuantizationConfig())

quantizer.quantize(ov_config=ov_config, calibration_dataset=calibration_dataset, save_directory=int8_ptq_model_path)
tokenizer.save_pretrained(int8_ptq_model_path)

No OpenVINO files were found for sentence-transformers/all-MiniLM-L6-v2, setting `export=True` to convert the model to the OpenVINO IR. Don't forget to save the resulting model with `.save_pretrained()`
Framework not specified. Using pt to export the model.
Using framework PyTorch: 2.4.1+cpu
Overriding 1 configuration item(s)
	- use_cache -> False
Compiling the model to CPU ...


Output()

Output()

Output()

Output()

Configuration saved in all-MiniLM-L6-v2_int8/openvino_config.json


('all-MiniLM-L6-v2_int8/tokenizer_config.json',
 'all-MiniLM-L6-v2_int8/special_tokens_map.json',
 'all-MiniLM-L6-v2_int8/vocab.txt',
 'all-MiniLM-L6-v2_int8/added_tokens.json',
 'all-MiniLM-L6-v2_int8/tokenizer.json')

## Benchmark model accuracy on GLUE STSB task

Here we estimate accuracy impact from model quantization. We evaluate accuracy of both the baseline and quantized model on a different task from the GLUE benchmark.

In [3]:
import torch
import torch.nn.functional as F
from transformers import Pipeline


# copied from the model card "sentence-transformers/all-MiniLM-L6-v2"
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


class SentenceEmbeddingPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        # we don"t have any hyperameters to sanitize
        preprocess_kwargs = {}
        return preprocess_kwargs, {}, {}

    def preprocess(self, inputs):
        encoded_inputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors="pt")
        return encoded_inputs

    def _forward(self, model_inputs):
        outputs = self.model(**model_inputs)
        return {"outputs": outputs, "attention_mask": model_inputs["attention_mask"]}

    def postprocess(self, model_outputs):
        # Perform pooling
        sentence_embeddings = mean_pooling(model_outputs["outputs"], model_outputs["attention_mask"])
        # Normalize embeddings
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
        return sentence_embeddings

In [4]:
model = OVModelForFeatureExtraction.from_pretrained(base_model_path)
vanilla_emb = SentenceEmbeddingPipeline(model=model, tokenizer=tokenizer)

q_model = OVModelForFeatureExtraction.from_pretrained(int8_ptq_model_path)
q8_emb = SentenceEmbeddingPipeline(model=q_model, tokenizer=tokenizer)

Compiling the model to CPU ...


Compiling the model to CPU ...


In [5]:
from datasets import load_dataset
from evaluate import load


eval_dataset = load_dataset("glue", "stsb", split="validation")
metric = load("glue", "stsb")

In [6]:
def compute_sentence_similarity(sentence_1, sentence_2, pipeline):
    embedding_1 = pipeline(sentence_1)
    embedding_2 = pipeline(sentence_2)
    # compute cosine similarity between two sentences
    return torch.nn.functional.cosine_similarity(embedding_1, embedding_2, dim=1)


def evaluate_stsb(example):
    default = compute_sentence_similarity(example["sentence1"], example["sentence2"], vanilla_emb)
    quantized = compute_sentence_similarity(example["sentence1"], example["sentence2"], q8_emb)
    return {
        "reference": (example["label"] - 1) / (5 - 1),  # rescale to [0,1]
        "default": float(default),
        "quantized": float(quantized),
    }


result = eval_dataset.map(evaluate_stsb)



Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [7]:
default_acc = metric.compute(predictions=result["default"], references=result["reference"])
quantized = metric.compute(predictions=result["quantized"], references=result["reference"])

print("vanilla model: pearson=", default_acc["pearson"])
print("quantized model: pearson=", quantized["pearson"])
print(
    "The quantized model achieves ",
    round(quantized["pearson"] / default_acc["pearson"], 2) * 100,
    "% accuracy of the fp32 model",
)

vanilla model: pearson= 0.869619439095004
quantized model: pearson= 0.869415534480936
The quantized model achieves  100.0 % accuracy of the fp32 model


## Compare performance of the baseline and INT8 models

We use OpenVINO `benchmark_app` with static input shape `[1,384]` for performance benchmarking. It should reflect the application performance as the tokenizer pads or trancates the input sequence to `max_length=384`.

In [8]:
# FP32 baseline model
!benchmark_app -m all-MiniLM-L6-v2/openvino_model.xml -shape "input_ids[1,384],attention_mask[1,384],token_type_ids[1,384]" -api sync -niter 200

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.4.1-16618-643f23d1318-releases/2024/4
[ INFO ] 
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2024.4.1-16618-643f23d1318-releases/2024/4
[ INFO ] 
[ INFO ] 
[Step 3/11] Setting device configuration
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 10.17 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     input_ids (node: input_ids) : i64 / [...] / [?,?]
[ INFO ]     attention_mask (node: attention_mask) : i64 / [...] / [?,?]
[ INFO ]     token_type_ids (node: token_type_ids) : i64 / [...] / [?,?]
[ INFO ] Model outputs:
[ INFO ]     last_hidden_state (node: __module.encoder.layer.5.output.LayerNorm/aten::layer_norm/Add) : f32 / [...] / [?,?,384]
[Step 5/11] Resizing model to match image sizes an

In [9]:
# INT8 counterpart
!benchmark_app -m all-MiniLM-L6-v2_int8/openvino_model.xml -shape "input_ids[1,384],attention_mask[1,384],token_type_ids[1,384]" -api sync -niter 200

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.4.1-16618-643f23d1318-releases/2024/4
[ INFO ] 
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2024.4.1-16618-643f23d1318-releases/2024/4
[ INFO ] 
[ INFO ] 
[Step 3/11] Setting device configuration
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 20.87 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     input_ids (node: input_ids) : i64 / [...] / [?,?]
[ INFO ]     attention_mask (node: attention_mask) : i64 / [...] / [?,?]
[ INFO ]     token_type_ids (node: token_type_ids) : i64 / [...] / [?,?]
[ INFO ] Model outputs:
[ INFO ]     last_hidden_state (node: __module.encoder.layer.5.output.LayerNorm/aten::layer_norm/Add) : f32 / [...] / [?,?,384]
[Step 5/11] Resizing model to match image sizes an