# Optimum

This notebook demonstrate on how to use the Optimum to perform quantization of models hosted on the Hugging Face Hub using the ONNX Runtime quantization tool.

## Setup Optimum

First, let's install optimum and import required modules.

In [None]:
%pip install optimum

In [None]:
from optimum.onnxruntime import ORTQuantizer, ORTModelForImageClassification
from functools import partial
from optimum.onnxruntime.configuration import AutoQuantizationConfig, AutoCalibrationConfig
from onnxruntime.quantization import QuantType
from transformers import AutoFeatureExtractor
from PIL import Image
import requests

## Load ONNX Runtime Model

Load the ONNX Runtime model from Huggingface Hub. We will be using the Vision Transformer `vit-base-patch16-224`.

In [None]:
preprocessor = AutoFeatureExtractor.from_pretrained("optimum/vit-base-patch16-224")
model = ORTModelForImageClassification.from_pretrained("optimum/vit-base-patch16-224")
model.save_pretrained("models/vit-base-patch16-224")

## Dynamic Quantization

Similar to ONNX Runtime quantization, dynamic quantization calculates the parameters to be quantized for activations dynamically which increase the accuracy of the model but increase the latency as well.

To perform dynamic quantization, first create quantizer using `ORTQuantizer` class and define the configuration using `AutoQuantizationConfig` before calling `quantize()` method to quantize the model.

In [None]:
quantizer = ORTQuantizer.from_pretrained(model)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
dqconfig.weights_dtype = QuantType.QUInt8
model_quantized_path = quantizer.quantize(
    save_dir="models/vit-base-patch16-224-quantized-dynamic",
    quantization_config=dqconfig,
)

## Check Model Size

Compare the size of the original model and the quantized model.

Size of original model

In [None]:
%ls -lh models/vit-base-patch16-224

Size of quantized model

In [None]:
%ls -lh models/vit-base-patch16-224-quantized-dynamic

## Check Model Result

Next, we will validate the quantized model by comparing the result of the original model and the quantized model.
We created a function to perform inference when given model, processor and image then return the classification result based on the ImageNet label

In [None]:

def infer_ImageNet(classification_model, processor, image):
    inputs = processor(images=image, return_tensors="pt")
    outputs = classification_model(**inputs)
    logits = outputs.logits
    predicted_class_idx = logits.argmax(-1).item()
    return classification_model.config.id2label[predicted_class_idx]

# Get sample image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

res = infer_ImageNet(model, preprocessor, image)
print("Original model prediction:", res)

quantized_model = ORTModelForImageClassification.from_pretrained(model_quantized_path)
dq_res = infer_ImageNet(quantized_model, preprocessor, image)
print("Quantized model prediction:", dq_res)
display(image)

## Static Quantization

For static quantization, similar to ONNX Runtime static quantization,  parameters are quantized first using the calibration dataset. This method is faster than dynamic quantization but the accuracy is lower. 

When using Optimum, claibration dataset can be created using `quantizer.get_calibration_dataset(()` method which take any datsets from HuggingFace Hub or local folder. Once calibration dataset is created and calibration configuration defined using `AutoCalibrationConfig.minmax()`, perform calibration by calling `quantizer.fit()` method and then quantize the model using `quantizer.quantize()` method.

In [None]:
quantizer = ORTQuantizer.from_pretrained(model)
static_qconfig = AutoQuantizationConfig.arm64(is_static=True, per_channel=False)

# Create the calibration dataset
def preprocess_fn(ex, processor):
    return processor(ex["image"])

calibration_dataset = quantizer.get_calibration_dataset(
    "zh-plus/tiny-imagenet",
    preprocess_function=partial(preprocess_fn, processor=preprocessor),
    num_samples=50,
    dataset_split="train",
)

# Create the calibration configuration containing the parameters related to calibration.
calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)

# Perform the calibration step: computes the activations quantization ranges
ranges = quantizer.fit(
    dataset=calibration_dataset,
    calibration_config=calibration_config,
    operators_to_quantize=static_qconfig.operators_to_quantize,
)

# Apply static quantization on the model
model_quantized_path_static = quantizer.quantize(
    save_dir="models/vit-base-patch16-224-quantized-static",
    calibration_tensors_range=ranges,
    quantization_config=static_qconfig,
)

## Check Model Size

Again, check the size of the quantized model.

In [None]:
%ls -lh models/vit-base-patch16-224-quantized-static

## Validate Quantized Model

Finally, validate the quantized model by comparing the result of the original model and the quantized model.

In [None]:
static_quantized_model = ORTModelForImageClassification.from_pretrained(model_quantized_path_static)
sq_res = infer_ImageNet(static_quantized_model, preprocessor, image)
print("Quantized model prediction (static):", sq_res)