# Speech Transcription on IPUs using Whisper - Quantized Inference

This notebook demonstrates speech transcription on the IPU using the [Whisper implementation in the ðŸ¤— Transformers library](https://huggingface.co/spaces/openai/whisper) alongside [Optimum Graphcore](https://github.com/huggingface/optimum-graphcore) using INT4 **group quantization**.

Whisper is a versatile speech recognition model that can transcribe speech as well as perform multi-lingual translation and recognition tasks.
It was trained on diverse datasets to give human-level speech recognition performance without the need for fine-tuning.

This notebook demonstrates the use of group quantization in Whisper inference to compress the weights from FP16 to INT4. Group quantization is a common scheme and divides each weights matrix into groups of 16 elements and for each group store the maximum and minimum values as FP16. Then, it divides the range between the minimum and maximum values of each group into 16 intervals and finally codes individual elements as an INT4 based on the interval that they fall into. This gives a compression of about 3.5x. While the model is running in forward mode, the weights are decompressed on-the-fly back to FP16 for calculation. There is a small loss of accuracy when using these compressed values, but it is typically only about a 0.1% word error rate (WER).


[ðŸ¤— Optimum Graphcore](https://github.com/huggingface/optimum-graphcore) is the interface between the [ðŸ¤— Transformers library](https://huggingface.co/docs/transformers/index) and [Graphcore IPUs](https://www.graphcore.ai/products/ipu).
It provides a set of tools enabling model parallelization and loading on IPUs, training and fine-tuning on all the tasks already supported by ðŸ¤— Transformers while being compatible with the ðŸ¤— Hub and every model available on it out of the box.

> **Hardware requirements:** All the Whisper models from `whisper-tiny` to `whisper-large-v2` can run in inference mode on smallest IPU-POD4 machine.

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

## Environment setup

The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

[![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://ipu.dev/Djq2SC)

To run the demo using other IPU hardware, you need to have the Poplar SDK enabled and a PopTorch wheel installed. Refer to the [Getting Started guide for your system](https://docs.graphcore.ai/en/latest/getting-started.html) for details on how to do this. Also refer to the Jupyter Quick Start guide for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

## Dependencies

Install the dependencies the notebook needs.

In [None]:
# Install optimum from source 
!pip install optimum-graphcore==0.7.1 transformers librosa matplotlib graphcore-cloud-tools[logger]@git+https://github.com/graphcore/graphcore-cloud-tools
%load_ext graphcore_cloud_tools.notebook_logging.gc_logger

In order to improve usability and support for future users, Graphcore would like to collect information about the applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:

- User progression through the notebook
- Notebook details: number of cells, code being run and the output of the cells
- Environment details

You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell.

IPU Whisper with group quantization requires features from Poplar SDK version 3.3 or later. The following code checks whether these features can be enabled.

In [None]:
import warnings
from transformers.utils.versions import require_version

try:
    require_version("poptorch>=3.3")
    enable_sdk_features=True
    print(f"SDK check passed.")
except Exception:
    enable_sdk_features=False
    warnings.warn("SDK versions earlier than 3.3 do not support the functionality in this notebook. We recommend that you relaunch the Paperspace Notebook with the PyTorch SDK 3.3 image. You can use https://hub.docker.com/r/graphcore/pytorch-early-access")


## Running Whisper on the IPU

We start by importing the required modules, some of which are needed to configure the IPU.


In [None]:
# Generic imports
import os
from datasets import load_dataset
import matplotlib.pyplot as plt
import librosa
import IPython
import random


# IPU-specific imports
from optimum.graphcore import IPUConfig
from optimum.graphcore.modeling_utils import to_pipelined
from optimum.graphcore.models.whisper import WhisperProcessorTorch

# HF-related imports
from transformers import WhisperForConditionalGeneration

This notebook demonstrates how to run all sizes of Whisper. All sizes will fit on an IPU-POD4:

- `whisper-tiny`, `base` and `small` only require 1 IPU
- `whisper-medium` requires 2 IPUs
- `whisper-large` requires 4 IPUs

The Whisper model is available on Hugging Face in several sizes, from `whisper-tiny` with 39M parameters to `whisper-large` with 1550M parameters.

The [Whisper architecture](https://openai.com/research/whisper) is an encoder-decoder Transformer, with the audio split into 30-second chunks.
- For `whisper-tiny`, `small` and `base`, both encoder and decoder fit on 1 IPU.
- For `whisper-medium `, one IPU is used to place the encoder part and two others for the decoder part.
- For `whisper-large `, two IPUs are used to place the encoder part and two others for the decoder part.

The `IPUConfig` object helps to configure the model to be pipelined across the IPUs.
The number of transformer layers per IPU can be adjusted by using `layers_per_ipu`.

In [None]:
num_available_ipus=int(os.getenv("NUM_AVAILABLE_IPU", 4))
cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "./exe_cache")

configs = {
    "tiny": ("openai/whisper-tiny.en", 
        IPUConfig(executable_cache_dir=cache_dir,
                  ipus_per_replica=1,
                  explicit_ir_inference=True,
                 )),
    
    "base": ("openai/whisper-base.en", 
        IPUConfig(executable_cache_dir=cache_dir,
                  ipus_per_replica=1,
                  explicit_ir_inference=True,
                 )),

    "small": ("openai/whisper-small.en", 
        IPUConfig(executable_cache_dir=cache_dir,
                  ipus_per_replica=1,
                  explicit_ir_inference=True,
                 )),
    
    "medium": ("openai/whisper-medium.en",
        IPUConfig(executable_cache_dir=cache_dir,
                  ipus_per_replica=2,
                  explicit_ir_inference=True,
                 )),

    "large": ("openai/whisper-large-v2", 
        IPUConfig(executable_cache_dir=cache_dir,
                  ipus_per_replica=4,
                  layers_per_ipu=[-1, -1, 14, 18],
                  matmul_proportion=0.1,
                  inference_projection_serialization_factor=5,
                  explicit_ir_inference=True,
                )),
}


def select_whisper_config(size: str, custom_checkpoint: str):
    model_checkpoint, ipu_config = configs[size]
    if custom_checkpoint is not None:
        model_checkpoint = custom_checkpoint

    print(f"Using whisper-{size} config with the checkpoint '{model_checkpoint}'.")
    return model_checkpoint, ipu_config 

Select the Whisper size bellow, try `"tiny"`,`"base"`, `"small"`, `"medium"`, `"large"`.

In [None]:
model_checkpoint, ipu_config = select_whisper_config("tiny", custom_checkpoint=None) 

You can also use a custom checkpoint from Hugging Face Hub using the argument `custom_checkpoint` above. In this case, you have to make sure that `size` matches the checkpoint model size.

Two features of Optimum Graphcore are demonstrated below:
1. `use_cond_encoder` : This enables putting the Whisper encoder and decoder on a single IPU and switching between them using a compiled `cond` operation. This is only available if `ipus_per_replica == 1`.
2. `use_group_quantized_linears` : This enables compressing all the weights of the transformer block's linear layers to INT4 using group quantization.

In [None]:
# Instantiate processor and model
processor = WhisperProcessorTorch.from_pretrained(model_checkpoint)
model = WhisperForConditionalGeneration.from_pretrained(model_checkpoint)
num_beams = 1

# Adapt whisper to run on the IPU

pipelined_model = to_pipelined(model, ipu_config)
pipelined_model = pipelined_model.parallelize(
    for_generation=True, 
    use_cache=True, 
    batch_size=1,
    num_beams=num_beams,
    max_length=448,
    on_device_generation_steps=16,
    use_encoder_output_buffer=ipu_config.ipus_per_replica > 1,
    use_cond_encoder=ipu_config.ipus_per_replica == 1,
    use_group_quantized_linears=True # Enables quantization!
).half()

Now we can load the dataset and process an example audio file.
If precompiled models are not available, then the first run of the model triggers two graph compilations.
This means that our first test transcription could take a minute or two to run, but subsequent runs will be much faster.

In [None]:
# load the dataset and read an example sound file
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
test_sample = ds[2]
sample_rate = test_sample['audio']['sampling_rate']

def transcribe(data, rate):
    input_features = processor(data, return_tensors="pt", sampling_rate=rate).input_features.half()

    # This triggers a compilation, unless a precompiled model is available.
    sample_output = pipelined_model.generate(
        input_features,
        use_cache=True,
        num_beams=num_beams,
        max_length=448, 
        min_length=3)
    transcription = processor.batch_decode(sample_output, skip_special_tokens=True)[0]
    return transcription

test_transcription = transcribe(test_sample["audio"]["array"], sample_rate)

In the next cell, we compare the expected text from the dataset with the transcribed result from the model.
There will typically be some small differences, but even `whisper-tiny` does a great job! It even adds punctuation.

You can listen to the audio and compare the model result yourself using the controls below.

In [None]:
print(f"Expected: {test_sample['text']}\n")
print(f"Transcribed: {test_transcription}")

plt.figure(figsize=(14, 5))
librosa.display.waveshow(test_sample["audio"]["array"], sr=sample_rate)
IPython.display.Audio(test_sample["audio"]["array"], rate=sample_rate)

The model only needs to be compiled once. Subsequent inferences will be much faster.
In the cell below, we repeat the exercise but with a random example from the dataset.

You might like to re-run this next cell multiple times to get different comparisons.

In [None]:
idx = random.randint(0, ds.num_rows - 1)
data = ds[idx]["audio"]["array"]

print(f"Example #{idx}\n")
print(f"Expected: {ds[idx]['text']}\n")
print(f"Transcribed: {transcribe(data, sample_rate)}")

plt.figure(figsize=(14, 5))
librosa.display.waveshow(data, sr=sample_rate)
IPython.display.Audio(data, rate=sample_rate, autoplay=True)

Finally, we detach the process from the IPUs when we are done to make the IPUs available to other users.

In [None]:
pipelined_model.detachFromDevice()

## Next Steps

The `whisper-tiny` model used here is very fast for inference and so cheap to run, but its accuracy can be improved.
The `whisper-base`, `whisper-small` and `whisper-medium` models have 74M, 244M and 769 M parameters respectively (compared to just 39M for `whisper-tiny`). You can try out `whisper-base`, `whisper-small` and `whisper-medium` by changing `select_whisper_config("small")` (at the beginning of this notebook) to:
- `select_whisper_config("base")`
- `select_whisper_config("small")`
- `select_whisper_config("medium")` respectively.

Larger models and multilingual models are also available.
To access the multilingual models, remove the `.en` from the checkpoint name. Note however that the multilingual models are slightly less accurate for this English transcription task but they can be used for transcribing other languages or for translating to English. The largest model `whisper-large` has 1550M parameters and requires a 4-IPUs pipeline.
You can try it by setting `select_whisper_config("large")`

You can also try using beam search by setting `num_beams>1` in the calls to `parallelize` and `generate` above. `whisper-small` will fit on 1 IPU with `num_beams=5`.

For `whisper-medium` with `num_beams>1` the model will need 4 IPUs to fit. For `whisper-large` with `num_beams>1` you will need more than the 4 IPUS in an IPU-POD4. On Paperspace, you can use either an IPU-POD16 or a Bow Pod16 machine, each with 16 IPUs. Please contact Graphcore if you need assistance running these larger models.


## Conclusion

In this notebook, we demonstrated using Whisper and group quantization for speech recognition and transcription on the IPU.
We used the Optimum Graphcore package to interface between the IPU and the ðŸ¤— Transformers library. This meant that only a few lines of code were needed to get this state-of-the-art automated speech recognition model running on IPUs.