notebooks/whisper_finetuning.ipynb (594 lines of code) (raw):
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "fb7a2f86",
"metadata": {},
"source": [
"# Multi-lingual ASR Transcription on IPUs using Whisper - Fine-tuning\n",
"\n",
"This notebook demonstrates fine-tuning for multi-lingual speech transcription on the IPU using the [Whisper implementation in the 🤗 Transformers library](https://huggingface.co/spaces/openai/whisper) alongside [Optimum Graphcore](https://github.com/huggingface/optimum-graphcore). We will be using the Catalan subset of the [OpenSLR dataset](https://huggingface.co/datasets/openslr).\n",
"\n",
"Whisper is a versatile speech recognition model that can transcribe speech as well as perform multi-lingual translation and recognition tasks.\n",
"It was trained on diverse datasets to give human-level speech recognition performance without the need for fine-tuning. \n",
"\n",
"[🤗 Optimum Graphcore](https://github.com/huggingface/optimum-graphcore) is the interface between the [🤗 Transformers library](https://huggingface.co/docs/transformers/index) and [Graphcore IPUs](https://www.graphcore.ai/products/ipu).\n",
"It provides a set of tools that enables model parallelization, loading on IPUs, training and fine-tuning on all the tasks already supported by Transformers. Optimum Graphcore is also compatible with the 🤗 Hub and every model available on it out of the box.\n",
"\n",
"| Domain | Tasks | Model | Datasets | Workflow | Number of IPUs | Execution time |\n",
"|---------|-------|-------|----------|----------|--------------|--------------|\n",
"| Automatic Speech Recognition | Transcription | whisper-small | OpenSLR (SLR69) | Fine-tuning | 4 or 16 | 33 mins total (18 mins training, or 6 mins on POD16) |\n",
"\n",
"[](https://www.graphcore.ai/join-community)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "56a3e2a5",
"metadata": {},
"source": [
"## Environment setup\n",
"\n",
"The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.\n",
"\n",
"To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "4be57731",
"metadata": {},
"source": [
"## Dependencies and imports\n",
"\n",
"In order to improve usability and support for future users, Graphcore would like to collect information about the\n",
"applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:\n",
"\n",
"- User progression through the notebook\n",
"- Notebook details: number of cells, code being run and the output of the cells\n",
"- Environment details\n",
"\n",
"You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "070d9b99",
"metadata": {},
"source": [
"Install the dependencies the notebook needs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fde99b10-e2d2-4787-877f-fb120e327ccb",
"metadata": {},
"outputs": [],
"source": [
"# Install optimum-graphcore from source \n",
"!pip install \"optimum-graphcore==0.7.1\" \"soundfile\" \"librosa\" \"evaluate\" \"jiwer\"\n",
"%pip install \"graphcore-cloud-tools[logger] @ git+https://github.com/graphcore/graphcore-cloud-tools\"\n",
"%load_ext graphcore_cloud_tools.notebook_logging.gc_logger"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "afdb91a5",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"n_ipu = int(os.getenv(\"NUM_AVAILABLE_IPU\", 4))\n",
"executable_cache_dir = os.getenv(\"POPLAR_EXECUTABLE_CACHE_DIR\", \"/tmp/exe_cache/\") + \"/whisper\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3872977f",
"metadata": {},
"outputs": [],
"source": [
"# Generic imports\n",
"from dataclasses import dataclass\n",
"from typing import Any, Dict, List, Union\n",
"\n",
"import evaluate\n",
"import numpy as np\n",
"import torch\n",
"from datasets import load_dataset, Audio, Dataset, DatasetDict\n",
"\n",
"# IPU-specific imports\n",
"from optimum.graphcore import (\n",
" IPUConfig, \n",
" IPUSeq2SeqTrainer, \n",
" IPUSeq2SeqTrainingArguments, \n",
")\n",
"from optimum.graphcore.models.whisper import WhisperProcessorTorch\n",
"\n",
"# HF-related imports\n",
"from transformers import WhisperForConditionalGeneration"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "fb7631ad",
"metadata": {},
"source": [
"## Load Dataset"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "44c7ce56",
"metadata": {},
"source": [
"Common Voice datasets consist of recordings of speakers reading text from Wikipedia in different languages. 🤗 Datasets enables us to easily download and prepare the training and evaluation splits.\n",
"\n",
"First, ensure you have accepted the terms of use on the 🤗 Hub: [mozilla-foundation/common_voice_13_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0). Once you have accepted the terms, you will have full access to the dataset and be able to download the data locally."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "873bf3af",
"metadata": {},
"outputs": [],
"source": [
"dataset = DatasetDict()\n",
"split_dataset = Dataset.train_test_split(\n",
" load_dataset(\"openslr\", \"SLR69\", split=\"train\", token=False), test_size=0.2, seed=0\n",
")\n",
"dataset[\"train\"] = split_dataset[\"train\"]\n",
"dataset[\"eval\"] = split_dataset[\"test\"]\n",
"print(dataset)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "4ddfbefc",
"metadata": {},
"source": [
"The columns of interest are:\n",
"* `audio`: the raw audio samples\n",
"* `sentence`: the corresponding ground truth transcription. \n",
"\n",
"We drop the `path` column."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "42d261f4",
"metadata": {},
"outputs": [],
"source": [
"dataset = dataset.remove_columns([\"path\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "536162dd",
"metadata": {},
"source": [
"Since Whisper was pre-trained on audio sampled at 16 kHz, we must ensure the Common Voice samples are downsampled accordingly."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "07aebab0",
"metadata": {},
"outputs": [],
"source": [
"dataset = dataset.cast_column(\"audio\", Audio(sampling_rate=16000))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c938925f",
"metadata": {},
"source": [
"## Prepare Dataset"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "28331a5c",
"metadata": {},
"source": [
"We prepare the datasets by extracting features from the raw audio inputs and injecting labels which are simply transcriptions with some basic processing.\n",
"\n",
"The feature extraction is provided by 🤗 Transformers `WhisperFeatureExtractor`. To decode generated tokens into text after running the model, we will similarly require a tokenizer, `WhisperTokenizer`. Both of these are wrapped by an instance of `WhisperProcessor`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "60013423",
"metadata": {},
"outputs": [],
"source": [
"MODEL_NAME = \"openai/whisper-small\"\n",
"LANGUAGE = \"spanish\"\n",
"TASK = \"transcribe\"\n",
"MAX_LENGTH = 224\n",
"\n",
"processor = WhisperProcessorTorch.from_pretrained(MODEL_NAME, language=LANGUAGE, task=TASK)\n",
"processor.tokenizer.pad_token = processor.tokenizer.eos_token\n",
"processor.tokenizer.max_length = MAX_LENGTH\n",
"processor.tokenizer.set_prefix_tokens(language=LANGUAGE, task=TASK)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ba3633b3",
"metadata": {},
"outputs": [],
"source": [
"def prepare_dataset(batch, processor):\n",
" inputs = processor.feature_extractor(\n",
" raw_speech=batch[\"audio\"][\"array\"],\n",
" sampling_rate=batch[\"audio\"][\"sampling_rate\"],\n",
" )\n",
" batch[\"input_features\"] = inputs.input_features[0].astype(np.float16)\n",
"\n",
" transcription = batch[\"sentence\"]\n",
" batch[\"labels\"] = processor.tokenizer(text=transcription).input_ids\n",
" return batch\n",
"\n",
"columns_to_remove = dataset.column_names[\"train\"]\n",
"dataset = dataset.map(\n",
" lambda elem: prepare_dataset(elem, processor),\n",
" remove_columns=columns_to_remove,\n",
" num_proc=1,\n",
")\n",
"\n",
"train_dataset = dataset[\"train\"]\n",
"eval_dataset = dataset[\"eval\"]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c3101275",
"metadata": {},
"source": [
"Lastly, we pre-process the labels by padding them with values that will be ignored during fine-tuning. This padding is to ensure tensors of static shape are provided to the model. We do this on the fly via the data collator below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bbd5a4ff",
"metadata": {},
"outputs": [],
"source": [
"@dataclass\n",
"class DataCollatorSpeechSeq2SeqWithLabelProcessing:\n",
" processor: Any\n",
"\n",
" def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:\n",
" batch = {}\n",
" batch[\"input_features\"] = torch.tensor([feature[\"input_features\"] for feature in features])\n",
" \n",
" label_features = [{\"input_ids\": feature[\"labels\"]} for feature in features]\n",
" labels_batch = self.processor.tokenizer.pad(label_features, return_tensors=\"pt\", padding=\"longest\", pad_to_multiple_of=MAX_LENGTH)\n",
" labels = labels_batch[\"input_ids\"].masked_fill(labels_batch.attention_mask.ne(1), -100)\n",
"\n",
" batch[\"labels\"] = labels\n",
"\n",
" return batch"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9e4c40dc",
"metadata": {},
"source": [
"## Define metrics"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9ec3f989",
"metadata": {},
"source": [
"The performance of our fine-tuned model will be evaluated using word error rate (WER)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6a7e7c8d",
"metadata": {},
"outputs": [],
"source": [
"metric = evaluate.load(\"wer\")\n",
"\n",
"\n",
"def compute_metrics(pred, tokenizer):\n",
" pred_ids = pred.predictions\n",
" label_ids = pred.label_ids\n",
"\n",
" # replace -100 with the pad_token_id\n",
" pred_ids = np.where(pred_ids != -100, pred_ids, tokenizer.pad_token_id)\n",
" label_ids = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)\n",
"\n",
" pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)\n",
" label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)\n",
"\n",
" normalized_pred_str = [tokenizer._normalize(pred).strip() for pred in pred_str]\n",
" normalized_label_str = [tokenizer._normalize(label).strip() for label in label_str]\n",
"\n",
" wer = 100 * metric.compute(predictions=pred_str, references=label_str)\n",
" normalized_wer = 100 * metric.compute(predictions=normalized_pred_str, references=normalized_label_str)\n",
"\n",
" return {\"wer\": wer, \"normalized_wer\": normalized_wer}"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a68f25e8",
"metadata": {},
"source": [
"## Load pre-trained model"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5d3b91c1",
"metadata": {},
"outputs": [],
"source": [
"model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "54d6828e",
"metadata": {},
"outputs": [],
"source": [
"model.config.max_length = MAX_LENGTH\n",
"model.generation_config.max_length = MAX_LENGTH"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ad2db080",
"metadata": {},
"source": [
"Ensure language-appropriate tokens, if any, are set for generation. We set them on both the `config` and the `generation_config` to ensure they are used correctly during generation."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8bf2531a",
"metadata": {},
"outputs": [],
"source": [
"model.config.forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(\n",
" language=LANGUAGE, task=TASK\n",
")\n",
"model.config.suppress_tokens = []\n",
"model.generation_config.forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(\n",
" language=LANGUAGE, task=TASK\n",
")\n",
"model.generation_config.suppress_tokens = []"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "08888a86",
"metadata": {},
"source": [
"## Fine-tuning Whisper on the IPU"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "995852d7",
"metadata": {},
"source": [
"The model can be directly fine-tuned on the IPU using the `IPUSeq2SeqTrainer` class. \n",
"\n",
"The `IPUConfig` object specifies how the model will be pipelined across the IPUs. \n",
"\n",
"For fine-tuning, we place the encoder on two IPUs, and the decoder on two IPUs.\n",
"\n",
"For inference, the encoder is placed on one IPU, and the decoder on a different IPU."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b2586d7",
"metadata": {},
"outputs": [],
"source": [
"replication_factor = n_ipu // 4\n",
"ipu_config = IPUConfig.from_dict(\n",
" {\n",
" \"optimizer_state_offchip\": True,\n",
" \"recompute_checkpoint_every_layer\": True,\n",
" \"enable_half_partials\": True,\n",
" \"executable_cache_dir\": executable_cache_dir,\n",
" \"gradient_accumulation_steps\": 16,\n",
" \"replication_factor\": replication_factor,\n",
" \"layers_per_ipu\": [5, 7, 5, 7],\n",
" \"matmul_proportion\": [0.2, 0.2, 0.6, 0.6],\n",
" \"projection_serialization_factor\": 5,\n",
" \"inference_replication_factor\": 1,\n",
" \"inference_layers_per_ipu\": [12, 12],\n",
" \"inference_parallelize_kwargs\": {\n",
" \"use_cache\": True,\n",
" \"use_encoder_output_buffer\": True,\n",
" \"on_device_generation_steps\": 16,\n",
" }\n",
" }\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9634f147",
"metadata": {},
"source": [
"Lastly, we specify the arguments controlling the training process."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aaad9843",
"metadata": {},
"outputs": [],
"source": [
"total_steps = 1000 // replication_factor\n",
"training_args = IPUSeq2SeqTrainingArguments(\n",
" output_dir=\"./whisper-small-ipu-checkpoints\",\n",
" do_train=True,\n",
" do_eval=True,\n",
" predict_with_generate=True,\n",
" learning_rate=1e-5 * replication_factor,\n",
" warmup_steps=total_steps // 4,\n",
" evaluation_strategy=\"steps\",\n",
" eval_steps=total_steps,\n",
" max_steps=total_steps,\n",
" save_strategy=\"steps\",\n",
" save_steps=total_steps,\n",
" logging_steps=25,\n",
" dataloader_num_workers=16,\n",
" dataloader_drop_last=True,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "3d64aa7d",
"metadata": {},
"source": [
"Then, we just need to pass all of this together with our datasets to the `IPUSeq2SeqTrainer` class:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "436ef768",
"metadata": {},
"outputs": [],
"source": [
"trainer = IPUSeq2SeqTrainer(\n",
" model=model,\n",
" ipu_config=ipu_config,\n",
" args=training_args,\n",
" train_dataset=train_dataset,\n",
" eval_dataset=eval_dataset,\n",
" data_collator=DataCollatorSpeechSeq2SeqWithLabelProcessing(processor),\n",
" compute_metrics=lambda x: compute_metrics(x, processor.tokenizer),\n",
" tokenizer=processor.feature_extractor,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b3a72cf6",
"metadata": {},
"source": [
"To gauge the improvement in WER, we run an evaluation step before fine-tuning."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "070c6d6c",
"metadata": {},
"outputs": [],
"source": [
"trainer.evaluate()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9383944a",
"metadata": {},
"source": [
"All that remains is to fine-tune the model! The fine-tuning process should take between 6 and 18 minutes, depending on how many replicas are used, and achieve a final WER of around 10%."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a88f9cb",
"metadata": {},
"outputs": [],
"source": [
"trainer.train()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "21ff7629",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"In this notebook, we demonstrated how to fine-tune Whisper for multi-lingual speech recognition and transcription on the IPU. We used a single replica on a total of four IPUs. To reduce the fine-tuning time, more than one replica, hence more IPUs are required. On Paperspace, you can use either an IPU-POD16 or a Bow-POD16, both with 16 IPUs. Please contact Graphcore if you need assistance running on larger platforms.\n",
"\n",
"For all available notebooks, check [IPU-powered Jupyter Notebooks](https://www.graphcore.ai/ipu-jupyter-notebooks) to see how IPUs perform on other tasks.\n",
"\n",
"Have a question? Please contact us on our [Graphcore community channel](https://www.graphcore.ai/join-community).\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "sdk33",
"language": "python",
"name": "sdk33"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}