notebooks/whisper-example.ipynb

{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "fb7a2f86", "metadata": {}, "source": [ "# Speech Transcription on IPUs using Whisper - Inference\n", "\n", "This notebook demonstrates speech transcription on the IPU using the [Whisper implementation in the Hugging Face Transformers library](https://huggingface.co/spaces/openai/whisper) alongside [Optimum Graphcore](https://github.com/huggingface/optimum-graphcore).\n", "\n", "Whisper is a versatile speech recognition model that can transcribe speech as well as perform multi-lingual translation and recognition tasks.\n", "It was trained on diverse datasets to give human-level speech recognition performance without the need for fine-tuning. \n", "\n", "[🤗 Optimum Graphcore](https://github.com/huggingface/optimum-graphcore) is the interface between the [🤗 Transformers library](https://huggingface.co/docs/transformers/index) and [Graphcore IPUs](https://www.graphcore.ai/products/ipu).\n", "It provides a set of tools enabling model parallelization and loading on IPUs, training and fine-tuning on all the tasks already supported by 🤗 Transformers while being compatible with the 🤗 Hub and every model available on it out of the box.\n", "\n", "> **Hardware requirements:** The Whisper models `whisper-tiny`, `whisper-base` and `whisper-small` can run two replicas on the smallest IPU-POD4 machine. The most capable model, `whisper-large`, will need to use either an IPU-POD16 or a Bow Pod16 machine. Please contact Graphcore if you'd like assistance running model sizes that don't work in this simple example notebook.\n", "\n", "[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "44e8c3fc", "metadata": {}, "source": [ "## Environment setup\n", "\n", "The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.\n", "\n", "To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine." ] }, { "attachments": {}, "cell_type": "markdown", "id": "4be57731", "metadata": {}, "source": [ "## Dependencies" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e6c99e95", "metadata": {}, "source": [ "In order to improve usability and support for future users, Graphcore would like to collect information about the\n", "applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:\n", "\n", "- User progression through the notebook\n", "- Notebook details: number of cells, code being run and the output of the cells\n", "- Environment details\n", "\n", "You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell." ] }, { "attachments": {}, "cell_type": "markdown", "id": "070d9b99", "metadata": {}, "source": [ "Install the dependencies the notebook needs." ] }, { "cell_type": "code", "execution_count": null, "id": "fde99b10-e2d2-4787-877f-fb120e327ccb", "metadata": {}, "outputs": [], "source": [ "# Install optimum from source \n", "!pip install git+https://github.com/huggingface/optimum-graphcore.git@1f13c9279921bd064a0a857b044d9c18f7fbca13 \"tokenizers<0.13\" \"transformers==4.25.1\" \"soundfile\" \"librosa\" \"matplotlib\"\n", "%pip install \"graphcore-cloud-tools[logger] @ git+https://github.com/graphcore/graphcore-cloud-tools\"\n", "%load_ext graphcore_cloud_tools.notebook_logging.gc_logger" ] }, { "attachments": {}, "cell_type": "markdown", "id": "08888a86", "metadata": {}, "source": [ "## Running Whisper on the IPU\n", "\n", "We start by importing the required modules, some of which are needed to configure the IPU.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ea6efd44", "metadata": {}, "outputs": [], "source": [ "# Generic imports\n", "import os\n", "from datasets import load_dataset\n", "import matplotlib.pyplot as plt\n", "import librosa\n", "import IPython\n", "import random\n", "\n", "\n", "# IPU-specific imports\n", "from optimum.graphcore import IPUConfig\n", "from optimum.graphcore.modeling_utils import to_pipelined\n", "from optimum.graphcore.models.whisper import WhisperProcessorTorch\n", "\n", "# HF-related imports\n", "from transformers import WhisperForConditionalGeneration" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c7a7484f", "metadata": {}, "source": [ "This notebook demonstrates how to run all sizes of Whisper, assuming you meet the IPU hardware requirements:\n", "\n", "- `whisper-tiny`, `base` and `small` only requires 2 IPUs (IPU-POD4)\n", "- `whisper-medium` requires 4 IPUs (IPU-POD4)\n", "- `whisper-large` requires 8 IPUs (IPU-POD16 or a Bow Pod16)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "734d8d54", "metadata": {}, "source": [ "The Whisper model is available on Hugging Face in several sizes, from `whisper-tiny` with 39M parameters to `whisper-large` with 1550M parameters.\n", "\n", "The [Whisper architecture](https://openai.com/research/whisper) is an encoder-decoder Transformer, with the audio split into 30-second chunks.\n", "- For `whisper-tiny`, `small` and `base`, one IPU is used for the encoder part of the graph and another for the decoder part.\n", "- For `whisper-medium `, two IPUs are used to place the encoder part and two others for the decoder part.\n", "- For `whisper-large `, four IPUs are used to place the encoder part and four others for the decoder part.\n", "\n", "The `IPUConfig` object helps to configure the model to be pipelined across the IPUs.\n", "The number of transformer layers per IPU can be adjusted by using `layers_per_ipu`." ] }, { "cell_type": "code", "execution_count": null, "id": "1cffea71", "metadata": {}, "outputs": [], "source": [ "num_available_ipus=int(os.getenv(\"NUM_AVAILABLE_IPU\", 4))\n", "cache_dir = os.getenv(\"POPLAR_EXECUTABLE_CACHE_DIR\", \"./exe_cache\") + \"/whisper\"\n", "\n", "default_ipu_config = IPUConfig(executable_cache_dir=cache_dir,\n", " ipus_per_replica=2)\n", "\n", "medium_ipu_config = IPUConfig(executable_cache_dir=cache_dir,\n", " ipus_per_replica=4,\n", " layers_per_ipu=[12, 12, 13, 11])\n", "\n", "large_ipu_config = IPUConfig(executable_cache_dir=cache_dir,\n", " ipus_per_replica=8,\n", " layers_per_ipu=[8, 8, 8, 8, 6, 9, 9, 8])\n", "\n", "configs = {\n", " \"tiny\": (\"openai/whisper-tiny.en\", \n", " default_ipu_config),\n", " \n", " \"base\": (\"openai/whisper-base.en\", \n", " default_ipu_config),\n", "\n", " \"small\": (\"openai/whisper-small.en\", \n", " default_ipu_config),\n", " \n", " \"medium\": (\"openai/whisper-medium.en\",\n", " medium_ipu_config),\n", "\n", " \"large\": (\"openai/whisper-large-v2\", \n", " large_ipu_config),\n", "}\n", "\n", "\n", "def select_whisper_config(size: str, custom_checkpoint: str):\n", " auto_sizes = {4: \"tiny\", 16: \"large\"}\n", " if size == \"auto\":\n", " selected_size = auto_sizes[num_available_ipus]\n", " elif size in configs.keys():\n", " if size == \"large\" and num_available_ipus < 8:\n", " raise ValueError(\"Error: You need at least 8 IPUs to run whisper-large \"\n", " f\"but your current environment has {num_available_ipus} IPUs available.\")\n", " selected_size = size\n", " else:\n", " raise ValueError(f\"{size} is not a valid size for Whisper\")\n", " \n", " model_checkpoint, ipu_config = configs[selected_size]\n", " if custom_checkpoint is not None:\n", " model_checkpoint = custom_checkpoint\n", "\n", " print(f\"Using whisper-{selected_size} config with the checkpoint '{model_checkpoint}'.\")\n", " return model_checkpoint, ipu_config " ] }, { "attachments": {}, "cell_type": "markdown", "id": "f0d22c55", "metadata": {}, "source": [ "Select the Whisper size bellow, try `\"tiny\"`,`\"base\"`, `\"small\"`, `\"medium\"`, `\"large\"` or let the `\"auto\"` mode choose for you." ] }, { "cell_type": "code", "execution_count": null, "id": "f919498b", "metadata": {}, "outputs": [], "source": [ "model_checkpoint, ipu_config = select_whisper_config(\"auto\", custom_checkpoint=None) " ] }, { "attachments": {}, "cell_type": "markdown", "id": "eff8bcf6", "metadata": {}, "source": [ "You can also use a custom checkpoint from Hugging Face Hub using the argument `custom_checkpoint` above. In this case, you have to make sure that `size` matches the checkpoint model size." ] }, { "cell_type": "code", "execution_count": null, "id": "8c5d72f3-cbd6-462f-9741-1726d412c4eb", "metadata": {}, "outputs": [], "source": [ "# Instantiate processor and model\n", "processor = WhisperProcessorTorch.from_pretrained(model_checkpoint)\n", "model = WhisperForConditionalGeneration.from_pretrained(model_checkpoint)\n", "\n", "# Adapt whisper-tiny to run on the IPU\n", "\n", "pipelined_model = to_pipelined(model, ipu_config)\n", "pipelined_model = pipelined_model.parallelize(\n", " for_generation=True, \n", " use_cache=True, \n", " batch_size=1, \n", " max_length=448,\n", " on_device_generation_steps=16, \n", " use_encoder_output_buffer=True).half()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9e99620b", "metadata": {}, "source": [ "Now we can load the dataset and process an example audio file.\n", "If precompiled models are not available, then the first run of the model triggers two graph compilations.\n", "This means that our first test transcription could take a minute or two to run, but subsequent runs will be much faster." ] }, { "cell_type": "code", "execution_count": null, "id": "8ab692b6", "metadata": {}, "outputs": [], "source": [ "# load the dataset and read an example sound file\n", "ds = load_dataset(\"hf-internal-testing/librispeech_asr_dummy\", \"clean\", split=\"validation\")\n", "test_sample = ds[2]\n", "sample_rate = test_sample['audio']['sampling_rate']\n", "\n", "def transcribe(data, rate):\n", " input_features = processor(data, return_tensors=\"pt\", sampling_rate=rate).input_features.half()\n", "\n", " # This triggers a compilation, unless a precompiled model is available.\n", " sample_output = pipelined_model.generate(\n", " input_features,\n", " use_cache=True,\n", " do_sample=False,\n", " max_length=448, \n", " min_length=3)\n", " transcription = processor.batch_decode(sample_output, skip_special_tokens=True)[0]\n", " return transcription\n", "\n", "test_transcription = transcribe(test_sample[\"audio\"][\"array\"], sample_rate)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "aa59411d", "metadata": {}, "source": [ "In the next cell, we compare the expected text from the dataset with the transcribed result from the model.\n", "There will typically be some small differences, but even `whisper-tiny` does a great job! It even adds punctuation.\n", "\n", "You can listen to the audio and compare the model result yourself using the controls below." ] }, { "cell_type": "code", "execution_count": null, "id": "17947b7c", "metadata": {}, "outputs": [], "source": [ "print(f\"Expected: {test_sample['text']}\\n\")\n", "print(f\"Transcribed: {test_transcription}\")\n", "\n", "plt.figure(figsize=(14, 5))\n", "librosa.display.waveshow(test_sample[\"audio\"][\"array\"], sr=sample_rate)\n", "IPython.display.Audio(test_sample[\"audio\"][\"array\"], rate=sample_rate)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "217f7821-1ddb-425e-995e-a9f084c7ff0b", "metadata": {}, "source": [ "The model only needs to be compiled once. Subsequent inferences will be much faster.\n", "In the cell below, we repeat the exercise but with a random example from the dataset.\n", "\n", "You might like to re-run this next cell multiple times to get different comparisons." ] }, { "cell_type": "code", "execution_count": null, "id": "8c8e9ca3-a932-4e66-97c7-8ffe98d00bf1", "metadata": {}, "outputs": [], "source": [ "idx = random.randint(0, ds.num_rows - 1)\n", "data = ds[idx][\"audio\"][\"array\"]\n", "\n", "print(f\"Example #{idx}\\n\")\n", "print(f\"Expected: {ds[idx]['text']}\\n\")\n", "print(f\"Transcribed: {transcribe(data, sample_rate)}\")\n", "\n", "plt.figure(figsize=(14, 5))\n", "librosa.display.waveshow(data, sr=sample_rate)\n", "IPython.display.Audio(data, rate=sample_rate, autoplay=True)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3625a6bf", "metadata": {}, "source": [ "Finally, we detach the process from the IPUs when we are done to make the IPUs available to other users." ] }, { "cell_type": "code", "execution_count": null, "id": "24c1175a", "metadata": {}, "outputs": [], "source": [ "pipelined_model.detachFromDevice()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "08b4e580-703e-4329-9dba-808a3a1096c8", "metadata": {}, "source": [ "## Next Steps\n", "\n", "The `whisper-tiny` model used here is very fast for inference and so cheap to run, but its accuracy can be improved.\n", "The `whisper-base`, `whisper-small` and `whisper-medium` models have 74M, 244M and 769 M parameters respectively (compared to just 39M for `whisper-tiny`). You can try out `whisper-base`, `whisper-small` and `whisper-medium` by changing `select_whisper_config(\"auto\")` (at the beginning of this notebook) to:\n", "- `select_whisper_config(\"base\")`\n", "- `select_whisper_config(\"small\")`\n", "- `select_whisper_config(\"medium\")` respectively.\n", "\n", "Larger models and multilingual models are also available.\n", "To access the multilingual models, remove the `.en` from the checkpoint name. Note however that the multilingual models are slightly less accurate for this English transcription task but they can be used for transcribing other languages or for translating to English.\n", "\n", "The largest model `whisper-large` has 1550M parameters and requires a 8-IPUs pipeline.\n", "You can try it by setting `select_whisper_config(\"large\")`\n", "To run it you will need more than the IPU-POD4. On Paperspace, this is available using either an IPU-POD16 or a Bow Pod16 machine. Please contact Graphcore if you need assistance running these larger models.\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "21ff7629", "metadata": {}, "source": [ "## Conclusion\n", "\n", "In this notebook, we demonstrated using Whisper for speech recognition and transcription on the IPU.\n", "We used the Optimum Graphcore package to interface between the IPU and the Hugging Face Transformers library. This meant that only a few lines of code were needed to get this state-of-the-art automated speech recognition model running on IPUs." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }

notebooks/whisper-example.ipynb (432 lines of code) (raw):