notebooks/external_model.ipynb (693 lines of code) (raw):

{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Train an External Language Model on IPUs\n", "\n", "In this notebook, we'll see how to train a model on a language modelling task when that model is not supported by [🤗 Optimum Graphcore](https://github.com/huggingface/optimum-graphcore) or by [🤗 Transformers](https://github.com/huggingface/transformers) .\n", "\n", "We will see how to easily load and preprocess the dataset for each of the tasks, and how to use the `IPUTrainer` API to train a model on it.\n", "\n", "This notebook assumes you have trained a tokenizer on the corpus you are using. Refer to the [How to train a tokenizer](https://github.com/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb) notebook for more details." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "| Domain | Tasks | Model | Datasets | Workflow | Number of IPUs | Execution time |\n", "|---------|-------|-------|----------|----------|--------------|--------------|\n", "| Natural language processing | Text generation / Causal language model (predicting the next token) | GPT2 | Wikitext-2| Training | 4 or 16 | ~40 min on POD4, ~20 min on POD16 |" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Environment setup\n", "\n", "The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.\n", "\n", "[![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://ipu.dev/3xwTmHM)\n", "\n", "To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Dependencies and configuration\n", "\n", "In order to improve usability and support for future users, Graphcore would like to collect information about the\n", "applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:\n", "\n", "- User progression through the notebook\n", "- Notebook details: number of cells, code being run and the output of the cells\n", "- Environment details\n", "\n", "You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Install the dependencies for this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%pip install \"optimum-graphcore==0.7\"\n", "%pip install graphcore-cloud-tools[logger]@git+https://github.com/graphcore/graphcore-cloud-tools\n", "%load_ext graphcore_cloud_tools.notebook_logging.gc_logger" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The cache directories can be configured through environment variables or directly in the notebook:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import os\n", "\n", "executable_cache_dir = os.getenv(\"POPLAR_EXECUTABLE_CACHE_DIR\", \"/tmp/exe_cache/\") + \"/external_model\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "1r_n9OWV3l-Q" }, "source": [ "## Preparing the dataset" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "kswRMhPc3l-Q" }, "source": [ "For each of the tasks, we will use the Wikitext-2 dataset as an example. You can load it very easily with the [🤗 Datasets library](https://huggingface.co/docs/datasets/index)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "n2ZRs1cL3l-R", "outputId": "11151c56-be90-4d11-e7df-db85e745ca5c", "tags": [] }, "outputs": [], "source": [ "from datasets import load_dataset\n", "datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "JEA1ju653l-p" }, "source": [ "## Causal language modelling" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "5io6fY_d3l-u" }, "source": [ "To tokenize all our text with the same vocabulary that was used when training the model, we could download a pre-trained tokenizer. Even though we plan to define our own model, here we borrow GPT2's tokenizer. This is all done with the `AutoTokenizer` class:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iAYlS40Z3l-v", "tags": [] }, "outputs": [], "source": [ "from transformers import AutoTokenizer\n", " \n", "tokenizer_checkpoint = \"gpt2\"\n", "tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "rpOiBrJ13l-y" }, "source": [ "We can now call the tokenizer on all our text. This is very simple with the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that calls the tokenizer on our text:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lS2m25YM3l-z", "tags": [] }, "outputs": [], "source": [ "def tokenize_function(examples):\n", " return tokenizer(examples[\"text\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "M9xVAa3s3l-2" }, "source": [ "We can now apply `tokenize_function` to all the splits in our `datasets` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NVAO0H8u3l-3", "outputId": "30d88b8a-e353-4e13-f709-8e5e06ef747b", "scrolled": true, "tags": [] }, "outputs": [], "source": [ "tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=[\"text\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "obvgcXda3l--" }, "source": [ "We next set the maximum length our model was pre-trained with." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DVHs5aCA3l-_", "tags": [] }, "outputs": [], "source": [ "block_size = 128" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "RpNfGiMw3l_A" }, "source": [ "Then we write the preprocessing function that will group our text:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iaAJy5Hu3l_B", "tags": [] }, "outputs": [], "source": [ "def group_texts(examples):\n", " # Concatenate all text.\n", " concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}\n", " total_length = len(concatenated_examples[list(examples.keys())[0]])\n", " # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can\n", " # customize this part to your needs.\n", " total_length = (total_length // block_size) * block_size\n", " # Split by chunks of max_len.\n", " result = {\n", " k: [t[i : i + block_size] for i in range(0, total_length, block_size)]\n", " for k, t in concatenated_examples.items()\n", " }\n", " result[\"labels\"] = result[\"input_ids\"].copy()\n", " return result" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "LGJWXtNv3l_C" }, "source": [ "Now we apply the `group_texts` function to all the splits in our `datasets` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gXUSfBrq3l_C", "outputId": "34e55885-3d8f-4f05-cbdb-706ce56a25f8", "scrolled": true, "tags": [] }, "outputs": [], "source": [ "lm_datasets = tokenized_datasets.map(\n", " group_texts,\n", " batched=True,\n", " batch_size=1000,\n", " num_proc=4,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Let's define a customized model, which is just a simple implementation GPT2. Note that there is nothing IPU-specific or 🤗 Transformers-related in this model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import torch\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "from torch.nn import TransformerEncoder, TransformerEncoderLayer\n", "\n", "class TransformerModel(nn.Module):\n", "\n", " def __init__(self, block_size, vocab_size, d_model, nhead, dim_feedforward, nlayers, dropout=0.1, embd_pdrop=0.1):\n", " super(TransformerModel, self).__init__()\n", " self.block_size = block_size\n", " self.word_embeddings = nn.Embedding(vocab_size, d_model)\n", " self.position_embeddings = nn.Embedding(block_size, d_model)\n", " self.drop = nn.Dropout(embd_pdrop)\n", " encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, batch_first=True)\n", " self.transformer_encoder = TransformerEncoder(encoder_layer, nlayers)\n", " self.lm_head = nn.Linear(d_model, vocab_size)\n", "\n", " self.init_weights()\n", " self.tie_weights(self.lm_head, self.word_embeddings)\n", "\n", "\n", " def tie_weights(self, output_embeddings, input_embeddings):\n", " output_embeddings.weight = input_embeddings.weight\n", " output_embeddings.bias.data = nn.functional.pad(\n", " output_embeddings.bias.data,\n", " (\n", " 0,\n", " output_embeddings.weight.shape[0] - output_embeddings.bias.shape[0],\n", " ),\n", " \"constant\",\n", " 0,\n", " )\n", " output_embeddings.out_features = input_embeddings.num_embeddings\n", "\n", " def _generate_square_subsequent_mask(self, sz):\n", " mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)\n", " mask = mask.half().masked_fill(mask == 0, -10000.0).masked_fill(mask == 1, float(0.0))\n", " return mask\n", "\n", " def init_weights(self):\n", " initrange = 0.1\n", " nn.init.uniform_(self.word_embeddings.weight, -initrange, initrange)\n", " nn.init.uniform_(self.position_embeddings.weight, -initrange, initrange)\n", "\n", " def forward(self, input_ids, attention_mask=None, labels=None):\n", " device = input_ids.device\n", " input_shape = input_ids.size()\n", "\n", " mask = self._generate_square_subsequent_mask(self.block_size).to(device)\n", "\n", " inputs_embeds = self.word_embeddings(input_ids)\n", " position_ids = torch.arange(0, input_shape[-1], dtype=torch.long, device=device)\n", " position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])\n", " position_embeds = self.position_embeddings(position_ids)\n", " hidden_states = inputs_embeds + position_embeds\n", " hidden_states = self.drop(hidden_states)\n", "\n", " hidden_states = self.transformer_encoder(hidden_states, mask)\n", " lm_logits = self.lm_head(hidden_states)\n", "\n", " return lm_logits" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We then subclass the model to inherit from `PipelineMixin`, so that the model will have the `parallelize` and `deparallelize` methods. Here we override the `parallelize` method to customize the optimization. Note that if the model is small and no customized optimization is needed for the model, there is no need to override `parallelize`. The optimizations we apply here and later are just for demonstration, so some of them are actually not necessary for such a relatively small model with `block_size` set to 128.\n", "\n", "Another change we do here is to override the `forward` method. This is because an external model usually just returns logits, but we need to respect the return format of 🤗 Transformers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import poptorch\n", "from optimum.graphcore.modeling_utils import PipelineMixin, get_layer_ipu, recomputation_checkpoint, register\n", "from optimum.utils import logging\n", "logger = logging.get_logger(__name__)\n", "\n", "\n", "class IPUTransformerModel(TransformerModel, PipelineMixin):\n", " def parallelize(self):\n", " super().parallelize()\n", " logger.info(\"---------- Device Allocation -----------\")\n", " logger.info(\"Embedding --> IPU 0\")\n", " self.word_embeddings = poptorch.BeginBlock(self.word_embeddings, \"word_embeddings\", ipu_id=0)\n", " self.position_embeddings = poptorch.BeginBlock(self.position_embeddings, \"position_embeddings\", ipu_id=0)\n", "\n", " layer_ipu = get_layer_ipu(self.ipu_config, self.transformer_encoder.layers)\n", " for index, layer in enumerate(self.transformer_encoder.layers):\n", " if self.ipu_config.recompute_checkpoint_every_layer:\n", " # Put checkpoints on every encoder layer\n", " h = recomputation_checkpoint(layer)\n", " self._hooks.append(h)\n", " ipu = layer_ipu[index]\n", " logger.info(f\"Encoder {index:<2} --> IPU {ipu}\")\n", " self.transformer_encoder.layers[index] = poptorch.BeginBlock(layer, f\"Encoder{index}\", ipu_id=ipu)\n", "\n", " logger.info(f\"Head --> IPU 0\")\n", " logger.info(\"---------------------------------------\")\n", " self.lm_head = poptorch.BeginBlock(self.lm_head, \"lm_head\", ipu_id=0)\n", " return self\n", "\n", " def forward(self, input_ids, attention_mask=None, labels=None):\n", " lm_logits = super().forward(input_ids, attention_mask=attention_mask, labels=labels)\n", "\n", " loss = None\n", " if labels is not None:\n", " # Shift so that tokens < n predict n. Use roll() + ignore_index instead of slicing for better efficiency on IPUs.\n", " labels = torch.roll(labels, -1, 1)\n", " # By default the ignore_index of CrossEntropyLoss is -100\n", " labels[:, -1] = -100\n", " loss_fct = nn.CrossEntropyLoss()\n", " loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), labels.view(-1))\n", "\n", " output = (lm_logits,)\n", " return (loss,) if loss is not None else output" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We are now ready to instantiate the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "model = IPUTransformerModel(\n", " block_size=block_size,\n", " vocab_size=tokenizer.vocab_size,\n", " d_model=768,\n", " nhead=12,\n", " dim_feedforward=768 * 4,\n", " nlayers=12,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "iEmeQ7Xm3l_H" }, "source": [ "To instantiate the `IPUTrainer` class, we first define `IPUConfig`, which is a class that specifies attributes and configuration parameters to compile and put the model on the device. We usually initialize `IPUConfig` with one config name or a path to a JSON file. We could also initialize `IPUConfig` from a dict as we are doing here:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments\n", "\n", "# ipu_config = IPUConfig.from_pretrained(\"ipu_config.json\")\n", "ipu_config_dict = {\n", " \"embedding_serialization_factor\": 2,\n", " \"recompute_checkpoint_every_layer\": True,\n", " \"optimizer_state_offchip\": True,\n", " \"replicated_tensor_sharding\": True,\n", " \"enable_half_partials\": True,\n", " \"device_iterations\": 1, \n", " \"inference_device_iterations\": 5,\n", " \"gradient_accumulation_steps\": 512,\n", " \"executable_cache_dir\": executable_cache_dir,\n", " \"ipus_per_replica\": 4,\n", " \"layers_per_ipu\": [0, 4, 4, 4],\n", " \"matmul_proportion\": [0.25, 0.25, 0.25, 0.25],\n", " }\n", "ipu_config = IPUConfig.from_dict(ipu_config_dict)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The other thing we need to define is `IPUTrainingArguments`, which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model. All other arguments are optional:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YbSwEhQ63l_L", "tags": [] }, "outputs": [], "source": [ "micro_batch_size = 1\n", "gradient_accumulation_steps = 64\n", "\n", "training_args = IPUTrainingArguments(\n", " \"mymodel-wikitext2\",\n", " learning_rate=2e-5,\n", " weight_decay=0.01,\n", " per_device_train_batch_size=micro_batch_size,\n", " per_device_eval_batch_size=micro_batch_size,\n", " gradient_accumulation_steps=gradient_accumulation_steps,\n", " n_ipu=4,\n", " num_train_epochs=10,\n", " loss_scaling=16384,\n", " warmup_ratio=0.1,\n", " dataloader_drop_last=True,\n", " dataloader_num_workers=64,\n", " logging_steps=10,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "sZRbT9ui3l_N" }, "source": [ "Finally, we pass all of these to the `IPUTrainer` class:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OEuqwIra3l_N", "scrolled": true, "tags": [] }, "outputs": [], "source": [ "trainer = IPUTrainer(\n", " model=model,\n", " ipu_config=ipu_config,\n", " args=training_args,\n", " train_dataset=lm_datasets[\"train\"],\n", " eval_dataset=lm_datasets[\"validation\"],\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "6Vvz34Td3l_O" }, "source": [ "And we can train our model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NyZvu_MF3l_P", "outputId": "b69d0931-7f1f-4f2d-fdb8-09d37c7418bb", "tags": [] }, "outputs": [], "source": [ "trainer.train()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "3APq-vUc3l_R" }, "source": [ "Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "diKZnB1I3l_R", "outputId": "9b3ac725-0117-4830-f380-a555ee57c8cf", "tags": [] }, "outputs": [], "source": [ "import math\n", "eval_results = trainer.evaluate()\n", "print(f\"Perplexity: {math.exp(eval_results['eval_loss']):.2f}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The perplexity is still quite high because we only trained on a small dataset for a small number of epochs. For real language model training, you would need a larger dataset and more epochs.\n", "\n", "If you want to resume training from a checkpoint, you could do this." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "trainer.train(resume_from_checkpoint='mymodel-wikitext2/checkpoint-500')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Next steps\n", "\n", "Check out the full list of [IPU-powered Jupyter Notebooks](https://www.graphcore.ai/ipu-jupyter-notebooks) to get more of a feel for how IPUs perform on other tasks." ] } ], "metadata": { "colab": { "name": "Train a language model", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 4 }