notebooks/language_modelling_from_scratch.ipynb (1,102 lines of code) (raw):

{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "a3KD3WXU3l-O" }, "source": [ "# Language Modelling on IPUs - Training" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "JAscNNUD3l-P" }, "source": [ "In this notebook, we'll see how to train a [🤗 Transformers](https://github.com/huggingface/transformers) model on a language modelling task. We will cover two types of language modelling tasks:\n", "\n", "- Causal language modelling: The model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, it gets an attention mask that will prevent it from accessing the tokens after token `i` when trying to predict token `i+1` in the sentence.\n", "\n", "![Widget inference representing the causal language modelling task](images/causal_language_modeling.png)\n", "\n", "- Masked language modelling: The model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens that have been masked to predict their value.\n", "\n", "![Widget inference representing the masked language modelling task](images/masked_language_modeling.png)\n", "\n", "We will see how to easily load and preprocess the dataset for each of these tasks, and how to use the `IPUTrainer` API to train a model on it.\n", "\n", "This notebooks assumes you have trained a tokenizer on the corpus you are using (see the [How to train a tokenizer](https://github.com/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb) notebook for details)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "| Domain | Tasks | Model | Datasets | Workflow | Number of IPUs | Execution time |\n", "|---------|-------|-------|----------|----------|--------------|--------------|\n", "| Natural language processing | Causal language modelling and Masked language modelling | gpt2 and bert-base-cased | Wikitext 2 | Training | 4 or 16 | 28 min on POD4, 15 min on POD16 |" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Environment setup\n", "\n", "The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.\n", "\n", "[![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://ipu.dev/414XiNp)\n", "\n", "To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Dependencies and configuration\n", "\n", "In order to improve usability and support for future users, Graphcore would like to collect information about the\n", "applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:\n", "\n", "- User progression through the notebook\n", "- Notebook details: number of cells, code being run and the output of the cells\n", "- Environment details\n", "\n", "You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Install the dependencies for this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install \"optimum-graphcore==0.7\"\n", "%pip install graphcore-cloud-tools[logger]@git+https://github.com/graphcore/graphcore-cloud-tools\n", "%load_ext graphcore_cloud_tools.notebook_logging.gc_logger" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The cache directories can be configured through environment variables or directly in the notebook:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import os\n", "\n", "executable_cache_dir = os.getenv(\"POPLAR_EXECUTABLE_CACHE_DIR\", \"/tmp/exe_cache/\") + \"/language_modelling_from_scratch\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Sharing your model with the community" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can share your model with the 🤗 community. You do this by completing the following steps:\n", "\n", "1. Store your authentication token from the 🤗 website. [Sign up to 🤗](https://huggingface.co/join) if you haven't already.\n", "2. Execute the following cell and input your username and password:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from huggingface_hub import notebook_login\n", "\n", "notebook_login()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Then you need to install Git-LFS to manage large files:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!apt install git-lfs" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "1r_n9OWV3l-Q" }, "source": [ "## Preparing the dataset" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "kswRMhPc3l-Q" }, "source": [ "For each of the tasks, we will use the Wikitext 2 dataset as an example. You can load it easily with the 🤗 Datasets library." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "n2ZRs1cL3l-R", "outputId": "11151c56-be90-4d11-e7df-db85e745ca5c" }, "outputs": [], "source": [ "from datasets import load_dataset\n", "datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "f1-9jepM3l-W" }, "source": [ "You can replace the dataset above with any dataset hosted on [🤗 Datasets](https://huggingface.co/datasets). \n", "\n", "You can also use your own data. Just uncomment the following cell and replace the paths shown with the paths to your files:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "uxSaGa_l3l-W" }, "outputs": [], "source": [ "# datasets = load_dataset(\"text\", data_files={\"train\": path_to_train.txt, \"validation\": path_to_validation.txt}" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "jY1SwIrY3l-a" }, "source": [ "You can also load datasets from a CSV or a JSON file. See the Datasets documentation for [loading datasets from local files](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "u3EtYfeHIrIz" }, "source": [ "To access an actual element, you need to select a split (\"train\" in the example) and specify an index:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "X6HrpprwIrIz", "outputId": "d7670bc0-42e4-4c09-8a6a-5c018ded7d95" }, "outputs": [], "source": [ "datasets[\"train\"][10]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "WHUmphG3IrI3" }, "source": [ "We want to get a sense of what the data looks like, so we define the `show_random_elements` function to display some examples picked randomly from the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ur5sNUcZ3l-g" }, "outputs": [], "source": [ "from datasets import ClassLabel\n", "import random\n", "import pandas as pd\n", "from IPython.display import display, HTML\n", "\n", "def show_random_elements(dataset, num_examples=10):\n", " assert num_examples <= len(dataset), \"Can't pick more elements than there are in the dataset.\"\n", " picks = []\n", " for _ in range(num_examples):\n", " pick = random.randint(0, len(dataset)-1)\n", " while pick in picks:\n", " pick = random.randint(0, len(dataset)-1)\n", " picks.append(pick)\n", " \n", " df = pd.DataFrame(dataset[picks])\n", " for column, typ in dataset.features.items():\n", " if isinstance(typ, ClassLabel):\n", " df[column] = df[column].transform(lambda i: typ.names[i])\n", " display(HTML(df.to_html()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1Uk8NROQ3l-k", "outputId": "a822dcec-51e3-4dba-c73c-dba9e0301726" }, "outputs": [], "source": [ "show_random_elements(datasets[\"train\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "CKerdF353l-o" }, "source": [ "As we can see, some of the text samples are a full paragraph of a Wikipedia article while others are just titles or empty lines." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "JEA1ju653l-p" }, "source": [ "## Causal language modelling" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "v5GTGKZS3l-q" }, "source": [ "For causal language modelling (CLM) we are going to take all the text in our dataset and concatenate them after they are tokenized. Then we will split them in samples of a certain sequence length. This means that the model will receive chunks of contiguous text that may look like:\n", "```\n", "part of text 1\n", "```\n", "or \n", "```\n", "end of text 1 [BOS_TOKEN] beginning of text 2\n", "```\n", "depending on whether the samples span over several of the original text samples in the dataset or not. The labels will be the same as the inputs, shifted to the left.\n", "\n", "We will use the [`gpt2`](https://huggingface.co/gpt2) architecture for this example. You can pick any of the [🤗 models for causal language modelling](https://huggingface.co/models?filter=causal-lm) as long as that model is supported by Optimum Graphcore. The IPU config files of the supported models are available in Graphcore's [🤗 account](https://huggingface.co/Graphcore). You can also create your own IPU config file locally. For the tokenizer, you can replace the checkpoint with the one you trained yourself.\n", "\n", "In this notebook, we are using both data parallelism and pipeline parallelism (see the [tutorial on efficient data loading](https://github.com/graphcore/examples/tree/master/tutorials/tutorials/pytorch/efficient_data_loading) for more information). Therefore the global batch size, which is the actual number of samples used for the weight update, is determined from three factors:\n", "- global batch size = micro batch size * gradient accumulation steps * replication factor\n", "\n", "The replication factor is determined by the type of IPU Pod used, which will be used as a key to select the replication factor from a dictionary defined in the IPU config file. For example, the dictionary in the IPU config file [Graphcore/gpt2-small-ipu](https://huggingface.co/Graphcore/gpt2-small-ipu/blob/main/ipu_config.json) looks like this:\n", "- \"replication_factor\": {\"pod4\": 1, \"pod8\": 2, \"pod16\": 4, \"pod32\": 8, \"pod64\": 16, \"default\": 1}\n", "\n", "Depending on your model and the IPU Pod you are using, you might need to adjust these three batch-size-related arguments." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-WGBCO343l-q" }, "outputs": [], "source": [ "model_checkpoint = \"gpt2\"\n", "tokenizer_checkpoint = \"sgugger/gpt2-like-tokenizer\"\n", "\n", "ipu_config_name = \"Graphcore/gpt2-small-ipu\"\n", "micro_batch_size = 1\n", "gradient_accumulation_steps = 64\n", "dataloader_workers = 64" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "5io6fY_d3l-u" }, "source": [ "To tokenize all our text samples with the same vocabulary that was used when training the model, we have to download a pre-trained tokenizer. This is all done by the `AutoTokenizer` class:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iAYlS40Z3l-v" }, "outputs": [], "source": [ "from transformers import AutoTokenizer\n", " \n", "tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "rpOiBrJ13l-y" }, "source": [ "We can now call the tokenizer on all our text samples. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that calls the tokenizer on our texts:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lS2m25YM3l-z" }, "outputs": [], "source": [ "def tokenize_function(examples):\n", " return tokenizer(examples[\"text\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "M9xVAa3s3l-2" }, "source": [ "Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NVAO0H8u3l-3", "outputId": "30d88b8a-e353-4e13-f709-8e5e06ef747b", "scrolled": true }, "outputs": [], "source": [ "tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=[\"text\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "8qik3J_C3l-7" }, "source": [ "If we now look at an element of our datasets, we will see the text has been replaced with `input_ids` that the model will need:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nYv_mcKk3l-7", "outputId": "8334734c-0f86-4e18-ec17-4216a2d5dd18" }, "outputs": [], "source": [ "tokenized_datasets[\"train\"][1]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "obvgcXda3l--" }, "source": [ "Now we need to concatenate all our text samples together then split the result into small chunks of a certain block size (`block_size`). To do this, we will use the `map` method again, with the option `batched=True`. This option lets us change the number of samples in the datasets by returning a different number of samples than we originally had. This means that we can create a new set of samples from an existing set of samples.\n", "\n", "We can read the maximum length our model was pre-trained with (with tokenizer.model_max_length), but since the value might be too big to fit on your IPU RAM, we set it to 128." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DVHs5aCA3l-_" }, "outputs": [], "source": [ "# block_size = tokenizer.model_max_length\n", "block_size = 128" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "RpNfGiMw3l_A" }, "source": [ "Then we write the preprocessing function that will group our text samples:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iaAJy5Hu3l_B" }, "outputs": [], "source": [ "def group_texts(examples):\n", " # Concatenate all texts.\n", " concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}\n", " total_length = len(concatenated_examples[list(examples.keys())[0]])\n", " # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can\n", " # customize this part to your needs.\n", " total_length = (total_length // block_size) * block_size\n", " # Split by chunks of max_len.\n", " result = {\n", " k: [t[i : i + block_size] for i in range(0, total_length, block_size)]\n", " for k, t in concatenated_examples.items()\n", " }\n", " result[\"labels\"] = result[\"input_ids\"].copy()\n", " return result" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "LGJWXtNv3l_C" }, "source": [ "Note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library applies a shift to the right, so we don't need to do it manually.\n", "\n", "Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized text samples a multiple of `block_size` every 1,000 examples. You can adjust this behaviour by passing a larger batch size (which will also take longer to be processed). You can also speed up the preprocessing by using multiprocessing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gXUSfBrq3l_C", "outputId": "34e55885-3d8f-4f05-cbdb-706ce56a25f8", "scrolled": true }, "outputs": [], "source": [ "lm_datasets = tokenized_datasets.map(\n", " group_texts,\n", " batched=True,\n", " batch_size=1000,\n", " num_proc=4,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "6n84V8Gc3l_G" }, "source": [ "We can check our datasets have changed. Now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original text samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hTeGCLl_3l_G", "outputId": "ab381a07-f92e-4b14-f7b6-e4af5513d7c4" }, "outputs": [], "source": [ "tokenizer.decode(lm_datasets[\"train\"][1][\"input_ids\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "iEmeQ7Xm3l_H" }, "source": [ "To instantiate `IPUTrainer`, we will need to define:\n", "* `IPUConfig`, which is a class that specifies attributes and configuration parameters to compile and put the model on the device.\n", "* A model.\n", "* `IPUTrainingArguments`, which is a class that contains all the attributes to customize the training.\n", "\n", "We initialize `IPUConfig` with one config name or a path, which we set earlier. We also get the model configuration from the model name set earlier and initialize our model using that config." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sPqQA3TT3l_I" }, "outputs": [], "source": [ "from transformers import AutoConfig, AutoModelForCausalLM\n", "from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments\n", "\n", "ipu_config = IPUConfig.from_pretrained(ipu_config_name, executable_cache_dir=executable_cache_dir)\n", "\n", "config = AutoConfig.from_pretrained(model_checkpoint)\n", "config.update({'activation_function':'gelu'})\n", "model = AutoModelForCausalLM.from_config(config)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "`IPUTrainingArguments` requires one folder name, which will be used to save the checkpoints of the model. All other arguments are optional:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YbSwEhQ63l_L" }, "outputs": [], "source": [ "training_args = IPUTrainingArguments(\n", " f\"{model_checkpoint}-wikitext2\",\n", " learning_rate=2e-5,\n", " weight_decay=0.01,\n", " per_device_train_batch_size=micro_batch_size,\n", " per_device_eval_batch_size=micro_batch_size,\n", " gradient_accumulation_steps=gradient_accumulation_steps,\n", " num_train_epochs=10,\n", " loss_scaling=16384,\n", " n_ipu=4,\n", " warmup_ratio=0.1,\n", " dataloader_drop_last=True,\n", " dataloader_num_workers=dataloader_workers,\n", " logging_steps=10,\n", " push_to_hub=False,\n", " # hub_model_id=f\"username-or-organization/{model_checkpoint}-wikitext2\",\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "`push_to_hub` and `hub_model_id` in `IPUTrainingArguments` are necessary if we want to push the model to the [🤗 Models Hub](https://huggingface.co/models) regularly during training. You can remove them if you didn't follow the installation steps at the beginning of this notebook. If you want to save your model locally to a name that is different to the name of the repository it will be pushed to, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `\"sgugger/gpt-finetuned-wikitext2\"` or `\"huggingface/gpt-finetuned-wikitext2\"`)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "sZRbT9ui3l_N" }, "source": [ "Finally, we pass along all of these to the `IPUTrainer` class:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OEuqwIra3l_N", "scrolled": true }, "outputs": [], "source": [ "from transformers import default_data_collator\n", "\n", "trainer = IPUTrainer(\n", " model=model,\n", " ipu_config=ipu_config,\n", " args=training_args,\n", " train_dataset=lm_datasets[\"train\"],\n", " eval_dataset=lm_datasets[\"validation\"],\n", " tokenizer=tokenizer,\n", " data_collator=default_data_collator,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "6Vvz34Td3l_O" }, "source": [ "We are now ready to train our model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NyZvu_MF3l_P", "outputId": "b69d0931-7f1f-4f2d-fdb8-09d37c7418bb" }, "outputs": [], "source": [ "trainer.train()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "3APq-vUc3l_R" }, "source": [ "Once the training is complete, we can evaluate our model and get its perplexity on the validation set like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "diKZnB1I3l_R", "outputId": "9b3ac725-0117-4830-f380-a555ee57c8cf" }, "outputs": [], "source": [ "import math\n", "eval_results = trainer.evaluate()\n", "print(f\"Perplexity: {math.exp(eval_results['eval_loss']):.2f}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The perplexity is still quite high since we only trained on a small dataset for a small number of epochs. For a real language model training, you would need a larger dataset and more epochs." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "wY82caEX3l_i" }, "source": [ "You can now upload the result of the training to the 🤗 Hub:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# trainer.push_to_hub()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can now share this model and other users can load it with the identifier `\"your-username/the-name-you-picked\"` so for instance:\n", "\n", "```python\n", "from transformers import AutoModelForCausalLM\n", "\n", "model = AutoModelForCausalLM.from_pretrained(\"sgugger/my-awesome-model\")\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "q-EIELH43l_T" }, "source": [ "## Masked language modelling" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "LWk97-Ny3l_T" }, "source": [ "For masked language modelling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them with `[MASK]` and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens). If you use a tokenizer you trained yourself, make sure the `[MASK]` token is among the special tokens you passed during training!\n", "\n", "We will use the [`bert-base-cased`](https://huggingface.co/bert-based-cased) model for this example. You can pick any of the [🤗 models for masked language modelling](https://huggingface.co/models?filter=masked-lm) as long as that model is supported by Optimum Graphcore. The IPU config files of the supported models are available in Graphcore's [🤗 account](https://huggingface.co/Graphcore). You can also create your own IPU config file locally. For the tokenizer, replace the checkpoint with the one you trained." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QRTpmyCc3l_T" }, "outputs": [], "source": [ "model_checkpoint = \"bert-base-cased\"\n", "tokenizer_checkpoint = \"sgugger/bert-like-tokenizer\"\n", "\n", "ipu_config_name = \"Graphcore/bert-base-ipu\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "12F1ulgT3l_V" }, "source": [ "We can apply the same tokenization function as before, we just need to update our tokenizer to use the model checkpoint we just picked:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "h8RCYcvr3l_V", "outputId": "a5ffeb0a-71da-4b27-e57a-c62f1927562e", "scrolled": true }, "outputs": [], "source": [ "tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)\n", "tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=[\"text\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "MTuy8UUs3l_X" }, "source": [ "As with the causal language modelling example, we group the text samples together and create chunks of length `block_size`. You can skip that step if your dataset is composed of individual sentences." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LVYPMwEs3l_X", "outputId": "e71ed7f1-b182-4643-a8fb-3d731c70e40b", "scrolled": true }, "outputs": [], "source": [ "lm_datasets = tokenized_datasets.map(\n", " group_texts,\n", " batched=True,\n", " batch_size=1000,\n", " num_proc=4,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "nFJ49iHJ3l_Z" }, "source": [ "The rest is very similar to what we did for causal language modelling, with two exceptions:\n", "* We need a model suitable for masked language modelling.\n", "* We need a special data collator.\n", "\n", "First, we use a model suitable for masked language modelling:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PM10A9Za3l_Z", "outputId": "fff2d5bb-397d-4d5d-9aa9-933090cb6680" }, "outputs": [], "source": [ "from transformers import AutoConfig, AutoModelForMaskedLM\n", "from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments\n", "\n", "ipu_config = IPUConfig.from_pretrained(ipu_config_name, executable_cache_dir=executable_cache_dir)\n", "\n", "config = AutoConfig.from_pretrained(model_checkpoint)\n", "model = AutoModelForMaskedLM.from_config(config)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We redefine the `IPUTrainingArguments` class:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YbSwEhQ63l_L" }, "outputs": [], "source": [ "training_args = IPUTrainingArguments(\n", " f\"{model_checkpoint}-wikitext2-test-mlm\",\n", " learning_rate=2e-5,\n", " weight_decay=0.01,\n", " per_device_train_batch_size=micro_batch_size,\n", " per_device_eval_batch_size=micro_batch_size,\n", " gradient_accumulation_steps=gradient_accumulation_steps,\n", " num_train_epochs=10,\n", " dataloader_drop_last=True,\n", " dataloader_num_workers=dataloader_workers,\n", " warmup_ratio=0.1,\n", " logging_steps=10,\n", " n_ipu=4,\n", " push_to_hub=False,\n", " # hub_model_id=f\"username-or-organization/{model_checkpoint}-wikitext2-test-mlm\",\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Like before, the last two arguments in `IPUTrainingArguments` are needed if we want to push the model to the [🤗 Models Hub](https://huggingface.co/models) at the end of training. Remove these two arguments if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally to a name that is different to the name of the repository it will be pushed to, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `\"sgugger/gpt-finetuned-wikitext2\"` or `\"huggingface/gpt-finetuned-wikitext2\"`)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "z6uuUnvz3l_b" }, "source": [ "Finally, we use a special data collator. The data collator is a function that is responsible for taking samples and batching them into tensors. In the causal language modelling example, we didn't need anything special, so we just used the default data collator. Here we want to randomly mask the data. We could do it as a pre-processing step (like with the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the data collator, we ensure this random masking is done in a new way each time we go over the data.\n", "\n", "To do this masking, we use `DataCollatorForLanguagemodelling` which lets us adjust the probability of the masking:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nRZ-5v_P3l_b" }, "outputs": [], "source": [ "from transformers import DataCollatorForLanguageModeling\n", "data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "bqHnWcYC3l_d" }, "source": [ "Then we just have to pass everything to `IPUTrainer` and begin training:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "V-Y3gNqV3l_d" }, "outputs": [], "source": [ "trainer = IPUTrainer(\n", " model=model,\n", " args=training_args,\n", " ipu_config=ipu_config,\n", " train_dataset=lm_datasets[\"train\"],\n", " eval_dataset=lm_datasets[\"validation\"],\n", " data_collator=data_collator,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Y9TFqDG_3l_e", "outputId": "2e0c8bca-0e04-4b4f-ad06-8dd320af6c37", "scrolled": true }, "outputs": [], "source": [ "trainer.train()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "KDBi0reX3l_g" }, "source": [ "Like before, we can evaluate our model on the validation set. The perplexity is much lower than for the CLM objective because for the MLM objective, we only have to make predictions for the masked tokens (which represent 15% of the total here) while having access to the rest of the tokens. It's thus an easier task for the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4hSaANqj3l_g", "outputId": "eeeb8727-2e27-4aeb-ac71-c98123214661" }, "outputs": [], "source": [ "eval_results = trainer.evaluate()\n", "print(f\"Perplexity: {math.exp(eval_results['eval_loss']):.2f}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The perplexity is still quite high since we only trained on a small dataset for a small number of epochs. For a real language model training, you would need a larger dataset and more epochs." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "wY82caEX3l_i" }, "source": [ "You can now upload the result of the training to the 🤗 Hub:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# trainer.push_to_hub()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can also share this model and other users can load it with the identifier `\"your-username/the-name-you-picked\"` so for instance:\n", "\n", "```python\n", "from transformers import AutoModelForMaskedLM\n", "\n", "model = AutoModelForMaskedLM.from_pretrained(\"sgugger/my-awesome-model\")\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Next steps\n", "\n", "Check out the full list of [IPU-powered Jupyter Notebooks](https://www.graphcore.ai/ipu-jupyter-notebooks) to get more of a feel for how IPUs perform on other tasks." ] } ], "metadata": { "colab": { "name": "Train a language model", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 4 }