notebooks/translation.ipynb (921 lines of code) (raw):
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "rEJBSTyZIrIb"
},
"source": [
"# Translation on IPU using BART-Base - Fine-tuning"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "kTCFado4IrIc"
},
"source": [
"In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) models for a translation task. We will use the [WMT dataset](http://www.statmt.org/wmt16/), a machine translation dataset composed of a collection of various sources, including news commentaries and parliament proceedings.\n",
"\n",
"\n",
"\n",
"We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `IPUSeq2SeqTrainer` API."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"| Domain | Tasks | Model | Datasets | Workflow | Number of IPUs | Execution time |\n",
"|---------|-------|-------|----------|----------|--------------|--------------|\n",
"| Natural language processing | Translation | bart-base | WMT dataset | Fine-tuning | 4 | 68min |\n",
"\n",
"[](https://www.graphcore.ai/join-community)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Environment setup\n",
"\n",
"The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.\n",
"\n",
"[]()\n",
"\n",
"To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dependencies and configuration\n",
"\n",
"In order to improve usability and support for future users, Graphcore would like to collect information about the\n",
"applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:\n",
"\n",
"- User progression through the notebook\n",
"- Notebook details: number of cells, code being run and the output of the cells\n",
"- Environment details\n",
"\n",
"You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Install the dependencies for this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install \"optimum-graphcore==0.7\" sacrebleu\n",
"%pip install graphcore-cloud-tools[logger]@git+https://github.com/graphcore/graphcore-cloud-tools\n",
"%load_ext graphcore_cloud_tools.notebook_logging.gc_logger"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "4RRkXuteIrIh"
},
"source": [
"This notebook is built to run with any model checkpoint from the [🤗 Models Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library and is supported by Optimum Graphcore. Here we picked the [`facebook/bart-base`](https://huggingface.co/facebook/bart-base) model checkpoint. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model_checkpoint = \"facebook/bart-base\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Values for machine size and cache directories can be configured through environment variables or directly in the notebook:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"n_ipu = int(os.getenv(\"NUM_AVAILABLE_IPU\", 4))\n",
"executable_cache_dir = os.getenv(\"POPLAR_EXECUTABLE_CACHE_DIR\", \"/tmp/exe_cache/\") + \"/translation\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sharing your model with the community"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from huggingface_hub import notebook_login\n",
"\n",
"notebook_login()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"You can share your model with the 🤗 community. You do this by completing the following steps:\n",
"1. Store your authentication token from the 🤗 website. [Sign up to 🤗](https://huggingface.co/join) if you haven't already.\n",
"2. Execute the following cell and input your username and password."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Then you need to install Git-LFS to manage large files:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!apt install git-lfs"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "whPRbBNbIrIl"
},
"source": [
"## Loading the dataset"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "W7QYTpxXIrIl"
},
"source": [
"We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we will use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`. We use the English/Romanian part of the WMT dataset here."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "IreSlFmlIrIm"
},
"outputs": [],
"source": [
"from datasets import load_dataset, load_metric\n",
"\n",
"raw_datasets = load_dataset(\"wmt16\", \"ro-en\")\n",
"metric = load_metric(\"sacrebleu\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "RzfPtOMoIrIu"
},
"source": [
"The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test sets:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "GWiVUF0jIrIv",
"outputId": "35e3ea43-f397-4a54-c90c-f2cf8d36873e"
},
"outputs": [],
"source": [
"raw_datasets"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "u3EtYfeHIrIz"
},
"source": [
"To access an actual element, you need to select a split (\"train\" in the example) and then specify an index:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "X6HrpprwIrIz",
"outputId": "d7670bc0-42e4-4c09-8a6a-5c018ded7d95"
},
"outputs": [],
"source": [
"raw_datasets[\"train\"][0]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "WHUmphG3IrI3"
},
"source": [
"To get a sense of what the data looks like, the following function will show some samples picked randomly from the dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "i3j8APAoIrI3"
},
"outputs": [],
"source": [
"import datasets\n",
"import random\n",
"import pandas as pd\n",
"from IPython.display import display, HTML\n",
"\n",
"def show_random_elements(dataset, num_examples=5):\n",
" assert num_examples <= len(dataset), \"Can't pick more elements than there are in the dataset.\"\n",
" picks = []\n",
" for _ in range(num_examples):\n",
" pick = random.randint(0, len(dataset)-1)\n",
" while pick in picks:\n",
" pick = random.randint(0, len(dataset)-1)\n",
" picks.append(pick)\n",
" \n",
" df = pd.DataFrame(dataset[picks])\n",
" for column, typ in dataset.features.items():\n",
" if isinstance(typ, datasets.ClassLabel):\n",
" df[column] = df[column].transform(lambda i: typ.names[i])\n",
" display(HTML(df.to_html()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "SZy5tRB_IrI7",
"outputId": "ba8f2124-e485-488f-8c0c-254f34f24f13"
},
"outputs": [],
"source": [
"show_random_elements(raw_datasets[\"train\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "lnjDIuQ3IrI-"
},
"source": [
"The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5o4rUteaIrI_",
"outputId": "18038ef5-554c-45c5-e00a-133b02ec10f1"
},
"outputs": [],
"source": [
"metric"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "jAWdqcUBIrJC"
},
"source": [
"You can call its `compute` method with your predictions and labels, which need to be list of decoded strings (a list of lists for the labels):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6XN1Rq0aIrJC",
"outputId": "a4405435-a8a9-41ff-9f79-a13077b587c7"
},
"outputs": [],
"source": [
"fake_preds = [\"hello there\", \"general kenobi\"]\n",
"fake_labels = [[\"hello there\"], [\"general kenobi\"]]\n",
"metric.compute(predictions=fake_preds, references=fake_labels)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "n9qywopnIrJH"
},
"source": [
"## Preprocessing the data"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "YVx71GdAIrJH"
},
"source": [
"Before we can feed the text samples to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will tokenize the inputs (including converting the tokens to their corresponding IDs in the pre-trained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.\n",
"\n",
"To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:\n",
"\n",
"- We get a tokenizer that corresponds to the model architecture we want to use.\n",
"- We download the vocabulary used when pre-training this specific checkpoint.\n",
"\n",
"That vocabulary will be cached, so it's not downloaded again the next time we run the cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eXNLu_-nIrJI"
},
"outputs": [],
"source": [
"from transformers import AutoTokenizer\n",
" \n",
"tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"For the `mBART` tokenizer, we need to set the source and target languages (so the text samples are preprocessed properly). You can check the language codes for supported languages on this [🤗 `mBart` Model Card](https://huggingface.co/facebook/mbart-large-cc25) if you want to use this notebook on a different pair of languages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if \"mbart\" in model_checkpoint:\n",
" tokenizer.src_lang = \"en-XX\"\n",
" tokenizer.tgt_lang = \"ro-RO\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "Vl6IidfdIrJK"
},
"source": [
"By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "rowT4iCLIrJK"
},
"source": [
"You can directly call this tokenizer on one sentence or a pair of sentences:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "a5hBlsrHIrJL",
"outputId": "acdaa98a-a8cd-4a20-89b8-cc26437bbe90"
},
"outputs": [],
"source": [
"tokenizer(\"Hello, this one sentence!\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "qo_0B1M2IrJM"
},
"source": [
"Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here, but they are required by the model we will instantiate later. You can learn more about them in this [tutorial on preprocessing](https://huggingface.co/transformers/preprocessing.html).\n",
"\n",
"Instead of one sentence, we can also pass a list of sentences:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tokenizer([\"Hello, this one sentence!\", \"This is another sentence.\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with tokenizer.as_target_tokenizer():\n",
" print(tokenizer([\"Hello, this one sentence!\", \"This is another sentence.\"]))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "2C0hcmp9IrJQ"
},
"source": [
"If you are using one of the five T5 checkpoints that require a special prefix inserted before the inputs, you should adapt `prefix` in the following cell to the prefix you need:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if model_checkpoint in [\"t5-small\", \"t5-base\", \"t5-larg\", \"t5-3b\", \"t5-11b\"]:\n",
" prefix = \"translate English to Romanian: \"\n",
"else:\n",
" prefix = \"\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then write the function that will preprocess our samples. We just feed the tokenizer with three arguments. `padding=\"max_length\"` will ensure that an input shorter than the maximum length will be padded to the maximum length. `truncation=True` will ensure that an input longer than the maximum length will be truncated to the maximum length. `max_length=max_input/target_length` sets the maximum length of a sequence.\n",
"\n",
"Note that it is necessary to pad all the sentences to the same length since currently Graphcore's PyTorch implementation only runs in static mode."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vc0BSBLIIrJQ"
},
"outputs": [],
"source": [
"max_input_length = 128\n",
"max_target_length = 128\n",
"source_lang = \"en\"\n",
"target_lang = \"ro\"\n",
"\n",
"def preprocess_function(examples):\n",
" inputs = [prefix + ex[source_lang] for ex in examples[\"translation\"]]\n",
" targets = [ex[target_lang] for ex in examples[\"translation\"]]\n",
" model_inputs = tokenizer(inputs, max_length=max_input_length, padding=\"max_length\", truncation=True)\n",
"\n",
" # Setup the tokenizer for targets\n",
" with tokenizer.as_target_tokenizer():\n",
" labels = tokenizer(targets, max_length=max_target_length, padding=\"max_length\", truncation=True)\n",
" \n",
" # Since we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore\n",
" # padding in the loss.\n",
" labels[\"input_ids\"] = [\n",
" [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels[\"input_ids\"]\n",
" ]\n",
"\n",
" model_inputs[\"labels\"] = labels[\"input_ids\"]\n",
" return model_inputs"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "0lm8ozrJIrJR"
},
"source": [
"This function works with one or several samples. In the case of several samples, the tokenizer will return a list of lists for each key:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-b70jh26IrJS",
"outputId": "acd3a42d-985b-44ee-9daa-af5d944ce1d9"
},
"outputs": [],
"source": [
"preprocess_function(raw_datasets['train'][:2])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "zS-6iXTkIrJT"
},
"source": [
"To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of the `dataset` object we created earlier. This will apply the function to all the elements of all the splits in `dataset`. This means our training, validation and testing data will be preprocessed in a single command."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DDtsaJeVIrJT",
"outputId": "aa4734bf-4ef5-4437-9948-2c16363da719"
},
"outputs": [],
"source": [
"tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "voWiw8C7IrJV"
},
"source": [
"Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is able to detect when the function you pass to `map` has changed (and thus to not use the cached data). For instance, it will detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files. You can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.\n",
"\n",
"Note that we passed `batched=True` to encode the text samples together into samples. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the text samples in a batch concurrently."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "545PP3o8IrJV"
},
"source": [
"## Fine-tuning the model"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "FBiW8UpKIrJW"
},
"source": [
"Now that our data is ready, we can download the pre-trained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "TlqNaB8jIrJW",
"outputId": "84916cf3-6e6c-47f3-d081-032ec30a4132"
},
"outputs": [],
"source": [
"from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq\n",
"from optimum.graphcore import IPUConfig, IPUSeq2SeqTrainer, IPUSeq2SeqTrainingArguments\n",
"\n",
"model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "CczA5lJlIrJX"
},
"source": [
"Note that we don't get a warning like in our [text classification notebook](https://github.com/huggingface/optimum-graphcore/blob/main/notebooks/text_classification.ipynb). This means we used all the weights of the pre-trained model and there is no randomly initialized head in this case."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To instantiate a `IPUSeq2SeqTrainer`, we will need to define: \n",
"* `IPUConfig`, which is a class that specifies attributes and configuration parameters to compile and put the model on the device.\n",
"* `IPUSeq2SeqTrainingArguments`, which is a class that contains all the attributes to customize the training.\n",
"* Data collator.\n",
"* How to compute the metrics from the predictions.\n",
"\n",
"We initialize `IPUConfig` with a config name or path, which we defined earlier in the notebook:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ipu_config_name = 'Graphcore/bart-base-ipu'\n",
"ipu_config = IPUConfig.from_pretrained(\n",
" ipu_config_name,\n",
" executable_cache_dir=executable_cache_dir,\n",
" # -1 wildcard, \n",
" # split encoder and decoder layers evenly across IPUs \n",
" # for inference \n",
" inference_layers_per_ipu=[-1]\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "_N8urzhyIrJY"
},
"source": [
"`IPUSeq2SeqTrainingArguments` requires a folder name, which will be used to save the checkpoints of the model. All other arguments are optional."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Bliy8zgjIrJY"
},
"outputs": [],
"source": [
"micro_batch_size = 1\n",
"gradient_accumulation_steps = 128\n",
"\n",
"model_name = model_checkpoint.split(\"/\")[-1]\n",
"args = IPUSeq2SeqTrainingArguments(\n",
" f\"{model_name}-finetuned-{source_lang}-to-{target_lang}\",\n",
" evaluation_strategy = \"epoch\",\n",
" learning_rate=2e-5,\n",
" per_device_train_batch_size=micro_batch_size,\n",
" per_device_eval_batch_size=micro_batch_size,\n",
" gradient_accumulation_steps=gradient_accumulation_steps,\n",
" n_ipu=n_ipu,\n",
" weight_decay=0.01,\n",
" save_total_limit=3,\n",
" num_train_epochs=1,\n",
" predict_with_generate=True,\n",
" generation_max_length=max_target_length,\n",
" dataloader_drop_last=True,\n",
" logging_steps=20,\n",
" push_to_hub=False\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "km3pGVdTIrJc"
},
"source": [
"Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the three batch-size-related arguments, namely `micro_batch_size`, `gradient_accumulation_steps` and `n_ipu` defined at the top of the cell and customize the weight decay. Since `IPUSeq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make a maximum of three.\n",
"\n",
"`push_to_hub` in `IPUSeq2SeqTrainer` is necessary if we want to push the model to the [🤗 Models Hub](https://huggingface.co/models) regularly during training. You can remove them if you didn't follow the installation steps at the beginning of this notebook. If you want to save your model locally to a name that is different to the name of the repository it will be pushed to, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `\"sgugger/marian-finetuned-en-to-ro\"` or `\"huggingface/marian-finetuned-en-to-ro\"`).\n",
"\n",
"Then, we need a special kind of data collator, which will prepare `decoder_input_ids`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "7sZOdRlRIrJd"
},
"source": [
"The last thing to define for our `IPUSeq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use `metric`, which we loaded earlier. We have to do a bit of pre-processing to decode the predictions into text samples:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UmvbnJ9JIrJd"
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"def postprocess_text(preds, labels):\n",
" preds = [pred.strip() for pred in preds]\n",
" labels = [[label.strip()] for label in labels]\n",
"\n",
" return preds, labels\n",
"\n",
"def compute_metrics(eval_preds):\n",
" preds, labels = eval_preds\n",
" if isinstance(preds, tuple):\n",
" preds = preds[0]\n",
" # Replace -100 in the labels as we can't decode them.\n",
" preds = np.where(preds != -100, preds, tokenizer.pad_token_id)\n",
" decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)\n",
" labels = np.where(labels != -100, labels, tokenizer.pad_token_id)\n",
" decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)\n",
"\n",
" # Some simple post-processing\n",
" decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)\n",
"\n",
" result = metric.compute(predictions=decoded_preds, references=decoded_labels)\n",
" result = {\"bleu\": result[\"score\"]}\n",
"\n",
" prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]\n",
" result[\"gen_len\"] = np.mean(prediction_lens)\n",
" result = {k: round(v, 4) for k, v in result.items()}\n",
" return result"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "rXuFTAzDIrJe",
"tags": []
},
"source": [
"Then we just need to pass all of this together with our datasets to the `IPUSeq2SeqTrainer` class:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "imY1oC3SIrJf"
},
"outputs": [],
"source": [
"trainer = IPUSeq2SeqTrainer(\n",
" model,\n",
" ipu_config,\n",
" args,\n",
" train_dataset=tokenized_datasets[\"train\"],\n",
" eval_dataset=tokenized_datasets[\"validation\"],\n",
" data_collator=data_collator,\n",
" compute_metrics=compute_metrics\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "CdzABDVcIrJg"
},
"source": [
"We now fine-tune our model by calling the `train` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "uNx5pyRlIrJh",
"outputId": "077e661e-d36c-469b-89b8-7ff7f73541ec"
},
"outputs": [],
"source": [
"trainer.model.generation_config.use_cache=False\n",
"trainer.model.config.use_cache=False\n",
"trainer.train()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can upload the result of the training to the 🤗 Hub:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# trainer.push_to_hub()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also share this model and other users can load it with the identifier \"your-username/the-name-you-picked\" so for instance:\n",
"\n",
"```python\n",
"from transformers import AutoModelForSeq2SeqLM\n",
"\n",
"model = AutoModelForSeq2SeqLM.from_pretrained(\"sgugger/my-awesome-model\")\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"Try out the other [IPU-powered Jupyter Notebooks](https://www.graphcore.ai/ipu-jupyter-notebooks) to see how how IPUs perform on other tasks."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"name": "Translation",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}