# How to Train PyTorch Hugging Face Transformers on Cloud TPUs

Over the past several months the Hugging Face and Google [`pytorch/xla`](https://github.com/pytorch/xla) teams have been collaborating bringing first class support for training Hugging Face transformers on Cloud TPUs, with significant speedups.

In this Colab we walk you through Masked Language Modeling (MLM) finetuning [RoBERTa](https://arxiv.org/abs/1907.11692) on the [WikiText-2 dataset](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) using free TPUs provided by Colab.

Last Updated: February 8th, 2021

### Install and clone depedencies

In [None]:
!pip install transformers==4.2.2 \
  torch==1.7.0 \
  cloud-tpu-client==0.10 \
  datasets==1.2.1 \
  https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.7-cp37-cp37m-linux_x86_64.whl
!git clone -b v4.2.2 https://github.com/huggingface/transformers

### Train the model

All Cloud TPU training functionality has been built into [`trainer.py`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) and so we'll use the [`run_mlm.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_mlm.py) script under `examples/language-modeling` to finetune our RoBERTa model on the WikiText-2 dataset.

Note that in the following command we use [`xla_spawn.py`](https://github.com/huggingface/transformers/blob/master/examples/xla_spawn.py) to spawn 8 processes to train on the 8 cores a single v2-8/v3-8 Cloud TPU system has (Cloud TPU Pods can scale all the way up to 2048 cores). All `xla_spawn.py` does, is call [`xmp.spawn`](https://github.com/pytorch/xla/blob/master/torch_xla/distributed/xla_multiprocessing.py#L350), which sets up some environment metadata that's needed and calls `torch.multiprocessing.start_processes`.

The below command ends up spawning 8 processes and each of those drives one TPU core. We've set the `per_device_train_batch_size=4` and `per_device_eval_batch_size=4`, which means that the global bactch size will be `32` (`4 examples/device * 8 devices/Colab TPU = 32 examples / Colab TPU`). You can also append the `--tpu_metrics_debug` flag for additional debug metrics (ex. how long it took to compile, execute one step, etc).

The following cell should take around 10~15 minutes to run.

In [None]:
!python transformers/examples/xla_spawn.py \
    --num_cores 8 \
    transformers/examples/language-modeling/run_mlm.py \
    --model_name_or_path roberta-base \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --max_seq_length 512 \
    --pad_to_max_length \
    --logging_dir tensorboard \
    --num_train_epochs 3 \
    --do_train \
    --do_eval \
    --output_dir output \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --logging_steps=50 \
    --save_steps=5000

### Visualize Tensorboard Metrics

In [None]:
%load_ext tensorboard
%tensorboard --logdir tensorboard

## ðŸŽ‰ðŸŽ‰ðŸŽ‰ **Done Training!** ðŸŽ‰ðŸŽ‰ðŸŽ‰


## Run inference on finetuned model

In [None]:
import torch_xla.core.xla_model as xm
from transformers import pipeline
from transformers import FillMaskPipeline
from transformers import AutoModelForMaskedLM, AutoTokenizer

tpu_device = xm.xla_device()
model = AutoModelForMaskedLM.from_pretrained('output').to(tpu_device)
tokenizer = AutoTokenizer.from_pretrained('output')
fill_mask = FillMaskPipeline(model, tokenizer)
fill_mask.device = tpu_device

In [None]:
fill_mask('TPUs are much faster than <mask>!')

And just like that, you've just used Cloud TPUs to both fine-tuned your model and run predictions! ðŸŽ‰