# Image Classification on IPU with ViT - Fine-tuning

In this notebook, we will see how to fine-tune one of the [ðŸ¤— Transformers](https://github.com/huggingface/transformers) vision models on an image classification dataset.

Given an image, the goal is to predict an appropriate class for it, for example "tiger". The screenshot below is taken from [ViT fine-tuned on ImageNet-1k](https://huggingface.co/google/vit-base-patch16-224). You can try out the inference widget!

This notebook shows how to fine-tune any pre-trained vision model for image classification on an IPU. The idea is to add a randomly initialized classification head on top of a pre-trained encoder, and fine-tune the model altogether on a labelled dataset.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tiger_image.png" alt="drawing" width="600"/>


|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
| Vision | Image classification | ViT | EuroSAT | Fine-tuning |  |   |

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

## Environment setup

The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

[![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://ipu.dev/3QxTCyU)

To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

## Dependencies and Configuration

In order to improve usability and support for future users, Graphcore would like to collect information about the
applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:

- User progression through the notebook
- Notebook details: number of cells, code being run and the output of the cells
- Environment details

You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell.

Install the dependencies for this notebook.

In [None]:
%pip install "optimum-graphcore==0.7" scikit-learn torchvision==0.15.2+cpu -f https://download.pytorch.org/whl/torch_stable.html
%pip install graphcore-cloud-tools[logger]@git+https://github.com/graphcore/graphcore-cloud-tools
%load_ext graphcore_cloud_tools.notebook_logging.gc_logger

This notebook leverages the [ImageFolder](https://huggingface.co/docs/datasets/v2.0.0/en/image_process#imagefolder) feature to easily run the notebook on a custom dataset ([EuroSAT](https://github.com/phelber/EuroSAT) in this case). You can either load a dataset from local folders or from local or remote files, like zip or tar.

This notebook is built to run on any image classification dataset with any vision model checkpoint from the [ðŸ¤— Models Hub](https://huggingface.co/) as long as that model has a version with an image classification head and is supported by [ðŸ¤— Optimum Graphcore](https://github.com/huggingface/optimum-graphcore). The IPU config files of the supported models are available in Graphcore's [Hugging Face account](https://huggingface.co/Graphcore). You can also create your own IPU config file locally. 
Currently supported models:
* [ViT](https://huggingface.co/docs/transformers/model_doc/vit#transformers.ViTForImageClassification)
* [ConvNeXT](https://huggingface.co/docs/transformers/master/en/model_doc/convnext#transformers.ConvNextForImageClassification)

In this notebook, we are using both data parallelism and pipeline parallelism (see this [tutorial on efficient data loading](https://github.com/graphcore/examples/tree/master/tutorials/pytorch/tut2_efficient_data_loading) for more details). Therefore the global batch size, which is the actual number of samples used for the weight update, is determined from three factors:
- global batch size = micro batch size * gradient accumulation steps * replication factor

The replication factor is determined by the IPU Pod type, which will be used as a key to select the replication factor from a dictionary defined in the IPU config file. For example, the dictionary in the IPU config file [Graphcore/roberta-base-ipu](https://huggingface.co/Graphcore/roberta-base-ipu/blob/main/ipu_config.json) looks like this:
- "replication_factor": {"pod4": 1, "pod8": 2, "pod16": 4, "pod32": 8, "pod64": 16, "default": 1}

Depending on your model and the IPU Pod you are using, you might need to adjust these three batch-size-related arguments.

In this notebook, we'll fine-tune from the [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) checkpoint, but note that there are many more available on the [ðŸ¤— Models Hub](https://huggingface.co/models?other=vision).

In [None]:
model_checkpoint = "google/vit-base-patch16-224-in21k" # pre-trained model from which to fine-tune

ipu_config_name = "Graphcore/vit-base-ipu" # config specific to the IPU
micro_batch_size = 1 # micro batch size for training and evaluation
gradient_accumulation_steps = 32

Values for machine size and cache directories can be configured through environment variables or directly in the notebook:

In [None]:
import os

n_ipu = int(os.getenv("NUM_AVAILABLE_IPU", 4))
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/image_classification"
dataset_dir = os.getenv("DATASETS_DIR", "./")

### Sharing your model with the community

You can share your model with the ðŸ¤— community. You do this by completing the following steps:

1. Store your authentication token from the ðŸ¤— website. [Sign up to ðŸ¤—](https://huggingface.co/join) if you haven't already.
2. Execute the following cell and input your username and password.

In [None]:
from huggingface_hub import notebook_login

notebook_login()


Then you need to install Git-LFS to upload your model checkpoints:

In [None]:
!apt -qq install git-lfs
!git config --global credential.helper store

## Loading the dataset

We will use the [ImageFolder](https://huggingface.co/docs/datasets/v2.0.0/en/image_process#imagefolder) feature from the [ðŸ¤— Datasets](https://github.com/huggingface/datasets) library to download our custom dataset into a `DatasetDict`.

In this case, the EuroSAT dataset is hosted remotely, so we provide the `data_files` argument. Alternatively, if you have local folders with images, you can load them using the `data_dir` argument. 

In [None]:
from datasets import load_dataset 
from pathlib import Path

# load a custom dataset from local/remote files or folders using the ImageFolder feature

# option 1: local/remote files (supporting the following formats: tar, gzip, zip, xz, rar, zstd)
url = "https://madm.dfki.de/files/sentinel/EuroSAT.zip"
files = list(Path(dataset_dir).rglob("EuroSAT.zip"))
dataset = load_dataset("imagefolder", data_files=str(files[0]) if files else url)

# note that you can also provide several splits:
# dataset = load_dataset("imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]})

# note that you can push your dataset to the hub very easily (and reload afterwards using load_dataset)!
# dataset.push_to_hub("nielsr/eurosat")
# dataset.push_to_hub("nielsr/eurosat", private=True)

# option 2: local folder
# dataset = load_dataset("imagefolder", data_dir="path_to_folder")

# option 3: just load any existing dataset from the hub, like CIFAR-10, FashionMNIST ...
# dataset = load_dataset("cifar10")

Let us also load the accuracy metric, which we'll use to evaluate our model both during and after training.

In [None]:
from datasets import load_metric

metric = load_metric("accuracy")

The `dataset` object itself is a [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for each split of the data. In this case, there is only "train" for a training dataset.

In [None]:
dataset

To access an actual element, you need to select the split and then specify an index:

In [None]:
example = dataset["train"][10]
example

Each example consists of an image and a corresponding label. We can also verify this by checking the features of the dataset:

In [None]:
dataset["train"].features

We are fortunate we can view the image (as the 'image' field is an [Image feature](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Image)) as follows:

In [None]:
example['image']

Let's resize the image to make it larger as the images in the EuroSAT dataset are of a low resolution (64x64 pixels):

In [None]:
example['image'].resize((200, 200))

Let's print the corresponding label:

In [None]:
example['label']

As you can see, the `label` field is not an actual string label. By default the `ClassLabel` fields are encoded into integers for convenience:

In [None]:
dataset["train"].features["label"]

Let's create an `id2label` dictionary to decode them back to strings and see what they are. The inverse `label2id` dictionary will be useful later when we load the model.

In [None]:
labels = dataset["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = i
    id2label[i] = label

id2label[2]

## Preprocessing the data

Before we can feed these images to our model, we need to preprocess them. 

Preprocessing images typically comes down to (1) resizing them to a particular size (2) normalizing the color channels (R,G,B) using a mean and standard deviation. These are referred to as **image transformations**.

In addition, you typically perform **data augmentation** during training (like random cropping and flipping) to make the model more robust and achieve a higher accuracy. Data augmentation is also a great technique to increase the size of the training data.

We will use `torchvision.transforms` for the image transformations (including data augmentation) in this notebook. Note that you can use any other package, for example [albumentations](https://albumentations.ai/), [imgaug](https://github.com/aleju/imgaug) or [Kornia](https://kornia.readthedocs.io/en/latest/) to perform these transformations.

To make sure we (1) resize to the appropriate size and (2) use the appropriate image mean and standard deviation for the model architecture we are going to use, we instantiate a *feature extractor* with the `AutoFeatureExtractor.from_pretrained` method.

This feature extractor is a minimal preprocessor that can be used to prepare images for inference.

In [None]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
feature_extractor

The Datasets library is made for processing data very easily. We can write custom functions, which can then be applied to an entire dataset (either using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=map#datasets.Dataset.map) or [`set_transform`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=set_transform#datasets.Dataset.set_transform) functions).

Here we define two separate functions, one for training (which includes data augmentation) and one for validation (which only includes resizing, center cropping and normalizing). 

In [None]:
import torch
from torchvision.transforms import (
    CenterCrop,
    Compose,
    Normalize,
    RandomHorizontalFlip,
    RandomResizedCrop,
    Resize,
    ToTensor,
)

size = (feature_extractor.size["height"], feature_extractor.size["width"])
normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
train_transforms = Compose(
        [
            RandomResizedCrop(size),
            RandomHorizontalFlip(),
            ToTensor(),
            normalize,
            lambda tensor: tensor.half(),
        ]
    )

val_transforms = Compose(
        [
            Resize(size),
            CenterCrop(size),
            ToTensor(),
            normalize,
            lambda tensor: tensor.half(),
        ]
    )

def preprocess_train(example_batch):
    """Apply train_transforms across a batch."""
    example_batch["pixel_values"] = [
        train_transforms(image.convert("RGB")) for image in example_batch["image"]
    ]
    return example_batch

def preprocess_val(example_batch):
    """Apply val_transforms across a batch."""
    example_batch["pixel_values"] = [val_transforms(image.convert("RGB")) for image in example_batch["image"]]
    return example_batch

Next, we can preprocess our dataset by applying these functions. We will use the `set_transform` functionality, which allows us to apply the functions above on-the-fly (meaning that they will only be applied when the images are loaded in RAM).

In [None]:
# split up training into training + validation
splits = dataset["train"].train_test_split(test_size=0.1)
train_ds = splits['train']
val_ds = splits['test']

In [None]:
train_ds.set_transform(preprocess_train)
val_ds.set_transform(preprocess_val)

Let's access an element to see that we've added a "pixel_values" feature:

In [None]:
train_ds[0]

## Training the model

Now that our data is ready, we can download the pre-trained model and fine-tune it. For classification we use the `AutoModelForImageClassification` class. Calling the `from_pretrained` method on it will download and cache the weights for us. As the label IDs and the number of labels are dataset dependent, we pass `label2id`, and `id2label` alongside the `model_checkpoint` here. This will make sure a custom classification head will be created (with a custom number of output neurons).

Note: If you're planning to fine-tune an already fine-tuned checkpoint, like [facebook/convnext-tiny-224](https://huggingface.co/facebook/convnext-tiny-224) (which has already been fine-tuned on ImageNet-1k), then you need to provide the additional argument `ignore_mismatched_sizes=True` to the `from_pretrained` method. This will make sure the output head (with 1000 output neurons) is thrown away and replaced with a new, randomly initialized classification head that includes a custom number of output neurons. You don't need to specify this argument if the pre-trained model doesn't include a head. 

In [None]:
from transformers import AutoModelForImageClassification
from optimum.graphcore import IPUTrainingArguments, IPUTrainer, IPUConfig

model = AutoModelForImageClassification.from_pretrained(
    model_checkpoint, 
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes = True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

The warning tells us that we are throwing away some weights (the weights and bias of the `classifier` layer) and randomly initializing some others (the weights and biases of a new `classifier` layer). This is expected in this case, because we are adding a new head for which we don't have pre-trained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate an `IPUTrainer` class, we will need to define the training configuration and the evaluation metric. 

The most important is the `IPUTrainingArguments` class, which contains all the attributes to customize the training. It requires a folder name, which will be used to save the checkpoints of the model.

Most of the training arguments are pretty self-explanatory, but one that is quite important here is `remove_unused_columns=False`. This will drop any features not used by the model's `call` function. By default `remove_unused_columns` is True because usually it's ideal to drop unused feature columns, making it easier to unpack inputs into the model's `call` function. But, in our case, we need the unused features ('image' in particular) in order to create 'pixel_values'.

In [None]:
model_name = model_checkpoint.split("/")[-1]

args = IPUTrainingArguments(
    f"{model_name}-finetuned-eurosat",
    remove_unused_columns=False,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=micro_batch_size,
    per_device_eval_batch_size=micro_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=3,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    n_ipu=n_ipu,
    dataloader_drop_last=True,
    push_to_hub=False,
    # model_hub_id = f"username-or-organization/{model_name}-finetuned-eurosat"
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use `micro_batch_size`, `gradient_accumulation_steps` and `n_ipu` to determine the global batch size, which we defined earlier in the notebook, and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask `IPUTrainer` to load the best model it saved (according to `metric_name`) at the end of training.

`push_to_hub` is necessary if we want to push the model to the [ðŸ¤— Models Hub](https://huggingface.co/models) regularly during training. You can remove them if you didn't follow the installation steps at the beginning of this notebook. If you want to save your model locally to a name that is different to the name of the repository it will be pushed to, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace:  for instance `"nielsr/vit-finetuned-cifar10"` or `"huggingface/nielsr/vit-finetuned-cifar10"`).

We also need to define the `IPUConfig` class, which specifies attributes and configuration parameters to compile and put the model on the device. We initialize it with a config name or path, which we set earlier:

In [None]:
ipu_config = IPUConfig.from_pretrained(ipu_config_name, executable_cache_dir=executable_cache_dir)

Next, we need to define a function for how to compute the metrics from the predictions, which will use `metric`, which we loaded earlier. The only preprocessing we have to do is to take the argmax of our predicted logits:

In [None]:
import numpy as np

# the compute_metrics function takes a Named Tuple as input:
# predictions, which are the logits of the model as Numpy arrays,
# and label_ids, which are the ground-truth labels as Numpy arrays.
def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

We also define `collate_fn`, which will be used to batch examples together.
Each batch consists of two keys, namely `pixel_values` and `labels`.

In [None]:
def collate_fn(examples):
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}

Then we pass all of this together with our datasets to `IPUTrainer`:

In [None]:
trainer = IPUTrainer(
    model,
    ipu_config,
    args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
    data_collator=collate_fn,
)

You might wonder why we pass `feature_extractor` as a tokenizer when we have already preprocessed our data. This is only to make sure the feature extractor configuration file (stored as JSON) will also be uploaded to the repo on the [ðŸ¤— Models Hub](https://huggingface.co/models).

Now we can fine-tune our model by calling the `train` method:

We can check with the `evaluate` method that `IPUTrainer` did reload the best model properly (if it was not the last one):

In [None]:
train_results = trainer.train()
# rest is optional but nice to have
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

In [None]:
metrics = trainer.evaluate()
# some nice to haves:
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)


You can upload the result of the training to the ðŸ¤— Hub. Note that `IPUTrainer` will automatically create a model card as well as Tensorboard logs - see the "Training metrics" tab:

In [None]:
# trainer.push_to_hub()

You can also share this model and other users can load it with the identifier "your-username/the-name-you-picked" so for instance:

```python
from transformers import AutoModelForImageClassification, AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("nielsr/my-awesome-model")
model = AutoModelForImageClassification.from_pretrained("nielsr/my-awesome-model")

```

## Inference

Let's say you have a new image, on which you'd like to make a prediction. Let's load a satellite image of a forest (that's not part of the EuroSAT dataset), and see how the model does.

In [None]:
from PIL import Image
import requests

url = 'https://huggingface.co/nielsr/convnext-tiny-finetuned-eurostat/resolve/main/forest.png'
image = Image.open(requests.get(url, stream=True).raw)
image

We'll load the feature extractor and model from the [ðŸ¤— Transformers Hub](https://huggingface.co/transformers). In this case, we use [Auto Classes](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForImageClassification), which will make sure the appropriate classes will be loaded automatically based on the `config.json` and `preprocessor_config.json` files of the repo on the Hub). Simply put your Hugging Face user or organization name into the string below to load your fine-tuned model:

In [None]:
from transformers import AutoModelForImageClassification, AutoFeatureExtractor

repo_name = f"username-or-organization/{model_name}-finetuned-eurosat"

try:
    # if the model was pushed to the hub it can be downloaded
    feature_extractor = AutoFeatureExtractor.from_pretrained(repo_name)
    model = AutoModelForImageClassification.from_pretrained(repo_name)
except:
    # otherwise we use the local folder where the model was saved after training
    feature_extractor = AutoFeatureExtractor.from_pretrained(trainer.args.output_dir)
    model = AutoModelForImageClassification.from_pretrained(trainer.args.output_dir)

In [None]:
# prepare image for the model
encoding = feature_extractor(image.convert("RGB"), return_tensors="pt")
print(encoding.pixel_values.shape)

In [None]:
import torch

# forward pass
with torch.no_grad():
  outputs = model(**encoding)
  logits = outputs.logits

In [None]:
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Looks like our model got it right! 

## Next steps

Try out the other [IPU-powered Jupyter Notebooks](https://www.graphcore.ai/ipu-jupyter-notebooks) to see how how IPUs perform on other tasks.