notebooks/fl-with-flower.ipynb (386 lines of code) (raw):
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU",
"gpuClass": "standard"
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# Federated Learning using Hugging Face and Flower\n",
"\n",
"This tutorial will show how to leverage Hugging Face to federate the training of language models over multiple clients using [Flower](https://flower.dev/). More specifically, we will fine-tune a pre-trained Transformer model (alBERT) for sequence classification over a dataset of IMDB ratings. The end goal is to detect if a movie rating is positive or negative.\n"
],
"metadata": {
"id": "ESpKTVP3F_Xt"
}
},
{
"cell_type": "markdown",
"source": [
"## Dependencies\n",
"\n",
"For this tutorial we will need `datasets`, `flwr['simulation']`(here we use the extra 'simulation' dependencies from Flower as we will simulated the federated setting inside Google Colab), `torch`, and `transformers`."
],
"metadata": {
"id": "hcUWBC4ih-mp"
}
},
{
"cell_type": "code",
"source": [
"!pip install datasets evaluate flwr[\"simulation\"] torch transformers"
],
"metadata": {
"id": "zBuj5kSif2yt"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We can now import the relevant modules."
],
"metadata": {
"id": "Q5I0ZUC4hpua"
}
},
{
"cell_type": "code",
"source": [
"from collections import OrderedDict\n",
"import os\n",
"import random\n",
"import warnings\n",
"\n",
"import flwr as fl\n",
"import torch\n",
"\n",
"from torch.utils.data import DataLoader\n",
"\n",
"from datasets import load_dataset\n",
"from evaluate import load as load_metric\n",
"\n",
"from transformers import AutoTokenizer, DataCollatorWithPadding\n",
"from transformers import AutoModelForSequenceClassification\n",
"from transformers import AdamW\n",
"from transformers import logging"
],
"metadata": {
"id": "IhNwuY-Oefau"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Next we will set some global variables and disable some of the logging to clear out our output."
],
"metadata": {
"id": "J-gZqELEhsun"
}
},
{
"cell_type": "code",
"source": [
"warnings.filterwarnings(\"ignore\", category=UserWarning)\n",
"warnings.filterwarnings(\"ignore\", category=DeprecationWarning)\n",
"logging.set_verbosity(logging.ERROR)\n",
"os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'\n",
"warnings.simplefilter('ignore')\n",
"\n",
"DEVICE = torch.device(\"cpu\")\n",
"CHECKPOINT = \"albert-base-v2\" # transformer model checkpoint\n",
"NUM_CLIENTS = 2\n",
"NUM_ROUNDS = 3"
],
"metadata": {
"id": "AH0Sx53Rehjc"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Standard Hugging Face workflow\n",
"\n",
"### Handling the data\n",
"\n",
"To fetch the IMDB dataset, we will use Hugging Face's `datasets` library. We then need to tokenize the data and create `PyTorch` dataloaders, this is all done in the `load_data` function:"
],
"metadata": {
"id": "aI21VQX-GRSb"
}
},
{
"cell_type": "code",
"source": [
"def load_data():\n",
" \"\"\"Load IMDB data (training and eval)\"\"\"\n",
" raw_datasets = load_dataset(\"imdb\")\n",
" raw_datasets = raw_datasets.shuffle(seed=42)\n",
"\n",
" # remove unnecessary data split\n",
" del raw_datasets[\"unsupervised\"]\n",
"\n",
" tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)\n",
"\n",
" def tokenize_function(examples):\n",
" return tokenizer(examples[\"text\"], truncation=True)\n",
"\n",
" # Select 20 random samples to reduce the computation cost\n",
" train_population = random.sample(range(len(raw_datasets[\"train\"])), 20)\n",
" test_population = random.sample(range(len(raw_datasets[\"test\"])), 20)\n",
"\n",
" tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)\n",
" tokenized_datasets[\"train\"] = tokenized_datasets[\"train\"].select(train_population)\n",
" tokenized_datasets[\"test\"] = tokenized_datasets[\"test\"].select(test_population)\n",
"\n",
" tokenized_datasets = tokenized_datasets.remove_columns(\"text\")\n",
" tokenized_datasets = tokenized_datasets.rename_column(\"label\", \"labels\")\n",
"\n",
" data_collator = DataCollatorWithPadding(tokenizer=tokenizer)\n",
" trainloader = DataLoader(\n",
" tokenized_datasets[\"train\"],\n",
" shuffle=True,\n",
" batch_size=32,\n",
" collate_fn=data_collator,\n",
" )\n",
"\n",
" testloader = DataLoader(\n",
" tokenized_datasets[\"test\"], batch_size=32, collate_fn=data_collator\n",
" )\n",
"\n",
" return trainloader, testloader"
],
"metadata": {
"id": "06-OMJtvekAB"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Training and testing the model\n",
"\n",
"Once we have a way of creating our trainloader and testloader, we can take care of the training and testing. This is very similar to any `PyTorch` training or testing loop:"
],
"metadata": {
"id": "s1UtfzMFGVKF"
}
},
{
"cell_type": "code",
"source": [
"def train(net, trainloader, epochs):\n",
" optimizer = AdamW(net.parameters(), lr=5e-5)\n",
" net.train()\n",
" for _ in range(epochs):\n",
" for batch in trainloader:\n",
" batch = {k: v.to(DEVICE) for k, v in batch.items()}\n",
" outputs = net(**batch)\n",
" loss = outputs.loss\n",
" loss.backward()\n",
" optimizer.step()\n",
" optimizer.zero_grad()\n",
"\n",
"\n",
"def test(net, testloader):\n",
" metric = load_metric(\"accuracy\")\n",
" loss = 0\n",
" net.eval()\n",
" for batch in testloader:\n",
" batch = {k: v.to(DEVICE) for k, v in batch.items()}\n",
" with torch.no_grad():\n",
" outputs = net(**batch)\n",
" logits = outputs.logits\n",
" loss += outputs.loss.item()\n",
" predictions = torch.argmax(logits, dim=-1)\n",
" metric.add_batch(predictions=predictions, references=batch[\"labels\"])\n",
" loss /= len(testloader.dataset)\n",
" accuracy = metric.compute()[\"accuracy\"]\n",
" return loss, accuracy"
],
"metadata": {
"id": "szd1PmUbem1v"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### Creating the model itself\n",
"\n",
"To create the model itself, we will just load the pre-trained alBERT model using Hugging Face’s `AutoModelForSequenceClassification` :"
],
"metadata": {
"id": "rVbWtgQLGhFB"
}
},
{
"cell_type": "code",
"source": [
"net = AutoModelForSequenceClassification.from_pretrained(\n",
" CHECKPOINT, num_labels=2\n",
").to(DEVICE)"
],
"metadata": {
"id": "qeiueaYKGiBf"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Federating the example\n",
"\n",
"The idea behind Federated Learning is to train a model between multiple clients and a server without having to share any data. This is done by letting each client train the model locally on its data and send its parameters back to the server, which then aggregates all the clients’ parameters together using a predefined strategy. This process is made very simple by using the [Flower](https://github.com/adap/flower) framework. If you want a more complete overview, be sure to check out this guide: [What is Federated Learning?](https://flower.dev/docs/tutorial/Flower-0-What-is-FL.html)\n",
"\n",
"### Creating the IMDBClient\n",
"\n",
"To federate our example to multiple clients, we first need to write our Flower client class (inheriting from `flwr.client.NumPyClient`). This is very easy, as our model is a standard `PyTorch` model:"
],
"metadata": {
"id": "Mx95k0TUGtSG"
}
},
{
"cell_type": "code",
"source": [
"class IMDBClient(fl.client.NumPyClient):\n",
" def __init__(self, net, trainloader, testloader):\n",
" self.net = net\n",
" self.trainloader = trainloader\n",
" self.testloader = testloader\n",
"\n",
" def get_parameters(self, config):\n",
" return [val.cpu().numpy() for _, val in self.net.state_dict().items()]\n",
"\n",
" def set_parameters(self, parameters):\n",
" params_dict = zip(self.net.state_dict().keys(), parameters)\n",
" state_dict = OrderedDict({k: torch.Tensor(v) for k, v in params_dict})\n",
" self.net.load_state_dict(state_dict, strict=True)\n",
"\n",
" def fit(self, parameters, config):\n",
" self.set_parameters(parameters)\n",
" print(\"Training Started...\")\n",
" train(self.net, self.trainloader, epochs=1)\n",
" print(\"Training Finished.\")\n",
" return self.get_parameters(config={}), len(self.trainloader), {}\n",
"\n",
" def evaluate(self, parameters, config):\n",
" self.set_parameters(parameters)\n",
" loss, accuracy = test(self.net, self.testloader)\n",
" return float(loss), len(self.testloader), {\"accuracy\": float(accuracy), \"loss\": float(loss)}"
],
"metadata": {
"id": "-sSuLWYzeuPC"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The `get_parameters` function lets the server get the client's parameters. Inversely, the `set_parameters` function allows the server to send its parameters to the client. Finally, the `fit` function trains the model locally for the client, and the `evaluate` function tests the model locally and returns the relevant metrics. "
],
"metadata": {
"id": "PTdzUBpkG3jE"
}
},
{
"cell_type": "markdown",
"source": [
"### Generating the clients\n",
"\n",
"In order to simulate the federated setting we need to provide a way to instantiate clients for our simulation. Here, it is very simple as every client will hold the same piece of data (this is not realistic, it is just used here for simplicity sakes)."
],
"metadata": {
"id": "kZDZ3KUaGapq"
}
},
{
"cell_type": "code",
"source": [
"trainloader, testloader = load_data()\n",
"def client_fn(cid):\n",
" return IMDBClient(net, trainloader, testloader)"
],
"metadata": {
"id": "y9A11kmafSwX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Starting the simulation\n",
"\n",
"We now have all the elements to start our simulation. The `weighted_average` function is there to provide a way to aggregate the metrics distributed amongst the clients (basically to display a nice average accuracy at the end of the training). We then define our strategy (here `FedAvg`, which will aggregate the clients weights by doing an average). \n",
"\n",
"Finally, `start_simulation` is used to start the training."
],
"metadata": {
"id": "Y7dcCPKDjaFt"
}
},
{
"cell_type": "code",
"source": [
"def weighted_average(metrics):\n",
" accuracies = [num_examples * m[\"accuracy\"] for num_examples, m in metrics]\n",
" losses = [num_examples * m[\"loss\"] for num_examples, m in metrics]\n",
" examples = [num_examples for num_examples, _ in metrics]\n",
" return {\"accuracy\": sum(accuracies) / sum(examples), \"loss\": sum(losses) / sum(examples)}\n",
"\n",
"strategy = fl.server.strategy.FedAvg(\n",
" fraction_fit=1.0,\n",
" fraction_evaluate=1.0,\n",
" evaluate_metrics_aggregation_fn=weighted_average,\n",
")\n",
"\n",
"fl.simulation.start_simulation(\n",
" client_fn=client_fn,\n",
" num_clients=NUM_CLIENTS,\n",
" config=fl.server.ServerConfig(num_rounds=NUM_ROUNDS),\n",
" strategy=strategy,\n",
" client_resources={\"num_cpus\": 1, \"num_gpus\": 0},\n",
" ray_init_args={\"log_to_driver\": False, \"num_cpus\": 1, \"num_gpus\": 0}\n",
")"
],
"metadata": {
"id": "s6Jsw70Qe_yA"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Note that this is a very basic example, and a lot can be added or modified, it was just to showcase how simply we could federate a Hugging Face workflow using Flower. The number of clients and the data samples are intentionally very small in order to quickly run inside Colab, but keep in mind that everything can be tweaked and extended."
],
"metadata": {
"id": "YaIbuJ_xmsxk"
}
}
]
}