subtitles/en/30_supercharge-your-pytorch-training-loop-with-accelerate.srt (248 lines of code) (raw):
1
00:00:00,225 --> 00:00:02,892
(air whooshing)
2
00:00:05,460 --> 00:00:07,470
- Supercharge your PyTorch training loop
3
00:00:07,470 --> 00:00:08,943
with Hugging Face Accelerate.
4
00:00:11,340 --> 00:00:12,600
There are multiple setups
5
00:00:12,600 --> 00:00:14,580
on which you can run your training:
6
00:00:14,580 --> 00:00:17,910
it could be on CPU, GPUs, TPUs,
7
00:00:17,910 --> 00:00:20,610
distributed on one machine
with several devices,
8
00:00:20,610 --> 00:00:23,220
or even several machines,
often called nodes,
9
00:00:23,220 --> 00:00:25,173
each with multiple devices.
10
00:00:26,340 --> 00:00:28,200
On top of that, there are new tweaks
11
00:00:28,200 --> 00:00:30,810
to make your training
faster or more efficient,
12
00:00:30,810 --> 00:00:32,763
like mixed precision and DeepSpeed.
13
00:00:33,840 --> 00:00:36,600
Each of those setups or training tweaks
14
00:00:36,600 --> 00:00:38,760
requires you to change the
code of your training loop
15
00:00:38,760 --> 00:00:41,733
in one way or another
and to learn a new API.
16
00:00:43,260 --> 00:00:45,940
All those setups are
handled by the Trainer API,
17
00:00:45,940 --> 00:00:49,590
and there are several third-party
libraries that can help.
18
00:00:49,590 --> 00:00:50,760
The problem with those
19
00:00:50,760 --> 00:00:53,100
is that they can feel like a black box
20
00:00:53,100 --> 00:00:55,320
and that it might not be
easy to implement the tweak
21
00:00:55,320 --> 00:00:56,820
to the training loop you need.
22
00:00:57,840 --> 00:00:59,760
Accelerate has been designed specifically
23
00:00:59,760 --> 00:01:02,790
to let you retain full control
over your training loop
24
00:01:02,790 --> 00:01:04,833
and be as non-intrusive as possible.
25
00:01:05,760 --> 00:01:08,760
With just four lines of code
to add to your training loop,
26
00:01:08,760 --> 00:01:11,733
here shown on the example
of the training loop video,
27
00:01:12,630 --> 00:01:14,730
Accelerate will handle all the setups
28
00:01:14,730 --> 00:01:17,180
and training tweaks
mentioned on the first slide.
29
00:01:18,630 --> 00:01:20,400
It's only one API to learn and master
30
00:01:20,400 --> 00:01:21,933
instead of 10 different ones.
31
00:01:23,340 --> 00:01:25,980
More specifically, you have
to import and instantiate
32
00:01:25,980 --> 00:01:27,360
an accelerator object,
33
00:01:27,360 --> 00:01:29,100
that will handle all the necessary code
34
00:01:29,100 --> 00:01:30,300
for your specific setup.
35
00:01:31,380 --> 00:01:33,780
Then you have to send it the model,
36
00:01:33,780 --> 00:01:36,000
optimizer and dataloaders you are using
37
00:01:36,000 --> 00:01:39,633
in the prepare method, which
is the main method to remember.
38
00:01:40,860 --> 00:01:42,870
Accelerate handles device placement,
39
00:01:42,870 --> 00:01:44,370
so you don't need to put your batch
40
00:01:44,370 --> 00:01:46,980
on the specific device you are using.
41
00:01:46,980 --> 00:01:50,640
Finally, you have to replace
the loss.backward line
42
00:01:50,640 --> 00:01:54,300
by accelerator.backwardloss,
43
00:01:54,300 --> 00:01:55,500
and that's all you need!
44
00:01:58,410 --> 00:02:01,710
Accelerate also handles
distributed evaluation.
45
00:02:01,710 --> 00:02:04,020
You can still use a
classic evaluation loop
46
00:02:04,020 --> 00:02:06,750
such as the one we saw in
the training loop video,
47
00:02:06,750 --> 00:02:08,280
in which case all processes
48
00:02:08,280 --> 00:02:10,083
will perform the full evaluation.
49
00:02:11,340 --> 00:02:13,530
To use a distributed evaluation,
50
00:02:13,530 --> 00:02:16,380
you just have to adapt your
evaluation loop like this:
51
00:02:16,380 --> 00:02:17,657
pass along the evaluation dataloader
52
00:02:17,657 --> 00:02:21,093
to the accelerator.prepare
method, like for training.
53
00:02:22,170 --> 00:02:23,430
Then you can dismiss the line
54
00:02:23,430 --> 00:02:26,160
that places the batch
on the proper device,
55
00:02:26,160 --> 00:02:27,870
and just before passing your predictions
56
00:02:27,870 --> 00:02:31,110
and labels to your metric,
use accelerator.gather
57
00:02:31,110 --> 00:02:33,300
to gather together the predictions
58
00:02:33,300 --> 00:02:34,803
and labels from each process.
59
00:02:36,420 --> 00:02:37,890
A distributed training script
60
00:02:37,890 --> 00:02:41,040
has to be launched several
times on different processes,
61
00:02:41,040 --> 00:02:43,203
for instance, one per GPU you are using.
62
00:02:44,070 --> 00:02:46,350
You can use the PyTorch tools to do that
63
00:02:46,350 --> 00:02:48,210
if you are familiar with them,
64
00:02:48,210 --> 00:02:50,520
but Accelerate also provides an easy API
65
00:02:50,520 --> 00:02:53,523
to configure your setup and
launch your training script.
66
00:02:54,540 --> 00:02:57,270
In a terminal, run accelerate config
67
00:02:57,270 --> 00:02:58,650
and answer the small questionnaire
68
00:02:58,650 --> 00:03:00,330
to generate a configuration file
69
00:03:00,330 --> 00:03:02,073
with all the relevant information,
70
00:03:03,240 --> 00:03:05,790
then you can just run accelerate launch,
71
00:03:05,790 --> 00:03:08,580
followed by the path to
your training script.
72
00:03:08,580 --> 00:03:12,000
In a notebook, you can use
the notebook launcher function
73
00:03:12,000 --> 00:03:13,233
to launch your training.
74
00:03:15,186 --> 00:03:17,853
(air whooshing)