1
00:00:00,225 --> 00:00:02,892
(air whooshing)

2
00:00:05,460 --> 00:00:07,470
- Supercharge your PyTorch training loop

3
00:00:07,470 --> 00:00:08,943
with Hugging Face Accelerate.

4
00:00:11,340 --> 00:00:12,600
There are multiple setups

5
00:00:12,600 --> 00:00:14,580
on which you can run your training:

6
00:00:14,580 --> 00:00:17,910
it could be on CPU, GPUs, TPUs,

7
00:00:17,910 --> 00:00:20,610
distributed on one machine
with several devices,

8
00:00:20,610 --> 00:00:23,220
or even several machines,
often called nodes,

9
00:00:23,220 --> 00:00:25,173
each with multiple devices.

10
00:00:26,340 --> 00:00:28,200
On top of that, there are new tweaks

11
00:00:28,200 --> 00:00:30,810
to make your training
faster or more efficient,

12
00:00:30,810 --> 00:00:32,763
like mixed precision and DeepSpeed.

13
00:00:33,840 --> 00:00:36,600
Each of those setups or training tweaks

14
00:00:36,600 --> 00:00:38,760
requires you to change the
code of your training loop

15
00:00:38,760 --> 00:00:41,733
in one way or another
and to learn a new API.

16
00:00:43,260 --> 00:00:45,940
All those setups are
handled by the Trainer API,

17
00:00:45,940 --> 00:00:49,590
and there are several third-party
libraries that can help.

18
00:00:49,590 --> 00:00:50,760
The problem with those

19
00:00:50,760 --> 00:00:53,100
is that they can feel like a black box

20
00:00:53,100 --> 00:00:55,320
and that it might not be
easy to implement the tweak

21
00:00:55,320 --> 00:00:56,820
to the training loop you need.

22
00:00:57,840 --> 00:00:59,760
Accelerate has been designed specifically

23
00:00:59,760 --> 00:01:02,790
to let you retain full control
over your training loop

24
00:01:02,790 --> 00:01:04,833
and be as non-intrusive as possible.

25
00:01:05,760 --> 00:01:08,760
With just four lines of code
to add to your training loop,

26
00:01:08,760 --> 00:01:11,733
here shown on the example
of the training loop video,

27
00:01:12,630 --> 00:01:14,730
Accelerate will handle all the setups

28
00:01:14,730 --> 00:01:17,180
and training tweaks
mentioned on the first slide.

29
00:01:18,630 --> 00:01:20,400
It's only one API to learn and master

30
00:01:20,400 --> 00:01:21,933
instead of 10 different ones.

31
00:01:23,340 --> 00:01:25,980
More specifically, you have
to import and instantiate

32
00:01:25,980 --> 00:01:27,360
an accelerator object,

33
00:01:27,360 --> 00:01:29,100
that will handle all the necessary code

34
00:01:29,100 --> 00:01:30,300
for your specific setup.

35
00:01:31,380 --> 00:01:33,780
Then you have to send it the model,

36
00:01:33,780 --> 00:01:36,000
optimizer and dataloaders you are using

37
00:01:36,000 --> 00:01:39,633
in the prepare method, which
is the main method to remember.

38
00:01:40,860 --> 00:01:42,870
Accelerate handles device placement,

39
00:01:42,870 --> 00:01:44,370
so you don't need to put your batch

40
00:01:44,370 --> 00:01:46,980
on the specific device you are using.

41
00:01:46,980 --> 00:01:50,640
Finally, you have to replace
the loss.backward line

42
00:01:50,640 --> 00:01:54,300
by accelerator.backwardloss,

43
00:01:54,300 --> 00:01:55,500
and that's all you need!

44
00:01:58,410 --> 00:02:01,710
Accelerate also handles
distributed evaluation.

45
00:02:01,710 --> 00:02:04,020
You can still use a
classic evaluation loop

46
00:02:04,020 --> 00:02:06,750
such as the one we saw in
the training loop video,

47
00:02:06,750 --> 00:02:08,280
in which case all processes

48
00:02:08,280 --> 00:02:10,083
will perform the full evaluation.

49
00:02:11,340 --> 00:02:13,530
To use a distributed evaluation,

50
00:02:13,530 --> 00:02:16,380
you just have to adapt your
evaluation loop like this:

51
00:02:16,380 --> 00:02:17,657
pass along the evaluation dataloader

52
00:02:17,657 --> 00:02:21,093
to the accelerator.prepare
method, like for training.

53
00:02:22,170 --> 00:02:23,430
Then you can dismiss the line

54
00:02:23,430 --> 00:02:26,160
that places the batch
on the proper device,

55
00:02:26,160 --> 00:02:27,870
and just before passing your predictions

56
00:02:27,870 --> 00:02:31,110
and labels to your metric,
use accelerator.gather

57
00:02:31,110 --> 00:02:33,300
to gather together the predictions

58
00:02:33,300 --> 00:02:34,803
and labels from each process.

59
00:02:36,420 --> 00:02:37,890
A distributed training script

60
00:02:37,890 --> 00:02:41,040
has to be launched several
times on different processes,

61
00:02:41,040 --> 00:02:43,203
for instance, one per GPU you are using.

62
00:02:44,070 --> 00:02:46,350
You can use the PyTorch tools to do that

63
00:02:46,350 --> 00:02:48,210
if you are familiar with them,

64
00:02:48,210 --> 00:02:50,520
but Accelerate also provides an easy API

65
00:02:50,520 --> 00:02:53,523
to configure your setup and
launch your training script.

66
00:02:54,540 --> 00:02:57,270
In a terminal, run accelerate config

67
00:02:57,270 --> 00:02:58,650
and answer the small questionnaire

68
00:02:58,650 --> 00:03:00,330
to generate a configuration file

69
00:03:00,330 --> 00:03:02,073
with all the relevant information,

70
00:03:03,240 --> 00:03:05,790
then you can just run accelerate launch,

71
00:03:05,790 --> 00:03:08,580
followed by the path to
your training script.

72
00:03:08,580 --> 00:03:12,000
In a notebook, you can use
the notebook launcher function

73
00:03:12,000 --> 00:03:13,233
to launch your training.

74
00:03:15,186 --> 00:03:17,853
(air whooshing)