subtitles/zh-CN/30_supercharge-your-pytorch-training-loop-with-accelerate.srt (296 lines of code) (raw):
1
00:00:00,225 --> 00:00:02,892
(空气呼啸)
(air whooshing)
2
00:00:05,460 --> 00:00:07,470
- 增强你的 PyTorch 训练循环
- Supercharge your PyTorch training loop
3
00:00:07,470 --> 00:00:08,943
使用 Hugging Face Accelerate
with Hugging Face Accelerate.
4
00:00:11,340 --> 00:00:12,600
有多个设置
There are multiple setups
5
00:00:12,600 --> 00:00:14,580
你可以在其上进行训练:
on which you can run your training:
6
00:00:14,580 --> 00:00:17,910
它可以在 CPU、GPU、TPU 上,
it could be on CPU, GPUs, TPUs,
7
00:00:17,910 --> 00:00:20,610
分布在具有多个设备的一台机器上,
distributed on one machine with several devices,
8
00:00:20,610 --> 00:00:23,220
甚至几台机器,通常称为节点 (node) ,
or even several machines, often called nodes,
9
00:00:23,220 --> 00:00:25,173
每个都有多个设备。
each with multiple devices.
10
00:00:26,340 --> 00:00:28,200
在这之上,还有新的调整
On top of that, there are new tweaks
11
00:00:28,200 --> 00:00:30,810
使你的训练更快或更有效,
to make your training faster or more efficient,
12
00:00:30,810 --> 00:00:32,763
比如混合精度和 DeepSpeed 。
like mixed precision and DeepSpeed.
13
00:00:33,840 --> 00:00:36,600
这每一个设置或训练调整
Each of those setups or training tweaks
14
00:00:36,600 --> 00:00:38,760
都要求你更改训练循环的代码
requires you to change the code of your training loop
15
00:00:38,760 --> 00:00:41,733
以某种方式, 或者学习新的 API。
in one way or another and to learn a new API.
16
00:00:43,260 --> 00:00:45,940
所有这些设置都由 Trainer API 处理,
All those setups are handled by the Trainer API,
17
00:00:45,940 --> 00:00:49,590
并且有几个第三方库可以提供帮助。
and there are several third-party libraries that can help.
18
00:00:49,590 --> 00:00:50,760
它们的问题
The problem with those
19
00:00:50,760 --> 00:00:53,100
是他们感觉像个黑盒子
is that they can feel like a black box
20
00:00:53,100 --> 00:00:55,320
并且实现调整可能并不容易
and that it might not be easy to implement the tweak
21
00:00:55,320 --> 00:00:56,820
到你需要的训练循环上。
to the training loop you need.
22
00:00:57,840 --> 00:00:59,760
Accelerate 是专门设计的
Accelerate has been designed specifically
23
00:00:59,760 --> 00:01:02,790
以让你保持对训练循环的完全控制
to let you retain full control over your training loop
24
00:01:02,790 --> 00:01:04,833
并尽可能不打乱。
and be as non-intrusive as possible.
25
00:01:05,760 --> 00:01:08,760
只需四行代码即可添加到你的训练循环中,
With just four lines of code to add to your training loop,
26
00:01:08,760 --> 00:01:11,733
这里显示在训练循环视频的例子中,
here shown on the example of the training loop video,
27
00:01:12,630 --> 00:01:14,730
Accelerate 将处理所有设置
Accelerate will handle all the setups
28
00:01:14,730 --> 00:01:17,180
和第一张幻灯片中提到的训练调整。
and training tweaks mentioned on the first slide.
29
00:01:18,630 --> 00:01:20,400
只需学习和掌握一个 API
It's only one API to learn and master
30
00:01:20,400 --> 00:01:21,933
而不是 10 个不同的。
instead of 10 different ones.
31
00:01:23,340 --> 00:01:25,980
更具体地说,你必须导入和实例化
More specifically, you have to import and instantiate
32
00:01:25,980 --> 00:01:27,360
一个 accelerator 对象,
an accelerator object,
33
00:01:27,360 --> 00:01:29,100
这将处理所有必要的代码
that will handle all the necessary code
34
00:01:29,100 --> 00:01:30,300
为你的特定设置。
for your specific setup.
35
00:01:31,380 --> 00:01:33,780
然后你必须发给它模型
Then you have to send it the model,
36
00:01:33,780 --> 00:01:36,000
你正在使用的优化器和数据加载器
optimizer and dataloaders you are using
37
00:01:36,000 --> 00:01:39,633
在 prepare 方法中,这是要记住的主要方法。
in the prepare method, which is the main method to remember.
38
00:01:40,860 --> 00:01:42,870
加速处理设备放置,
Accelerate handles device placement,
39
00:01:42,870 --> 00:01:44,370
所以你不需要把你的分批
so you don't need to put your batch
40
00:01:44,370 --> 00:01:46,980
在你使用的特定设备上。
on the specific device you are using.
41
00:01:46,980 --> 00:01:50,640
最后,你必须更换 loss.backward 行
Finally, you have to replace the loss.backward line
42
00:01:50,640 --> 00:01:54,300
成 accelerator.backwardloss,
by accelerator.backwardloss,
43
00:01:54,300 --> 00:01:55,500
这就是你所需要的!
and that's all you need!
44
00:01:58,410 --> 00:02:01,710
Accelerate 还处理分布式评估。
Accelerate also handles distributed evaluation.
45
00:02:01,710 --> 00:02:04,020
你仍然可以使用经典的评估循环
You can still use a classic evaluation loop
46
00:02:04,020 --> 00:02:06,750
比如我们在训练循环视频中看到的那个,
such as the one we saw in the training loop video,
47
00:02:06,750 --> 00:02:08,280
在这种情况下所有进程
in which case all processes
48
00:02:08,280 --> 00:02:10,083
将进行全面评估。
will perform the full evaluation.
49
00:02:11,340 --> 00:02:13,530
要使用分布式评估,
To use a distributed evaluation,
50
00:02:13,530 --> 00:02:16,380
你只需要像这样调整你的评估循环:
you just have to adapt your evaluation loop like this:
51
00:02:16,380 --> 00:02:17,657
传递评估数据加载器
pass along the evaluation dataloader
52
00:02:17,657 --> 00:02:21,093
到 accelerator.prepare 方法,比如训练。
to the accelerator.prepare method, like for training.
53
00:02:22,170 --> 00:02:23,430
然后你可以关闭这行
Then you can dismiss the line
54
00:02:23,430 --> 00:02:26,160
将分批放在适当的设备上,
that places the batch on the proper device,
55
00:02:26,160 --> 00:02:27,870
在输入你的预测
and just before passing your predictions
56
00:02:27,870 --> 00:02:31,110
和指标的标签之前,使用 accelerator.gather
and labels to your metric, use accelerator.gather
57
00:02:31,110 --> 00:02:33,300
来收集预测
to gather together the predictions
58
00:02:33,300 --> 00:02:34,803
和每个进程的标签。
and labels from each process.
59
00:02:36,420 --> 00:02:37,890
分布式训练脚本
A distributed training script
60
00:02:37,890 --> 00:02:41,040
必须在不同的进程中多次启动,
has to be launched several times on different processes,
61
00:02:41,040 --> 00:02:43,203
例如,你使用的每个 GPU 一个。
for instance, one per GPU you are using.
62
00:02:44,070 --> 00:02:46,350
你可以使用 PyTorch 工具来做到这一点
You can use the PyTorch tools to do that
63
00:02:46,350 --> 00:02:48,210
如果你熟悉他们,
if you are familiar with them,
64
00:02:48,210 --> 00:02:50,520
但 Accelerate 还提供了一个简单的 API
but Accelerate also provides an easy API
65
00:02:50,520 --> 00:02:53,523
配置你的设置并启动你的训练脚本。
to configure your setup and launch your training script.
66
00:02:54,540 --> 00:02:57,270
在终端中,运行加速配置
In a terminal, run accelerate config
67
00:02:57,270 --> 00:02:58,650
并回答小问题
and answer the small questionnaire
68
00:02:58,650 --> 00:03:00,330
生成配置文件
to generate a configuration file
69
00:03:00,330 --> 00:03:02,073
含所有相关信息,
with all the relevant information,
70
00:03:03,240 --> 00:03:05,790
然后你可以运行加速启动,
then you can just run accelerate launch,
71
00:03:05,790 --> 00:03:08,580
然后是训练脚本的路径。
followed by the path to your training script.
72
00:03:08,580 --> 00:03:12,000
在 notebook 中,你可以使用 notebook 启动器函数
In a notebook, you can use the notebook launcher function
73
00:03:12,000 --> 00:03:13,233
开始你的训练。
to launch your training.
74
00:03:15,186 --> 00:03:17,853
(空气呼啸)
(air whooshing)