1 00:00:00,225 --> 00:00:02,892 (空气呼啸) (air whooshing) 2 00:00:05,460 --> 00:00:07,470 - 增强你的 PyTorch 训练循环 - Supercharge your PyTorch training loop 3 00:00:07,470 --> 00:00:08,943 使用 Hugging Face Accelerate with Hugging Face Accelerate. 4 00:00:11,340 --> 00:00:12,600 有多个设置 There are multiple setups 5 00:00:12,600 --> 00:00:14,580 你可以在其上进行训练: on which you can run your training: 6 00:00:14,580 --> 00:00:17,910 它可以在 CPU、GPU、TPU 上, it could be on CPU, GPUs, TPUs, 7 00:00:17,910 --> 00:00:20,610 分布在具有多个设备的一台机器上, distributed on one machine with several devices, 8 00:00:20,610 --> 00:00:23,220 甚至几台机器,通常称为节点 (node) , or even several machines, often called nodes, 9 00:00:23,220 --> 00:00:25,173 每个都有多个设备。 each with multiple devices. 10 00:00:26,340 --> 00:00:28,200 在这之上,还有新的调整 On top of that, there are new tweaks 11 00:00:28,200 --> 00:00:30,810 使你的训练更快或更有效, to make your training faster or more efficient, 12 00:00:30,810 --> 00:00:32,763 比如混合精度和 DeepSpeed 。 like mixed precision and DeepSpeed. 13 00:00:33,840 --> 00:00:36,600 这每一个设置或训练调整 Each of those setups or training tweaks 14 00:00:36,600 --> 00:00:38,760 都要求你更改训练循环的代码 requires you to change the code of your training loop 15 00:00:38,760 --> 00:00:41,733 以某种方式, 或者学习新的 API。 in one way or another and to learn a new API. 16 00:00:43,260 --> 00:00:45,940 所有这些设置都由 Trainer API 处理, All those setups are handled by the Trainer API, 17 00:00:45,940 --> 00:00:49,590 并且有几个第三方库可以提供帮助。 and there are several third-party libraries that can help. 18 00:00:49,590 --> 00:00:50,760 它们的问题 The problem with those 19 00:00:50,760 --> 00:00:53,100 是他们感觉像个黑盒子 is that they can feel like a black box 20 00:00:53,100 --> 00:00:55,320 并且实现调整可能并不容易 and that it might not be easy to implement the tweak 21 00:00:55,320 --> 00:00:56,820 到你需要的训练循环上。 to the training loop you need. 22 00:00:57,840 --> 00:00:59,760 Accelerate 是专门设计的 Accelerate has been designed specifically 23 00:00:59,760 --> 00:01:02,790 以让你保持对训练循环的完全控制 to let you retain full control over your training loop 24 00:01:02,790 --> 00:01:04,833 并尽可能不打乱。 and be as non-intrusive as possible. 25 00:01:05,760 --> 00:01:08,760 只需四行代码即可添加到你的训练循环中, With just four lines of code to add to your training loop, 26 00:01:08,760 --> 00:01:11,733 这里显示在训练循环视频的例子中, here shown on the example of the training loop video, 27 00:01:12,630 --> 00:01:14,730 Accelerate 将处理所有设置 Accelerate will handle all the setups 28 00:01:14,730 --> 00:01:17,180 和第一张幻灯片中提到的训练调整。 and training tweaks mentioned on the first slide. 29 00:01:18,630 --> 00:01:20,400 只需学习和掌握一个 API It's only one API to learn and master 30 00:01:20,400 --> 00:01:21,933 而不是 10 个不同的。 instead of 10 different ones. 31 00:01:23,340 --> 00:01:25,980 更具体地说,你必须导入和实例化 More specifically, you have to import and instantiate 32 00:01:25,980 --> 00:01:27,360 一个 accelerator 对象, an accelerator object, 33 00:01:27,360 --> 00:01:29,100 这将处理所有必要的代码 that will handle all the necessary code 34 00:01:29,100 --> 00:01:30,300 为你的特定设置。 for your specific setup. 35 00:01:31,380 --> 00:01:33,780 然后你必须发给它模型 Then you have to send it the model, 36 00:01:33,780 --> 00:01:36,000 你正在使用的优化器和数据加载器 optimizer and dataloaders you are using 37 00:01:36,000 --> 00:01:39,633 在 prepare 方法中,这是要记住的主要方法。 in the prepare method, which is the main method to remember. 38 00:01:40,860 --> 00:01:42,870 加速处理设备放置, Accelerate handles device placement, 39 00:01:42,870 --> 00:01:44,370 所以你不需要把你的分批 so you don't need to put your batch 40 00:01:44,370 --> 00:01:46,980 在你使用的特定设备上。 on the specific device you are using. 41 00:01:46,980 --> 00:01:50,640 最后,你必须更换 loss.backward 行 Finally, you have to replace the loss.backward line 42 00:01:50,640 --> 00:01:54,300 成 accelerator.backwardloss, by accelerator.backwardloss, 43 00:01:54,300 --> 00:01:55,500 这就是你所需要的! and that's all you need! 44 00:01:58,410 --> 00:02:01,710 Accelerate 还处理分布式评估。 Accelerate also handles distributed evaluation. 45 00:02:01,710 --> 00:02:04,020 你仍然可以使用经典的评估循环 You can still use a classic evaluation loop 46 00:02:04,020 --> 00:02:06,750 比如我们在训练循环视频中看到的那个, such as the one we saw in the training loop video, 47 00:02:06,750 --> 00:02:08,280 在这种情况下所有进程 in which case all processes 48 00:02:08,280 --> 00:02:10,083 将进行全面评估。 will perform the full evaluation. 49 00:02:11,340 --> 00:02:13,530 要使用分布式评估, To use a distributed evaluation, 50 00:02:13,530 --> 00:02:16,380 你只需要像这样调整你的评估循环: you just have to adapt your evaluation loop like this: 51 00:02:16,380 --> 00:02:17,657 传递评估数据加载器 pass along the evaluation dataloader 52 00:02:17,657 --> 00:02:21,093 到 accelerator.prepare 方法,比如训练。 to the accelerator.prepare method, like for training. 53 00:02:22,170 --> 00:02:23,430 然后你可以关闭这行 Then you can dismiss the line 54 00:02:23,430 --> 00:02:26,160 将分批放在适当的设备上, that places the batch on the proper device, 55 00:02:26,160 --> 00:02:27,870 在输入你的预测 and just before passing your predictions 56 00:02:27,870 --> 00:02:31,110 和指标的标签之前,使用 accelerator.gather and labels to your metric, use accelerator.gather 57 00:02:31,110 --> 00:02:33,300 来收集预测 to gather together the predictions 58 00:02:33,300 --> 00:02:34,803 和每个进程的标签。 and labels from each process. 59 00:02:36,420 --> 00:02:37,890 分布式训练脚本 A distributed training script 60 00:02:37,890 --> 00:02:41,040 必须在不同的进程中多次启动, has to be launched several times on different processes, 61 00:02:41,040 --> 00:02:43,203 例如,你使用的每个 GPU 一个。 for instance, one per GPU you are using. 62 00:02:44,070 --> 00:02:46,350 你可以使用 PyTorch 工具来做到这一点 You can use the PyTorch tools to do that 63 00:02:46,350 --> 00:02:48,210 如果你熟悉他们, if you are familiar with them, 64 00:02:48,210 --> 00:02:50,520 但 Accelerate 还提供了一个简单的 API but Accelerate also provides an easy API 65 00:02:50,520 --> 00:02:53,523 配置你的设置并启动你的训练脚本。 to configure your setup and launch your training script. 66 00:02:54,540 --> 00:02:57,270 在终端中,运行加速配置 In a terminal, run accelerate config 67 00:02:57,270 --> 00:02:58,650 并回答小问题 and answer the small questionnaire 68 00:02:58,650 --> 00:03:00,330 生成配置文件 to generate a configuration file 69 00:03:00,330 --> 00:03:02,073 含所有相关信息, with all the relevant information, 70 00:03:03,240 --> 00:03:05,790 然后你可以运行加速启动, then you can just run accelerate launch, 71 00:03:05,790 --> 00:03:08,580 然后是训练脚本的路径。 followed by the path to your training script. 72 00:03:08,580 --> 00:03:12,000 在 notebook 中,你可以使用 notebook 启动器函数 In a notebook, you can use the notebook launcher function 73 00:03:12,000 --> 00:03:13,233 开始你的训练。 to launch your training. 74 00:03:15,186 --> 00:03:17,853 (空气呼啸) (air whooshing)