1 00:00:00,225 --> 00:00:02,892 (air whooshing) 2 00:00:05,460 --> 00:00:07,470 - Supercharge your PyTorch training loop 3 00:00:07,470 --> 00:00:08,943 with Hugging Face Accelerate. 4 00:00:11,340 --> 00:00:12,600 There are multiple setups 5 00:00:12,600 --> 00:00:14,580 on which you can run your training: 6 00:00:14,580 --> 00:00:17,910 it could be on CPU, GPUs, TPUs, 7 00:00:17,910 --> 00:00:20,610 distributed on one machine with several devices, 8 00:00:20,610 --> 00:00:23,220 or even several machines, often called nodes, 9 00:00:23,220 --> 00:00:25,173 each with multiple devices. 10 00:00:26,340 --> 00:00:28,200 On top of that, there are new tweaks 11 00:00:28,200 --> 00:00:30,810 to make your training faster or more efficient, 12 00:00:30,810 --> 00:00:32,763 like mixed precision and DeepSpeed. 13 00:00:33,840 --> 00:00:36,600 Each of those setups or training tweaks 14 00:00:36,600 --> 00:00:38,760 requires you to change the code of your training loop 15 00:00:38,760 --> 00:00:41,733 in one way or another and to learn a new API. 16 00:00:43,260 --> 00:00:45,940 All those setups are handled by the Trainer API, 17 00:00:45,940 --> 00:00:49,590 and there are several third-party libraries that can help. 18 00:00:49,590 --> 00:00:50,760 The problem with those 19 00:00:50,760 --> 00:00:53,100 is that they can feel like a black box 20 00:00:53,100 --> 00:00:55,320 and that it might not be easy to implement the tweak 21 00:00:55,320 --> 00:00:56,820 to the training loop you need. 22 00:00:57,840 --> 00:00:59,760 Accelerate has been designed specifically 23 00:00:59,760 --> 00:01:02,790 to let you retain full control over your training loop 24 00:01:02,790 --> 00:01:04,833 and be as non-intrusive as possible. 25 00:01:05,760 --> 00:01:08,760 With just four lines of code to add to your training loop, 26 00:01:08,760 --> 00:01:11,733 here shown on the example of the training loop video, 27 00:01:12,630 --> 00:01:14,730 Accelerate will handle all the setups 28 00:01:14,730 --> 00:01:17,180 and training tweaks mentioned on the first slide. 29 00:01:18,630 --> 00:01:20,400 It's only one API to learn and master 30 00:01:20,400 --> 00:01:21,933 instead of 10 different ones. 31 00:01:23,340 --> 00:01:25,980 More specifically, you have to import and instantiate 32 00:01:25,980 --> 00:01:27,360 an accelerator object, 33 00:01:27,360 --> 00:01:29,100 that will handle all the necessary code 34 00:01:29,100 --> 00:01:30,300 for your specific setup. 35 00:01:31,380 --> 00:01:33,780 Then you have to send it the model, 36 00:01:33,780 --> 00:01:36,000 optimizer and dataloaders you are using 37 00:01:36,000 --> 00:01:39,633 in the prepare method, which is the main method to remember. 38 00:01:40,860 --> 00:01:42,870 Accelerate handles device placement, 39 00:01:42,870 --> 00:01:44,370 so you don't need to put your batch 40 00:01:44,370 --> 00:01:46,980 on the specific device you are using. 41 00:01:46,980 --> 00:01:50,640 Finally, you have to replace the loss.backward line 42 00:01:50,640 --> 00:01:54,300 by accelerator.backwardloss, 43 00:01:54,300 --> 00:01:55,500 and that's all you need! 44 00:01:58,410 --> 00:02:01,710 Accelerate also handles distributed evaluation. 45 00:02:01,710 --> 00:02:04,020 You can still use a classic evaluation loop 46 00:02:04,020 --> 00:02:06,750 such as the one we saw in the training loop video, 47 00:02:06,750 --> 00:02:08,280 in which case all processes 48 00:02:08,280 --> 00:02:10,083 will perform the full evaluation. 49 00:02:11,340 --> 00:02:13,530 To use a distributed evaluation, 50 00:02:13,530 --> 00:02:16,380 you just have to adapt your evaluation loop like this: 51 00:02:16,380 --> 00:02:17,657 pass along the evaluation dataloader 52 00:02:17,657 --> 00:02:21,093 to the accelerator.prepare method, like for training. 53 00:02:22,170 --> 00:02:23,430 Then you can dismiss the line 54 00:02:23,430 --> 00:02:26,160 that places the batch on the proper device, 55 00:02:26,160 --> 00:02:27,870 and just before passing your predictions 56 00:02:27,870 --> 00:02:31,110 and labels to your metric, use accelerator.gather 57 00:02:31,110 --> 00:02:33,300 to gather together the predictions 58 00:02:33,300 --> 00:02:34,803 and labels from each process. 59 00:02:36,420 --> 00:02:37,890 A distributed training script 60 00:02:37,890 --> 00:02:41,040 has to be launched several times on different processes, 61 00:02:41,040 --> 00:02:43,203 for instance, one per GPU you are using. 62 00:02:44,070 --> 00:02:46,350 You can use the PyTorch tools to do that 63 00:02:46,350 --> 00:02:48,210 if you are familiar with them, 64 00:02:48,210 --> 00:02:50,520 but Accelerate also provides an easy API 65 00:02:50,520 --> 00:02:53,523 to configure your setup and launch your training script. 66 00:02:54,540 --> 00:02:57,270 In a terminal, run accelerate config 67 00:02:57,270 --> 00:02:58,650 and answer the small questionnaire 68 00:02:58,650 --> 00:03:00,330 to generate a configuration file 69 00:03:00,330 --> 00:03:02,073 with all the relevant information, 70 00:03:03,240 --> 00:03:05,790 then you can just run accelerate launch, 71 00:03:05,790 --> 00:03:08,580 followed by the path to your training script. 72 00:03:08,580 --> 00:03:12,000 In a notebook, you can use the notebook launcher function 73 00:03:12,000 --> 00:03:13,233 to launch your training. 74 00:03:15,186 --> 00:03:17,853 (air whooshing)