subtitles/en/29_write-your-training-loop-in-pytorch.srt (412 lines of code) (raw):

1 00:00:00,298 --> 00:00:01,511 (air whooshing) 2 00:00:01,511 --> 00:00:02,769 (smiley face popping) 3 00:00:02,769 --> 00:00:05,460 (air whooshing) 4 00:00:05,460 --> 00:00:08,486 - Write your own training loop with PyTorch. 5 00:00:08,486 --> 00:00:09,960 In this video, we'll look at 6 00:00:09,960 --> 00:00:12,750 how we can do the same fine-tuning as in the Trainer video, 7 00:00:12,750 --> 00:00:14,760 but without relying on that class. 8 00:00:14,760 --> 00:00:17,790 This way, you'll be able to easily customize each step 9 00:00:17,790 --> 00:00:20,310 to the training loop to your needs. 10 00:00:20,310 --> 00:00:21,660 This is also very useful 11 00:00:21,660 --> 00:00:22,740 to manually debug something 12 00:00:22,740 --> 00:00:24,590 that went wrong with the Trainer API. 13 00:00:26,220 --> 00:00:28,020 Before we dive into the code, 14 00:00:28,020 --> 00:00:30,481 here is a sketch of a training loop. 15 00:00:30,481 --> 00:00:33,381 We take a batch of training data and feed it to the model. 16 00:00:34,223 --> 00:00:36,960 With the labels, we can then compute a loss. 17 00:00:36,960 --> 00:00:39,316 That number is not useful in its own, 18 00:00:39,316 --> 00:00:40,260 that is used to compute 19 00:00:40,260 --> 00:00:42,150 the ingredients of our model weights, 20 00:00:42,150 --> 00:00:43,440 that is the derivative of the loss 21 00:00:44,610 --> 00:00:47,160 with respect to each model weight. 22 00:00:47,160 --> 00:00:49,800 Those gradients are then used by the optimizer 23 00:00:49,800 --> 00:00:51,210 to update the model weights, 24 00:00:51,210 --> 00:00:53,550 and make them a little bit better. 25 00:00:53,550 --> 00:00:54,510 We then repeat the process 26 00:00:54,510 --> 00:00:56,880 with a new batch of training data. 27 00:00:56,880 --> 00:00:58,620 If any of this isn't clear, 28 00:00:58,620 --> 00:01:00,270 don't hesitate to take a refresher 29 00:01:00,270 --> 00:01:02,170 on your favorite deep learning course. 30 00:01:03,210 --> 00:01:06,000 We'll use the GLUE MRPC data set here again, 31 00:01:06,000 --> 00:01:07,680 and we've seen how to prepropose the data 32 00:01:07,680 --> 00:01:11,130 using the Datasets library with dynamic padding. 33 00:01:11,130 --> 00:01:12,630 Check out the videos link below 34 00:01:12,630 --> 00:01:14,280 if you haven't seen them already. 35 00:01:15,480 --> 00:01:18,930 With this done, we only have to define PyTorch DataLoaders 36 00:01:18,930 --> 00:01:20,610 which will be responsible to convert 37 00:01:20,610 --> 00:01:23,253 the elements of our dataset into patches. 38 00:01:24,450 --> 00:01:27,960 We use our DataColletorForPadding as a collate function, 39 00:01:27,960 --> 00:01:29,460 and shuffle the training set 40 00:01:29,460 --> 00:01:31,080 to make sure we don't go over the samples 41 00:01:31,080 --> 00:01:33,870 in the same order at a epoch*. 42 00:01:33,870 --> 00:01:36,390 To check that everything works as intended, 43 00:01:36,390 --> 00:01:38,883 we try to grab a batch of data, and inspect it. 44 00:01:40,080 --> 00:01:43,050 Like our data set elements, it's a dictionary, 45 00:01:43,050 --> 00:01:46,260 but these times the values are not a single list of integers 46 00:01:46,260 --> 00:01:49,053 but a tensor of shape batch size by sequence length. 47 00:01:50,460 --> 00:01:53,580 The next step is to send the training data in our model. 48 00:01:53,580 --> 00:01:56,730 For that, we'll need to actually create a model. 49 00:01:56,730 --> 00:01:58,740 As seen in the Model API video, 50 00:01:58,740 --> 00:02:00,540 we use the from_pretrained method, 51 00:02:00,540 --> 00:02:03,270 and adjust the number of labels to the number of classes 52 00:02:03,270 --> 00:02:06,810 we have on this data set, here two. 53 00:02:06,810 --> 00:02:08,940 Again to be sure everything is going well, 54 00:02:08,940 --> 00:02:11,100 we pass the batch we grabbed to our model, 55 00:02:11,100 --> 00:02:13,320 and check there is no error. 56 00:02:13,320 --> 00:02:14,940 If the labels are provided, 57 00:02:14,940 --> 00:02:16,590 the models of the Transformers library 58 00:02:16,590 --> 00:02:18,273 always returns a loss directly. 59 00:02:19,525 --> 00:02:21,090 We will be able to do loss.backward() 60 00:02:21,090 --> 00:02:22,860 to compute all the gradients, 61 00:02:22,860 --> 00:02:26,460 and will then need an optimizer to do the training step. 62 00:02:26,460 --> 00:02:28,860 We use the AdamW optimizer here, 63 00:02:28,860 --> 00:02:31,440 which is a variant of Adam with proper weight decay, 64 00:02:31,440 --> 00:02:33,840 but you can pick any PyTorch optimizer you like. 65 00:02:34,830 --> 00:02:36,150 Using the previous loss, 66 00:02:36,150 --> 00:02:39,060 and computing the gradients with loss.backward(), 67 00:02:39,060 --> 00:02:41,130 we check that we can do the optimizer step 68 00:02:41,130 --> 00:02:42,030 without any error. 69 00:02:43,380 --> 00:02:45,870 Don't forget to zero your gradient afterwards, 70 00:02:45,870 --> 00:02:46,890 or at the next step, 71 00:02:46,890 --> 00:02:49,343 they will get added to the gradients you computed. 72 00:02:50,490 --> 00:02:52,080 We could already write our training loop, 73 00:02:52,080 --> 00:02:53,220 but we will add two more things 74 00:02:53,220 --> 00:02:55,620 to make it as good as it can be. 75 00:02:55,620 --> 00:02:57,690 The first one is a learning rate scheduler, 76 00:02:57,690 --> 00:03:00,140 to progressively decay our learning rate to zero. 77 00:03:01,195 --> 00:03:04,590 The get_scheduler function from the Transformers library 78 00:03:04,590 --> 00:03:06,150 is just a convenience function 79 00:03:06,150 --> 00:03:07,800 to easily build such a scheduler. 80 00:03:08,850 --> 00:03:09,683 You can again use 81 00:03:09,683 --> 00:03:11,860 any PyTorch learning rate scheduler instead. 82 00:03:13,110 --> 00:03:14,850 Finally, if we want our training 83 00:03:14,850 --> 00:03:17,610 to take a couple of minutes instead of a few hours, 84 00:03:17,610 --> 00:03:19,530 we will need to use a GPU. 85 00:03:19,530 --> 00:03:21,270 The first step is to get one, 86 00:03:21,270 --> 00:03:23,283 for instance by using a collab notebook. 87 00:03:24,180 --> 00:03:26,040 Then you need to actually send your model, 88 00:03:26,040 --> 00:03:28,923 and training data on it by using a torch device. 89 00:03:29,790 --> 00:03:30,840 Double-check the following lines 90 00:03:30,840 --> 00:03:32,340 print a CUDA device for you. 91 00:03:32,340 --> 00:03:35,640 or be prepared for your training to less, more than an hour. 92 00:03:35,640 --> 00:03:37,390 We can now put everything together. 93 00:03:38,550 --> 00:03:40,860 First, we put our model in training mode 94 00:03:40,860 --> 00:03:42,240 which will activate the training behavior 95 00:03:42,240 --> 00:03:44,790 for some layers, like Dropout. 96 00:03:44,790 --> 00:03:46,860 Then go through the number of epochs we picked, 97 00:03:46,860 --> 00:03:50,070 and all the data in our training dataloader. 98 00:03:50,070 --> 00:03:52,410 Then we go through all the steps we have seen already; 99 00:03:52,410 --> 00:03:54,240 send the data to the GPU, 100 00:03:54,240 --> 00:03:55,560 compute the model outputs, 101 00:03:55,560 --> 00:03:57,720 and in particular the loss. 102 00:03:57,720 --> 00:03:59,850 Use the loss to compute gradients, 103 00:03:59,850 --> 00:04:02,880 then make a training step with the optimizer. 104 00:04:02,880 --> 00:04:04,500 Update the learning rate in our scheduler 105 00:04:04,500 --> 00:04:05,970 for the next iteration, 106 00:04:05,970 --> 00:04:07,763 and zero the gradients of the optimizer. 107 00:04:09,240 --> 00:04:10,500 Once this is finished, 108 00:04:10,500 --> 00:04:12,150 we can evaluate our model very easily 109 00:04:12,150 --> 00:04:14,283 with a metric from the Datasets library. 110 00:04:15,180 --> 00:04:17,880 First, we put our model in evaluation mode, 111 00:04:17,880 --> 00:04:20,550 to deactivate layers like Dropout, 112 00:04:20,550 --> 00:04:23,850 then go through all the data in the evaluation data loader. 113 00:04:23,850 --> 00:04:25,530 As we have seen in the Trainer video, 114 00:04:25,530 --> 00:04:26,850 the model outputs logits, 115 00:04:26,850 --> 00:04:28,530 and we need to apply the argmax function 116 00:04:28,530 --> 00:04:30,213 to convert them into predictions. 117 00:04:31,350 --> 00:04:33,420 The metric object then has an add_batch method, 118 00:04:33,420 --> 00:04:36,810 we can use to send it those intermediate predictions. 119 00:04:36,810 --> 00:04:38,700 Once the evaluation loop is finished, 120 00:04:38,700 --> 00:04:40,320 we just have to call the compute method 121 00:04:40,320 --> 00:04:42,180 to get our final results. 122 00:04:42,180 --> 00:04:44,490 Congratulations, you have now fine-tuned a model 123 00:04:44,490 --> 00:04:45,633 all by yourself. 124 00:04:47,253 --> 00:04:49,920 (air whooshing)