subtitles/en/73_debugging-the-training-pipeline-(pytorch).srt (347 lines of code) (raw):

1 00:00:06,210 --> 00:00:08,760 - In this video, we will see how to debug an error 2 00:00:08,760 --> 00:00:11,896 you encounter when running Trainer.train 3 00:00:11,896 --> 00:00:15,066 As an example, we will use this script that finetunes 4 00:00:15,066 --> 00:00:17,760 a bert model on the GLUE MNLI dataset. 5 00:00:17,760 --> 00:00:19,470 Checkout the videos linked below 6 00:00:19,470 --> 00:00:21,840 to see how we came to such a script. 7 00:00:21,840 --> 00:00:24,540 Here we want to learn how to debug the problems in it. 8 00:00:25,470 --> 00:00:28,110 Running the script gives us an error pretty quickly. 9 00:00:28,110 --> 00:00:29,040 It happens at the line 10 00:00:29,040 --> 00:00:30,990 where we feed the inputs to the model, 11 00:00:30,990 --> 00:00:32,850 according to the traceback. 12 00:00:32,850 --> 00:00:34,702 That tells us there is a problem there, 13 00:00:34,702 --> 00:00:37,881 but the problem could come from many different causes. 14 00:00:37,881 --> 00:00:39,330 To debug an error in a training, 15 00:00:39,330 --> 00:00:41,760 you need to make sure each step of the training pipeline 16 00:00:41,760 --> 00:00:43,440 works as intended. 17 00:00:43,440 --> 00:00:45,780 This means checking that the inputs of your dataset 18 00:00:45,780 --> 00:00:47,040 are correct, 19 00:00:47,040 --> 00:00:48,720 you can batch them together, 20 00:00:48,720 --> 00:00:50,790 feed them through the model to get a loss, 21 00:00:50,790 --> 00:00:52,500 then compute the gradients of that loss 22 00:00:52,500 --> 00:00:54,303 before performing an optimizer step. 23 00:00:55,470 --> 00:00:57,810 So let's start by looking at the training dataset 24 00:00:57,810 --> 00:00:59,043 this Trainer is using. 25 00:00:59,910 --> 00:01:02,190 There is definitely a problem here. 26 00:01:02,190 --> 00:01:04,293 We see texts and not number. 27 00:01:05,130 --> 00:01:06,660 The error message was telling us the model 28 00:01:06,660 --> 00:01:08,220 did not get input IDs 29 00:01:08,220 --> 00:01:11,100 and we do not have those in the dataset indeed. 30 00:01:11,100 --> 00:01:12,660 Looking back at our code, 31 00:01:12,660 --> 00:01:14,400 we can see we made a mistake 32 00:01:14,400 --> 00:01:17,400 and passed the wrong datasets to the Trainer. 33 00:01:17,400 --> 00:01:19,173 So let's fix that and run again. 34 00:01:20,490 --> 00:01:21,840 Now we have a new error. 35 00:01:21,840 --> 00:01:23,130 Inspecting the traceback 36 00:01:23,130 --> 00:01:25,860 tells us it happens when we try to create a batch, 37 00:01:25,860 --> 00:01:28,743 specifically to group the features in a tensor. 38 00:01:29,700 --> 00:01:32,610 We can confirm this by asking the Trainer to get us a batch 39 00:01:32,610 --> 00:01:34,230 of the training data loader, 40 00:01:34,230 --> 00:01:35,913 which reproduces the same error. 41 00:01:36,780 --> 00:01:39,064 Either by inspecting the inputs or debugging, 42 00:01:39,064 --> 00:01:42,870 we can then see they are not all of the same size. 43 00:01:42,870 --> 00:01:45,120 This is because we have not passed a data collator 44 00:01:45,120 --> 00:01:46,890 to do the padding to the Trainer 45 00:01:46,890 --> 00:01:49,443 and didn't pad when preprocessing the data either. 46 00:01:50,430 --> 00:01:52,710 Padding inside the Trainer is normally the default, 47 00:01:52,710 --> 00:01:55,380 but only if you provide your tokenizer to the Trainer, 48 00:01:55,380 --> 00:01:57,270 and we forgot to do that. 49 00:01:57,270 --> 00:01:59,120 So let's fix the issue and run again. 50 00:02:00,510 --> 00:02:02,883 This time we get a nasty CUDA error. 51 00:02:03,765 --> 00:02:06,285 They are very difficult to debug because for one, 52 00:02:06,285 --> 00:02:10,530 they put your kernel in a state that is not recoverable 53 00:02:10,530 --> 00:02:13,260 so you have to restart your notebook from the beginning 54 00:02:13,260 --> 00:02:16,950 and two, the traceback is completely useless for those. 55 00:02:16,950 --> 00:02:19,230 Here the traceback tells us the error happens 56 00:02:19,230 --> 00:02:22,500 when we do the gradient computation with loss.backward, 57 00:02:22,500 --> 00:02:25,113 but as we will see later on that is not the case. 58 00:02:26,520 --> 00:02:28,920 This is because everything that happens on the GPU 59 00:02:28,920 --> 00:02:30,720 is done asynchronously. 60 00:02:30,720 --> 00:02:32,880 When you execute the model call, 61 00:02:32,880 --> 00:02:34,457 what the program does is just stacking that 62 00:02:34,457 --> 00:02:36,600 in the queue of GPU, 63 00:02:36,600 --> 00:02:39,856 then if the GPU didn't have any current job to do, 64 00:02:39,856 --> 00:02:41,850 the work will start on the GPU at the same time 65 00:02:41,850 --> 00:02:45,000 as the CPU moves to the next instruction. 66 00:02:45,000 --> 00:02:47,040 Continuing with the extraction of the loss, 67 00:02:47,040 --> 00:02:49,170 this is stacked into the GPU queue 68 00:02:49,170 --> 00:02:51,953 while the CPU moves to the instruction loss.backward. 69 00:02:51,953 --> 00:02:54,180 But the GPU still hasn't finished 70 00:02:54,180 --> 00:02:55,710 the forward pass of the model 71 00:02:55,710 --> 00:02:57,603 since all that took no time at all. 72 00:02:58,440 --> 00:03:00,210 The CPU stops moving forward, 73 00:03:00,210 --> 00:03:03,240 because loss.backward as an instruction telling it to wait 74 00:03:03,240 --> 00:03:04,830 for the GPUs to be finished, 75 00:03:04,830 --> 00:03:06,780 to make sure the gradients are correct. 76 00:03:07,650 --> 00:03:09,570 When the GPU encounters an error, 77 00:03:09,570 --> 00:03:13,140 it gives it back to the CPU with a cryptic message 78 00:03:13,140 --> 00:03:15,423 who raises the error at the wrong place. 79 00:03:16,350 --> 00:03:18,720 So to debug this, we will need to execute the next steps 80 00:03:18,720 --> 00:03:21,211 of the training pipeline on the CPU. 81 00:03:21,211 --> 00:03:22,380 It is very easy to do, 82 00:03:22,380 --> 00:03:25,350 and we get a traceback we can trust this time. 83 00:03:25,350 --> 00:03:26,520 As we said before, 84 00:03:26,520 --> 00:03:28,620 the error actually happens during the forward pass 85 00:03:28,620 --> 00:03:29,453 of the model, 86 00:03:29,453 --> 00:03:30,993 and not loss.backward. 87 00:03:31,920 --> 00:03:33,680 It's an index error. 88 00:03:33,680 --> 00:03:34,950 With a bit of debugging, 89 00:03:34,950 --> 00:03:37,410 we see we have labels ranging from 0 to 2, 90 00:03:37,410 --> 00:03:39,000 so three different values, 91 00:03:39,000 --> 00:03:42,191 but our outputs have a shape of batch size per 2. 92 00:03:42,191 --> 00:03:45,600 It looks like our model has the wrong number of labels. 93 00:03:45,600 --> 00:03:47,190 We can indeed confirm that, 94 00:03:47,190 --> 00:03:49,860 and now that we know it's easy to fix it in the code 95 00:03:49,860 --> 00:03:53,969 by adding num_labels=3 when we create the model. 96 00:03:53,969 --> 00:03:56,883 Now the training script will run to completion. 97 00:03:58,440 --> 00:03:59,430 We did not need it yet, 98 00:03:59,430 --> 00:04:00,960 but here is how we would debug the next step 99 00:04:00,960 --> 00:04:02,944 of the pipeline, gradient computation, 100 00:04:02,944 --> 00:04:05,850 as well as the optimizer step. 101 00:04:05,850 --> 00:04:08,823 With all of this, good luck debugging your own trainings!