subtitles/en/74_debugging-the-training-pipeline-(tensorflow).srt (619 lines of code) (raw):

1 00:00:00,212 --> 00:00:02,879 (air whooshing) 2 00:00:04,680 --> 00:00:08,130 - Some bugs in your code are very straightforward. 3 00:00:08,130 --> 00:00:11,580 You try running it, you get a syntax error somewhere, 4 00:00:11,580 --> 00:00:14,490 Python tells you exactly where, and you fix it. 5 00:00:14,490 --> 00:00:17,760 This is great, it's simple and it's satisfying. 6 00:00:17,760 --> 00:00:20,310 Sometimes, though, things crash 7 00:00:20,310 --> 00:00:23,670 and the error is impossible to understand. 8 00:00:23,670 --> 00:00:26,700 This happens a lot in machine learning for a few reasons, 9 00:00:26,700 --> 00:00:29,310 you're working with big data structures, 10 00:00:29,310 --> 00:00:31,440 you're using these big, complex libraries 11 00:00:31,440 --> 00:00:33,420 with a lot of moving parts, 12 00:00:33,420 --> 00:00:35,310 and also you're doing a lot of GPU computing, 13 00:00:35,310 --> 00:00:38,490 and that in general is much more difficult to debug. 14 00:00:38,490 --> 00:00:40,260 In Keras there's the additional problem 15 00:00:40,260 --> 00:00:43,140 that your models are often compiled before execution, 16 00:00:43,140 --> 00:00:44,400 which is great for performance 17 00:00:44,400 --> 00:00:47,430 but it makes debugging them very difficult as well. 18 00:00:47,430 --> 00:00:50,370 So, this is going to be a video about what to do 19 00:00:50,370 --> 00:00:52,410 when you run into one of those nightmare bugs 20 00:00:52,410 --> 00:00:55,210 and you just have no idea where to begin with fixing it. 21 00:00:56,370 --> 00:00:58,920 So, to give you some intuitions for 22 00:00:58,920 --> 00:01:01,530 the most common things that go wrong 23 00:01:01,530 --> 00:01:03,573 and cause these weird issues, 24 00:01:04,800 --> 00:01:07,530 and show you where to look for the sources of bugs 25 00:01:07,530 --> 00:01:10,560 that you encounter, let's use this example script. 26 00:01:10,560 --> 00:01:12,900 So, I'll show it to you here in two parts. 27 00:01:12,900 --> 00:01:16,410 First, we do all our imports, we load a dataset, 28 00:01:16,410 --> 00:01:20,280 we create our tokenizer and we tokenize the dataset. 29 00:01:20,280 --> 00:01:23,640 Next, we convert our datasets to TensorFlow datasets, 30 00:01:23,640 --> 00:01:26,100 so that's tf.data.Dataset, 31 00:01:26,100 --> 00:01:28,500 and that's so that we can run fit on them, 32 00:01:28,500 --> 00:01:31,170 and then we load our model from a pretrained checkpoint, 33 00:01:31,170 --> 00:01:33,870 we compile it and we fit it with those datasets. 34 00:01:33,870 --> 00:01:35,970 So, this seems straightforward enough, 35 00:01:35,970 --> 00:01:38,220 it's similar to what we've done in the course before. 36 00:01:38,220 --> 00:01:40,650 But beware, this is spooky code 37 00:01:40,650 --> 00:01:43,590 and hides many dark and mysterious secrets. 38 00:01:43,590 --> 00:01:46,050 So, what happens when we run it? 39 00:01:46,050 --> 00:01:48,840 Well, it's not great. 40 00:01:48,840 --> 00:01:52,320 So, we get this error message, but what does it mean? 41 00:01:52,320 --> 00:01:55,470 We tried to train on our data, but we got no gradient? 42 00:01:55,470 --> 00:01:59,130 It's pretty perplexing, I mean, how do we even begin 43 00:01:59,130 --> 00:02:01,500 to debug not getting a gradient? 44 00:02:01,500 --> 00:02:03,930 So, when the error you get doesn't immediately suggest 45 00:02:03,930 --> 00:02:06,630 where the problem is, the best solution 46 00:02:06,630 --> 00:02:09,180 is often to walk through things in sequence, 47 00:02:09,180 --> 00:02:12,900 making sure at each stage that the outputs look right, 48 00:02:12,900 --> 00:02:15,300 that everything looks okay at that point. 49 00:02:15,300 --> 00:02:17,730 And, of course, that means the place to start 50 00:02:17,730 --> 00:02:19,473 is always to check your data. 51 00:02:20,670 --> 00:02:22,050 So, the best way to make sure 52 00:02:22,050 --> 00:02:24,480 that the data you're giving the model is good, 53 00:02:24,480 --> 00:02:27,690 is to grab a batch from the tf.data.Dataset 54 00:02:27,690 --> 00:02:29,520 that your model is training on, 55 00:02:29,520 --> 00:02:31,560 and that's because it's right at the end 56 00:02:31,560 --> 00:02:33,990 of the data pipeline. 57 00:02:33,990 --> 00:02:36,990 And so that means that if those outputs are good, 58 00:02:36,990 --> 00:02:39,990 you're guaranteed that your data pipeline is working well. 59 00:02:39,990 --> 00:02:42,600 So, we can do that by looping over the dataset 60 00:02:42,600 --> 00:02:44,790 for one iteration and then breaking, 61 00:02:44,790 --> 00:02:46,980 and that gives us a single batch. 62 00:02:46,980 --> 00:02:49,443 So, what do we get when we inspect that batch? 63 00:02:50,460 --> 00:02:52,500 We'll see that we're not getting any gradient 64 00:02:52,500 --> 00:02:55,530 because we're not passing labels to Keras. 65 00:02:55,530 --> 00:02:57,510 So, our labels are in the batch, 66 00:02:57,510 --> 00:02:59,670 but they're a key in the input dictionary 67 00:02:59,670 --> 00:03:02,340 and they're not a separate label as Keras expects, 68 00:03:02,340 --> 00:03:04,830 so this is one of the most common issues you'll encounter 69 00:03:04,830 --> 00:03:07,590 when training Transformers models with TensorFlow. 70 00:03:07,590 --> 00:03:10,980 Our models can all compute loss internally, 71 00:03:10,980 --> 00:03:13,140 but to use that loss for training 72 00:03:13,140 --> 00:03:15,960 the labels need to be passed in the input dictionary, 73 00:03:15,960 --> 00:03:17,940 where the model can see them. 74 00:03:17,940 --> 00:03:20,280 This internal loss is the loss that we use 75 00:03:20,280 --> 00:03:23,760 when we don't specify a loss when we call compile, 76 00:03:23,760 --> 00:03:25,660 when we don't specify a loss argument. 77 00:03:26,520 --> 00:03:27,870 So, Keras, on the other hand, 78 00:03:27,870 --> 00:03:30,570 usually expects labels to be passed separately 79 00:03:30,570 --> 00:03:32,130 from the input dictionary, 80 00:03:32,130 --> 00:03:34,110 and not to be visible to the model, 81 00:03:34,110 --> 00:03:36,600 and loss computations will usually fail 82 00:03:36,600 --> 00:03:38,220 if you don't do that 83 00:03:38,220 --> 00:03:40,380 So we need to choose one or the other, 84 00:03:40,380 --> 00:03:42,780 either we use the model's internal loss 85 00:03:42,780 --> 00:03:44,940 and keep the labels where they are, 86 00:03:44,940 --> 00:03:46,980 or we keep using Keras losses 87 00:03:46,980 --> 00:03:50,520 but we move the labels to the place Keras expects them. 88 00:03:50,520 --> 00:03:53,310 So, or simplicity here, let's fix this issue 89 00:03:53,310 --> 00:03:55,860 by using the model's internal losses, 90 00:03:55,860 --> 00:03:57,900 and we do that by removing the loss argument 91 00:03:57,900 --> 00:03:59,343 from the call to compile. 92 00:04:00,540 --> 00:04:03,000 So, what happens if we try training now? 93 00:04:03,000 --> 00:04:08,000 So we recompile with that, we call model.fit, what happens? 94 00:04:08,220 --> 00:04:13,050 Well, it runs this time but now we get a loss of NaN. 95 00:04:13,050 --> 00:04:16,440 So, that's not good, NaN means not a number 96 00:04:16,440 --> 00:04:19,140 and it's not a good loss to have in general. 97 00:04:19,140 --> 00:04:21,000 In fact, if we inspect our model now, 98 00:04:21,000 --> 00:04:23,970 we'll see that not only are all the outputs NaN, 99 00:04:23,970 --> 00:04:27,600 all the weights are NaN as well, as well as the loss. 100 00:04:27,600 --> 00:04:30,810 So once a single NaN creeps into your computations, 101 00:04:30,810 --> 00:04:34,530 it tends to spread, because it propagates from the loss 102 00:04:34,530 --> 00:04:36,420 and once it's at the loss it's at the gradient, 103 00:04:36,420 --> 00:04:37,530 it gets to the gradient, 104 00:04:37,530 --> 00:04:38,910 and then once it's in the gradient 105 00:04:38,910 --> 00:04:41,280 it enters the weight updates, 106 00:04:41,280 --> 00:04:43,980 and then all your weight updates end up as NaN as well. 107 00:04:43,980 --> 00:04:46,950 So NaN just completely destroyed our model here, 108 00:04:46,950 --> 00:04:49,560 but where did it creep in first? 109 00:04:49,560 --> 00:04:52,140 So to find out, we need to back to a point 110 00:04:52,140 --> 00:04:53,490 before the model was destroyed, 111 00:04:53,490 --> 00:04:55,440 we need to re-initialize the model 112 00:04:55,440 --> 00:04:58,590 and look at the outputs for just the first batch. 113 00:04:58,590 --> 00:04:59,850 And when we do that, 114 00:04:59,850 --> 00:05:02,790 we see that NaN first appears in the loss, 115 00:05:02,790 --> 00:05:04,980 but only in some samples. 116 00:05:04,980 --> 00:05:06,540 So you can see this in more detail 117 00:05:06,540 --> 00:05:09,090 in the accompanying section of the course notes, 118 00:05:09,090 --> 00:05:11,220 I am moving fairly quickly here, 119 00:05:11,220 --> 00:05:13,500 but we find that if we look at the labels, 120 00:05:13,500 --> 00:05:17,790 the samples with a loss of NaN all have a label of two. 121 00:05:17,790 --> 00:05:19,950 So this gives us a very strong clue, 122 00:05:19,950 --> 00:05:24,060 if we check the model with model.config.num_labels, 123 00:05:24,060 --> 00:05:26,760 we see that the model thinks there's only two labels, 124 00:05:26,760 --> 00:05:28,950 but if we see a value of two, 125 00:05:28,950 --> 00:05:31,200 that means there's at least three labels 126 00:05:31,200 --> 00:05:33,630 because 0 is a label as well. 127 00:05:33,630 --> 00:05:35,070 So we got a loss of NaN 128 00:05:35,070 --> 00:05:37,887 because we got an "impossible" label in our label set, 129 00:05:37,887 --> 00:05:41,010 and to fix that we need to go back and set the model 130 00:05:41,010 --> 00:05:43,650 to expect the right number of labels, 131 00:05:43,650 --> 00:05:45,870 so we can set num_labels=3 132 00:05:45,870 --> 00:05:48,540 when we initialize the model but from_pretrained, 133 00:05:48,540 --> 00:05:51,450 and now hopefully we can avoid this issue. 134 00:05:51,450 --> 00:05:54,660 So, now we think our data is good and our model is good 135 00:05:54,660 --> 00:05:56,220 and so training should work 136 00:05:56,220 --> 00:06:00,510 but if we try running model.fit, we, well... 137 00:06:00,510 --> 00:06:02,040 I mean, we do get a loss, 138 00:06:02,040 --> 00:06:03,930 it is a number and it is going down 139 00:06:03,930 --> 00:06:06,090 but it's not going down very quickly 140 00:06:06,090 --> 00:06:07,770 and if we keep running this out, 141 00:06:07,770 --> 00:06:10,980 we'll find that it stalls at a fairly high loss value. 142 00:06:10,980 --> 00:06:12,450 So, what's going on? 143 00:06:12,450 --> 00:06:14,130 Well, when things are mostly working, 144 00:06:14,130 --> 00:06:16,620 but training is just slow or a bit odd, 145 00:06:16,620 --> 00:06:19,470 that can often be a good time to look at your optimizer 146 00:06:19,470 --> 00:06:22,020 and your training hyperparameters. 147 00:06:22,020 --> 00:06:23,460 And this is where I want to mention 148 00:06:23,460 --> 00:06:25,320 one of the most common sources of issues 149 00:06:25,320 --> 00:06:27,000 when you're working with Keras, 150 00:06:27,000 --> 00:06:30,870 you can name things like optimizers with strings, 151 00:06:30,870 --> 00:06:33,180 so Keras supports that and it's very convenient, 152 00:06:33,180 --> 00:06:35,460 but if you do that all of the options 153 00:06:35,460 --> 00:06:38,400 get silently set to their default values. 154 00:06:38,400 --> 00:06:41,190 So we specified our optimizer as Adam, 155 00:06:41,190 --> 00:06:43,110 but in the process we invisibly got 156 00:06:43,110 --> 00:06:46,260 the default learning rate, which is 1e-3, 157 00:06:46,260 --> 00:06:48,630 or 10 to the power of -3. 158 00:06:48,630 --> 00:06:50,550 So this learning rate is way too high 159 00:06:50,550 --> 00:06:52,530 for training transformer models, 160 00:06:52,530 --> 00:06:55,620 we should go back and specify the learning rate directly, 161 00:06:55,620 --> 00:06:57,060 not using a string. 162 00:06:57,060 --> 00:07:01,290 So, good values here are between 1e-5 and 1e-4 163 00:07:01,290 --> 00:07:04,233 so let's split the difference and pick 5e-5. 164 00:07:05,310 --> 00:07:06,990 So if you recompile with that, 165 00:07:06,990 --> 00:07:09,840 you'll find that training actually works, at last. 166 00:07:09,840 --> 00:07:11,700 The loss goes down efficiently 167 00:07:11,700 --> 00:07:14,070 and it converges to a lower value. 168 00:07:14,070 --> 00:07:16,410 So, again, I did go through this quite quickly 169 00:07:16,410 --> 00:07:18,720 and I strongly recommend checking out the course notes 170 00:07:18,720 --> 00:07:20,040 to see this in more detail, 171 00:07:20,040 --> 00:07:21,600 and to experiment with the code yourself 172 00:07:21,600 --> 00:07:23,490 and see what the errors look like 173 00:07:23,490 --> 00:07:25,380 and how you can approach them, 174 00:07:25,380 --> 00:07:27,930 but I hope I've given you here a quick summary 175 00:07:27,930 --> 00:07:30,510 of the most common bugs 176 00:07:30,510 --> 00:07:32,880 and maybe the most common debugging approaches 177 00:07:32,880 --> 00:07:33,960 to dealing with them. 178 00:07:33,960 --> 00:07:37,020 So, good luck, and remember to take plenty of breaks 179 00:07:37,020 --> 00:07:38,970 if your code is giving you a hard time. 180 00:07:39,805 --> 00:07:42,472 (air whooshing)