subtitles/zh-CN/74_debugging-the-training-pipeline-(tensorflow).srt

1 00:00:00,212 --> 00:00:02,879 （空气呼啸） (air whooshing) 2 00:00:04,680 --> 00:00:08,130 - 你代码中的一些错误非常简单。 - Some bugs in your code are very straightforward. 3 00:00:08,130 --> 00:00:11,580 你尝试运行它，某处出现语法错误， You try running it, you get a syntax error somewhere, 4 00:00:11,580 --> 00:00:14,490 Python 准确地告诉你在哪里，然后你修复它。 Python tells you exactly where, and you fix it. 5 00:00:14,490 --> 00:00:17,760 这很棒，简单而且令人满意。 This is great, it's simple and it's satisfying. 6 00:00:17,760 --> 00:00:20,310 但有时，事情会崩溃 Sometimes, though, things crash 7 00:00:20,310 --> 00:00:23,670 错误是无法理解的。 and the error is impossible to understand. 8 00:00:23,670 --> 00:00:26,700 由于一些原因，这种情况在机器学习中经常发生， This happens a lot in machine learning for a few reasons, 9 00:00:26,700 --> 00:00:29,310 你正在处理大数据结构， you're working with big data structures, 10 00:00:29,310 --> 00:00:31,440 你正在使用这些大而复杂的库 you're using these big, complex libraries 11 00:00:31,440 --> 00:00:33,420 有很多活动部件， with a lot of moving parts, 12 00:00:33,420 --> 00:00:35,310 而且你正在做大量的 GPU 计算， and also you're doing a lot of GPU computing, 13 00:00:35,310 --> 00:00:38,490 而且一般来说调试起来要困难得多。 and that in general is much more difficult to debug. 14 00:00:38,490 --> 00:00:40,260 在 Keras 中还有一个问题 In Keras there's the additional problem 15 00:00:40,260 --> 00:00:43,140 你的模型通常在执行前编译， that your models are often compiled before execution, 16 00:00:43,140 --> 00:00:44,400 这对性能很有帮助 which is great for performance 17 00:00:44,400 --> 00:00:47,430 但这也使调试它们变得非常困难。 but it makes debugging them very difficult as well. 18 00:00:47,430 --> 00:00:50,370 所以，这将是一个关于该怎么做的视频 So, this is going to be a video about what to do 19 00:00:50,370 --> 00:00:52,410 当你遇到那些噩梦般的一个错误时 when you run into one of those nightmare bugs 20 00:00:52,410 --> 00:00:55,210 而且你只是不知道从哪里开始修复它。 and you just have no idea where to begin with fixing it. 21 00:00:56,370 --> 00:00:58,920 所以，给你一些直觉 So, to give you some intuitions for 22 00:00:58,920 --> 00:01:01,530 最常见的错误 the most common things that go wrong 23 00:01:01,530 --> 00:01:03,573 并导致这些奇怪的问题， and cause these weird issues, 24 00:01:04,800 --> 00:01:07,530 并告诉你在哪里寻找错误的来源 and show you where to look for the sources of bugs 25 00:01:07,530 --> 00:01:10,560 你遇到的，让我们使用这个示例脚本。 that you encounter, let's use this example script. 26 00:01:10,560 --> 00:01:12,900 因此，我将在这里分两部分向你展示。 So, I'll show it to you here in two parts. 27 00:01:12,900 --> 00:01:16,410 首先，我们做所有的导入，我们加载一个数据集， First, we do all our imports, we load a dataset, 28 00:01:16,410 --> 00:01:20,280 我们创建了 tokenizer 并对数据集进行分词。 we create our tokenizer and we tokenize the dataset. 29 00:01:20,280 --> 00:01:23,640 接下来，我们将数据集转换为 TensorFlow 数据集， Next, we convert our datasets to TensorFlow datasets, 30 00:01:23,640 --> 00:01:26,100 这就是 tf.data.Dataset， so that's tf.data.Dataset, 31 00:01:26,100 --> 00:01:28,500 这样我们就可以拟合它们， and that's so that we can run fit on them, 32 00:01:28,500 --> 00:01:31,170 然后我们从预训练的 checkpoint 加载我们的模型， and then we load our model from a pretrained checkpoint, 33 00:01:31,170 --> 00:01:33,870 我们对其进行编译，并将其与这些数据集相匹配。 we compile it and we fit it with those datasets. 34 00:01:33,870 --> 00:01:35,970 所以，这看起来很简单， So, this seems straightforward enough, 35 00:01:35,970 --> 00:01:38,220 这与我们之前在课程中所做的类似。 it's similar to what we've done in the course before. 36 00:01:38,220 --> 00:01:40,650 但请注意，这是令人毛骨悚然的代码 But beware, this is spooky code 37 00:01:40,650 --> 00:01:43,590 并隐藏着许多黑暗而神秘的秘密。 and hides many dark and mysterious secrets. 38 00:01:43,590 --> 00:01:46,050 那么，当我们运行它时会发生什么？ So, what happens when we run it? 39 00:01:46,050 --> 00:01:48,840 好吧，这不是很好。 Well, it's not great. 40 00:01:48,840 --> 00:01:52,320 所以，我们收到了这个错误信息，但它是什么意思呢？ So, we get this error message, but what does it mean? 41 00:01:52,320 --> 00:01:55,470 我们试图训练我们的数据，但我们没有梯度？ We tried to train on our data, but we got no gradient? 42 00:01:55,470 --> 00:01:59,130 这很令人困惑，我的意思是，我们从何开始 It's pretty perplexing, I mean, how do we even begin 43 00:01:59,130 --> 00:02:01,500 调试没有得到梯度？ to debug not getting a gradient? 44 00:02:01,500 --> 00:02:03,930 所以，当你得到的错误没有立即提示 So, when the error you get doesn't immediately suggest 45 00:02:03,930 --> 00:02:06,630 问题出在哪里，最好的解决办法 where the problem is, the best solution 46 00:02:06,630 --> 00:02:09,180 通常是按顺序遍历事物， is often to walk through things in sequence, 47 00:02:09,180 --> 00:02:12,900 确保在每个阶段输出看起来都是正确的， making sure at each stage that the outputs look right, 48 00:02:12,900 --> 00:02:15,300 那时一切看起来都很好。 that everything looks okay at that point. 49 00:02:15,300 --> 00:02:17,730 而且，当然，这意味着开始的地方 And, of course, that means the place to start 50 00:02:17,730 --> 00:02:19,473 总是检查你的数据。 is always to check your data. 51 00:02:20,670 --> 00:02:22,050 所以，最好的办法是确保 So, the best way to make sure 52 00:02:22,050 --> 00:02:24,480 你给模型的数据是好的， that the data you're giving the model is good, 53 00:02:24,480 --> 00:02:27,690 是从 tf.data.Dataset 中抓取一批 is to grab a batch from the tf.data.Dataset 54 00:02:27,690 --> 00:02:29,520 你的模型正在训练， that your model is training on, 55 00:02:29,520 --> 00:02:31,560 那是因为它就在最后 and that's because it's right at the end 56 00:02:31,560 --> 00:02:33,990 的数据 pipeline 。 of the data pipeline. 57 00:02:33,990 --> 00:02:36,990 所以这意味着如果这些输出是好的， And so that means that if those outputs are good, 58 00:02:36,990 --> 00:02:39,990 你可以保证你的数据 pipeline 运行良好。 you're guaranteed that your data pipeline is working well. 59 00:02:39,990 --> 00:02:42,600 所以，我们可以通过遍历数据集来做到这一点 So, we can do that by looping over the dataset 60 00:02:42,600 --> 00:02:44,790 进行一次迭代然后中断， for one iteration and then breaking, 61 00:02:44,790 --> 00:02:46,980 这给了我们一个分批。 and that gives us a single batch. 62 00:02:46,980 --> 00:02:49,443 那么，当我们检查该批次时，我们得到了什么？ So, what do we get when we inspect that batch? 63 00:02:50,460 --> 00:02:52,500 我们会看到我们没有得到任何梯度 We'll see that we're not getting any gradient 64 00:02:52,500 --> 00:02:55,530 因为我们没有将标签传递给 Keras。 because we're not passing labels to Keras. 65 00:02:55,530 --> 00:02:57,510 所以，我们的标签在批次中， So, our labels are in the batch, 66 00:02:57,510 --> 00:02:59,670 但它们是输入字典中的键 but they're a key in the input dictionary 67 00:02:59,670 --> 00:03:02,340 而且它们不像 Keras 期望的那样是一个单独的标签， and they're not a separate label as Keras expects, 68 00:03:02,340 --> 00:03:04,830 所以这是你会遇到的最常见的问题之一 so this is one of the most common issues you'll encounter 69 00:03:04,830 --> 00:03:07,590 使用 TensorFlow 训练 Transformers 模型时。 when training Transformers models with TensorFlow. 70 00:03:07,590 --> 00:03:10,980 我们的模型都可以在内部计算损失， Our models can all compute loss internally, 71 00:03:10,980 --> 00:03:13,140 但要将损失用于训练 but to use that loss for training 72 00:03:13,140 --> 00:03:15,960 标签需要在输入字典中传递， the labels need to be passed in the input dictionary, 73 00:03:15,960 --> 00:03:17,940 其模型可以看到它们。 where the model can see them. 74 00:03:17,940 --> 00:03:20,280 这种内部损失是我们使用的损失 This internal loss is the loss that we use 75 00:03:20,280 --> 00:03:23,760 当我们在调用编译时没有指定损失时， when we don't specify a loss when we call compile, 76 00:03:23,760 --> 00:03:25,660 当我们不指定损失参数时。 when we don't specify a loss argument. 77 00:03:26,520 --> 00:03:27,870 所以，另一方面，Keras， So, Keras, on the other hand, 78 00:03:27,870 --> 00:03:30,570 通常期望标签单独传递 usually expects labels to be passed separately 79 00:03:30,570 --> 00:03:32,130 从输入字典， from the input dictionary, 80 00:03:32,130 --> 00:03:34,110 并且对模型不可见， and not to be visible to the model, 81 00:03:34,110 --> 00:03:36,600 损失计算通常会失败 and loss computations will usually fail 82 00:03:36,600 --> 00:03:38,220 如果你不那样做 if you don't do that 83 00:03:38,220 --> 00:03:40,380 所以我们需要选择其中之一， So we need to choose one or the other, 84 00:03:40,380 --> 00:03:42,780 要么我们使用模型的内部损失 either we use the model's internal loss 85 00:03:42,780 --> 00:03:44,940 并将标签保留在原处， and keep the labels where they are, 86 00:03:44,940 --> 00:03:46,980 或者我们继续使用 Keras 损失 or we keep using Keras losses 87 00:03:46,980 --> 00:03:50,520 但我们将标签移动到 Keras 期望的地方。 but we move the labels to the place Keras expects them. 88 00:03:50,520 --> 00:03:53,310 所以，还是简单点，让我们解决这个问题 So, or simplicity here, let's fix this issue 89 00:03:53,310 --> 00:03:55,860 通过使用模型的内部损失， by using the model's internal losses, 90 00:03:55,860 --> 00:03:57,900 我们通过删除损失参数来做到这一点 and we do that by removing the loss argument 91 00:03:57,900 --> 00:03:59,343 从调用编译。 from the call to compile. 92 00:04:00,540 --> 00:04:03,000 那么，如果我们现在尝试训练会发生什么？ So, what happens if we try training now? 93 00:04:03,000 --> 00:04:08,000 所以我们用它重新编译，我们调用 model.fit，会发生什么？ So we recompile with that, we call model.fit, what happens? 94 00:04:08,220 --> 00:04:13,050 好吧，这次它运行了，但现在我们消失了 NaN。 Well, it runs this time but now we get a loss of NaN. 95 00:04:13,050 --> 00:04:16,440 所以，这不好，NaN 表示不是数字 So, that's not good, NaN means not a number 96 00:04:16,440 --> 00:04:19,140 总的来说，这不是一个好损失。 and it's not a good loss to have in general. 97 00:04:19,140 --> 00:04:21,000 事实上，如果我们现在检查我们的模型， In fact, if we inspect our model now, 98 00:04:21,000 --> 00:04:23,970 我们会看到不仅所有的输出都是 NaN， we'll see that not only are all the outputs NaN, 99 00:04:23,970 --> 00:04:27,600 所有的权重和损失也是 NaN。 all the weights are NaN as well, as well as the loss. 100 00:04:27,600 --> 00:04:30,810 所以一旦一个 NaN 悄悄爬进你的计算， So once a single NaN creeps into your computations, 101 00:04:30,810 --> 00:04:34,530 它倾向于传播，因为它从损失传播 it tends to spread, because it propagates from the loss 102 00:04:34,530 --> 00:04:36,420 一旦它处于损失状态，它就会处于梯度状态， and once it's at the loss it's at the gradient, 103 00:04:36,420 --> 00:04:37,530 它达到梯度， it gets to the gradient, 104 00:04:37,530 --> 00:04:38,910 然后一旦它处于梯度中 and then once it's in the gradient 105 00:04:38,910 --> 00:04:41,280 它进入权重更新， it enters the weight updates, 106 00:04:41,280 --> 00:04:43,980 然后你所有的权重更新最终也都是 NaN 。 and then all your weight updates end up as NaN as well. 107 00:04:43,980 --> 00:04:46,950 所以 NaN 在这里完全破坏了我们的模型， So NaN just completely destroyed our model here, 108 00:04:46,950 --> 00:04:49,560 但它首先潜入何处？ but where did it creep in first? 109 00:04:49,560 --> 00:04:52,140 所以要找出答案，我们需要回到一个点 So to find out, we need to back to a point 110 00:04:52,140 --> 00:04:53,490 在模型被摧毁之前， before the model was destroyed, 111 00:04:53,490 --> 00:04:55,440 我们需要重新初始化模型 we need to re-initialize the model 112 00:04:55,440 --> 00:04:58,590 并查看第一批的输出。 and look at the outputs for just the first batch. 113 00:04:58,590 --> 00:04:59,850 当我们这样做时， And when we do that, 114 00:04:59,850 --> 00:05:02,790 我们看到 NaN 首先出现在损失中， we see that NaN first appears in the loss, 115 00:05:02,790 --> 00:05:04,980 但仅在某些样本中。 but only in some samples. 116 00:05:04,980 --> 00:05:06,540 所以你可以更详细地看到这个 So you can see this in more detail 117 00:05:06,540 --> 00:05:09,090 在课程笔记的随附部分中， in the accompanying section of the course notes, 118 00:05:09,090 --> 00:05:11,220 我在这里移动得相当快， I am moving fairly quickly here, 119 00:05:11,220 --> 00:05:13,500 但我们发现如果我们看一下标签， but we find that if we look at the labels, 120 00:05:13,500 --> 00:05:17,790 损失为 NaN 的样本的标签均为 2。 the samples with a loss of NaN all have a label of two. 121 00:05:17,790 --> 00:05:19,950 所以这给了我们一个非常有力的线索， So this gives us a very strong clue, 122 00:05:19,950 --> 00:05:24,060 如果我们使用 model.config.num_labels 检查模型， if we check the model with model.config.num_labels, 123 00:05:24,060 --> 00:05:26,760 我们看到模型认为只有两个标签， we see that the model thinks there's only two labels, 124 00:05:26,760 --> 00:05:28,950 但如果我们看到值为二， but if we see a value of two, 125 00:05:28,950 --> 00:05:31,200 这意味着至少有三个标签 that means there's at least three labels 126 00:05:31,200 --> 00:05:33,630 因为 0 也是一个标签。 because 0 is a label as well. 127 00:05:33,630 --> 00:05:35,070 所以我们得到大量 NaN So we got a lots of NaN 128 00:05:35,070 --> 00:05:37,887 因为我们的标签集中有一个 “不可能” 的标签， because we got an "impossible" label in our label set, 129 00:05:37,887 --> 00:05:41,010 并修复我们需要返回并设置模型 and to fix that we need to go back and set the model 130 00:05:41,010 --> 00:05:43,650 期待正确数量的标签， to expect the right number of labels, 131 00:05:43,650 --> 00:05:45,870 所以我们可以设置 num_labels=3 so we can set num_labels=3 132 00:05:45,870 --> 00:05:48,540 当我们初始化模型用 from_pretrained 时， when we initialize the model but from_pretrained, 133 00:05:48,540 --> 00:05:51,450 现在希望我们可以避免这个问题。 and now hopefully we can avoid this issue. 134 00:05:51,450 --> 00:05:54,660 所以，现在我们认为我们的数据很好，我们的模型也很好 So, now we think our data is good and our model is good 135 00:05:54,660 --> 00:05:56,220 所以训练应该有效 and so training should work 136 00:05:56,220 --> 00:06:00,510 但是如果我们尝试运行 model.fit，我们，嗯…… but if we try running model.fit, we, well... 137 00:06:00,510 --> 00:06:02,040 我的意思是，我们确实有损失， I mean, we do get a loss, 138 00:06:02,040 --> 00:06:03,930 这是一个数字，它正在下降 it is a number and it is going down 139 00:06:03,930 --> 00:06:06,090 但它不会很快下降 but it's not going down very quickly 140 00:06:06,090 --> 00:06:07,770 如果我们一直运行这个， and if we keep running this out, 141 00:06:07,770 --> 00:06:10,980 我们会发现它停在相当高的损失值处。 we'll find that it stalls at a fairly high loss value. 142 00:06:10,980 --> 00:06:12,450 发生什么了？ So, what's going on? 143 00:06:12,450 --> 00:06:14,130 好吧，当一切正常时， Well, when things are mostly working, 144 00:06:14,130 --> 00:06:16,620 但训练只是很慢或有点奇怪， but training is just slow or a bit odd, 145 00:06:16,620 --> 00:06:19,470 这通常是查看优化器的好时机 that can often be a good time to look at your optimizer 146 00:06:19,470 --> 00:06:22,020 和你的训练超参数。 and your training hyperparameters. 147 00:06:22,020 --> 00:06:23,460 这就是我想提的地方 And this is where I want to mention 148 00:06:23,460 --> 00:06:25,320 最常见的问题来源之一 one of the most common sources of issues 149 00:06:25,320 --> 00:06:27,000 当你使用 Keras 时， when you're working with Keras, 150 00:06:27,000 --> 00:06:30,870 你可以用字符串命名优化器， you can name things like optimizers with strings, 151 00:06:30,870 --> 00:06:33,180 所以 Keras 支持它，而且非常方便， so Keras supports that and it's very convenient, 152 00:06:33,180 --> 00:06:35,460 但如果你这样做所有的选择 but if you do that all of the options 153 00:06:35,460 --> 00:06:38,400 默默地设置为默认值。 get silently set to their default values. 154 00:06:38,400 --> 00:06:41,190 所以我们将优化器指定为 Adam， So we specified our optimizer as Adam, 155 00:06:41,190 --> 00:06:43,110 但在这个过程中我们无形中得到了 but in the process we invisibly got 156 00:06:43,110 --> 00:06:46,260 默认学习率，即 1e-3， the default learning rate, which is 1e-3, 157 00:06:46,260 --> 00:06:48,630 或 10 的 -3 次方。 or 10 to the power of -3. 158 00:06:48,630 --> 00:06:50,550 所以这个学习率太高了 So this learning rate is way too high 159 00:06:50,550 --> 00:06:52,530 用于训练 transformer 模型， for training transformer models, 160 00:06:52,530 --> 00:06:55,620 我们应该回去直接指定学习率， we should go back and specify the learning rate directly, 161 00:06:55,620 --> 00:06:57,060 不使用字符串。 not using a string. 162 00:06:57,060 --> 00:07:01,290 所以，这里的好值在 1e-5 和 1e-4 之间 So, good values here are between 1e-5 and 1e-4 163 00:07:01,290 --> 00:07:04,233 所以让我们平分差价并选择 5e-5。 so let's split the difference and pick 5e-5. 164 00:07:05,310 --> 00:07:06,990 所以如果你用那个重新编译， So if you recompile with that, 165 00:07:06,990 --> 00:07:09,840 你最终会发现训练确实有效。 you'll find that training actually works, at last. 166 00:07:09,840 --> 00:07:11,700 损失有效减少 The loss goes down efficiently 167 00:07:11,700 --> 00:07:14,070 并且收敛到一个较低的值。 and it converges to a lower value. 168 00:07:14,070 --> 00:07:16,410 所以，再一次，我确实很快地完成了这个 So, again, I did go through this quite quickly 169 00:07:16,410 --> 00:07:18,720 我强烈建议查看课程笔记 and I strongly recommend checking out the course notes 170 00:07:18,720 --> 00:07:20,040 要更详细地了解这一点， to see this in more detail, 171 00:07:20,040 --> 00:07:21,600 并自己试验代码 and to experiment with the code yourself 172 00:07:21,600 --> 00:07:23,490 看看错误是什么样的 and see what the errors look like 173 00:07:23,490 --> 00:07:25,380 以及如何解决他们， and how you can approach them, 174 00:07:25,380 --> 00:07:27,930 但我希望我在这里给了你一个简短的总结 but I hope I've given you here a quick summary 175 00:07:27,930 --> 00:07:30,510 最常见的错误 of the most common bugs 176 00:07:30,510 --> 00:07:32,880 也许是最常见的调试方法 and maybe the most common debugging approaches 177 00:07:32,880 --> 00:07:33,960 来对付他们。 to dealing with them. 178 00:07:33,960 --> 00:07:37,020 所以，祝你好运，记得多休息 So, good luck, and remember to take plenty of breaks 179 00:07:37,020 --> 00:07:38,970 如果你的代码给你带来困难。 if your code is giving you a hard time. 180 00:07:39,805 --> 00:07:42,472 （空气呼啸） (air whooshing)

subtitles/zh-CN/74_debugging-the-training-pipeline-(tensorflow).srt (720 lines of code) (raw):