1
00:00:00,212 --> 00:00:02,879
（空气呼啸）
(air whooshing)

2
00:00:04,680 --> 00:00:08,130
- 你代码中的一些错误非常简单。
- Some bugs in your code are very straightforward.

3
00:00:08,130 --> 00:00:11,580
你尝试运行它，某处出现语法错误，
You try running it, you get a syntax error somewhere,

4
00:00:11,580 --> 00:00:14,490
Python 准确地告诉你在哪里，然后你修复它。
Python tells you exactly where, and you fix it.

5
00:00:14,490 --> 00:00:17,760
这很棒，简单而且令人满意。
This is great, it's simple and it's satisfying.

6
00:00:17,760 --> 00:00:20,310
但有时，事情会崩溃
Sometimes, though, things crash

7
00:00:20,310 --> 00:00:23,670
错误是无法理解的。
and the error is impossible to understand.

8
00:00:23,670 --> 00:00:26,700
由于一些原因，这种情况在机器学习中经常发生，
This happens a lot in machine learning for a few reasons,

9
00:00:26,700 --> 00:00:29,310
你正在处理大数据结构，
you're working with big data structures,

10
00:00:29,310 --> 00:00:31,440
你正在使用这些大而复杂的库
you're using these big, complex libraries

11
00:00:31,440 --> 00:00:33,420
有很多活动部件，
with a lot of moving parts,

12
00:00:33,420 --> 00:00:35,310
而且你正在做大量的 GPU 计算，
and also you're doing a lot of GPU computing,

13
00:00:35,310 --> 00:00:38,490
而且一般来说调试起来要困难得多。
and that in general is much more difficult to debug.

14
00:00:38,490 --> 00:00:40,260
在 Keras 中还有一个问题
In Keras there's the additional problem

15
00:00:40,260 --> 00:00:43,140
你的模型通常在执行前编译，
that your models are often compiled before execution,

16
00:00:43,140 --> 00:00:44,400
这对性能很有帮助
which is great for performance

17
00:00:44,400 --> 00:00:47,430
但这也使调试它们变得非常困难。
but it makes debugging them very difficult as well.

18
00:00:47,430 --> 00:00:50,370
所以，这将是一个关于该怎么做的视频
So, this is going to be a video about what to do

19
00:00:50,370 --> 00:00:52,410
当你遇到那些噩梦般的一个错误时
when you run into one of those nightmare bugs

20
00:00:52,410 --> 00:00:55,210
而且你只是不知道从哪里开始修复它。
and you just have no idea where to begin with fixing it.

21
00:00:56,370 --> 00:00:58,920
所以，给你一些直觉
So, to give you some intuitions for

22
00:00:58,920 --> 00:01:01,530
最常见的错误
the most common things that go wrong

23
00:01:01,530 --> 00:01:03,573
并导致这些奇怪的问题，
and cause these weird issues,

24
00:01:04,800 --> 00:01:07,530
并告诉你在哪里寻找错误的来源
and show you where to look for the sources of bugs

25
00:01:07,530 --> 00:01:10,560
你遇到的，让我们使用这个示例脚本。
that you encounter, let's use this example script.

26
00:01:10,560 --> 00:01:12,900
因此，我将在这里分两部分向你展示。
So, I'll show it to you here in two parts.

27
00:01:12,900 --> 00:01:16,410
首先，我们做所有的导入，我们加载一个数据集，
First, we do all our imports, we load a dataset,

28
00:01:16,410 --> 00:01:20,280
我们创建了 tokenizer 并对数据集进行分词。
we create our tokenizer and we tokenize the dataset.

29
00:01:20,280 --> 00:01:23,640
接下来，我们将数据集转换为 TensorFlow 数据集，
Next, we convert our datasets to TensorFlow datasets,

30
00:01:23,640 --> 00:01:26,100
这就是 tf.data.Dataset，
so that's tf.data.Dataset,

31
00:01:26,100 --> 00:01:28,500
这样我们就可以拟合它们，
and that's so that we can run fit on them,

32
00:01:28,500 --> 00:01:31,170
然后我们从预训练的 checkpoint 加载我们的模型，
and then we load our model from a pretrained checkpoint,

33
00:01:31,170 --> 00:01:33,870
我们对其进行编译，并将其与这些数据集相匹配。
we compile it and we fit it with those datasets.

34
00:01:33,870 --> 00:01:35,970
所以，这看起来很简单，
So, this seems straightforward enough,

35
00:01:35,970 --> 00:01:38,220
这与我们之前在课程中所做的类似。
it's similar to what we've done in the course before.

36
00:01:38,220 --> 00:01:40,650
但请注意，这是令人毛骨悚然的代码
But beware, this is spooky code

37
00:01:40,650 --> 00:01:43,590
并隐藏着许多黑暗而神秘的秘密。
and hides many dark and mysterious secrets.

38
00:01:43,590 --> 00:01:46,050
那么，当我们运行它时会发生什么？
So, what happens when we run it?

39
00:01:46,050 --> 00:01:48,840
好吧，这不是很好。
Well, it's not great.

40
00:01:48,840 --> 00:01:52,320
所以，我们收到了这个错误信息，但它是什么意思呢？
So, we get this error message, but what does it mean?

41
00:01:52,320 --> 00:01:55,470
我们试图训练我们的数据，但我们没有梯度？
We tried to train on our data, but we got no gradient?

42
00:01:55,470 --> 00:01:59,130
这很令人困惑，我的意思是，我们从何开始
It's pretty perplexing, I mean, how do we even begin

43
00:01:59,130 --> 00:02:01,500
调试没有得到梯度？
to debug not getting a gradient?

44
00:02:01,500 --> 00:02:03,930
所以，当你得到的错误没有立即提示
So, when the error you get doesn't immediately suggest

45
00:02:03,930 --> 00:02:06,630
问题出在哪里，最好的解决办法
where the problem is, the best solution

46
00:02:06,630 --> 00:02:09,180
通常是按顺序遍历事物，
is often to walk through things in sequence,

47
00:02:09,180 --> 00:02:12,900
确保在每个阶段输出看起来都是正确的，
making sure at each stage that the outputs look right,

48
00:02:12,900 --> 00:02:15,300
那时一切看起来都很好。
that everything looks okay at that point.

49
00:02:15,300 --> 00:02:17,730
而且，当然，这意味着开始的地方
And, of course, that means the place to start

50
00:02:17,730 --> 00:02:19,473
总是检查你的数据。
is always to check your data.

51
00:02:20,670 --> 00:02:22,050
所以，最好的办法是确保
So, the best way to make sure

52
00:02:22,050 --> 00:02:24,480
你给模型的数据是好的，
that the data you're giving the model is good,

53
00:02:24,480 --> 00:02:27,690
是从 tf.data.Dataset 中抓取一批
is to grab a batch from the tf.data.Dataset

54
00:02:27,690 --> 00:02:29,520
你的模型正在训练，
that your model is training on,

55
00:02:29,520 --> 00:02:31,560
那是因为它就在最后
and that's because it's right at the end

56
00:02:31,560 --> 00:02:33,990
的数据 pipeline 。
of the data pipeline.

57
00:02:33,990 --> 00:02:36,990
所以这意味着如果这些输出是好的，
And so that means that if those outputs are good,

58
00:02:36,990 --> 00:02:39,990
你可以保证你的数据 pipeline 运行良好。
you're guaranteed that your data pipeline is working well.

59
00:02:39,990 --> 00:02:42,600
所以，我们可以通过遍历数据集来做到这一点
So, we can do that by looping over the dataset

60
00:02:42,600 --> 00:02:44,790
进行一次迭代然后中断，
for one iteration and then breaking,

61
00:02:44,790 --> 00:02:46,980
这给了我们一个分批。
and that gives us a single batch.

62
00:02:46,980 --> 00:02:49,443
那么，当我们检查该批次时，我们得到了什么？
So, what do we get when we inspect that batch?

63
00:02:50,460 --> 00:02:52,500
我们会看到我们没有得到任何梯度
We'll see that we're not getting any gradient

64
00:02:52,500 --> 00:02:55,530
因为我们没有将标签传递给 Keras。
because we're not passing labels to Keras.

65
00:02:55,530 --> 00:02:57,510
所以，我们的标签在批次中，
So, our labels are in the batch,

66
00:02:57,510 --> 00:02:59,670
但它们是输入字典中的键
but they're a key in the input dictionary

67
00:02:59,670 --> 00:03:02,340
而且它们不像 Keras 期望的那样是一个单独的标签，
and they're not a separate label as Keras expects,

68
00:03:02,340 --> 00:03:04,830
所以这是你会遇到的最常见的问题之一
so this is one of the most common issues you'll encounter

69
00:03:04,830 --> 00:03:07,590
使用 TensorFlow 训练 Transformers 模型时。
when training Transformers models with TensorFlow.

70
00:03:07,590 --> 00:03:10,980
我们的模型都可以在内部计算损失，
Our models can all compute loss internally,

71
00:03:10,980 --> 00:03:13,140
但要将损失用于训练
but to use that loss for training

72
00:03:13,140 --> 00:03:15,960
标签需要在输入字典中传递，
the labels need to be passed in the input dictionary,

73
00:03:15,960 --> 00:03:17,940
其模型可以看到它们。
where the model can see them.

74
00:03:17,940 --> 00:03:20,280
这种内部损失是我们使用的损失
This internal loss is the loss that we use

75
00:03:20,280 --> 00:03:23,760
当我们在调用编译时没有指定损失时，
when we don't specify a loss when we call compile,

76
00:03:23,760 --> 00:03:25,660
当我们不指定损失参数时。
when we don't specify a loss argument.

77
00:03:26,520 --> 00:03:27,870
所以，另一方面，Keras，
So, Keras, on the other hand,

78
00:03:27,870 --> 00:03:30,570
通常期望标签单独传递
usually expects labels to be passed separately

79
00:03:30,570 --> 00:03:32,130
从输入字典，
from the input dictionary,

80
00:03:32,130 --> 00:03:34,110
并且对模型不可见，
and not to be visible to the model,

81
00:03:34,110 --> 00:03:36,600
损失计算通常会失败
and loss computations will usually fail

82
00:03:36,600 --> 00:03:38,220
如果你不那样做
if you don't do that

83
00:03:38,220 --> 00:03:40,380
所以我们需要选择其中之一，
So we need to choose one or the other,

84
00:03:40,380 --> 00:03:42,780
要么我们使用模型的内部损失
either we use the model's internal loss

85
00:03:42,780 --> 00:03:44,940
并将标签保留在原处，
and keep the labels where they are,

86
00:03:44,940 --> 00:03:46,980
或者我们继续使用 Keras 损失
or we keep using Keras losses

87
00:03:46,980 --> 00:03:50,520
但我们将标签移动到 Keras 期望的地方。
but we move the labels to the place Keras expects them.

88
00:03:50,520 --> 00:03:53,310
所以，还是简单点，让我们解决这个问题
So, or simplicity here, let's fix this issue

89
00:03:53,310 --> 00:03:55,860
通过使用模型的内部损失，
by using the model's internal losses,

90
00:03:55,860 --> 00:03:57,900
我们通过删除损失参数来做到这一点
and we do that by removing the loss argument

91
00:03:57,900 --> 00:03:59,343
从调用编译。
from the call to compile.

92
00:04:00,540 --> 00:04:03,000
那么，如果我们现在尝试训练会发生什么？
So, what happens if we try training now?

93
00:04:03,000 --> 00:04:08,000
所以我们用它重新编译，我们调用 model.fit，会发生什么？
So we recompile with that, we call model.fit, what happens?

94
00:04:08,220 --> 00:04:13,050
好吧，这次它运行了，但现在我们消失了 NaN。
Well, it runs this time but now we get a loss of NaN.

95
00:04:13,050 --> 00:04:16,440
所以，这不好，NaN 表示不是数字
So, that's not good, NaN means not a number

96
00:04:16,440 --> 00:04:19,140
总的来说，这不是一个好损失。
and it's not a good loss to have in general.

97
00:04:19,140 --> 00:04:21,000
事实上，如果我们现在检查我们的模型，
In fact, if we inspect our model now,

98
00:04:21,000 --> 00:04:23,970
我们会看到不仅所有的输出都是 NaN，
we'll see that not only are all the outputs NaN,

99
00:04:23,970 --> 00:04:27,600
所有的权重和损失也是 NaN。
all the weights are NaN as well, as well as the loss.

100
00:04:27,600 --> 00:04:30,810
所以一旦一个 NaN 悄悄爬进你的计算，
So once a single NaN creeps into your computations,

101
00:04:30,810 --> 00:04:34,530
它倾向于传播，因为它从损失传播
it tends to spread, because it propagates from the loss

102
00:04:34,530 --> 00:04:36,420
一旦它处于损失状态，它就会处于梯度状态，
and once it's at the loss it's at the gradient,

103
00:04:36,420 --> 00:04:37,530
它达到梯度，
it gets to the gradient,

104
00:04:37,530 --> 00:04:38,910
然后一旦它处于梯度中
and then once it's in the gradient

105
00:04:38,910 --> 00:04:41,280
它进入权重更新，
it enters the weight updates,

106
00:04:41,280 --> 00:04:43,980
然后你所有的权重更新最终也都是 NaN 。
and then all your weight updates end up as NaN as well.

107
00:04:43,980 --> 00:04:46,950
所以 NaN 在这里完全破坏了我们的模型，
So NaN just completely destroyed our model here,

108
00:04:46,950 --> 00:04:49,560
但它首先潜入何处？
but where did it creep in first?

109
00:04:49,560 --> 00:04:52,140
所以要找出答案，我们需要回到一个点
So to find out, we need to back to a point

110
00:04:52,140 --> 00:04:53,490
在模型被摧毁之前，
before the model was destroyed,

111
00:04:53,490 --> 00:04:55,440
我们需要重新初始化模型
we need to re-initialize the model

112
00:04:55,440 --> 00:04:58,590
并查看第一批的输出。
and look at the outputs for just the first batch.

113
00:04:58,590 --> 00:04:59,850
当我们这样做时，
And when we do that,

114
00:04:59,850 --> 00:05:02,790
我们看到 NaN 首先出现在损失中，
we see that NaN first appears in the loss,

115
00:05:02,790 --> 00:05:04,980
但仅在某些样本中。
but only in some samples.

116
00:05:04,980 --> 00:05:06,540
所以你可以更详细地看到这个
So you can see this in more detail

117
00:05:06,540 --> 00:05:09,090
在课程笔记的随附部分中，
in the accompanying section of the course notes,

118
00:05:09,090 --> 00:05:11,220
我在这里移动得相当快，
I am moving fairly quickly here,

119
00:05:11,220 --> 00:05:13,500
但我们发现如果我们看一下标签，
but we find that if we look at the labels,

120
00:05:13,500 --> 00:05:17,790
损失为 NaN 的样本的标签均为 2。
the samples with a loss of NaN all have a label of two.

121
00:05:17,790 --> 00:05:19,950
所以这给了我们一个非常有力的线索，
So this gives us a very strong clue,

122
00:05:19,950 --> 00:05:24,060
如果我们使用 model.config.num_labels 检查模型，
if we check the model with model.config.num_labels,

123
00:05:24,060 --> 00:05:26,760
我们看到模型认为只有两个标签，
we see that the model thinks there's only two labels,

124
00:05:26,760 --> 00:05:28,950
但如果我们看到值为二，
but if we see a value of two,

125
00:05:28,950 --> 00:05:31,200
这意味着至少有三个标签
that means there's at least three labels

126
00:05:31,200 --> 00:05:33,630
因为 0 也是一个标签。
because 0 is a label as well.

127
00:05:33,630 --> 00:05:35,070
所以我们得到大量 NaN
So we got a lots of NaN

128
00:05:35,070 --> 00:05:37,887
因为我们的标签集中有一个 “不可能” 的标签，
because we got an "impossible" label in our label set,

129
00:05:37,887 --> 00:05:41,010
并修复我们需要返回并设置模型
and to fix that we need to go back and set the model

130
00:05:41,010 --> 00:05:43,650
期待正确数量的标签，
to expect the right number of labels,

131
00:05:43,650 --> 00:05:45,870
所以我们可以设置 num_labels=3
so we can set num_labels=3

132
00:05:45,870 --> 00:05:48,540
当我们初始化模型用 from_pretrained 时，
when we initialize the model but from_pretrained,

133
00:05:48,540 --> 00:05:51,450
现在希望我们可以避免这个问题。
and now hopefully we can avoid this issue.

134
00:05:51,450 --> 00:05:54,660
所以，现在我们认为我们的数据很好，我们的模型也很好
So, now we think our data is good and our model is good

135
00:05:54,660 --> 00:05:56,220
所以训练应该有效
and so training should work

136
00:05:56,220 --> 00:06:00,510
但是如果我们尝试运行 model.fit，我们，嗯……
but if we try running model.fit, we, well...

137
00:06:00,510 --> 00:06:02,040
我的意思是，我们确实有损失，
I mean, we do get a loss,

138
00:06:02,040 --> 00:06:03,930
这是一个数字，它正在下降
it is a number and it is going down

139
00:06:03,930 --> 00:06:06,090
但它不会很快下降
but it's not going down very quickly

140
00:06:06,090 --> 00:06:07,770
如果我们一直运行这个，
and if we keep running this out,

141
00:06:07,770 --> 00:06:10,980
我们会发现它停在相当高的损失值处。
we'll find that it stalls at a fairly high loss value.

142
00:06:10,980 --> 00:06:12,450
发生什么了？
So, what's going on?

143
00:06:12,450 --> 00:06:14,130
好吧，当一切正常时，
Well, when things are mostly working,

144
00:06:14,130 --> 00:06:16,620
但训练只是很慢或有点奇怪，
but training is just slow or a bit odd,

145
00:06:16,620 --> 00:06:19,470
这通常是查看优化器的好时机
that can often be a good time to look at your optimizer

146
00:06:19,470 --> 00:06:22,020
和你的训练超参数。
and your training hyperparameters.

147
00:06:22,020 --> 00:06:23,460
这就是我想提的地方
And this is where I want to mention

148
00:06:23,460 --> 00:06:25,320
最常见的问题来源之一
one of the most common sources of issues

149
00:06:25,320 --> 00:06:27,000
当你使用 Keras 时，
when you're working with Keras,

150
00:06:27,000 --> 00:06:30,870
你可以用字符串命名优化器，
you can name things like optimizers with strings,

151
00:06:30,870 --> 00:06:33,180
所以 Keras 支持它，而且非常方便，
so Keras supports that and it's very convenient,

152
00:06:33,180 --> 00:06:35,460
但如果你这样做所有的选择
but if you do that all of the options

153
00:06:35,460 --> 00:06:38,400
默默地设置为默认值。
get silently set to their default values.

154
00:06:38,400 --> 00:06:41,190
所以我们将优化器指定为 Adam，
So we specified our optimizer as Adam,

155
00:06:41,190 --> 00:06:43,110
但在这个过程中我们无形中得到了
but in the process we invisibly got

156
00:06:43,110 --> 00:06:46,260
默认学习率，即 1e-3，
the default learning rate, which is 1e-3,

157
00:06:46,260 --> 00:06:48,630
或 10 的 -3 次方。
or 10 to the power of -3.

158
00:06:48,630 --> 00:06:50,550
所以这个学习率太高了
So this learning rate is way too high

159
00:06:50,550 --> 00:06:52,530
用于训练 transformer 模型，
for training transformer models,

160
00:06:52,530 --> 00:06:55,620
我们应该回去直接指定学习率，
we should go back and specify the learning rate directly,

161
00:06:55,620 --> 00:06:57,060
不使用字符串。
not using a string.

162
00:06:57,060 --> 00:07:01,290
所以，这里的好值在 1e-5 和 1e-4 之间
So, good values here are between 1e-5 and 1e-4

163
00:07:01,290 --> 00:07:04,233
所以让我们平分差价并选择 5e-5。
so let's split the difference and pick 5e-5.

164
00:07:05,310 --> 00:07:06,990
所以如果你用那个重新编译，
So if you recompile with that,

165
00:07:06,990 --> 00:07:09,840
你最终会发现训练确实有效。
you'll find that training actually works, at last.

166
00:07:09,840 --> 00:07:11,700
损失有效减少
The loss goes down efficiently

167
00:07:11,700 --> 00:07:14,070
并且收敛到一个较低的值。
and it converges to a lower value.

168
00:07:14,070 --> 00:07:16,410
所以，再一次，我确实很快地完成了这个
So, again, I did go through this quite quickly

169
00:07:16,410 --> 00:07:18,720
我强烈建议查看课程笔记
and I strongly recommend checking out the course notes

170
00:07:18,720 --> 00:07:20,040
要更详细地了解这一点，
to see this in more detail,

171
00:07:20,040 --> 00:07:21,600
并自己试验代码
and to experiment with the code yourself

172
00:07:21,600 --> 00:07:23,490
看看错误是什么样的
and see what the errors look like

173
00:07:23,490 --> 00:07:25,380
以及如何解决他们，
and how you can approach them,

174
00:07:25,380 --> 00:07:27,930
但我希望我在这里给了你一个简短的总结
but I hope I've given you here a quick summary

175
00:07:27,930 --> 00:07:30,510
最常见的错误
of the most common bugs

176
00:07:30,510 --> 00:07:32,880
也许是最常见的调试方法
and maybe the most common debugging approaches

177
00:07:32,880 --> 00:07:33,960
来对付他们。
to dealing with them.

178
00:07:33,960 --> 00:07:37,020
所以，祝你好运，记得多休息
So, good luck, and remember to take plenty of breaks

179
00:07:37,020 --> 00:07:38,970
如果你的代码给你带来困难。
if your code is giving you a hard time.

180
00:07:39,805 --> 00:07:42,472
（空气呼啸）
(air whooshing)