subtitles/zh-CN/74_debugging-the-training-pipeline-(tensorflow).srt (720 lines of code) (raw):
1
00:00:00,212 --> 00:00:02,879
(空气呼啸)
(air whooshing)
2
00:00:04,680 --> 00:00:08,130
- 你代码中的一些错误非常简单。
- Some bugs in your code are very straightforward.
3
00:00:08,130 --> 00:00:11,580
你尝试运行它,某处出现语法错误,
You try running it, you get a syntax error somewhere,
4
00:00:11,580 --> 00:00:14,490
Python 准确地告诉你在哪里,然后你修复它。
Python tells you exactly where, and you fix it.
5
00:00:14,490 --> 00:00:17,760
这很棒,简单而且令人满意。
This is great, it's simple and it's satisfying.
6
00:00:17,760 --> 00:00:20,310
但有时,事情会崩溃
Sometimes, though, things crash
7
00:00:20,310 --> 00:00:23,670
错误是无法理解的。
and the error is impossible to understand.
8
00:00:23,670 --> 00:00:26,700
由于一些原因,这种情况在机器学习中经常发生,
This happens a lot in machine learning for a few reasons,
9
00:00:26,700 --> 00:00:29,310
你正在处理大数据结构,
you're working with big data structures,
10
00:00:29,310 --> 00:00:31,440
你正在使用这些大而复杂的库
you're using these big, complex libraries
11
00:00:31,440 --> 00:00:33,420
有很多活动部件,
with a lot of moving parts,
12
00:00:33,420 --> 00:00:35,310
而且你正在做大量的 GPU 计算,
and also you're doing a lot of GPU computing,
13
00:00:35,310 --> 00:00:38,490
而且一般来说调试起来要困难得多。
and that in general is much more difficult to debug.
14
00:00:38,490 --> 00:00:40,260
在 Keras 中还有一个问题
In Keras there's the additional problem
15
00:00:40,260 --> 00:00:43,140
你的模型通常在执行前编译,
that your models are often compiled before execution,
16
00:00:43,140 --> 00:00:44,400
这对性能很有帮助
which is great for performance
17
00:00:44,400 --> 00:00:47,430
但这也使调试它们变得非常困难。
but it makes debugging them very difficult as well.
18
00:00:47,430 --> 00:00:50,370
所以,这将是一个关于该怎么做的视频
So, this is going to be a video about what to do
19
00:00:50,370 --> 00:00:52,410
当你遇到那些噩梦般的一个错误时
when you run into one of those nightmare bugs
20
00:00:52,410 --> 00:00:55,210
而且你只是不知道从哪里开始修复它。
and you just have no idea where to begin with fixing it.
21
00:00:56,370 --> 00:00:58,920
所以,给你一些直觉
So, to give you some intuitions for
22
00:00:58,920 --> 00:01:01,530
最常见的错误
the most common things that go wrong
23
00:01:01,530 --> 00:01:03,573
并导致这些奇怪的问题,
and cause these weird issues,
24
00:01:04,800 --> 00:01:07,530
并告诉你在哪里寻找错误的来源
and show you where to look for the sources of bugs
25
00:01:07,530 --> 00:01:10,560
你遇到的,让我们使用这个示例脚本。
that you encounter, let's use this example script.
26
00:01:10,560 --> 00:01:12,900
因此,我将在这里分两部分向你展示。
So, I'll show it to you here in two parts.
27
00:01:12,900 --> 00:01:16,410
首先,我们做所有的导入,我们加载一个数据集,
First, we do all our imports, we load a dataset,
28
00:01:16,410 --> 00:01:20,280
我们创建了 tokenizer 并对数据集进行分词。
we create our tokenizer and we tokenize the dataset.
29
00:01:20,280 --> 00:01:23,640
接下来,我们将数据集转换为 TensorFlow 数据集,
Next, we convert our datasets to TensorFlow datasets,
30
00:01:23,640 --> 00:01:26,100
这就是 tf.data.Dataset,
so that's tf.data.Dataset,
31
00:01:26,100 --> 00:01:28,500
这样我们就可以拟合它们,
and that's so that we can run fit on them,
32
00:01:28,500 --> 00:01:31,170
然后我们从预训练的 checkpoint 加载我们的模型,
and then we load our model from a pretrained checkpoint,
33
00:01:31,170 --> 00:01:33,870
我们对其进行编译,并将其与这些数据集相匹配。
we compile it and we fit it with those datasets.
34
00:01:33,870 --> 00:01:35,970
所以,这看起来很简单,
So, this seems straightforward enough,
35
00:01:35,970 --> 00:01:38,220
这与我们之前在课程中所做的类似。
it's similar to what we've done in the course before.
36
00:01:38,220 --> 00:01:40,650
但请注意,这是令人毛骨悚然的代码
But beware, this is spooky code
37
00:01:40,650 --> 00:01:43,590
并隐藏着许多黑暗而神秘的秘密。
and hides many dark and mysterious secrets.
38
00:01:43,590 --> 00:01:46,050
那么,当我们运行它时会发生什么?
So, what happens when we run it?
39
00:01:46,050 --> 00:01:48,840
好吧,这不是很好。
Well, it's not great.
40
00:01:48,840 --> 00:01:52,320
所以,我们收到了这个错误信息,但它是什么意思呢?
So, we get this error message, but what does it mean?
41
00:01:52,320 --> 00:01:55,470
我们试图训练我们的数据,但我们没有梯度?
We tried to train on our data, but we got no gradient?
42
00:01:55,470 --> 00:01:59,130
这很令人困惑,我的意思是,我们从何开始
It's pretty perplexing, I mean, how do we even begin
43
00:01:59,130 --> 00:02:01,500
调试没有得到梯度?
to debug not getting a gradient?
44
00:02:01,500 --> 00:02:03,930
所以,当你得到的错误没有立即提示
So, when the error you get doesn't immediately suggest
45
00:02:03,930 --> 00:02:06,630
问题出在哪里,最好的解决办法
where the problem is, the best solution
46
00:02:06,630 --> 00:02:09,180
通常是按顺序遍历事物,
is often to walk through things in sequence,
47
00:02:09,180 --> 00:02:12,900
确保在每个阶段输出看起来都是正确的,
making sure at each stage that the outputs look right,
48
00:02:12,900 --> 00:02:15,300
那时一切看起来都很好。
that everything looks okay at that point.
49
00:02:15,300 --> 00:02:17,730
而且,当然,这意味着开始的地方
And, of course, that means the place to start
50
00:02:17,730 --> 00:02:19,473
总是检查你的数据。
is always to check your data.
51
00:02:20,670 --> 00:02:22,050
所以,最好的办法是确保
So, the best way to make sure
52
00:02:22,050 --> 00:02:24,480
你给模型的数据是好的,
that the data you're giving the model is good,
53
00:02:24,480 --> 00:02:27,690
是从 tf.data.Dataset 中抓取一批
is to grab a batch from the tf.data.Dataset
54
00:02:27,690 --> 00:02:29,520
你的模型正在训练,
that your model is training on,
55
00:02:29,520 --> 00:02:31,560
那是因为它就在最后
and that's because it's right at the end
56
00:02:31,560 --> 00:02:33,990
的数据 pipeline 。
of the data pipeline.
57
00:02:33,990 --> 00:02:36,990
所以这意味着如果这些输出是好的,
And so that means that if those outputs are good,
58
00:02:36,990 --> 00:02:39,990
你可以保证你的数据 pipeline 运行良好。
you're guaranteed that your data pipeline is working well.
59
00:02:39,990 --> 00:02:42,600
所以,我们可以通过遍历数据集来做到这一点
So, we can do that by looping over the dataset
60
00:02:42,600 --> 00:02:44,790
进行一次迭代然后中断,
for one iteration and then breaking,
61
00:02:44,790 --> 00:02:46,980
这给了我们一个分批。
and that gives us a single batch.
62
00:02:46,980 --> 00:02:49,443
那么,当我们检查该批次时,我们得到了什么?
So, what do we get when we inspect that batch?
63
00:02:50,460 --> 00:02:52,500
我们会看到我们没有得到任何梯度
We'll see that we're not getting any gradient
64
00:02:52,500 --> 00:02:55,530
因为我们没有将标签传递给 Keras。
because we're not passing labels to Keras.
65
00:02:55,530 --> 00:02:57,510
所以,我们的标签在批次中,
So, our labels are in the batch,
66
00:02:57,510 --> 00:02:59,670
但它们是输入字典中的键
but they're a key in the input dictionary
67
00:02:59,670 --> 00:03:02,340
而且它们不像 Keras 期望的那样是一个单独的标签,
and they're not a separate label as Keras expects,
68
00:03:02,340 --> 00:03:04,830
所以这是你会遇到的最常见的问题之一
so this is one of the most common issues you'll encounter
69
00:03:04,830 --> 00:03:07,590
使用 TensorFlow 训练 Transformers 模型时。
when training Transformers models with TensorFlow.
70
00:03:07,590 --> 00:03:10,980
我们的模型都可以在内部计算损失,
Our models can all compute loss internally,
71
00:03:10,980 --> 00:03:13,140
但要将损失用于训练
but to use that loss for training
72
00:03:13,140 --> 00:03:15,960
标签需要在输入字典中传递,
the labels need to be passed in the input dictionary,
73
00:03:15,960 --> 00:03:17,940
其模型可以看到它们。
where the model can see them.
74
00:03:17,940 --> 00:03:20,280
这种内部损失是我们使用的损失
This internal loss is the loss that we use
75
00:03:20,280 --> 00:03:23,760
当我们在调用编译时没有指定损失时,
when we don't specify a loss when we call compile,
76
00:03:23,760 --> 00:03:25,660
当我们不指定损失参数时。
when we don't specify a loss argument.
77
00:03:26,520 --> 00:03:27,870
所以,另一方面,Keras,
So, Keras, on the other hand,
78
00:03:27,870 --> 00:03:30,570
通常期望标签单独传递
usually expects labels to be passed separately
79
00:03:30,570 --> 00:03:32,130
从输入字典,
from the input dictionary,
80
00:03:32,130 --> 00:03:34,110
并且对模型不可见,
and not to be visible to the model,
81
00:03:34,110 --> 00:03:36,600
损失计算通常会失败
and loss computations will usually fail
82
00:03:36,600 --> 00:03:38,220
如果你不那样做
if you don't do that
83
00:03:38,220 --> 00:03:40,380
所以我们需要选择其中之一,
So we need to choose one or the other,
84
00:03:40,380 --> 00:03:42,780
要么我们使用模型的内部损失
either we use the model's internal loss
85
00:03:42,780 --> 00:03:44,940
并将标签保留在原处,
and keep the labels where they are,
86
00:03:44,940 --> 00:03:46,980
或者我们继续使用 Keras 损失
or we keep using Keras losses
87
00:03:46,980 --> 00:03:50,520
但我们将标签移动到 Keras 期望的地方。
but we move the labels to the place Keras expects them.
88
00:03:50,520 --> 00:03:53,310
所以,还是简单点,让我们解决这个问题
So, or simplicity here, let's fix this issue
89
00:03:53,310 --> 00:03:55,860
通过使用模型的内部损失,
by using the model's internal losses,
90
00:03:55,860 --> 00:03:57,900
我们通过删除损失参数来做到这一点
and we do that by removing the loss argument
91
00:03:57,900 --> 00:03:59,343
从调用编译。
from the call to compile.
92
00:04:00,540 --> 00:04:03,000
那么,如果我们现在尝试训练会发生什么?
So, what happens if we try training now?
93
00:04:03,000 --> 00:04:08,000
所以我们用它重新编译,我们调用 model.fit,会发生什么?
So we recompile with that, we call model.fit, what happens?
94
00:04:08,220 --> 00:04:13,050
好吧,这次它运行了,但现在我们消失了 NaN。
Well, it runs this time but now we get a loss of NaN.
95
00:04:13,050 --> 00:04:16,440
所以,这不好,NaN 表示不是数字
So, that's not good, NaN means not a number
96
00:04:16,440 --> 00:04:19,140
总的来说,这不是一个好损失。
and it's not a good loss to have in general.
97
00:04:19,140 --> 00:04:21,000
事实上,如果我们现在检查我们的模型,
In fact, if we inspect our model now,
98
00:04:21,000 --> 00:04:23,970
我们会看到不仅所有的输出都是 NaN,
we'll see that not only are all the outputs NaN,
99
00:04:23,970 --> 00:04:27,600
所有的权重和损失也是 NaN。
all the weights are NaN as well, as well as the loss.
100
00:04:27,600 --> 00:04:30,810
所以一旦一个 NaN 悄悄爬进你的计算,
So once a single NaN creeps into your computations,
101
00:04:30,810 --> 00:04:34,530
它倾向于传播,因为它从损失传播
it tends to spread, because it propagates from the loss
102
00:04:34,530 --> 00:04:36,420
一旦它处于损失状态,它就会处于梯度状态,
and once it's at the loss it's at the gradient,
103
00:04:36,420 --> 00:04:37,530
它达到梯度,
it gets to the gradient,
104
00:04:37,530 --> 00:04:38,910
然后一旦它处于梯度中
and then once it's in the gradient
105
00:04:38,910 --> 00:04:41,280
它进入权重更新,
it enters the weight updates,
106
00:04:41,280 --> 00:04:43,980
然后你所有的权重更新最终也都是 NaN 。
and then all your weight updates end up as NaN as well.
107
00:04:43,980 --> 00:04:46,950
所以 NaN 在这里完全破坏了我们的模型,
So NaN just completely destroyed our model here,
108
00:04:46,950 --> 00:04:49,560
但它首先潜入何处?
but where did it creep in first?
109
00:04:49,560 --> 00:04:52,140
所以要找出答案,我们需要回到一个点
So to find out, we need to back to a point
110
00:04:52,140 --> 00:04:53,490
在模型被摧毁之前,
before the model was destroyed,
111
00:04:53,490 --> 00:04:55,440
我们需要重新初始化模型
we need to re-initialize the model
112
00:04:55,440 --> 00:04:58,590
并查看第一批的输出。
and look at the outputs for just the first batch.
113
00:04:58,590 --> 00:04:59,850
当我们这样做时,
And when we do that,
114
00:04:59,850 --> 00:05:02,790
我们看到 NaN 首先出现在损失中,
we see that NaN first appears in the loss,
115
00:05:02,790 --> 00:05:04,980
但仅在某些样本中。
but only in some samples.
116
00:05:04,980 --> 00:05:06,540
所以你可以更详细地看到这个
So you can see this in more detail
117
00:05:06,540 --> 00:05:09,090
在课程笔记的随附部分中,
in the accompanying section of the course notes,
118
00:05:09,090 --> 00:05:11,220
我在这里移动得相当快,
I am moving fairly quickly here,
119
00:05:11,220 --> 00:05:13,500
但我们发现如果我们看一下标签,
but we find that if we look at the labels,
120
00:05:13,500 --> 00:05:17,790
损失为 NaN 的样本的标签均为 2。
the samples with a loss of NaN all have a label of two.
121
00:05:17,790 --> 00:05:19,950
所以这给了我们一个非常有力的线索,
So this gives us a very strong clue,
122
00:05:19,950 --> 00:05:24,060
如果我们使用 model.config.num_labels 检查模型,
if we check the model with model.config.num_labels,
123
00:05:24,060 --> 00:05:26,760
我们看到模型认为只有两个标签,
we see that the model thinks there's only two labels,
124
00:05:26,760 --> 00:05:28,950
但如果我们看到值为二,
but if we see a value of two,
125
00:05:28,950 --> 00:05:31,200
这意味着至少有三个标签
that means there's at least three labels
126
00:05:31,200 --> 00:05:33,630
因为 0 也是一个标签。
because 0 is a label as well.
127
00:05:33,630 --> 00:05:35,070
所以我们得到大量 NaN
So we got a lots of NaN
128
00:05:35,070 --> 00:05:37,887
因为我们的标签集中有一个 “不可能” 的标签,
because we got an "impossible" label in our label set,
129
00:05:37,887 --> 00:05:41,010
并修复我们需要返回并设置模型
and to fix that we need to go back and set the model
130
00:05:41,010 --> 00:05:43,650
期待正确数量的标签,
to expect the right number of labels,
131
00:05:43,650 --> 00:05:45,870
所以我们可以设置 num_labels=3
so we can set num_labels=3
132
00:05:45,870 --> 00:05:48,540
当我们初始化模型用 from_pretrained 时,
when we initialize the model but from_pretrained,
133
00:05:48,540 --> 00:05:51,450
现在希望我们可以避免这个问题。
and now hopefully we can avoid this issue.
134
00:05:51,450 --> 00:05:54,660
所以,现在我们认为我们的数据很好,我们的模型也很好
So, now we think our data is good and our model is good
135
00:05:54,660 --> 00:05:56,220
所以训练应该有效
and so training should work
136
00:05:56,220 --> 00:06:00,510
但是如果我们尝试运行 model.fit,我们,嗯……
but if we try running model.fit, we, well...
137
00:06:00,510 --> 00:06:02,040
我的意思是,我们确实有损失,
I mean, we do get a loss,
138
00:06:02,040 --> 00:06:03,930
这是一个数字,它正在下降
it is a number and it is going down
139
00:06:03,930 --> 00:06:06,090
但它不会很快下降
but it's not going down very quickly
140
00:06:06,090 --> 00:06:07,770
如果我们一直运行这个,
and if we keep running this out,
141
00:06:07,770 --> 00:06:10,980
我们会发现它停在相当高的损失值处。
we'll find that it stalls at a fairly high loss value.
142
00:06:10,980 --> 00:06:12,450
发生什么了?
So, what's going on?
143
00:06:12,450 --> 00:06:14,130
好吧,当一切正常时,
Well, when things are mostly working,
144
00:06:14,130 --> 00:06:16,620
但训练只是很慢或有点奇怪,
but training is just slow or a bit odd,
145
00:06:16,620 --> 00:06:19,470
这通常是查看优化器的好时机
that can often be a good time to look at your optimizer
146
00:06:19,470 --> 00:06:22,020
和你的训练超参数。
and your training hyperparameters.
147
00:06:22,020 --> 00:06:23,460
这就是我想提的地方
And this is where I want to mention
148
00:06:23,460 --> 00:06:25,320
最常见的问题来源之一
one of the most common sources of issues
149
00:06:25,320 --> 00:06:27,000
当你使用 Keras 时,
when you're working with Keras,
150
00:06:27,000 --> 00:06:30,870
你可以用字符串命名优化器,
you can name things like optimizers with strings,
151
00:06:30,870 --> 00:06:33,180
所以 Keras 支持它,而且非常方便,
so Keras supports that and it's very convenient,
152
00:06:33,180 --> 00:06:35,460
但如果你这样做所有的选择
but if you do that all of the options
153
00:06:35,460 --> 00:06:38,400
默默地设置为默认值。
get silently set to their default values.
154
00:06:38,400 --> 00:06:41,190
所以我们将优化器指定为 Adam,
So we specified our optimizer as Adam,
155
00:06:41,190 --> 00:06:43,110
但在这个过程中我们无形中得到了
but in the process we invisibly got
156
00:06:43,110 --> 00:06:46,260
默认学习率,即 1e-3,
the default learning rate, which is 1e-3,
157
00:06:46,260 --> 00:06:48,630
或 10 的 -3 次方。
or 10 to the power of -3.
158
00:06:48,630 --> 00:06:50,550
所以这个学习率太高了
So this learning rate is way too high
159
00:06:50,550 --> 00:06:52,530
用于训练 transformer 模型,
for training transformer models,
160
00:06:52,530 --> 00:06:55,620
我们应该回去直接指定学习率,
we should go back and specify the learning rate directly,
161
00:06:55,620 --> 00:06:57,060
不使用字符串。
not using a string.
162
00:06:57,060 --> 00:07:01,290
所以,这里的好值在 1e-5 和 1e-4 之间
So, good values here are between 1e-5 and 1e-4
163
00:07:01,290 --> 00:07:04,233
所以让我们平分差价并选择 5e-5。
so let's split the difference and pick 5e-5.
164
00:07:05,310 --> 00:07:06,990
所以如果你用那个重新编译,
So if you recompile with that,
165
00:07:06,990 --> 00:07:09,840
你最终会发现训练确实有效。
you'll find that training actually works, at last.
166
00:07:09,840 --> 00:07:11,700
损失有效减少
The loss goes down efficiently
167
00:07:11,700 --> 00:07:14,070
并且收敛到一个较低的值。
and it converges to a lower value.
168
00:07:14,070 --> 00:07:16,410
所以,再一次,我确实很快地完成了这个
So, again, I did go through this quite quickly
169
00:07:16,410 --> 00:07:18,720
我强烈建议查看课程笔记
and I strongly recommend checking out the course notes
170
00:07:18,720 --> 00:07:20,040
要更详细地了解这一点,
to see this in more detail,
171
00:07:20,040 --> 00:07:21,600
并自己试验代码
and to experiment with the code yourself
172
00:07:21,600 --> 00:07:23,490
看看错误是什么样的
and see what the errors look like
173
00:07:23,490 --> 00:07:25,380
以及如何解决他们,
and how you can approach them,
174
00:07:25,380 --> 00:07:27,930
但我希望我在这里给了你一个简短的总结
but I hope I've given you here a quick summary
175
00:07:27,930 --> 00:07:30,510
最常见的错误
of the most common bugs
176
00:07:30,510 --> 00:07:32,880
也许是最常见的调试方法
and maybe the most common debugging approaches
177
00:07:32,880 --> 00:07:33,960
来对付他们。
to dealing with them.
178
00:07:33,960 --> 00:07:37,020
所以,祝你好运,记得多休息
So, good luck, and remember to take plenty of breaks
179
00:07:37,020 --> 00:07:38,970
如果你的代码给你带来困难。
if your code is giving you a hard time.
180
00:07:39,805 --> 00:07:42,472
(空气呼啸)
(air whooshing)