subtitles/zh-CN/73_debugging-the-training-pipeline-(pytorch).srt (404 lines of code) (raw):
1
00:00:06,210 --> 00:00:08,760
- 在本视频中,我们将了解如何调试错误
- In this video, we will see how to debug an error
2
00:00:08,760 --> 00:00:11,896
你在运行 Trainer.train 时你会遇到的
you encounter when running Trainer.train
3
00:00:11,896 --> 00:00:15,066
作为一个例子,我们将使用这个脚本微调
As an example, we will use this script that finetunes
4
00:00:15,066 --> 00:00:17,760
一个 BERT 模型, 在 GLUE MNLI 数据集上。
a BERT model on the GLUE MNLI dataset.
5
00:00:17,760 --> 00:00:19,470
查看下面链接的视频
Checkout the videos linked below
6
00:00:19,470 --> 00:00:21,840
以看看我们是如何得出这样一个脚本的。
to see how we came to such a script.
7
00:00:21,840 --> 00:00:24,540
这里我们要学习如何调试其中的问题。
Here we want to learn how to debug the problems in it.
8
00:00:25,470 --> 00:00:28,110
运行脚本很快就会给我们一个错误。
Running the script gives us an error pretty quickly.
9
00:00:28,110 --> 00:00:29,040
它发生在这一行
It happens at the line
10
00:00:29,040 --> 00:00:30,990
我们将输入提供给模型,
where we feed the inputs to the model,
11
00:00:30,990 --> 00:00:32,850
根据回溯。
according to the traceback.
12
00:00:32,850 --> 00:00:34,702
这告诉我们这里有问题,
That tells us there is a problem there,
13
00:00:34,702 --> 00:00:37,881
但问题可能有许多不同的原因。
but the problem could come from many different causes.
14
00:00:37,881 --> 00:00:39,330
要调试训练中的错误,
To debug an error in a training,
15
00:00:39,330 --> 00:00:41,760
你需要确保训练 pipeline 的每一步
you need to make sure each step of the training pipeline
16
00:00:41,760 --> 00:00:43,440
按预期工作。
works as intended.
17
00:00:43,440 --> 00:00:45,780
这意味着检查数据集的输入
This means checking that the inputs of your dataset
18
00:00:45,780 --> 00:00:47,040
是正确的,
are correct,
19
00:00:47,040 --> 00:00:48,720
你可以把它们分批在一起,
you can batch them together,
20
00:00:48,720 --> 00:00:50,790
把他们通过模型以获得损失,
feed them through the model to get a loss,
21
00:00:50,790 --> 00:00:52,500
然后计算该损失的梯度
then compute the gradients of that loss
22
00:00:52,500 --> 00:00:54,303
在执行优化器步骤之前。
before performing an optimizer step.
23
00:00:55,470 --> 00:00:57,810
因此,让我们从查看训练数据集开始
So let's start by looking at the training dataset
24
00:00:57,810 --> 00:00:59,043
这个 Trainer 正在使用。
this Trainer is using.
25
00:00:59,910 --> 00:01:02,190
这里肯定有问题。
There is definitely a problem here.
26
00:01:02,190 --> 00:01:04,293
我们看到的是文字而不是数字。
We see texts and not number.
27
00:01:05,130 --> 00:01:06,660
错误消息告诉我们模型
The error message was telling us the model
28
00:01:06,660 --> 00:01:08,220
没有得到输入 ID
did not get input IDs
29
00:01:08,220 --> 00:01:11,100
我们数据集中确实没有那些。
and we do not have those in the dataset indeed.
30
00:01:11,100 --> 00:01:12,660
回顾我们的代码,
Looking back at our code,
31
00:01:12,660 --> 00:01:14,400
我们可以看到我们犯了一个错误
we can see we made a mistake
32
00:01:14,400 --> 00:01:17,400
并将错误的数据集传递给 Trainer 。
and passed the wrong datasets to the Trainer.
33
00:01:17,400 --> 00:01:19,173
所以让我们修复它并再次运行。
So let's fix that and run again.
34
00:01:20,490 --> 00:01:21,840
现在我们有一个新的错误。
Now we have a new error.
35
00:01:21,840 --> 00:01:23,130
检查回溯
Inspecting the traceback
36
00:01:23,130 --> 00:01:25,860
告诉我们当我们尝试创建批处理时会发生,
tells us it happens when we try to create a batch,
37
00:01:25,860 --> 00:01:28,743
特别对于对张量中的特征进行分组。
specifically to group the features in a tensor.
38
00:01:29,700 --> 00:01:32,610
我们可以通过要求 Trainer 给我们一分批来确认这一点
We can confirm this by asking the Trainer to get us a batch
39
00:01:32,610 --> 00:01:34,230
训练数据加载器,
of the training data loader,
40
00:01:34,230 --> 00:01:35,913
复现相同的错误。
which reproduces the same error.
41
00:01:36,780 --> 00:01:39,064
通过检查输入或调试,
Either by inspecting the inputs or debugging,
42
00:01:39,064 --> 00:01:42,870
然后我们可以看到它们的大小并不相同。
we can then see they are not all of the same size.
43
00:01:42,870 --> 00:01:45,120
这是因为我们还没有输入数据整理器
This is because we have not passed a data collator
44
00:01:45,120 --> 00:01:46,890
对 Trainer 进行填充
to do the padding to the Trainer
45
00:01:46,890 --> 00:01:49,443
并且在预处理数据时也没有填充。
and didn't pad when preprocessing the data either.
46
00:01:50,430 --> 00:01:52,710
Trainer 内部的填充通常是默认设置,
Padding inside the Trainer is normally the default,
47
00:01:52,710 --> 00:01:55,380
但前提是你将 tokenizer 提供给 Trainer ,
but only if you provide your tokenizer to the Trainer,
48
00:01:55,380 --> 00:01:57,270
我们忘了这样做。
and we forgot to do that.
49
00:01:57,270 --> 00:01:59,120
因此,让我们解决问题并再次运行。
So let's fix the issue and run again.
50
00:02:00,510 --> 00:02:02,883
这次我们得到了一个讨厌的 CUDA 错误。
This time we get a nasty CUDA error.
51
00:02:03,765 --> 00:02:06,285
它们很难调试,因为一方面,
They are very difficult to debug because for one,
52
00:02:06,285 --> 00:02:10,530
他们将你的内核置于不可恢复的状态
they put your kernel in a state that is not recoverable
53
00:02:10,530 --> 00:02:13,260
所以你必须从头开始重启你的笔记本
so you have to restart your notebook from the beginning
54
00:02:13,260 --> 00:02:16,950
第二,追溯对那些来说完全没用。
and two, the traceback is completely useless for those.
55
00:02:16,950 --> 00:02:19,230
这里的回溯告诉我们错误发生了
Here the traceback tells us the error happens
56
00:02:19,230 --> 00:02:22,500
当我们使用 loss.backward 进行梯度计算时,
when we do the gradient computation with loss.backward,
57
00:02:22,500 --> 00:02:25,113
但正如我们稍后将看到的那样,情况并非如此。
but as we will see later on that is not the case.
58
00:02:26,520 --> 00:02:28,920
这是因为在 GPU 上发生的一切
This is because everything that happens on the GPU
59
00:02:28,920 --> 00:02:30,720
是异步完成的。
is done asynchronously.
60
00:02:30,720 --> 00:02:32,880
当你执行模型调用时,
When you execute the model call,
61
00:02:32,880 --> 00:02:34,457
该程序所做的只是堆叠它
what the program does is just stacking that
62
00:02:34,457 --> 00:02:36,600
在 GPU 的队列中,
in the queue of GPU,
63
00:02:36,600 --> 00:02:39,856
那么如果 GPU 当前没有任何工作要做,
then if the GPU didn't have any current job to do,
64
00:02:39,856 --> 00:02:41,850
工作将同时在 GPU 上开始
the work will start on the GPU at the same time
65
00:02:41,850 --> 00:02:45,000
当 CPU 移动到下一条指令时。
as the CPU moves to the next instruction.
66
00:02:45,000 --> 00:02:47,040
继续提取损失,
Continuing with the extraction of the loss,
67
00:02:47,040 --> 00:02:49,170
这被堆叠到 GPU 队列中
this is stacked into the GPU queue
68
00:02:49,170 --> 00:02:51,953
同时 CPU 移动到指令 loss.backward。
while the CPU moves to the instruction loss.backward.
69
00:02:51,953 --> 00:02:54,180
但是 GPU 还没有完成
But the GPU still hasn't finished
70
00:02:54,180 --> 00:02:55,710
模型的前向传播
the forward pass of the model
71
00:02:55,710 --> 00:02:57,603
因为这一切根本不需要时间。
since all that took no time at all.
72
00:02:58,440 --> 00:03:00,210
CPU 停止前进,
The CPU stops moving forward,
73
00:03:00,210 --> 00:03:03,240
因为 loss.backward 作为指令告诉它等待
because loss.backward as an instruction telling it to wait
74
00:03:03,240 --> 00:03:04,830
为了完成 GPU,
for the GPUs to be finished,
75
00:03:04,830 --> 00:03:06,780
以确保梯度是正确的。
to make sure the gradients are correct.
76
00:03:07,650 --> 00:03:09,570
当 GPU 遇到错误时,
When the GPU encounters an error,
77
00:03:09,570 --> 00:03:13,140
它通过一条神秘的消息将它返回给 CPU
it gives it back to the CPU with a cryptic message
78
00:03:13,140 --> 00:03:15,423
谁在错误的地方提出错误。
who raises the error at the wrong place.
79
00:03:16,350 --> 00:03:18,720
所以要调试这个,我们需要执行
So to debug this, we will need to execute
80
00:03:18,720 --> 00:03:21,211
接下来 CPU 上的训练流水线的步骤。
the next steps of the training pipeline on the CPU.
81
00:03:21,211 --> 00:03:22,380
这很容易做到,
It is very easy to do,
82
00:03:22,380 --> 00:03:25,350
这次我们得到了可以信任的回溯。
and we get a traceback we can trust this time.
83
00:03:25,350 --> 00:03:26,520
正如我们之前所说,
As we said before,
84
00:03:26,520 --> 00:03:28,620
错误实际上发生在前向传播过程中
the error actually happens during the forward pass
85
00:03:28,620 --> 00:03:29,453
对模型,
of the model,
86
00:03:29,453 --> 00:03:30,993
而不是 loss.backward。
and not loss.backward.
87
00:03:31,920 --> 00:03:33,680
这是一个索引错误。
It's an index error.
88
00:03:33,680 --> 00:03:34,950
经过一些调试,
With a bit of debugging,
89
00:03:34,950 --> 00:03:37,410
我们看到我们有从 0 到 2 的标签,
we see we have labels ranging from 0 to 2,
90
00:03:37,410 --> 00:03:39,000
所以三个不同的值,
so three different values,
91
00:03:39,000 --> 00:03:42,191
但我们的输出具有每 2 个批量大小的形状。
but our outputs have a shape of batch size per 2.
92
00:03:42,191 --> 00:03:45,600
看起来我们的模型有错误的标签数量。
It looks like our model has the wrong number of labels.
93
00:03:45,600 --> 00:03:47,190
我们确实可以确认,
We can indeed confirm that,
94
00:03:47,190 --> 00:03:49,860
现在我们知道在代码中修复它很容易
and now that we know it's easy to fix it in the code
95
00:03:49,860 --> 00:03:53,969
通过在创建模型时添加 num_labels=3。
by adding num_labels=3 when we create the model.
96
00:03:53,969 --> 00:03:56,883
现在训练脚本将运行完成。
Now the training script will run to completion.
97
00:03:58,440 --> 00:03:59,430
我们还不需要它,
We did not need it yet,
98
00:03:59,430 --> 00:04:00,960
但这是我们调试下一步的方式
but here is how we would debug the next step
99
00:04:00,960 --> 00:04:02,944
pipeline ,梯度计算,
of the pipeline, gradient computation,
100
00:04:02,944 --> 00:04:05,850
以及优化器更新。
as well as the optimizer step.
101
00:04:05,850 --> 00:04:08,823
有了所有这些,祝你在调试自己的训练时好运!
With all of this, good luck debugging your own trainings!