subtitles/zh-CN/73_debugging-the-training-pipeline-(pytorch).srt (404 lines of code) (raw):

1 00:00:06,210 --> 00:00:08,760 - 在本视频中,我们将了解如何调试错误 - In this video, we will see how to debug an error 2 00:00:08,760 --> 00:00:11,896 你在运行 Trainer.train 时你会遇到的 you encounter when running Trainer.train 3 00:00:11,896 --> 00:00:15,066 作为一个例子,我们将使用这个脚本微调 As an example, we will use this script that finetunes 4 00:00:15,066 --> 00:00:17,760 一个 BERT 模型, 在 GLUE MNLI 数据集上。 a BERT model on the GLUE MNLI dataset. 5 00:00:17,760 --> 00:00:19,470 查看下面链接的视频 Checkout the videos linked below 6 00:00:19,470 --> 00:00:21,840 以看看我们是如何得出这样一个脚本的。 to see how we came to such a script. 7 00:00:21,840 --> 00:00:24,540 这里我们要学习如何调试其中的问题。 Here we want to learn how to debug the problems in it. 8 00:00:25,470 --> 00:00:28,110 运行脚本很快就会给我们一个错误。 Running the script gives us an error pretty quickly. 9 00:00:28,110 --> 00:00:29,040 它发生在这一行 It happens at the line 10 00:00:29,040 --> 00:00:30,990 我们将输入提供给模型, where we feed the inputs to the model, 11 00:00:30,990 --> 00:00:32,850 根据回溯。 according to the traceback. 12 00:00:32,850 --> 00:00:34,702 这告诉我们这里有问题, That tells us there is a problem there, 13 00:00:34,702 --> 00:00:37,881 但问题可能有许多不同的原因。 but the problem could come from many different causes. 14 00:00:37,881 --> 00:00:39,330 要调试训练中的错误, To debug an error in a training, 15 00:00:39,330 --> 00:00:41,760 你需要确保训练 pipeline 的每一步 you need to make sure each step of the training pipeline 16 00:00:41,760 --> 00:00:43,440 按预期工作。 works as intended. 17 00:00:43,440 --> 00:00:45,780 这意味着检查数据集的输入 This means checking that the inputs of your dataset 18 00:00:45,780 --> 00:00:47,040 是正确的, are correct, 19 00:00:47,040 --> 00:00:48,720 你可以把它们分批在一起, you can batch them together, 20 00:00:48,720 --> 00:00:50,790 把他们通过模型以获得损失, feed them through the model to get a loss, 21 00:00:50,790 --> 00:00:52,500 然后计算该损失的梯度 then compute the gradients of that loss 22 00:00:52,500 --> 00:00:54,303 在执行优化器步骤之前。 before performing an optimizer step. 23 00:00:55,470 --> 00:00:57,810 因此,让我们从查看训练数据集开始 So let's start by looking at the training dataset 24 00:00:57,810 --> 00:00:59,043 这个 Trainer 正在使用。 this Trainer is using. 25 00:00:59,910 --> 00:01:02,190 这里肯定有问题。 There is definitely a problem here. 26 00:01:02,190 --> 00:01:04,293 我们看到的是文字而不是数字。 We see texts and not number. 27 00:01:05,130 --> 00:01:06,660 错误消息告诉我们模型 The error message was telling us the model 28 00:01:06,660 --> 00:01:08,220 没有得到输入 ID did not get input IDs 29 00:01:08,220 --> 00:01:11,100 我们数据集中确实没有那些。 and we do not have those in the dataset indeed. 30 00:01:11,100 --> 00:01:12,660 回顾我们的代码, Looking back at our code, 31 00:01:12,660 --> 00:01:14,400 我们可以看到我们犯了一个错误 we can see we made a mistake 32 00:01:14,400 --> 00:01:17,400 并将错误的数据集传递给 Trainer 。 and passed the wrong datasets to the Trainer. 33 00:01:17,400 --> 00:01:19,173 所以让我们修复它并再次运行。 So let's fix that and run again. 34 00:01:20,490 --> 00:01:21,840 现在我们有一个新的错误。 Now we have a new error. 35 00:01:21,840 --> 00:01:23,130 检查回溯 Inspecting the traceback 36 00:01:23,130 --> 00:01:25,860 告诉我们当我们尝试创建批处理时会发生, tells us it happens when we try to create a batch, 37 00:01:25,860 --> 00:01:28,743 特别对于对张量中的特征进行分组。 specifically to group the features in a tensor. 38 00:01:29,700 --> 00:01:32,610 我们可以通过要求 Trainer 给我们一分批来确认这一点 We can confirm this by asking the Trainer to get us a batch 39 00:01:32,610 --> 00:01:34,230 训练数据加载器, of the training data loader, 40 00:01:34,230 --> 00:01:35,913 复现相同的错误。 which reproduces the same error. 41 00:01:36,780 --> 00:01:39,064 通过检查输入或调试, Either by inspecting the inputs or debugging, 42 00:01:39,064 --> 00:01:42,870 然后我们可以看到它们的大小并不相同。 we can then see they are not all of the same size. 43 00:01:42,870 --> 00:01:45,120 这是因为我们还没有输入数据整理器 This is because we have not passed a data collator 44 00:01:45,120 --> 00:01:46,890 对 Trainer 进行填充 to do the padding to the Trainer 45 00:01:46,890 --> 00:01:49,443 并且在预处理数据时也没有填充。 and didn't pad when preprocessing the data either. 46 00:01:50,430 --> 00:01:52,710 Trainer 内部的填充通常是默认设置, Padding inside the Trainer is normally the default, 47 00:01:52,710 --> 00:01:55,380 但前提是你将 tokenizer 提供给 Trainer , but only if you provide your tokenizer to the Trainer, 48 00:01:55,380 --> 00:01:57,270 我们忘了这样做。 and we forgot to do that. 49 00:01:57,270 --> 00:01:59,120 因此,让我们解决问题并再次运行。 So let's fix the issue and run again. 50 00:02:00,510 --> 00:02:02,883 这次我们得到了一个讨厌的 CUDA 错误。 This time we get a nasty CUDA error. 51 00:02:03,765 --> 00:02:06,285 它们很难调试,因为一方面, They are very difficult to debug because for one, 52 00:02:06,285 --> 00:02:10,530 他们将你的内核置于不可恢复的状态 they put your kernel in a state that is not recoverable 53 00:02:10,530 --> 00:02:13,260 所以你必须从头开始重启你的笔记本 so you have to restart your notebook from the beginning 54 00:02:13,260 --> 00:02:16,950 第二,追溯对那些来说完全没用。 and two, the traceback is completely useless for those. 55 00:02:16,950 --> 00:02:19,230 这里的回溯告诉我们错误发生了 Here the traceback tells us the error happens 56 00:02:19,230 --> 00:02:22,500 当我们使用 loss.backward 进行梯度计算时, when we do the gradient computation with loss.backward, 57 00:02:22,500 --> 00:02:25,113 但正如我们稍后将看到的那样,情况并非如此。 but as we will see later on that is not the case. 58 00:02:26,520 --> 00:02:28,920 这是因为在 GPU 上发生的一切 This is because everything that happens on the GPU 59 00:02:28,920 --> 00:02:30,720 是异步完成的。 is done asynchronously. 60 00:02:30,720 --> 00:02:32,880 当你执行模型调用时, When you execute the model call, 61 00:02:32,880 --> 00:02:34,457 该程序所做的只是堆叠它 what the program does is just stacking that 62 00:02:34,457 --> 00:02:36,600 在 GPU 的队列中, in the queue of GPU, 63 00:02:36,600 --> 00:02:39,856 那么如果 GPU 当前没有任何工作要做, then if the GPU didn't have any current job to do, 64 00:02:39,856 --> 00:02:41,850 工作将同时在 GPU 上开始 the work will start on the GPU at the same time 65 00:02:41,850 --> 00:02:45,000 当 CPU 移动到下一条指令时。 as the CPU moves to the next instruction. 66 00:02:45,000 --> 00:02:47,040 继续提取损失, Continuing with the extraction of the loss, 67 00:02:47,040 --> 00:02:49,170 这被堆叠到 GPU 队列中 this is stacked into the GPU queue 68 00:02:49,170 --> 00:02:51,953 同时 CPU 移动到指令 loss.backward。 while the CPU moves to the instruction loss.backward. 69 00:02:51,953 --> 00:02:54,180 但是 GPU 还没有完成 But the GPU still hasn't finished 70 00:02:54,180 --> 00:02:55,710 模型的前向传播 the forward pass of the model 71 00:02:55,710 --> 00:02:57,603 因为这一切根本不需要时间。 since all that took no time at all. 72 00:02:58,440 --> 00:03:00,210 CPU 停止前进, The CPU stops moving forward, 73 00:03:00,210 --> 00:03:03,240 因为 loss.backward 作为指令告诉它等待 because loss.backward as an instruction telling it to wait 74 00:03:03,240 --> 00:03:04,830 为了完成 GPU, for the GPUs to be finished, 75 00:03:04,830 --> 00:03:06,780 以确保梯度是正确的。 to make sure the gradients are correct. 76 00:03:07,650 --> 00:03:09,570 当 GPU 遇到错误时, When the GPU encounters an error, 77 00:03:09,570 --> 00:03:13,140 它通过一条神秘的消息将它返回给 CPU it gives it back to the CPU with a cryptic message 78 00:03:13,140 --> 00:03:15,423 谁在错误的地方提出错误。 who raises the error at the wrong place. 79 00:03:16,350 --> 00:03:18,720 所以要调试这个,我们需要执行 So to debug this, we will need to execute 80 00:03:18,720 --> 00:03:21,211 接下来 CPU 上的训练流水线的步骤。 the next steps of the training pipeline on the CPU. 81 00:03:21,211 --> 00:03:22,380 这很容易做到, It is very easy to do, 82 00:03:22,380 --> 00:03:25,350 这次我们得到了可以信任的回溯。 and we get a traceback we can trust this time. 83 00:03:25,350 --> 00:03:26,520 正如我们之前所说, As we said before, 84 00:03:26,520 --> 00:03:28,620 错误实际上发生在前向传播过程中 the error actually happens during the forward pass 85 00:03:28,620 --> 00:03:29,453 对模型, of the model, 86 00:03:29,453 --> 00:03:30,993 而不是 loss.backward。 and not loss.backward. 87 00:03:31,920 --> 00:03:33,680 这是一个索引错误。 It's an index error. 88 00:03:33,680 --> 00:03:34,950 经过一些调试, With a bit of debugging, 89 00:03:34,950 --> 00:03:37,410 我们看到我们有从 0 到 2 的标签, we see we have labels ranging from 0 to 2, 90 00:03:37,410 --> 00:03:39,000 所以三个不同的值, so three different values, 91 00:03:39,000 --> 00:03:42,191 但我们的输出具有每 2 个批量大小的形状。 but our outputs have a shape of batch size per 2. 92 00:03:42,191 --> 00:03:45,600 看起来我们的模型有错误的标签数量。 It looks like our model has the wrong number of labels. 93 00:03:45,600 --> 00:03:47,190 我们确实可以确认, We can indeed confirm that, 94 00:03:47,190 --> 00:03:49,860 现在我们知道在代码中修复它很容易 and now that we know it's easy to fix it in the code 95 00:03:49,860 --> 00:03:53,969 通过在创建模型时添加 num_labels=3。 by adding num_labels=3 when we create the model. 96 00:03:53,969 --> 00:03:56,883 现在训练脚本将运行完成。 Now the training script will run to completion. 97 00:03:58,440 --> 00:03:59,430 我们还不需要它, We did not need it yet, 98 00:03:59,430 --> 00:04:00,960 但这是我们调试下一步的方式 but here is how we would debug the next step 99 00:04:00,960 --> 00:04:02,944 pipeline ,梯度计算, of the pipeline, gradient computation, 100 00:04:02,944 --> 00:04:05,850 以及优化器更新。 as well as the optimizer step. 101 00:04:05,850 --> 00:04:08,823 有了所有这些,祝你在调试自己的训练时好运! With all of this, good luck debugging your own trainings!