subtitles/zh-CN/29_write-your-training-loop-in-pytorch.srt (496 lines of code) (raw):

1 00:00:00,298 --> 00:00:01,511 (空气呼啸) (air whooshing) 2 00:00:01,511 --> 00:00:02,769 (笑脸弹出) (smiley face popping) 3 00:00:02,769 --> 00:00:05,460 (空气呼啸) (air whooshing) 4 00:00:05,460 --> 00:00:08,486 使用 PyTorch 编写你自己的训练循环。 Write your own training loop with PyTorch. 5 00:00:08,486 --> 00:00:09,960 在本视频中,我们将了解 In this video, we'll look at 6 00:00:09,960 --> 00:00:12,750 我们如何进行与训练器视频中相同的微调, how we can do the same fine-tuning as in the Trainer video, 7 00:00:12,750 --> 00:00:14,760 但不依赖那个课程。 but without relying on that class. 8 00:00:14,760 --> 00:00:17,790 这样,你就可以根据你的需要轻松自定义 This way, you'll be able to easily customize each step 9 00:00:17,790 --> 00:00:20,310 训练循环的每个步骤。 to the training loop to your needs. 10 00:00:20,310 --> 00:00:21,660 这个也很有用 This is also very useful 11 00:00:21,660 --> 00:00:22,740 对于手动调试 to manually debug something 12 00:00:22,740 --> 00:00:24,590 Trainer API 出现的问题。 that went wrong with the Trainer API. 13 00:00:26,220 --> 00:00:28,020 在我们深入研究代码之前, Before we dive into the code, 14 00:00:28,020 --> 00:00:30,481 这是训练循环的草图。 here is a sketch of a training loop. 15 00:00:30,481 --> 00:00:33,381 我们获取一批训练数据并将其提供给模型。 We take a batch of training data and feed it to the model. 16 00:00:34,223 --> 00:00:36,960 有了标签,我们就可以计算损失。 With the labels, we can then compute a loss. 17 00:00:36,960 --> 00:00:39,316 这个数字本身没有用, That number is not useful in its own, 18 00:00:39,316 --> 00:00:40,260 它是用于计算 but is used to compute 19 00:00:40,260 --> 00:00:42,150 我们模型权重的成分, the ingredients of our model weights, 20 00:00:42,150 --> 00:00:43,440 即,损失关于 that is the derivative of the loss 21 00:00:44,610 --> 00:00:47,160 每个模型权重的导数。 with respect to each model weight. 22 00:00:47,160 --> 00:00:49,800 然后优化器使用这些梯度 Those gradients are then used by the optimizer 23 00:00:49,800 --> 00:00:51,210 更新模型权重, to update the model weights, 24 00:00:51,210 --> 00:00:53,550 让他们变得更好一点。 and make them a little bit better. 25 00:00:53,550 --> 00:00:54,510 然后我们重复这个过程 We then repeat the process 26 00:00:54,510 --> 00:00:56,880 使用一批新的训练数据。 with a new batch of training data. 27 00:00:56,880 --> 00:00:58,620 如果有任何不清楚的地方, If any of this isn't clear, 28 00:00:58,620 --> 00:01:00,270 不要犹豫,复习一下 don't hesitate to take a refresher 29 00:01:00,270 --> 00:01:02,170 在你最喜欢的深度学习课程上。 on your favorite deep learning course. 30 00:01:03,210 --> 00:01:06,000 我们将在这里再次使用 GLUE MRPC 数据集, We'll use the GLUE MRPC data set here again, 31 00:01:06,000 --> 00:01:07,680 我们已经看到了如何预先提出数据 and we've seen how to prepropose the data 32 00:01:07,680 --> 00:01:11,130 使用具有动态填充的数据集库。 using the Datasets library with dynamic padding. 33 00:01:11,130 --> 00:01:12,630 查看下面的视频链接 Check out the videos link below 34 00:01:12,630 --> 00:01:14,280 如果你还没有看过它们。 if you haven't seen them already. 35 00:01:15,480 --> 00:01:18,930 完成后,我们只需要定义 PyTorch DataLoaders With this done, we only have to define PyTorch DataLoaders 36 00:01:18,930 --> 00:01:20,610 这将负责转换 which will be responsible to convert 37 00:01:20,610 --> 00:01:23,253 我们数据集的元素到批次数据中。 the elements of our dataset into batches. 38 00:01:24,450 --> 00:01:27,960 我们将 DataColletorForPadding 用作整理函数, We use our DataColletorForPadding as a collate function, 39 00:01:27,960 --> 00:01:29,460 并打乱训练集的次序 and shuffle the training set 40 00:01:29,460 --> 00:01:31,080 确保我们不会每个纪元 to make sure we don't go over the samples 41 00:01:31,080 --> 00:01:33,870 以相同的顺序遍历样本。 in the same order at a epoch*. 42 00:01:33,870 --> 00:01:36,390 要检查一切是否按预期工作, To check that everything works as intended, 43 00:01:36,390 --> 00:01:38,883 我们尝试获取一批数据,并对其进行检查。 we try to grab a batch of data, and inspect it. 44 00:01:40,080 --> 00:01:43,050 就像我们的数据集元素一样,它是一个字典, Like our dataset elements, it's a dictionary, 45 00:01:43,050 --> 00:01:46,260 但这里的值不是一个整数列表 but this times the values are not a single list of integers 46 00:01:46,260 --> 00:01:49,053 而是形状为批量大小乘以序列长度的张量。 but a tensor of shape batch size by sequence length. 47 00:01:50,460 --> 00:01:53,580 下一步是在我们的模型中发送训练数据。 The next step is to send the training data in our model. 48 00:01:53,580 --> 00:01:56,730 为此,我们需要实际创建一个模型。 For that, we'll need to actually create a model. 49 00:01:56,730 --> 00:01:58,740 如模型 API 视频中所示, As seen in the Model API video, 50 00:01:58,740 --> 00:02:00,540 我们使用 from_pretrained 方法, we use the from_pretrained method, 51 00:02:00,540 --> 00:02:03,270 并将标签数量调整为这个数据集 and adjust the number of labels to the number of classes 52 00:02:03,270 --> 00:02:06,810 拥有的类别数量,这里是 2。 we have on this data set, here two. 53 00:02:06,810 --> 00:02:08,940 再次确保一切顺利, Again to be sure everything is going well, 54 00:02:08,940 --> 00:02:11,100 我们将我们抓取的批次传递给我们的模型, we pass the batch we grabbed to our model, 55 00:02:11,100 --> 00:02:13,320 并检查没有错误。 and check there is no error. 56 00:02:13,320 --> 00:02:14,940 如果提供了标签, If the labels are provided, 57 00:02:14,940 --> 00:02:16,590 Transformers 库的模型 the models of the Transformers library 58 00:02:16,590 --> 00:02:18,273 总是直接返回损失。 always returns a loss directly. 59 00:02:19,525 --> 00:02:21,090 我们将能够做损失的反向传播 We will be able to do loss.backward () 60 00:02:21,090 --> 00:02:22,860 以计算所有梯度, to compute all the gradients, 61 00:02:22,860 --> 00:02:26,460 然后需要一个优化器来完成训练步骤。 and will then need an optimizer to do the training step. 62 00:02:26,460 --> 00:02:28,860 我们在这里使用 AdamW 优化器, We use the AdamW optimizer here, 63 00:02:28,860 --> 00:02:31,440 这是具有适当权重衰减的 Adam 变体, which is a variant of Adam with proper weight decay, 64 00:02:31,440 --> 00:02:33,840 但你可以选择任何你喜欢的 PyTorch 优化器。 but you can pick any PyTorch optimizer you like. 65 00:02:34,830 --> 00:02:36,150 使用之前的损失, Using the previous loss, 66 00:02:36,150 --> 00:02:39,060 并使用 loss.backward() 计算梯度, and computing the gradients with loss.backward (), 67 00:02:39,060 --> 00:02:41,130 我们检查我们是否可以无误 we check that we can do the optimizer step 68 00:02:41,130 --> 00:02:42,030 执行优化器步骤。 without any error. 69 00:02:43,380 --> 00:02:45,870 之后不要忘记将梯度归零, Don't forget to zero your gradient afterwards, 70 00:02:45,870 --> 00:02:46,890 或者在下一步, or at the next step, 71 00:02:46,890 --> 00:02:49,343 它们将被添加到你计算的梯度中。 they will get added to the gradients you computed. 72 00:02:50,490 --> 00:02:52,080 我们已经可以编写我们的训练循环, We could already write our training loop, 73 00:02:52,080 --> 00:02:53,220 但我们还要再做两件事 but we will add two more things 74 00:02:53,220 --> 00:02:55,620 使它尽可能好。 to make it as good as it can be. 75 00:02:55,620 --> 00:02:57,690 第一个是学习率调度器, The first one is a learning rate scheduler, 76 00:02:57,690 --> 00:03:00,140 逐步将我们的学习率降低到零。 to progressively decay our learning rate to zero. 77 00:03:01,195 --> 00:03:04,590 Transformers 库中的 get_scheduler 函数 The get_scheduler function from the Transformers library 78 00:03:04,590 --> 00:03:06,150 只是一个方便的功能 is just a convenience function 79 00:03:06,150 --> 00:03:07,800 轻松构建这样的调度器。 to easily build such a scheduler. 80 00:03:08,850 --> 00:03:09,683 你可以再次使用 You can again use 81 00:03:09,683 --> 00:03:11,860 取而代之的是任何 PyTorch 学习率调度器。 any PyTorch learning rate scheduler instead. 82 00:03:13,110 --> 00:03:14,850 最后,如果我们想要我们的培训 Finally, if we want our training 83 00:03:14,850 --> 00:03:17,610 花几分钟而不是几个小时, to take a couple of minutes instead of a few hours, 84 00:03:17,610 --> 00:03:19,530 我们需要使用 GPU。 we will need to use a GPU. 85 00:03:19,530 --> 00:03:21,270 第一步是得到一个, The first step is to get one, 86 00:03:21,270 --> 00:03:23,283 例如通过使用协作笔记本。 for instance by using a colab notebook. 87 00:03:24,180 --> 00:03:26,040 然后你需要实际发送你的模型, Then you need to actually send your model, 88 00:03:26,040 --> 00:03:28,923 并使用 torch 设备对其进行训练数据。 and training data on it by using a torch device. 89 00:03:29,790 --> 00:03:30,840 仔细检查以下代码 Double-check the following lines 90 00:03:30,840 --> 00:03:32,340 为你打印一个 CUDA 设备。 print a CUDA device for you. 91 00:03:32,340 --> 00:03:35,640 或准备将你的训练减少到一个多小时。 or be prepared for your training to less, more than an hour. 92 00:03:35,640 --> 00:03:37,390 我们现在可以把所有东西放在一起。 We can now put everything together. 93 00:03:38,550 --> 00:03:40,860 首先,我们将模型置于训练模式 First, we put our model in training mode 94 00:03:40,860 --> 00:03:42,240 这将激活训练行为 which will activate the training behavior 95 00:03:42,240 --> 00:03:44,790 对于某些层,例如 Dropout。 for some layers, like Dropout. 96 00:03:44,790 --> 00:03:46,860 然后遍历我们选择的纪元数, Then go through the number of epochs we picked, 97 00:03:46,860 --> 00:03:50,070 以及我们训练数据加载器中的所有数据。 and all the data in our training dataloader. 98 00:03:50,070 --> 00:03:52,410 然后我们完成我们已经看到的所有步骤; Then we go through all the steps we have seen already; 99 00:03:52,410 --> 00:03:54,240 将数据发送到 GPU, send the data to the GPU, 100 00:03:54,240 --> 00:03:55,560 计算模型输出, compute the model outputs, 101 00:03:55,560 --> 00:03:57,720 尤其是损失。 and in particular the loss. 102 00:03:57,720 --> 00:03:59,850 使用损失来计算梯度, Use the loss to compute gradients, 103 00:03:59,850 --> 00:04:02,880 然后使用优化器进行训练。 then make a training step with the optimizer. 104 00:04:02,880 --> 00:04:04,500 在我们的调度器中更新学习率 Update the learning rate in our scheduler 105 00:04:04,500 --> 00:04:05,970 对于下一次迭代, for the next iteration, 106 00:04:05,970 --> 00:04:07,763 并将优化器的梯度归零。 and zero the gradients of the optimizer. 107 00:04:09,240 --> 00:04:10,500 一旦完成, Once this is finished, 108 00:04:10,500 --> 00:04:12,150 我们可以很容易地评估我们的模型 we can evaluate our model very easily 109 00:04:12,150 --> 00:04:14,283 使用数据集库中的指标。 with a metric from the Datasets library. 110 00:04:15,180 --> 00:04:17,880 首先,我们将模型置于评估模式, First, we put our model in evaluation mode, 111 00:04:17,880 --> 00:04:20,550 停用像 Dropout 这样的层, to deactivate layers like Dropout, 112 00:04:20,550 --> 00:04:23,850 然后遍历评估数据加载器中的所有数据。 then go through all the data in the evaluation data loader. 113 00:04:23,850 --> 00:04:25,530 正如我们在训练器视频中看到的那样, As we have seen in the Trainer video, 114 00:04:25,530 --> 00:04:26,850 模型输出逻辑斯蒂, the model outputs logits, 115 00:04:26,850 --> 00:04:28,530 我们需要应用 argmax 函数 and we need to apply the argmax function 116 00:04:28,530 --> 00:04:30,213 将它们转化为预测。 to convert them into predictions. 117 00:04:31,350 --> 00:04:33,420 然后度量对象有一个 add_batch 方法, The metric object then has an add_batch method, 118 00:04:33,420 --> 00:04:36,810 我们可以用来向它发送那些中间预测。 we can use to send it those intermediate predictions. 119 00:04:36,810 --> 00:04:38,700 一旦评估循环完成, Once the evaluation loop is finished, 120 00:04:38,700 --> 00:04:40,320 我们只需要调用计算方法 we just have to call the compute method 121 00:04:40,320 --> 00:04:42,180 得到我们的最终结果。 to get our final results. 122 00:04:42,180 --> 00:04:44,490 恭喜,你现在已经微调了一个模型 Congratulations, you have now fine-tuned a model 123 00:04:44,490 --> 00:04:45,633 全靠你自己。 all by yourself. 124 00:04:47,253 --> 00:04:49,920 (空气呼啸) (air whooshing)