1 00:00:00,288 --> 00:00:02,639 (画面沙沙作响) (screen swishing) 2 00:00:02,639 --> 00:00:05,190 (文字嗖嗖) (text swishing) 3 00:00:05,190 --> 00:00:06,780 在我们的其他视频中, In our other videos, 4 00:00:06,780 --> 00:00:08,280 我们讨论了基础知识 we talked about the basics 5 00:00:08,280 --> 00:00:11,610 使用 Tensorflow 微调语言模型, of fine-tuning a language model with Tensorflow, 6 00:00:11,610 --> 00:00:15,030 和往常一样,当我提到视频时,我会在下面链接它们。 and as always, when I refer to videos I'll link them below. 7 00:00:15,030 --> 00:00:17,610 不过,我们可以做得更好吗? Still, can we do better? 8 00:00:17,610 --> 00:00:20,700 这是我们模型微调视频中的代码, So here's the code from our model fine-tuning video, 9 00:00:20,700 --> 00:00:21,600 在它起作用的同时, and while it works, 10 00:00:21,600 --> 00:00:24,390 我们绝对可以调整一些东西。 we could definitely tweak a couple of things. 11 00:00:24,390 --> 00:00:27,540 到目前为止,最重要的是学习率。 By far the most important thing is the learning rate. 12 00:00:27,540 --> 00:00:29,940 在本视频中,我们将讨论如何更改它, In this video we'll talk about how to change it, 13 00:00:29,940 --> 00:00:31,080 这将使你的训练 which will make your training 14 00:00:31,080 --> 00:00:33,303 更加稳定地成功。 much more consistently successful. 15 00:00:34,440 --> 00:00:37,320 其实真的有两件事 In fact, really there are two things 16 00:00:37,320 --> 00:00:40,530 我们想改变 Adam 的默认学习率。 we want to change about the default learning rate for Adam. 17 00:00:40,530 --> 00:00:42,720 所以首先我们要改变 So the first we want to change 18 00:00:42,720 --> 00:00:45,630 是它对我们的模型来说太高了, is that it's way too high for our models, 19 00:00:45,630 --> 00:00:48,030 默认情况下,Adam 使用学习率 by default, Adam uses a learning rate 20 00:00:48,030 --> 00:00:51,540 10 的 -3 次方,1e-3, of 10 to the minus 3, 1e-3, 21 00:00:51,540 --> 00:00:54,660 这对于训练 transformer 模型来说非常高。 and that's very high for training transformer models. 22 00:00:54,660 --> 00:00:58,200 我们将从 5 乘 10 的负 5次方 开始, We're going to start at 5 by 10 to the minus 5, 23 00:00:58,200 --> 00:01:02,700 5e-5,比默认值低 20 倍。 5e-5, which is 20 times lower than the default. 24 00:01:02,700 --> 00:01:06,330 其次,我们不只是想要一个恒定的学习率, And secondly, we don't just want a constant learning rate, 25 00:01:06,330 --> 00:01:07,950 我们可以获得更好的性能 we can get even better performance 26 00:01:07,950 --> 00:01:11,160 如果我们将学习率降低到一个很小的值, if we decay the learning rate down to a tiny value, 27 00:01:11,160 --> 00:01:13,920 甚至在训练过程中为零。 or even to zero , over the course of training. 28 00:01:13,920 --> 00:01:15,510 这就是这里的东西, So that's what this thing here, 29 00:01:15,510 --> 00:01:18,540 这个 Polynomial Decay schedule 事情正在做。 this Polynomial Decay schedule thing is doing. 30 00:01:18,540 --> 00:01:21,570 等会儿我会告诉你衰减是什么样子的, So I'll show you what that decay looks like in a second, 31 00:01:21,570 --> 00:01:23,160 但首先我们需要告诉规划程序 but first we need to tell the scheduler 32 00:01:23,160 --> 00:01:25,290 训练将持续多长时间, how long training is going to be, 33 00:01:25,290 --> 00:01:27,450 以便它以正确的速度衰减, so that it decays at the right speed, 34 00:01:27,450 --> 00:01:29,450 这就是这里的代码所做的。 and that's what this code here is doing. 35 00:01:30,300 --> 00:01:32,280 我们正在计算有多少小批量 We're computing how many minibatches 36 00:01:32,280 --> 00:01:35,520 该模型将在其整个训练过程中进行观察, the model is going to see over its entire training run, 37 00:01:35,520 --> 00:01:37,950 这是训练集的大小, which is the size of the training set, 38 00:01:37,950 --> 00:01:39,570 然后我们乘以它 and then we multiply that 39 00:01:39,570 --> 00:01:41,220 按 epoch 数 by the number of epochs 40 00:01:41,220 --> 00:01:42,930 获得 batch 总数 to get the total number of batches 41 00:01:42,930 --> 00:01:45,060 在整个训练过程中。 across the whole training run. 42 00:01:45,060 --> 00:01:47,880 一旦我们知道我们进行了多少训练步骤, Once we know how many training steps we're taking, 43 00:01:47,880 --> 00:01:50,580 我们只是将所有这些信息传递给规划程序 we just pass all that information to the scheduler 44 00:01:50,580 --> 00:01:51,783 我们准备好了。 and we're ready to go. 45 00:01:53,110 --> 00:01:57,510 多项式衰减时间表是什么样的? What does the polynomial decay schedule look like? 46 00:01:57,510 --> 00:01:59,610 嗯,看起来像这样, Well, it looks like this, 47 00:01:59,610 --> 00:02:02,160 它从 5e-5 开始, it starts at 5e-5, 48 00:02:02,160 --> 00:02:05,490 这意味着 5 乘以 10^{-5}, which means 5 times 10 to the minus 5, 49 00:02:05,490 --> 00:02:08,190 然后以恒定速率衰减 and then decays down at a constant rate 50 00:02:08,190 --> 00:02:11,310 直到它在训练结束时达到零。 until it hits zero right at the very end of training. 51 00:02:11,310 --> 00:02:13,200 所以等一下,我已经能听到你的声音了 So hold on, I can already hear you 52 00:02:13,200 --> 00:02:14,640 不过,对着你的显示器大喊大叫, yelling at your monitor, though, 53 00:02:14,640 --> 00:02:16,020 是的,我知道, and yes, I know, 54 00:02:16,020 --> 00:02:18,690 这实际上是常数或线性衰减, this is actually constant or a linear decay, 55 00:02:18,690 --> 00:02:20,310 我知道这个名字有 "多项式", and I know the name is polynomial, 56 00:02:20,310 --> 00:02:21,870 你感觉被骗了,你知道的, and you're feeling cheated that, you know, 57 00:02:21,870 --> 00:02:24,390 你被许诺了一个多项式但还没有得到它, you were promised a polynomial and haven't gotten it, 58 00:02:24,390 --> 00:02:26,550 所以冷静下来,没关系, so calm down though, it's okay, 59 00:02:26,550 --> 00:02:28,830 因为,当然,线性函数只是 because, of course, linear functions are just 60 00:02:28,830 --> 00:02:30,480 一阶特例 first-order special cases 61 00:02:30,480 --> 00:02:32,850 的一般多项式函数, of the general polynomial functions, 62 00:02:32,850 --> 00:02:36,180 如果你调整这个类的选项, and if you tweak the options to this class, 63 00:02:36,180 --> 00:02:38,130 你可以获得一个真正的多项式, you can get a truly polynomial, 64 00:02:38,130 --> 00:02:40,170 高阶衰减时间表, a higher-order decay schedule, 65 00:02:40,170 --> 00:02:43,140 但现在这个线性时间表对我们来说还行, but this linear schedule will work fine for us for now, 66 00:02:43,140 --> 00:02:45,210 我们实际上并不需要所有这些 we don't actually need all those 67 00:02:45,210 --> 00:02:47,610 花哨的调整和花哨的小工具。 fancy tweaks and fancy gadgets. 68 00:02:47,610 --> 00:02:49,770 所以回来, So coming back, 69 00:02:49,770 --> 00:02:51,990 我们如何实际使用这个学习率计划 how do we actually use this learning rate schedule 70 00:02:51,990 --> 00:02:53,460 一旦我们创造了它? once we've created it? 71 00:02:53,460 --> 00:02:55,650 所以很简单,我们只需将其传递给 Adam。 So it's simple, we just pass it to Adam. 72 00:02:55,650 --> 00:02:58,560 所以我们第一次编译模型时, So the first time we compiled the model, 73 00:02:58,560 --> 00:03:00,840 我们刚刚传递了字符串 Adam, we just passed the string Adam, 74 00:03:00,840 --> 00:03:02,250 得到我们的优化器。 to get our optimizer. 75 00:03:02,250 --> 00:03:05,340 所以 Keras 识别常见优化器的名称 So Keras recognizes the names of common optimizers 76 00:03:05,340 --> 00:03:07,920 和损失函数,如果你将它们作为字符串传递, and loss functions if you pass them as strings, 77 00:03:07,920 --> 00:03:09,480 这样可以节省时间 so it saves time to do that 78 00:03:09,480 --> 00:03:11,460 如果你只想要默认设置。 if you only want the default settings. 79 00:03:11,460 --> 00:03:13,320 但现在我们是专业的机器学习者, But now we're professional machine learners, 80 00:03:13,320 --> 00:03:15,720 而且,你知道,薪资审查即将到来, and, you know, that salary review is upcoming, 81 00:03:15,720 --> 00:03:17,790 所以我们有自己的学习率时间表, so we've got our very own learning rate schedule, 82 00:03:17,790 --> 00:03:19,770 我们会把事情做好。 and we're gonna do things properly. 83 00:03:19,770 --> 00:03:22,830 所以我们首先要做的是导入优化器, So the first we do is we import the optimizer, 84 00:03:22,830 --> 00:03:24,960 然后我们用调度程序初始化它, and then we initialize it with a scheduler, 85 00:03:24,960 --> 00:03:27,540 它被传递给学习率参数 which is getting passed to to the learning rate argument 86 00:03:27,540 --> 00:03:29,100 对该优化器的。 of that optimizer. 87 00:03:29,100 --> 00:03:32,190 现在我们用这个新的优化器编译模型, And now we compile the model with this new optimizer, 88 00:03:32,190 --> 00:03:34,140 再一次,无论你想要什么损失函数, and again, whatever loss function you want, 89 00:03:34,140 --> 00:03:37,050 所以这将是稀疏的分类交叉熵 so this is going to be sparse categorical crossentropy 90 00:03:37,050 --> 00:03:39,840 如果你正在关注 fine-tuning (微调) 的视频。 if you're following along from the fine-tuning video. 91 00:03:39,840 --> 00:03:41,370 然后,我们准备好了, And then, we're we're ready to go, 92 00:03:41,370 --> 00:03:43,710 现在我们有了一个高性能模型, now we have a high-performance model, 93 00:03:43,710 --> 00:03:44,970 并准备接受训练。 and ready for training. 94 00:03:44,970 --> 00:03:46,830 剩下的就是拟合模型 All that remains is to fit the model 95 00:03:46,830 --> 00:03:48,363 就像我们以前做的一样。 just like we did before. 96 00:03:49,350 --> 00:03:51,600 记住,因为我们编译了模型 Remember, because we compiled the model 97 00:03:51,600 --> 00:03:54,300 使用新的优化器和新的学习率规划, with the new optimizer and the new learning rate schedule, 98 00:03:54,300 --> 00:03:56,190 我们实际上根本不需要改变任何东西 we actually don't need to change anything at all 99 00:03:56,190 --> 00:03:57,360 当我们调用 fit 时, when we call fit, 100 00:03:57,360 --> 00:03:58,290 我们只是再次调用它, we just call it again, 101 00:03:58,290 --> 00:04:00,540 使用与之前完全相同的命令, with exactly the same command as before, 102 00:04:00,540 --> 00:04:02,400 但现在我们得到了很好的训练, but now we get a beautiful training, 103 00:04:02,400 --> 00:04:04,740 有一个很好的,平滑的学习率衰减, with a nice, smooth learning rate decay, 104 00:04:04,740 --> 00:04:06,330 从好的值开始, starting from a good value, 105 00:04:06,330 --> 00:04:07,713 并衰减到零。 and decaying down to zero. 106 00:04:08,867 --> 00:04:13,395 (画面沙沙作响) (screen swishing)