subtitles/zh-CN/27_learning-rate-scheduling-with-tensorflow.srt (424 lines of code) (raw):
1
00:00:00,288 --> 00:00:02,639
(画面沙沙作响)
(screen swishing)
2
00:00:02,639 --> 00:00:05,190
(文字嗖嗖)
(text swishing)
3
00:00:05,190 --> 00:00:06,780
在我们的其他视频中,
In our other videos,
4
00:00:06,780 --> 00:00:08,280
我们讨论了基础知识
we talked about the basics
5
00:00:08,280 --> 00:00:11,610
使用 Tensorflow 微调语言模型,
of fine-tuning a language model with Tensorflow,
6
00:00:11,610 --> 00:00:15,030
和往常一样,当我提到视频时,我会在下面链接它们。
and as always, when I refer to videos I'll link them below.
7
00:00:15,030 --> 00:00:17,610
不过,我们可以做得更好吗?
Still, can we do better?
8
00:00:17,610 --> 00:00:20,700
这是我们模型微调视频中的代码,
So here's the code from our model fine-tuning video,
9
00:00:20,700 --> 00:00:21,600
在它起作用的同时,
and while it works,
10
00:00:21,600 --> 00:00:24,390
我们绝对可以调整一些东西。
we could definitely tweak a couple of things.
11
00:00:24,390 --> 00:00:27,540
到目前为止,最重要的是学习率。
By far the most important thing is the learning rate.
12
00:00:27,540 --> 00:00:29,940
在本视频中,我们将讨论如何更改它,
In this video we'll talk about how to change it,
13
00:00:29,940 --> 00:00:31,080
这将使你的训练
which will make your training
14
00:00:31,080 --> 00:00:33,303
更加稳定地成功。
much more consistently successful.
15
00:00:34,440 --> 00:00:37,320
其实真的有两件事
In fact, really there are two things
16
00:00:37,320 --> 00:00:40,530
我们想改变 Adam 的默认学习率。
we want to change about the default learning rate for Adam.
17
00:00:40,530 --> 00:00:42,720
所以首先我们要改变
So the first we want to change
18
00:00:42,720 --> 00:00:45,630
是它对我们的模型来说太高了,
is that it's way too high for our models,
19
00:00:45,630 --> 00:00:48,030
默认情况下,Adam 使用学习率
by default, Adam uses a learning rate
20
00:00:48,030 --> 00:00:51,540
10 的 -3 次方,1e-3,
of 10 to the minus 3, 1e-3,
21
00:00:51,540 --> 00:00:54,660
这对于训练 transformer 模型来说非常高。
and that's very high for training transformer models.
22
00:00:54,660 --> 00:00:58,200
我们将从 5 乘 10 的负 5次方 开始,
We're going to start at 5 by 10 to the minus 5,
23
00:00:58,200 --> 00:01:02,700
5e-5,比默认值低 20 倍。
5e-5, which is 20 times lower than the default.
24
00:01:02,700 --> 00:01:06,330
其次,我们不只是想要一个恒定的学习率,
And secondly, we don't just want a constant learning rate,
25
00:01:06,330 --> 00:01:07,950
我们可以获得更好的性能
we can get even better performance
26
00:01:07,950 --> 00:01:11,160
如果我们将学习率降低到一个很小的值,
if we decay the learning rate down to a tiny value,
27
00:01:11,160 --> 00:01:13,920
甚至在训练过程中为零。
or even to zero , over the course of training.
28
00:01:13,920 --> 00:01:15,510
这就是这里的东西,
So that's what this thing here,
29
00:01:15,510 --> 00:01:18,540
这个 Polynomial Decay schedule 事情正在做。
this Polynomial Decay schedule thing is doing.
30
00:01:18,540 --> 00:01:21,570
等会儿我会告诉你衰减是什么样子的,
So I'll show you what that decay looks like in a second,
31
00:01:21,570 --> 00:01:23,160
但首先我们需要告诉规划程序
but first we need to tell the scheduler
32
00:01:23,160 --> 00:01:25,290
训练将持续多长时间,
how long training is going to be,
33
00:01:25,290 --> 00:01:27,450
以便它以正确的速度衰减,
so that it decays at the right speed,
34
00:01:27,450 --> 00:01:29,450
这就是这里的代码所做的。
and that's what this code here is doing.
35
00:01:30,300 --> 00:01:32,280
我们正在计算有多少小批量
We're computing how many minibatches
36
00:01:32,280 --> 00:01:35,520
该模型将在其整个训练过程中进行观察,
the model is going to see over its entire training run,
37
00:01:35,520 --> 00:01:37,950
这是训练集的大小,
which is the size of the training set,
38
00:01:37,950 --> 00:01:39,570
然后我们乘以它
and then we multiply that
39
00:01:39,570 --> 00:01:41,220
按 epoch 数
by the number of epochs
40
00:01:41,220 --> 00:01:42,930
获得 batch 总数
to get the total number of batches
41
00:01:42,930 --> 00:01:45,060
在整个训练过程中。
across the whole training run.
42
00:01:45,060 --> 00:01:47,880
一旦我们知道我们进行了多少训练步骤,
Once we know how many training steps we're taking,
43
00:01:47,880 --> 00:01:50,580
我们只是将所有这些信息传递给规划程序
we just pass all that information to the scheduler
44
00:01:50,580 --> 00:01:51,783
我们准备好了。
and we're ready to go.
45
00:01:53,110 --> 00:01:57,510
多项式衰减时间表是什么样的?
What does the polynomial decay schedule look like?
46
00:01:57,510 --> 00:01:59,610
嗯,看起来像这样,
Well, it looks like this,
47
00:01:59,610 --> 00:02:02,160
它从 5e-5 开始,
it starts at 5e-5,
48
00:02:02,160 --> 00:02:05,490
这意味着 5 乘以 10^{-5},
which means 5 times 10 to the minus 5,
49
00:02:05,490 --> 00:02:08,190
然后以恒定速率衰减
and then decays down at a constant rate
50
00:02:08,190 --> 00:02:11,310
直到它在训练结束时达到零。
until it hits zero right at the very end of training.
51
00:02:11,310 --> 00:02:13,200
所以等一下,我已经能听到你的声音了
So hold on, I can already hear you
52
00:02:13,200 --> 00:02:14,640
不过,对着你的显示器大喊大叫,
yelling at your monitor, though,
53
00:02:14,640 --> 00:02:16,020
是的,我知道,
and yes, I know,
54
00:02:16,020 --> 00:02:18,690
这实际上是常数或线性衰减,
this is actually constant or a linear decay,
55
00:02:18,690 --> 00:02:20,310
我知道这个名字有 "多项式",
and I know the name is polynomial,
56
00:02:20,310 --> 00:02:21,870
你感觉被骗了,你知道的,
and you're feeling cheated that, you know,
57
00:02:21,870 --> 00:02:24,390
你被许诺了一个多项式但还没有得到它,
you were promised a polynomial and haven't gotten it,
58
00:02:24,390 --> 00:02:26,550
所以冷静下来,没关系,
so calm down though, it's okay,
59
00:02:26,550 --> 00:02:28,830
因为,当然,线性函数只是
because, of course, linear functions are just
60
00:02:28,830 --> 00:02:30,480
一阶特例
first-order special cases
61
00:02:30,480 --> 00:02:32,850
的一般多项式函数,
of the general polynomial functions,
62
00:02:32,850 --> 00:02:36,180
如果你调整这个类的选项,
and if you tweak the options to this class,
63
00:02:36,180 --> 00:02:38,130
你可以获得一个真正的多项式,
you can get a truly polynomial,
64
00:02:38,130 --> 00:02:40,170
高阶衰减时间表,
a higher-order decay schedule,
65
00:02:40,170 --> 00:02:43,140
但现在这个线性时间表对我们来说还行,
but this linear schedule will work fine for us for now,
66
00:02:43,140 --> 00:02:45,210
我们实际上并不需要所有这些
we don't actually need all those
67
00:02:45,210 --> 00:02:47,610
花哨的调整和花哨的小工具。
fancy tweaks and fancy gadgets.
68
00:02:47,610 --> 00:02:49,770
所以回来,
So coming back,
69
00:02:49,770 --> 00:02:51,990
我们如何实际使用这个学习率计划
how do we actually use this learning rate schedule
70
00:02:51,990 --> 00:02:53,460
一旦我们创造了它?
once we've created it?
71
00:02:53,460 --> 00:02:55,650
所以很简单,我们只需将其传递给 Adam。
So it's simple, we just pass it to Adam.
72
00:02:55,650 --> 00:02:58,560
所以我们第一次编译模型时,
So the first time we compiled the model,
73
00:02:58,560 --> 00:03:00,840
我们刚刚传递了字符串 Adam,
we just passed the string Adam,
74
00:03:00,840 --> 00:03:02,250
得到我们的优化器。
to get our optimizer.
75
00:03:02,250 --> 00:03:05,340
所以 Keras 识别常见优化器的名称
So Keras recognizes the names of common optimizers
76
00:03:05,340 --> 00:03:07,920
和损失函数,如果你将它们作为字符串传递,
and loss functions if you pass them as strings,
77
00:03:07,920 --> 00:03:09,480
这样可以节省时间
so it saves time to do that
78
00:03:09,480 --> 00:03:11,460
如果你只想要默认设置。
if you only want the default settings.
79
00:03:11,460 --> 00:03:13,320
但现在我们是专业的机器学习者,
But now we're professional machine learners,
80
00:03:13,320 --> 00:03:15,720
而且,你知道,薪资审查即将到来,
and, you know, that salary review is upcoming,
81
00:03:15,720 --> 00:03:17,790
所以我们有自己的学习率时间表,
so we've got our very own learning rate schedule,
82
00:03:17,790 --> 00:03:19,770
我们会把事情做好。
and we're gonna do things properly.
83
00:03:19,770 --> 00:03:22,830
所以我们首先要做的是导入优化器,
So the first we do is we import the optimizer,
84
00:03:22,830 --> 00:03:24,960
然后我们用调度程序初始化它,
and then we initialize it with a scheduler,
85
00:03:24,960 --> 00:03:27,540
它被传递给学习率参数
which is getting passed to to the learning rate argument
86
00:03:27,540 --> 00:03:29,100
对该优化器的。
of that optimizer.
87
00:03:29,100 --> 00:03:32,190
现在我们用这个新的优化器编译模型,
And now we compile the model with this new optimizer,
88
00:03:32,190 --> 00:03:34,140
再一次,无论你想要什么损失函数,
and again, whatever loss function you want,
89
00:03:34,140 --> 00:03:37,050
所以这将是稀疏的分类交叉熵
so this is going to be sparse categorical crossentropy
90
00:03:37,050 --> 00:03:39,840
如果你正在关注 fine-tuning (微调) 的视频。
if you're following along from the fine-tuning video.
91
00:03:39,840 --> 00:03:41,370
然后,我们准备好了,
And then, we're we're ready to go,
92
00:03:41,370 --> 00:03:43,710
现在我们有了一个高性能模型,
now we have a high-performance model,
93
00:03:43,710 --> 00:03:44,970
并准备接受训练。
and ready for training.
94
00:03:44,970 --> 00:03:46,830
剩下的就是拟合模型
All that remains is to fit the model
95
00:03:46,830 --> 00:03:48,363
就像我们以前做的一样。
just like we did before.
96
00:03:49,350 --> 00:03:51,600
记住,因为我们编译了模型
Remember, because we compiled the model
97
00:03:51,600 --> 00:03:54,300
使用新的优化器和新的学习率规划,
with the new optimizer and the new learning rate schedule,
98
00:03:54,300 --> 00:03:56,190
我们实际上根本不需要改变任何东西
we actually don't need to change anything at all
99
00:03:56,190 --> 00:03:57,360
当我们调用 fit 时,
when we call fit,
100
00:03:57,360 --> 00:03:58,290
我们只是再次调用它,
we just call it again,
101
00:03:58,290 --> 00:04:00,540
使用与之前完全相同的命令,
with exactly the same command as before,
102
00:04:00,540 --> 00:04:02,400
但现在我们得到了很好的训练,
but now we get a beautiful training,
103
00:04:02,400 --> 00:04:04,740
有一个很好的,平滑的学习率衰减,
with a nice, smooth learning rate decay,
104
00:04:04,740 --> 00:04:06,330
从好的值开始,
starting from a good value,
105
00:04:06,330 --> 00:04:07,713
并衰减到零。
and decaying down to zero.
106
00:04:08,867 --> 00:04:13,395
(画面沙沙作响)
(screen swishing)