1
00:00:00,288 --> 00:00:02,639
(screen swishing)

2
00:00:02,639 --> 00:00:05,190
(text swishing)

3
00:00:05,190 --> 00:00:06,780
In our other videos,

4
00:00:06,780 --> 00:00:08,280
we talked about the basics

5
00:00:08,280 --> 00:00:11,610
of fine-tuning a language
model with Tensorflow,

6
00:00:11,610 --> 00:00:15,030
and as always, when I refer to
videos I'll link them below.

7
00:00:15,030 --> 00:00:17,610
Still, can we do better?

8
00:00:17,610 --> 00:00:20,700
So here's the code from our
model fine-tuning video,

9
00:00:20,700 --> 00:00:21,600
and while it works,

10
00:00:21,600 --> 00:00:24,390
we could definitely
tweak a couple of things.

11
00:00:24,390 --> 00:00:27,540
By far the most important
thing is the learning rate.

12
00:00:27,540 --> 00:00:29,940
In this video we'll talk
about how to change it,

13
00:00:29,940 --> 00:00:31,080
which will make your training

14
00:00:31,080 --> 00:00:33,303
much more consistently successful.

15
00:00:34,440 --> 00:00:37,320
In fact, really there are two things

16
00:00:37,320 --> 00:00:40,530
we want to change about the
default learning rate for Adam.

17
00:00:40,530 --> 00:00:42,720
So the first we want to change

18
00:00:42,720 --> 00:00:45,630
is that it's way too high for our models,

19
00:00:45,630 --> 00:00:48,030
by default, Adam uses a learning rate

20
00:00:48,030 --> 00:00:51,540
of 10 to the minus 3, 1 E minus 3,

21
00:00:51,540 --> 00:00:54,660
and that's very high for
training transformer models.

22
00:00:54,660 --> 00:00:58,200
We're going to start at
5 by 10 to the minus 5,

23
00:00:58,200 --> 00:01:02,700
5 E minus 5, which is 20
times lower than the default.

24
00:01:02,700 --> 00:01:06,330
And secondly, we don't just
want a constant learning rate,

25
00:01:06,330 --> 00:01:07,950
we can get even better performance

26
00:01:07,950 --> 00:01:11,160
if we decay the learning
rate down to a tiny value,

27
00:01:11,160 --> 00:01:13,920
or even to zero , over
the course of training.

28
00:01:13,920 --> 00:01:15,510
So that's what this thing here,

29
00:01:15,510 --> 00:01:18,540
this Polynomial Decay
schedule thing is doing.

30
00:01:18,540 --> 00:01:21,570
So I'll show you what that
decay looks like in a second,

31
00:01:21,570 --> 00:01:23,160
but first we need to tell the scheduler

32
00:01:23,160 --> 00:01:25,290
how long training is going to be,

33
00:01:25,290 --> 00:01:27,450
so that it decays at the right speed,

34
00:01:27,450 --> 00:01:29,450
and that's what this code here is doing.

35
00:01:30,300 --> 00:01:32,280
We're computing how many minibatches

36
00:01:32,280 --> 00:01:35,520
the model is going to see
over its entire training run,

37
00:01:35,520 --> 00:01:37,950
which is the size of the training set,

38
00:01:37,950 --> 00:01:39,570
and then we multiply that

39
00:01:39,570 --> 00:01:41,220
by the number of epochs

40
00:01:41,220 --> 00:01:42,930
to get the total number of batches

41
00:01:42,930 --> 00:01:45,060
across the whole training run.

42
00:01:45,060 --> 00:01:47,880
Once we know how many
training steps we're taking,

43
00:01:47,880 --> 00:01:50,580
we just pass all that
information to the scheduler

44
00:01:50,580 --> 00:01:51,783
and we're ready to go.

45
00:01:53,110 --> 00:01:57,510
What does the polynomial
decay schedule look like?

46
00:01:57,510 --> 00:01:59,610
Well, it looks like this,

47
00:01:59,610 --> 00:02:02,160
it starts at 5 E minus 5,

48
00:02:02,160 --> 00:02:05,490
which means 5 times 10 to the minus 5,

49
00:02:05,490 --> 00:02:08,190
and then decays down at a constant rate

50
00:02:08,190 --> 00:02:11,310
until it hits zero right at
the very end of training.

51
00:02:11,310 --> 00:02:13,200
So hold on, I can already hear you

52
00:02:13,200 --> 00:02:14,640
yelling at your monitor, though,

53
00:02:14,640 --> 00:02:16,020
and yes, I know,

54
00:02:16,020 --> 00:02:18,690
this is actually constant
or a linear decay,

55
00:02:18,690 --> 00:02:20,310
and I know the name is polynomial,

56
00:02:20,310 --> 00:02:21,870
and you're feeling cheated that, you know,

57
00:02:21,870 --> 00:02:24,390
you were promised a polynomial
and haven't gotten it,

58
00:02:24,390 --> 00:02:26,550
so calm down though, it's okay,

59
00:02:26,550 --> 00:02:28,830
because, of course,
linear functions are just

60
00:02:28,830 --> 00:02:30,480
first-order special cases

61
00:02:30,480 --> 00:02:32,850
of the general polynomial functions,

62
00:02:32,850 --> 00:02:36,180
and if you tweak the
options to this class,

63
00:02:36,180 --> 00:02:38,130
you can get a truly polynomial,

64
00:02:38,130 --> 00:02:40,170
a higher-order decay schedule,

65
00:02:40,170 --> 00:02:43,140
but this linear schedule will
work fine for us for now,

66
00:02:43,140 --> 00:02:45,210
we don't actually need all those

67
00:02:45,210 --> 00:02:47,610
fancy tweaks and fancy gadgets.

68
00:02:47,610 --> 00:02:49,770
So coming back,

69
00:02:49,770 --> 00:02:51,990
how do we actually use
this learning rate schedule

70
00:02:51,990 --> 00:02:53,460
once we've created it?

71
00:02:53,460 --> 00:02:55,650
So it's simple, we just pass it to Adam.

72
00:02:55,650 --> 00:02:58,560
So the first time we compiled the model,

73
00:02:58,560 --> 00:03:00,840
we just passed the string Adam,

74
00:03:00,840 --> 00:03:02,250
to get our optimizer.

75
00:03:02,250 --> 00:03:05,340
So Keras recognizes the
names of common optimizers

76
00:03:05,340 --> 00:03:07,920
and loss functions if
you pass them as strings,

77
00:03:07,920 --> 00:03:09,480
so it saves time to do that

78
00:03:09,480 --> 00:03:11,460
if you only want the default settings.

79
00:03:11,460 --> 00:03:13,320
But now we're professional
machine learners,

80
00:03:13,320 --> 00:03:15,720
and, you know, that
salary review is upcoming,

81
00:03:15,720 --> 00:03:17,790
so we've got our very own
learning rate schedule,

82
00:03:17,790 --> 00:03:19,770
and we're gonna do things properly.

83
00:03:19,770 --> 00:03:22,830
So the first we do is
we import the optimizer,

84
00:03:22,830 --> 00:03:24,960
and then we initialize
it with a scheduler,

85
00:03:24,960 --> 00:03:27,540
which is getting passed to
to the learning rate argument

86
00:03:27,540 --> 00:03:29,100
of that optimizer.

87
00:03:29,100 --> 00:03:32,190
And now we compile the model
with this new optimizer,

88
00:03:32,190 --> 00:03:34,140
and again, whatever
loss function you want,

89
00:03:34,140 --> 00:03:37,050
so this is going to be sparse
categorical crossentropy

90
00:03:37,050 --> 00:03:39,840
if you're following along
from the fine-tuning video.

91
00:03:39,840 --> 00:03:41,370
And then, we're we're ready to go,

92
00:03:41,370 --> 00:03:43,710
now we have a high-performance model,

93
00:03:43,710 --> 00:03:44,970
and ready for training.

94
00:03:44,970 --> 00:03:46,830
All that remains is to fit the model

95
00:03:46,830 --> 00:03:48,363
just like we did before.

96
00:03:49,350 --> 00:03:51,600
Remember, because we compiled the model

97
00:03:51,600 --> 00:03:54,300
with the new optimizer and the
new learning rate schedule,

98
00:03:54,300 --> 00:03:56,190
we actually don't need
to change anything at all

99
00:03:56,190 --> 00:03:57,360
when we call fit,

100
00:03:57,360 --> 00:03:58,290
we just call it again,

101
00:03:58,290 --> 00:04:00,540
with exactly the same command as before,

102
00:04:00,540 --> 00:04:02,400
but now we get a beautiful training,

103
00:04:02,400 --> 00:04:04,740
with a nice, smooth learning rate decay,

104
00:04:04,740 --> 00:04:06,330
starting from a good value,

105
00:04:06,330 --> 00:04:07,713
and decaying down to zero.

106
00:04:08,867 --> 00:04:12,059
(screen swishing)

107
00:04:12,059 --> 00:04:13,395
(screen swishing)