subtitles/en/26_fine-tuning-with-tensorflow.srt (439 lines of code) (raw):
1
00:00:00,253 --> 00:00:02,920
(air whooshing)
2
00:00:06,060 --> 00:00:08,070
- In this video, we're going to see
3
00:00:08,070 --> 00:00:11,430
how to load and fine
tune a pre-trained model.
4
00:00:11,430 --> 00:00:12,510
It's very quick.
5
00:00:12,510 --> 00:00:14,490
And if you've watched our pipeline videos,
6
00:00:14,490 --> 00:00:18,150
which I'll link below, the
process is very similar.
7
00:00:18,150 --> 00:00:20,940
This time, though, we're going
to be using transfer learning
8
00:00:20,940 --> 00:00:23,040
and doing some training ourselves,
9
00:00:23,040 --> 00:00:26,400
rather than just loading a
model and using it as is.
10
00:00:26,400 --> 00:00:28,710
So to learn more about transfer learning,
11
00:00:28,710 --> 00:00:31,320
head to the 'What is
transfer learning?' video,
12
00:00:31,320 --> 00:00:33,420
and we'll link that below as well.
13
00:00:33,420 --> 00:00:35,610
But for now, let's look at this code.
14
00:00:35,610 --> 00:00:38,730
To start, we pick which
model we want to start with.
15
00:00:38,730 --> 00:00:40,920
In this case, we're
going to use the famous,
16
00:00:40,920 --> 00:00:42,060
the original BERT,
17
00:00:42,060 --> 00:00:44,850
as the foundation for our training today.
18
00:00:44,850 --> 00:00:46,770
But what is this monstrosity line,
19
00:00:46,770 --> 00:00:48,797
this
'TFAutoModelForSequenceClassification'?
20
00:00:49,860 --> 00:00:51,180
What does that mean?
21
00:00:51,180 --> 00:00:53,130
Well, the TF stands for TensorFlow.
22
00:00:53,130 --> 00:00:54,660
And the rest means,
23
00:00:54,660 --> 00:00:55,950
take a language model,
24
00:00:55,950 --> 00:00:58,380
and stick a sequence
classification head onto it
25
00:00:58,380 --> 00:01:00,750
if it doesn't have one already.
26
00:01:00,750 --> 00:01:02,880
So this line of code loads BERT,
27
00:01:02,880 --> 00:01:05,040
which is a general purpose language model,
28
00:01:05,040 --> 00:01:07,650
it loads at weights, architecture, and all
29
00:01:07,650 --> 00:01:10,920
and then adds a new sequence
classification head onto it
30
00:01:10,920 --> 00:01:13,440
with randomly initialized weights.
31
00:01:13,440 --> 00:01:15,870
So this method needs to know two things.
32
00:01:15,870 --> 00:01:18,270
Firstly, it needs to know
the name of the model
33
00:01:18,270 --> 00:01:21,060
you wanted to load, the
architecture and weights for.
34
00:01:21,060 --> 00:01:23,940
And secondly, it needs
to know how many classes
35
00:01:23,940 --> 00:01:26,693
your problem has, because
that will determine the size,
36
00:01:26,693 --> 00:01:29,610
the number of neurons in the output head.
37
00:01:29,610 --> 00:01:31,530
So if you want to follow
along with the data
38
00:01:31,530 --> 00:01:34,500
from our datasets videos,
which I'll link below,
39
00:01:34,500 --> 00:01:37,440
then you'll have two classes,
positive and negative,
40
00:01:37,440 --> 00:01:39,723
and thus num_labels equals two.
41
00:01:40,830 --> 00:01:43,230
But what about this compile line?
42
00:01:43,230 --> 00:01:44,970
Well, if you're familiar with Keras,
43
00:01:44,970 --> 00:01:46,920
you've probably seen this already.
44
00:01:46,920 --> 00:01:49,800
But if not, this is one of
the core methods in Keras
45
00:01:49,800 --> 00:01:51,450
that you're gonna see again, and again.
46
00:01:51,450 --> 00:01:54,900
You always need to compile
your model before you train it.
47
00:01:54,900 --> 00:01:57,870
And compile needs to know two things.
48
00:01:57,870 --> 00:02:00,090
Firstly, it needs to
know the loss function,
49
00:02:00,090 --> 00:02:02,340
which is what you're trying to optimize.
50
00:02:02,340 --> 00:02:05,910
So here, we import the
SparseCategoricalCrossentropy
51
00:02:05,910 --> 00:02:07,260
loss function.
52
00:02:07,260 --> 00:02:09,930
So that's a mouthful, but it's
the standard loss function
53
00:02:09,930 --> 00:02:13,260
for any neural network that's
doing a classification task.
54
00:02:13,260 --> 00:02:14,970
It basically encourages the network
55
00:02:14,970 --> 00:02:17,730
to output large values
for the right class,
56
00:02:17,730 --> 00:02:20,910
and low values for the wrong classes.
57
00:02:20,910 --> 00:02:24,150
Note that you can specify the
loss function as a string,
58
00:02:24,150 --> 00:02:26,010
like we did with the optimizer.
59
00:02:26,010 --> 00:02:27,600
But there's a risk there,
60
00:02:27,600 --> 00:02:30,090
there's a very common
trap people fall into,
61
00:02:30,090 --> 00:02:32,580
which is that by default,
this loss assumes
62
00:02:32,580 --> 00:02:36,510
the output is probabilities
after a softmax layer.
63
00:02:36,510 --> 00:02:38,310
But what our model has actually output
64
00:02:38,310 --> 00:02:40,770
is the values before the softmax,
65
00:02:40,770 --> 00:02:43,800
often called the logits, sometimes logits.
66
00:02:43,800 --> 00:02:46,110
No one's quite sure how
to pronounce that one.
67
00:02:46,110 --> 00:02:47,790
But you probably seen these before
68
00:02:47,790 --> 00:02:49,950
in the video about pipelines.
69
00:02:49,950 --> 00:02:52,320
So if you get this wrong,
your model won't train
70
00:02:52,320 --> 00:02:54,723
and it'll be very annoying
to figure out why.
71
00:02:55,590 --> 00:02:57,540
In future videos, we're gonna see
72
00:02:57,540 --> 00:03:00,540
how to use the model's
internal loss computations,
73
00:03:00,540 --> 00:03:02,910
so that you don't have to
specify the loss yourself
74
00:03:02,910 --> 00:03:05,340
and you don't have to
worry about these details.
75
00:03:05,340 --> 00:03:09,480
But for now, remember to
set from_logits to true.
76
00:03:09,480 --> 00:03:11,430
The second thing compile needs to know
77
00:03:11,430 --> 00:03:13,230
is the optimizer you want.
78
00:03:13,230 --> 00:03:15,120
In our case, we use adam,
79
00:03:15,120 --> 00:03:16,830
which is sort of the standard optimizer
80
00:03:16,830 --> 00:03:18,720
for deep learning these days.
81
00:03:18,720 --> 00:03:20,520
The one thing you might want to change
82
00:03:20,520 --> 00:03:21,780
is the learning rate.
83
00:03:21,780 --> 00:03:24,630
And to do that, we'll need to
import the actual optimizer
84
00:03:24,630 --> 00:03:26,910
rather than just calling it by string.
85
00:03:26,910 --> 00:03:28,680
But we'll talk about
that in another video,
86
00:03:28,680 --> 00:03:30,090
which I'll link below.
87
00:03:30,090 --> 00:03:33,360
For now, let's just
try training the model.
88
00:03:33,360 --> 00:03:35,580
Well, so how do you train the model?
89
00:03:35,580 --> 00:03:37,950
Again, if you've used Keras before,
90
00:03:37,950 --> 00:03:40,350
this is all going to be
very familiar to you.
91
00:03:40,350 --> 00:03:42,210
But if not, let's very quickly look
92
00:03:42,210 --> 00:03:43,710
at what we're doing here.
93
00:03:43,710 --> 00:03:47,010
fit is pretty much the central
method for Keras models.
94
00:03:47,010 --> 00:03:49,983
It tells the model to train
on the data we're passing in.
95
00:03:50,820 --> 00:03:52,920
So here we pass the datasets we made
96
00:03:52,920 --> 00:03:54,510
in the previous section,
97
00:03:54,510 --> 00:03:57,990
the dataset contains both
our inputs and our labels.
98
00:03:57,990 --> 00:04:00,420
So we don't need to
specify separate labels,
99
00:04:00,420 --> 00:04:01,570
when we're calling fit.
100
00:04:02,490 --> 00:04:05,340
Then we do the same thing
with the validation_data.
101
00:04:05,340 --> 00:04:08,190
And then we can if we want,
we can specify details,
102
00:04:08,190 --> 00:04:09,900
like the number of epochs for training
103
00:04:09,900 --> 00:04:12,420
where there's some other
arguments you can pass to fit.
104
00:04:12,420 --> 00:04:15,240
But in the end, you just
pass all of this to model.fit
105
00:04:15,240 --> 00:04:16,440
and you let it run.
106
00:04:16,440 --> 00:04:17,520
If everything works out,
107
00:04:17,520 --> 00:04:19,320
you should see a little training bar
108
00:04:19,320 --> 00:04:21,300
progressing along as your loss goes down.
109
00:04:21,300 --> 00:04:22,290
And that's it.
110
00:04:22,290 --> 00:04:23,123
While that's running,
111
00:04:23,123 --> 00:04:25,380
you know, you can call
your boss and tell them
112
00:04:25,380 --> 00:04:27,810
you're a senior NLP machine
learning engineer now
113
00:04:27,810 --> 00:04:30,900
and you're gonna want a
salary review next quarter.
114
00:04:30,900 --> 00:04:32,880
These few lines of code
are really all it takes
115
00:04:32,880 --> 00:04:34,500
to apply the power of a massive
116
00:04:34,500 --> 00:04:36,510
pre-trained language problem,
117
00:04:36,510 --> 00:04:38,250
massive pre-trained
language model, excuse me,
118
00:04:38,250 --> 00:04:40,080
to your NLP problem.
119
00:04:40,080 --> 00:04:42,150
But could we do better than this?
120
00:04:42,150 --> 00:04:43,920
I mean, we certainly could.
121
00:04:43,920 --> 00:04:45,720
With a few more advanced Keras features
122
00:04:45,720 --> 00:04:47,730
like a tuned, scheduled learning rate,
123
00:04:47,730 --> 00:04:49,290
we can get an even lower loss
124
00:04:49,290 --> 00:04:51,990
and an even more accurate,
more useful model.
125
00:04:51,990 --> 00:04:54,120
And what do we do with our
model after we train it?
126
00:04:54,120 --> 00:04:55,950
So all of this is going to
be covered in the videos
127
00:04:55,950 --> 00:04:57,963
that are coming up, so stay tuned.
128
00:04:59,220 --> 00:05:01,887
(air whooshing)