subtitles/en/03_what-is-transfer-learning.srt (306 lines of code) (raw):
1
00:00:00,189 --> 00:00:02,856
(air whooshing)
2
00:00:05,550 --> 00:00:07,293
- What is transfer learning?
3
00:00:09,480 --> 00:00:10,920
The idea of transfer learning
4
00:00:10,920 --> 00:00:12,570
is to leverage the knowledge acquired
5
00:00:12,570 --> 00:00:15,543
by a model trained with lots
of data on another task.
6
00:00:16,410 --> 00:00:20,130
The model A will be trained
specifically for task A.
7
00:00:20,130 --> 00:00:22,200
Now let's say you want to train a model B
8
00:00:22,200 --> 00:00:23,970
for a different task.
9
00:00:23,970 --> 00:00:27,330
One option would be to train
the model from scratch.
10
00:00:27,330 --> 00:00:30,633
This could take lots of
computation, time and data.
11
00:00:31,470 --> 00:00:34,260
Instead, we could initialize model B
12
00:00:34,260 --> 00:00:36,570
with the same weights as model A,
13
00:00:36,570 --> 00:00:39,213
transferring the knowledge
of model A on task B.
14
00:00:41,040 --> 00:00:42,690
When training from scratch,
15
00:00:42,690 --> 00:00:45,870
all the model's weight
are initialized randomly.
16
00:00:45,870 --> 00:00:48,870
In this example, we are
training a BERT model
17
00:00:48,870 --> 00:00:50,220
on the task of recognizing
18
00:00:50,220 --> 00:00:52,203
if two sentences are similar or not.
19
00:00:54,116 --> 00:00:56,730
On the left, it's trained from scratch,
20
00:00:56,730 --> 00:01:00,000
and on the right it's
fine-tuning a pretrained model.
21
00:01:00,000 --> 00:01:02,220
As we can see, using transfer learning
22
00:01:02,220 --> 00:01:05,160
and the pretrained model
yields better results.
23
00:01:05,160 --> 00:01:07,140
And it doesn't matter if we train longer.
24
00:01:07,140 --> 00:01:10,620
The training from scratch is
capped around 70% accuracy
25
00:01:10,620 --> 00:01:13,293
while the pretrained model
beats the 86% easily.
26
00:01:14,460 --> 00:01:16,140
This is because pretrained models
27
00:01:16,140 --> 00:01:18,420
are usually trained on
large amounts of data
28
00:01:18,420 --> 00:01:21,000
that provide the model with
a statistical understanding
29
00:01:21,000 --> 00:01:23,413
of the language used during pretraining.
30
00:01:24,450 --> 00:01:25,950
In computer vision,
31
00:01:25,950 --> 00:01:28,080
transfer learning has
been applied successfully
32
00:01:28,080 --> 00:01:30,060
for almost ten years.
33
00:01:30,060 --> 00:01:32,850
Models are frequently
pretrained on ImageNet,
34
00:01:32,850 --> 00:01:36,153
a dataset containing 1.2
millions of photo images.
35
00:01:37,170 --> 00:01:41,130
Each image is classified
by one of 1000 labels.
36
00:01:41,130 --> 00:01:44,010
Training like this, on labeled data
37
00:01:44,010 --> 00:01:45,663
is called supervised learning.
38
00:01:47,340 --> 00:01:49,140
In Natural Language Processing,
39
00:01:49,140 --> 00:01:51,870
transfer learning is a bit more recent.
40
00:01:51,870 --> 00:01:54,480
A key difference with ImageNet
is that the pretraining
41
00:01:54,480 --> 00:01:56,460
is usually self-supervised,
42
00:01:56,460 --> 00:01:58,770
which means it doesn't
require humans annotations
43
00:01:58,770 --> 00:01:59,673
for the labels.
44
00:02:00,780 --> 00:02:02,700
A very common pretraining objective
45
00:02:02,700 --> 00:02:05,310
is to guess the next word in a sentence.
46
00:02:05,310 --> 00:02:07,710
Which only requires lots and lots of text.
47
00:02:07,710 --> 00:02:10,710
GPT-2 for instance,
was pretrained this way
48
00:02:10,710 --> 00:02:12,900
using the content of 45 millions links
49
00:02:12,900 --> 00:02:14,673
posted by users on Reddit.
50
00:02:16,560 --> 00:02:19,590
Another example of self-supervised
pretraining objective
51
00:02:19,590 --> 00:02:22,470
is to predict the value
of randomly masked words.
52
00:02:22,470 --> 00:02:24,540
Which is similar to
fill-in-the-blank tests
53
00:02:24,540 --> 00:02:26,760
you may have done in school.
54
00:02:26,760 --> 00:02:29,880
BERT was pretrained this way
using the English Wikipedia
55
00:02:29,880 --> 00:02:31,893
and 11,000 unpublished books.
56
00:02:33,120 --> 00:02:36,450
In practice, transfer learning
is applied on a given model
57
00:02:36,450 --> 00:02:39,090
by throwing away its head, that is,
58
00:02:39,090 --> 00:02:42,150
its last layers focused on
the pretraining objective,
59
00:02:42,150 --> 00:02:45,360
and replacing it with a new,
randomly initialized head
60
00:02:45,360 --> 00:02:46,860
suitable for the task at hand.
61
00:02:47,970 --> 00:02:51,570
For instance, when we
fine-tuned a BERT model earlier,
62
00:02:51,570 --> 00:02:54,060
we removed the head that
classified mask words
63
00:02:54,060 --> 00:02:56,790
and replaced it with a
classifier with 2 outputs.
64
00:02:56,790 --> 00:02:58,563
Since our task had two labels.
65
00:02:59,700 --> 00:03:02,490
To be as efficient as possible,
the pretrained model used
66
00:03:02,490 --> 00:03:03,770
should be as similar as possible
67
00:03:03,770 --> 00:03:06,270
to the task it's fine-tuned on.
68
00:03:06,270 --> 00:03:08,190
For instance, if the problem
69
00:03:08,190 --> 00:03:10,860
is to classify German sentences,
70
00:03:10,860 --> 00:03:13,053
it's best to use a
German pretrained model.
71
00:03:14,370 --> 00:03:16,649
But with the good comes the bad.
72
00:03:16,649 --> 00:03:19,380
The pretrained model does not
only transfer its knowledge,
73
00:03:19,380 --> 00:03:21,693
but also any bias it may contain.
74
00:03:22,530 --> 00:03:24,300
ImageNet mostly contains images
75
00:03:24,300 --> 00:03:26,850
coming from the United
States and Western Europe.
76
00:03:26,850 --> 00:03:28,020
So models fine-tuned with it
77
00:03:28,020 --> 00:03:31,710
usually will perform better on
images from these countries.
78
00:03:31,710 --> 00:03:33,690
OpenAI also studied the bias
79
00:03:33,690 --> 00:03:36,120
in the predictions of its GPT-3 model
80
00:03:36,120 --> 00:03:36,953
which was pretrained
81
00:03:36,953 --> 00:03:38,750
using the guess the next word objective.
82
00:03:39,720 --> 00:03:41,040
Changing the gender of the prompt
83
00:03:41,040 --> 00:03:44,250
from he was very to she was very
84
00:03:44,250 --> 00:03:47,550
changed the predictions from
mostly neutral adjectives
85
00:03:47,550 --> 00:03:49,233
to almost only physical ones.
86
00:03:50,400 --> 00:03:52,367
In their model card of the GPT-2 model,
87
00:03:52,367 --> 00:03:54,990
OpenAI also acknowledges its bias
88
00:03:54,990 --> 00:03:56,730
and discourages its use
89
00:03:56,730 --> 00:03:58,803
in systems that interact with humans.
90
00:04:01,040 --> 00:04:03,707
(air whooshing)