subtitles/en/05_transformer-models-encoders.srt (352 lines of code) (raw):
1
00:00:00,253 --> 00:00:03,003
(intro striking)
2
00:00:04,440 --> 00:00:07,830
- In this video, we'll study
the encoder architecture.
3
00:00:07,830 --> 00:00:11,070
An example of a popular encoder
only architecture is BURT
4
00:00:11,070 --> 00:00:13,323
which is the most popular
model of its kind.
5
00:00:14,550 --> 00:00:16,950
Let's first start by
understanding how it works.
6
00:00:18,360 --> 00:00:20,910
We'll use a small example
using three words.
7
00:00:20,910 --> 00:00:23,823
We use these as inputs and
pass them through the encoder.
8
00:00:25,290 --> 00:00:28,173
We retrieve a numerical
representation of each word.
9
00:00:29,970 --> 00:00:32,700
Here, for example, the encoder
converts those three words,
10
00:00:32,700 --> 00:00:37,350
welcome to NYC, in these
three sequences of numbers.
11
00:00:37,350 --> 00:00:40,350
The encoder outputs exactly
one sequence of numbers
12
00:00:40,350 --> 00:00:41,493
per input word.
13
00:00:42,330 --> 00:00:44,880
This numerical representation
can also be called
14
00:00:44,880 --> 00:00:47,163
a feature vector, or a feature tensor.
15
00:00:49,080 --> 00:00:51,030
Let's dive into this representation.
16
00:00:51,030 --> 00:00:52,740
It contains one vector per word
17
00:00:52,740 --> 00:00:54,540
that was passed through the encoder.
18
00:00:56,130 --> 00:00:58,620
Each of these vector is a
numerical representation
19
00:00:58,620 --> 00:01:00,033
of the word in question.
20
00:01:01,080 --> 00:01:03,300
The dimension of that vector is defined
21
00:01:03,300 --> 00:01:05,520
by the architecture of the model.
22
00:01:05,520 --> 00:01:08,703
For the base BERT model, it is 768.
23
00:01:10,650 --> 00:01:13,230
These representations
contain the value of a word,
24
00:01:13,230 --> 00:01:15,240
but contextualized.
25
00:01:15,240 --> 00:01:18,570
For example, the vector
attributed to the word "to"
26
00:01:18,570 --> 00:01:22,290
isn't the representation
of only the "to" word.
27
00:01:22,290 --> 00:01:25,650
It also takes into account
the words around it
28
00:01:25,650 --> 00:01:27,363
which we call the context.
29
00:01:28,650 --> 00:01:30,780
As in it looks to the left context,
30
00:01:30,780 --> 00:01:32,970
the words on the left of
the one we're studying,
31
00:01:32,970 --> 00:01:34,980
here the word "Welcome",
32
00:01:34,980 --> 00:01:37,497
and the context on the
right, here the word "NYC",
33
00:01:38,348 --> 00:01:42,000
and it outputs a value for
the word given its context.
34
00:01:42,000 --> 00:01:45,420
It is therefore a contextualized value.
35
00:01:45,420 --> 00:01:48,810
One could say that the
vector of 768 values
36
00:01:48,810 --> 00:01:51,993
holds the meaning of the
word within the text.
37
00:01:53,310 --> 00:01:56,073
It does this thanks to the
self-attention mechanism.
38
00:01:57,240 --> 00:02:00,630
The self-attention mechanism
relates to different positions,
39
00:02:00,630 --> 00:02:02,850
or different words in a single sequence
40
00:02:02,850 --> 00:02:06,003
in order to compute a
representation of that sequence.
41
00:02:07,200 --> 00:02:09,000
As we've seen before, this means that
42
00:02:09,000 --> 00:02:11,130
the resulting representation of a word
43
00:02:11,130 --> 00:02:13,983
has been affected by other
words in the sequence.
44
00:02:15,840 --> 00:02:18,030
We won't dive into the specifics here
45
00:02:18,030 --> 00:02:19,680
which will offer some further readings
46
00:02:19,680 --> 00:02:21,330
if you want to get a better understanding
47
00:02:21,330 --> 00:02:22,953
at what happens under the hood.
48
00:02:25,050 --> 00:02:27,480
So why should one use and encoder?
49
00:02:27,480 --> 00:02:29,370
Encoders can be used as stand-alone models
50
00:02:29,370 --> 00:02:31,263
in a wide variety of tasks.
51
00:02:32,100 --> 00:02:33,360
For example, BERT,
52
00:02:33,360 --> 00:02:35,670
arguably the most famous
transformer model,
53
00:02:35,670 --> 00:02:37,590
is a standalone encoder model,
54
00:02:37,590 --> 00:02:38,820
and at the time of release,
55
00:02:38,820 --> 00:02:40,440
it'd be the state of the art
56
00:02:40,440 --> 00:02:42,780
in many sequence classification tasks,
57
00:02:42,780 --> 00:02:44,190
question answering tasks,
58
00:02:44,190 --> 00:02:46,743
and mask language modeling
to only cite of few.
59
00:02:48,150 --> 00:02:50,460
The idea is that encoders
are very powerful
60
00:02:50,460 --> 00:02:52,470
at extracting vectors that carry
61
00:02:52,470 --> 00:02:55,350
meaningful information about a sequence.
62
00:02:55,350 --> 00:02:57,870
This vector can then be
handled down the road
63
00:02:57,870 --> 00:03:00,070
by additional neurons
to make sense of them.
64
00:03:01,380 --> 00:03:02,850
Let's take a look at some examples
65
00:03:02,850 --> 00:03:04,563
where encoder really shine.
66
00:03:06,210 --> 00:03:09,900
First of all, Masked
Language Modeling, or MLM.
67
00:03:09,900 --> 00:03:11,970
It's the task of predicting a hidden word
68
00:03:11,970 --> 00:03:13,590
in a sequence of word.
69
00:03:13,590 --> 00:03:15,630
Here, for example, we have hidden the word
70
00:03:15,630 --> 00:03:17,247
between "My" and "is".
71
00:03:18,270 --> 00:03:21,120
This is one of the objectives
with which BERT was trained.
72
00:03:21,120 --> 00:03:24,393
It was trained to predict
hidden words in a sequence.
73
00:03:25,230 --> 00:03:27,930
Encoders shine in this
scenario in particular
74
00:03:27,930 --> 00:03:31,140
as bi-directional
information is crucial here.
75
00:03:31,140 --> 00:03:32,947
If we didn't have the words on the right,
76
00:03:32,947 --> 00:03:34,650
"is", "Sylvain" and the ".",
77
00:03:34,650 --> 00:03:35,940
then there is very little chance
78
00:03:35,940 --> 00:03:38,580
that BERT would have been
able to identify name
79
00:03:38,580 --> 00:03:40,500
as the correct word.
80
00:03:40,500 --> 00:03:42,270
The encoder needs to
have a good understanding
81
00:03:42,270 --> 00:03:45,360
of the sequence in order
to predict a masked word
82
00:03:45,360 --> 00:03:48,840
as even if the text is
grammatically correct,
83
00:03:48,840 --> 00:03:50,610
it does not necessarily make sense
84
00:03:50,610 --> 00:03:52,413
in the context of the sequence.
85
00:03:55,230 --> 00:03:56,580
As mentioned earlier,
86
00:03:56,580 --> 00:03:59,520
encoders are good at doing
sequence classification.
87
00:03:59,520 --> 00:04:02,883
Sentiment analysis is an example
of sequence classification.
88
00:04:04,410 --> 00:04:09,410
The model's aim is to identify
the sentiment of a sequence.
89
00:04:09,540 --> 00:04:11,280
It can range from giving a sequence,
90
00:04:11,280 --> 00:04:12,960
a rating from one to five stars
91
00:04:12,960 --> 00:04:15,900
if doing review analysis
to giving a positive,
92
00:04:15,900 --> 00:04:17,820
or negative rating to a sequence
93
00:04:17,820 --> 00:04:19,220
which is what is shown here.
94
00:04:20,280 --> 00:04:22,950
For example, here,
given the two sequences,
95
00:04:22,950 --> 00:04:25,860
we use the model to compute a prediction,
96
00:04:25,860 --> 00:04:27,420
and to classify the sequences
97
00:04:27,420 --> 00:04:30,393
among these two classes,
positive and negative.
98
00:04:31,230 --> 00:04:33,450
While the two sequences are very similar
99
00:04:33,450 --> 00:04:35,220
containing the same words,
100
00:04:35,220 --> 00:04:37,170
the meaning is entirely different,
101
00:04:37,170 --> 00:04:40,143
and the encoder model is able
to grasp that difference.
102
00:04:41,404 --> 00:04:44,154
(outro striking)