subtitles/en/07_transformer-models-encoder-decoders.srt (479 lines of code) (raw):
1
00:00:00,520 --> 00:00:02,603
(swoosh)
2
00:00:04,230 --> 00:00:05,063
- In this video,
3
00:00:05,063 --> 00:00:07,638
we'll study the
encoder-decoder architecture.
4
00:00:07,638 --> 00:00:12,243
An example of a popular
encoder-decoder model is T5.
5
00:00:13,770 --> 00:00:16,980
In order to understand how
the encoder-decoder works,
6
00:00:16,980 --> 00:00:18,630
we recommend you check out the videos
7
00:00:18,630 --> 00:00:22,590
on encoders and decoders
as standalone models.
8
00:00:22,590 --> 00:00:24,990
Understanding how they work individually
9
00:00:24,990 --> 00:00:28,323
will help understanding how
an encoder-decoder works.
10
00:00:30,510 --> 00:00:33,390
Let's start from what we've
seen about the encoder.
11
00:00:33,390 --> 00:00:36,240
The encoder takes words as inputs,
12
00:00:36,240 --> 00:00:38,520
casts them through the encoder,
13
00:00:38,520 --> 00:00:40,800
and retrieves a numerical representation
14
00:00:40,800 --> 00:00:42,663
for each word cast through it.
15
00:00:43,560 --> 00:00:46,470
We now know that this
numerical representation
16
00:00:46,470 --> 00:00:49,473
holds information about the
meaning of the sequence.
17
00:00:51,090 --> 00:00:54,243
Let's put this aside and add
the decoder to the diagram.
18
00:00:56,610 --> 00:00:57,510
In this scenario,
19
00:00:57,510 --> 00:00:59,190
we're using the decoder in a manner
20
00:00:59,190 --> 00:01:00,960
that we haven't seen before.
21
00:01:00,960 --> 00:01:04,173
We're passing the outputs of
the encoder directly to it.
22
00:01:05,356 --> 00:01:07,770
Additionally to the encoder outputs,
23
00:01:07,770 --> 00:01:10,800
we also give the decoder a sequence.
24
00:01:10,800 --> 00:01:12,840
When prompting the decoder for an output
25
00:01:12,840 --> 00:01:14,190
with no initial sequence,
26
00:01:14,190 --> 00:01:16,140
we can give it the value that indicates
27
00:01:16,140 --> 00:01:18,060
the start of a sequence.
28
00:01:18,060 --> 00:01:20,919
And that's where the
encoder-decoder magic happens.
29
00:01:20,919 --> 00:01:24,082
The encoder accepts a sequence as input.
30
00:01:24,082 --> 00:01:25,980
It computes a prediction,
31
00:01:25,980 --> 00:01:28,858
and outputs a numerical representation.
32
00:01:28,858 --> 00:01:33,120
Then, it sends that over to the decoder.
33
00:01:33,120 --> 00:01:36,300
It has, in a sense, encoded that sequence.
34
00:01:36,300 --> 00:01:38,130
And the decoder, in turn,
35
00:01:38,130 --> 00:01:40,847
using this input alongside
its usual sequence input,
36
00:01:40,847 --> 00:01:43,906
will take a stab at decoding the sequence.
37
00:01:43,906 --> 00:01:46,530
The decoder decodes the sequence,
38
00:01:46,530 --> 00:01:48,360
and outputs a word.
39
00:01:48,360 --> 00:01:51,300
As of now, we don't need
to make sense of that word,
40
00:01:51,300 --> 00:01:53,100
but we can understand that the decoder
41
00:01:53,100 --> 00:01:56,103
is essentially decoding
what the encoder has output.
42
00:01:57,008 --> 00:02:00,000
The start of sequence word here
43
00:02:00,000 --> 00:02:02,871
indicates that it should
start decoding the sequence.
44
00:02:02,871 --> 00:02:06,870
Now that we have both the
encoder numerical representation
45
00:02:06,870 --> 00:02:09,570
and an initial generated word,
46
00:02:09,570 --> 00:02:11,343
we don't need the encoder anymore.
47
00:02:12,269 --> 00:02:15,540
As we have seen before with the decoder,
48
00:02:15,540 --> 00:02:18,720
it can act in an auto-regressive manner.
49
00:02:18,720 --> 00:02:22,933
The word it has just output
can now be used as an input.
50
00:02:22,933 --> 00:02:26,188
This, in combination with
the numerical representation
51
00:02:26,188 --> 00:02:28,560
output by the encoder,
52
00:02:28,560 --> 00:02:31,203
can now be used to generate a second word.
53
00:02:33,040 --> 00:02:35,910
Please note that the
first word is still here,
54
00:02:35,910 --> 00:02:37,770
as the model still outputs it.
55
00:02:37,770 --> 00:02:39,240
However, we have grayed it out
56
00:02:39,240 --> 00:02:40,940
as we have no need for it anymore.
57
00:02:41,880 --> 00:02:44,070
We can continue on and on, for example,
58
00:02:44,070 --> 00:02:46,320
until the decoder outputs a value
59
00:02:46,320 --> 00:02:48,540
that we consider a stopping value,
60
00:02:48,540 --> 00:02:51,093
like a dot meaning the end of a sequence.
61
00:02:53,580 --> 00:02:55,926
Here, we've seen the full mechanism
62
00:02:55,926 --> 00:02:57,540
of the encoder-decoder transformer.
63
00:02:57,540 --> 00:02:59,280
Let's go over it one more time.
64
00:02:59,280 --> 00:03:02,773
We have an initial sequence
that is sent to the encoder.
65
00:03:02,773 --> 00:03:06,450
That encoder output is
then sent to the decoder
66
00:03:06,450 --> 00:03:07,563
for it to be decoded.
67
00:03:08,760 --> 00:03:12,450
While it can now discard the
encoder after a single use,
68
00:03:12,450 --> 00:03:14,427
the decoder will be used several times
69
00:03:14,427 --> 00:03:17,763
until we have generated
every word that we need.
70
00:03:19,288 --> 00:03:21,510
So let's see a concrete example
71
00:03:21,510 --> 00:03:23,460
with Translation Language Modeling.
72
00:03:23,460 --> 00:03:24,930
Also called transduction,
73
00:03:24,930 --> 00:03:28,200
which is the act of
translating a sequence.
74
00:03:28,200 --> 00:03:30,577
Here, we would like to
translate this English sequence
75
00:03:30,577 --> 00:03:33,067
"Welcome to NYC" in French.
76
00:03:33,067 --> 00:03:35,460
We're using a transformer model
77
00:03:35,460 --> 00:03:38,070
that is trained for that task explicitly.
78
00:03:38,070 --> 00:03:40,560
We use the encoder to
create a representation
79
00:03:40,560 --> 00:03:42,240
of the English sentence.
80
00:03:42,240 --> 00:03:44,730
We cast this to the decoder,
81
00:03:44,730 --> 00:03:46,620
with the use of the
start of sequence word,
82
00:03:46,620 --> 00:03:49,173
we ask it to output the first word.
83
00:03:50,029 --> 00:03:53,607
It outputs bienvenue, which means welcome.
84
00:03:53,607 --> 00:03:56,640
And we then use bienvenue
85
00:03:56,640 --> 00:03:59,283
as the input sequence for the decoder.
86
00:04:00,188 --> 00:04:04,470
This, alongside the encoder
numerical representation,
87
00:04:04,470 --> 00:04:07,440
allows the decoder to
predict the second word, Ã,
88
00:04:07,440 --> 00:04:09,240
which is to in English.
89
00:04:09,240 --> 00:04:13,590
Finally, we ask the decoder
to predict a third word
90
00:04:13,590 --> 00:04:15,330
It predicts NYC, which is correct.
91
00:04:15,330 --> 00:04:18,288
We've translated the sentence.
92
00:04:18,288 --> 00:04:20,760
Where the encoder-decoder really shines,
93
00:04:20,760 --> 00:04:23,550
is that we have an encoder and a decoder,
94
00:04:23,550 --> 00:04:25,323
which often do not share weights.
95
00:04:26,256 --> 00:04:29,460
Therefore, we have an
entire block, the encoder,
96
00:04:29,460 --> 00:04:31,650
that can be trained to
understand the sequence
97
00:04:31,650 --> 00:04:34,290
and extract the relevant information.
98
00:04:34,290 --> 00:04:36,450
For the translation
scenario we've seen earlier,
99
00:04:36,450 --> 00:04:38,760
for example, this would mean parsing
100
00:04:38,760 --> 00:04:42,003
and understanding what was
said in the English language.
101
00:04:42,900 --> 00:04:45,960
It would mean extracting
information from that language,
102
00:04:45,960 --> 00:04:49,413
and putting all of that in a
vector dense in information.
103
00:04:50,361 --> 00:04:53,370
On the other hand, we have the decoder,
104
00:04:53,370 --> 00:04:56,850
whose sole purpose is to decode
the numerical representation
105
00:04:56,850 --> 00:04:58,203
output by the encoder.
106
00:04:59,460 --> 00:05:01,170
This decoder can be specialized
107
00:05:01,170 --> 00:05:02,970
in a completely different language,
108
00:05:02,970 --> 00:05:05,403
or even modality like images or speech.
109
00:05:07,170 --> 00:05:10,473
Encoders-decoders are
special for several reasons.
110
00:05:11,310 --> 00:05:15,570
Firstly, they're able to manage
sequence to sequence tasks,
111
00:05:15,570 --> 00:05:18,358
like translation that we have just seen.
112
00:05:18,358 --> 00:05:20,940
Secondly, the weights between the encoder
113
00:05:20,940 --> 00:05:24,540
and the decoder parts are
not necessarily shared.
114
00:05:24,540 --> 00:05:27,172
Let's take another example of translation.
115
00:05:27,172 --> 00:05:30,810
Here we're translating
"Transformers are powerful"
116
00:05:30,810 --> 00:05:32,048
in French.
117
00:05:32,048 --> 00:05:35,258
Firstly, this means that from
a sequence of three words,
118
00:05:35,258 --> 00:05:39,030
we're able to generate a
sequence of four words.
119
00:05:39,030 --> 00:05:42,480
One could argue that this
could be handled with a decoder
120
00:05:42,480 --> 00:05:44,160
that would generate the translation
121
00:05:44,160 --> 00:05:46,260
in an auto-regressive manner,
122
00:05:46,260 --> 00:05:47,460
and they would be right.
123
00:05:49,980 --> 00:05:51,930
Another example of where
sequence to sequence
124
00:05:51,930 --> 00:05:54,810
transformers shine is in summarization.
125
00:05:54,810 --> 00:05:58,379
Here we have a very long
sequence, generally a full text,
126
00:05:58,379 --> 00:06:01,020
and we want to summarize it.
127
00:06:01,020 --> 00:06:04,020
Since the encoder and
decoders are separated,
128
00:06:04,020 --> 00:06:06,300
we can have different context lengths.
129
00:06:06,300 --> 00:06:08,910
For example, a very long
context for the encoder,
130
00:06:08,910 --> 00:06:10,230
which handles the text,
131
00:06:10,230 --> 00:06:12,210
and a smaller context for the decoder
132
00:06:12,210 --> 00:06:14,223
which handles the summarized sequence.
133
00:06:16,470 --> 00:06:18,840
There are a lot of sequence
to sequence models.
134
00:06:18,840 --> 00:06:20,310
This contains a few examples
135
00:06:20,310 --> 00:06:22,500
of popular encoder-decoder models
136
00:06:22,500 --> 00:06:24,400
available in the transformers library.
137
00:06:25,829 --> 00:06:29,940
Additionally, you can load
an encoder and a decoder
138
00:06:29,940 --> 00:06:32,130
inside an encoder-decoder model.
139
00:06:32,130 --> 00:06:35,190
Therefore, according to the
specific task you are targeting,
140
00:06:35,190 --> 00:06:38,700
you may choose to use specific
encoders and decoders,
141
00:06:38,700 --> 00:06:42,613
which have proven their worth
on these specific tasks.
142
00:06:42,613 --> 00:06:44,696
(swoosh)