subtitles/en/04_the-transformer-architecture.srt (216 lines of code) (raw):
1
00:00:00,000 --> 00:00:02,750
(logo whooshing)
2
00:00:05,010 --> 00:00:07,323
- Let's study the
transformer architecture.
3
00:00:09,150 --> 00:00:12,030
This video is the introductory
video to the encoders,
4
00:00:12,030 --> 00:00:15,510
decoders, and encoder-decoder
series of videos.
5
00:00:15,510 --> 00:00:16,343
In this series,
6
00:00:16,343 --> 00:00:18,900
we'll try to understand what
makes a transformer network,
7
00:00:18,900 --> 00:00:22,770
and we'll try to explain it
in simple, high-level terms.
8
00:00:22,770 --> 00:00:25,800
No advanced understanding of
neural networks is necessary,
9
00:00:25,800 --> 00:00:29,343
but an understanding of basic
vectors and tensors may help.
10
00:00:32,250 --> 00:00:33,270
To get started,
11
00:00:33,270 --> 00:00:34,530
we'll take up this diagram
12
00:00:34,530 --> 00:00:36,630
from the original transformer paper,
13
00:00:36,630 --> 00:00:40,140
entitled "Attention Is All
You Need", by Vaswani et al.
14
00:00:40,140 --> 00:00:41,010
As we'll see here,
15
00:00:41,010 --> 00:00:42,780
we can leverage only some parts of it,
16
00:00:42,780 --> 00:00:44,630
according to what we're trying to do.
17
00:00:45,480 --> 00:00:47,610
We want to dive into the specific layers,
18
00:00:47,610 --> 00:00:48,990
building up that architecture,
19
00:00:48,990 --> 00:00:51,390
but we'll try to understand
the different ways
20
00:00:51,390 --> 00:00:52,893
this architecture can be used.
21
00:00:55,170 --> 00:00:56,003
Let's first start
22
00:00:56,003 --> 00:00:58,260
by splitting that
architecture into two parts.
23
00:00:58,260 --> 00:00:59,910
On the left we have the encoder,
24
00:00:59,910 --> 00:01:01,980
and on the right, the decoder.
25
00:01:01,980 --> 00:01:03,330
These two can be used together,
26
00:01:03,330 --> 00:01:05,330
but they can also be used independently.
27
00:01:06,180 --> 00:01:08,610
Let's understand how these work.
28
00:01:08,610 --> 00:01:11,460
The encoder accepts inputs
that represent text.
29
00:01:11,460 --> 00:01:13,620
It converts this text, these words,
30
00:01:13,620 --> 00:01:15,675
into numerical representations.
31
00:01:15,675 --> 00:01:17,400
These numerical representations
32
00:01:17,400 --> 00:01:20,460
can also be called
embeddings, or features.
33
00:01:20,460 --> 00:01:23,100
We'll see that it uses the
self-attention mechanism
34
00:01:23,100 --> 00:01:24,483
as its main component.
35
00:01:25,500 --> 00:01:27,120
We recommend you check out the video
36
00:01:27,120 --> 00:01:29,700
on encoders specifically to understand
37
00:01:29,700 --> 00:01:31,680
what is this numerical representation,
38
00:01:31,680 --> 00:01:33,690
as well as how it works.
39
00:01:33,690 --> 00:01:36,660
We'll study the self-attention
mechanism in more detail,
40
00:01:36,660 --> 00:01:38,913
as well as its bi-directional properties.
41
00:01:40,650 --> 00:01:42,780
The decoder is similar to the encoder.
42
00:01:42,780 --> 00:01:45,630
It can also accept text inputs.
43
00:01:45,630 --> 00:01:48,210
It uses a similar
mechanism as the encoder,
44
00:01:48,210 --> 00:01:51,150
which is the masked
self-attention as well.
45
00:01:51,150 --> 00:01:52,590
It differs from the encoder
46
00:01:52,590 --> 00:01:54,990
due to its uni-directional feature
47
00:01:54,990 --> 00:01:58,590
and is traditionally used in
an auto-regressive manner.
48
00:01:58,590 --> 00:02:01,650
Here too, we recommend you
check out the video on decoders
49
00:02:01,650 --> 00:02:04,000
especially to understand
how all of this works.
50
00:02:06,810 --> 00:02:07,890
Combining the two parts
51
00:02:07,890 --> 00:02:10,200
results in what is known
as an encoder-decoder,
52
00:02:10,200 --> 00:02:12,720
or a sequence-to-sequence transformer.
53
00:02:12,720 --> 00:02:14,280
The encoder accepts inputs
54
00:02:14,280 --> 00:02:17,850
and computes a high-level
representation of those inputs.
55
00:02:17,850 --> 00:02:20,252
These outputs are then
passed to the decoder.
56
00:02:20,252 --> 00:02:22,860
The decoder uses the encoder's output,
57
00:02:22,860 --> 00:02:26,370
alongside other inputs
to generate a prediction.
58
00:02:26,370 --> 00:02:27,900
It then predicts an output,
59
00:02:27,900 --> 00:02:30,248
which it will re-use in future iterations,
60
00:02:30,248 --> 00:02:32,662
hence the term, auto-regressive.
61
00:02:32,662 --> 00:02:34,740
Finally, to get an understanding
62
00:02:34,740 --> 00:02:36,690
of the encoder-decoders as a whole,
63
00:02:36,690 --> 00:02:39,670
we recommend you check out
the video on encoder-decoders.
64
00:02:39,670 --> 00:02:42,420
(logo whooshing)