1
00:00:00,000 --> 00:00:02,750
(logo whooshing)

2
00:00:05,010 --> 00:00:07,323
- Let's study the
transformer architecture.

3
00:00:09,150 --> 00:00:12,030
This video is the introductory
video to the encoders,

4
00:00:12,030 --> 00:00:15,510
decoders, and encoder-decoder
series of videos.

5
00:00:15,510 --> 00:00:16,343
In this series,

6
00:00:16,343 --> 00:00:18,900
we'll try to understand what
makes a transformer network,

7
00:00:18,900 --> 00:00:22,770
and we'll try to explain it
in simple, high-level terms.

8
00:00:22,770 --> 00:00:25,800
No advanced understanding of
neural networks is necessary,

9
00:00:25,800 --> 00:00:29,343
but an understanding of basic
vectors and tensors may help.

10
00:00:32,250 --> 00:00:33,270
To get started,

11
00:00:33,270 --> 00:00:34,530
we'll take up this diagram

12
00:00:34,530 --> 00:00:36,630
from the original transformer paper,

13
00:00:36,630 --> 00:00:40,140
entitled "Attention Is All
You Need", by Vaswani et al.

14
00:00:40,140 --> 00:00:41,010
As we'll see here,

15
00:00:41,010 --> 00:00:42,780
we can leverage only some parts of it,

16
00:00:42,780 --> 00:00:44,630
according to what we're trying to do.

17
00:00:45,480 --> 00:00:47,610
We want to dive into the specific layers,

18
00:00:47,610 --> 00:00:48,990
building up that architecture,

19
00:00:48,990 --> 00:00:51,390
but we'll try to understand
the different ways

20
00:00:51,390 --> 00:00:52,893
this architecture can be used.

21
00:00:55,170 --> 00:00:56,003
Let's first start

22
00:00:56,003 --> 00:00:58,260
by splitting that
architecture into two parts.

23
00:00:58,260 --> 00:00:59,910
On the left we have the encoder,

24
00:00:59,910 --> 00:01:01,980
and on the right, the decoder.

25
00:01:01,980 --> 00:01:03,330
These two can be used together,

26
00:01:03,330 --> 00:01:05,330
but they can also be used independently.

27
00:01:06,180 --> 00:01:08,610
Let's understand how these work.

28
00:01:08,610 --> 00:01:11,460
The encoder accepts inputs
that represent text.

29
00:01:11,460 --> 00:01:13,620
It converts this text, these words,

30
00:01:13,620 --> 00:01:15,675
into numerical representations.

31
00:01:15,675 --> 00:01:17,400
These numerical representations

32
00:01:17,400 --> 00:01:20,460
can also be called
embeddings, or features.

33
00:01:20,460 --> 00:01:23,100
We'll see that it uses the
self-attention mechanism

34
00:01:23,100 --> 00:01:24,483
as its main component.

35
00:01:25,500 --> 00:01:27,120
We recommend you check out the video

36
00:01:27,120 --> 00:01:29,700
on encoders specifically to understand

37
00:01:29,700 --> 00:01:31,680
what is this numerical representation,

38
00:01:31,680 --> 00:01:33,690
as well as how it works.

39
00:01:33,690 --> 00:01:36,660
We'll study the self-attention
mechanism in more detail,

40
00:01:36,660 --> 00:01:38,913
as well as its bi-directional properties.

41
00:01:40,650 --> 00:01:42,780
The decoder is similar to the encoder.

42
00:01:42,780 --> 00:01:45,630
It can also accept text inputs.

43
00:01:45,630 --> 00:01:48,210
It uses a similar
mechanism as the encoder,

44
00:01:48,210 --> 00:01:51,150
which is the masked
self-attention as well.

45
00:01:51,150 --> 00:01:52,590
It differs from the encoder

46
00:01:52,590 --> 00:01:54,990
due to its uni-directional feature

47
00:01:54,990 --> 00:01:58,590
and is traditionally used in
an auto-regressive manner.

48
00:01:58,590 --> 00:02:01,650
Here too, we recommend you
check out the video on decoders

49
00:02:01,650 --> 00:02:04,000
especially to understand
how all of this works.

50
00:02:06,810 --> 00:02:07,890
Combining the two parts

51
00:02:07,890 --> 00:02:10,200
results in what is known
as an encoder-decoder,

52
00:02:10,200 --> 00:02:12,720
or a sequence-to-sequence transformer.

53
00:02:12,720 --> 00:02:14,280
The encoder accepts inputs

54
00:02:14,280 --> 00:02:17,850
and computes a high-level
representation of those inputs.

55
00:02:17,850 --> 00:02:20,252
These outputs are then
passed to the decoder.

56
00:02:20,252 --> 00:02:22,860
The decoder uses the encoder's output,

57
00:02:22,860 --> 00:02:26,370
alongside other inputs
to generate a prediction.

58
00:02:26,370 --> 00:02:27,900
It then predicts an output,

59
00:02:27,900 --> 00:02:30,248
which it will re-use in future iterations,

60
00:02:30,248 --> 00:02:32,662
hence the term, auto-regressive.

61
00:02:32,662 --> 00:02:34,740
Finally, to get an understanding

62
00:02:34,740 --> 00:02:36,690
of the encoder-decoders as a whole,

63
00:02:36,690 --> 00:02:39,670
we recommend you check out
the video on encoder-decoders.

64
00:02:39,670 --> 00:02:42,420
(logo whooshing)