subtitles/zh-CN/06_transformer-models-decoders.srt (348 lines of code) (raw):
1
00:00:03,750 --> 00:00:07,140
- 在本视频中,我们将研究解码器架构。
- In this video, we'll study the decoder architecture.
2
00:00:07,140 --> 00:00:07,973
一种流行的仅包含解码器架构
An example
3
00:00:07,973 --> 00:00:11,338
的例子是 GPT-2。
of a popular decoder only architecture is GPT-2.
4
00:00:11,338 --> 00:00:14,160
为了了解解码器的工作原理
In order to understand how decoders work
5
00:00:14,160 --> 00:00:17,430
我们建议您观看有关编码器的视频。
we recommend taking a look at the video regarding encoders.
6
00:00:17,430 --> 00:00:19,980
它们与解码器极为相似。
They're extremely similar to decoders.
7
00:00:19,980 --> 00:00:21,210
可以使用解码器
One can use a decoder
8
00:00:21,210 --> 00:00:23,760
执行与编码器相同的大多数任务
for most of the same tasks as an encoder
9
00:00:23,760 --> 00:00:27,330
尽管通常会有一点性能损失。
albeit with generally a little loss of performance.
10
00:00:27,330 --> 00:00:28,890
让我们采用相同的方法
Let's take the same approach we have taken
11
00:00:28,890 --> 00:00:30,300
用编码器试试
with the encoder to try
12
00:00:30,300 --> 00:00:32,670
并了解在编码器和解码器之间
and understand the architectural differences
13
00:00:32,670 --> 00:00:34,803
的架构差异。
between an encoder and decoder.
14
00:00:35,777 --> 00:00:38,910
我们将使用一个使用三个词的小例子。
We'll use a small example using three words.
15
00:00:38,910 --> 00:00:41,050
我们通过解码器传递它们。
We pass them through their decoder.
16
00:00:41,050 --> 00:00:44,793
我们检索每个单词的数值表示。
We retrieve a numerical representation for each word.
17
00:00:46,410 --> 00:00:49,350
这里举例来说,解码器对三个词进行转换。
Here for example, the decoder converts the three words.
18
00:00:49,350 --> 00:00:53,545
Welcome to NYC,这三个数字序列。
Welcome to NYC, and these three sequences of numbers.
19
00:00:53,545 --> 00:00:56,040
解码器针对每个输入词汇
The decoder outputs exactly one sequence
20
00:00:56,040 --> 00:00:58,740
只输出一个数列。
of numbers per input word.
21
00:00:58,740 --> 00:01:00,630
这个数值表示也可以
This numerical representation can also
22
00:01:00,630 --> 00:01:03,783
称为特征向量(feature vector)或特征传感器(feature sensor)。
be called a feature vector or a feature sensor.
23
00:01:04,920 --> 00:01:07,200
让我们深入了解这种表现形式。
Let's dive in this representation.
24
00:01:07,200 --> 00:01:08,490
它包含了每个通过解码器
It contains one vector
25
00:01:08,490 --> 00:01:11,340
传递的单词的一个向量。
per word that was passed through the decoder.
26
00:01:11,340 --> 00:01:14,250
这些向量中的每一个单词
Each of these vectors is a numerical representation
27
00:01:14,250 --> 00:01:15,573
都是一个数值表示。
of the word in question.
28
00:01:16,920 --> 00:01:18,562
这个向量的维度
The dimension of that vector is defined
29
00:01:18,562 --> 00:01:20,703
由模型的架构所决定。
by the architecture of the model.
30
00:01:22,860 --> 00:01:26,040
解码器与编码器的主要区别在于
Where the decoder differs from the encoder is principally
31
00:01:26,040 --> 00:01:28,200
具有自注意力机制。
with its self attention mechanism.
32
00:01:28,200 --> 00:01:30,843
它使用所谓的掩蔽自注意力。
It's using what is called masked self attention.
33
00:01:31,860 --> 00:01:34,650
例如,在这里,如果我们关注 “to” 这个词
Here, for example, if we focus on the word "to"
34
00:01:34,650 --> 00:01:37,620
我们会发现它的向量
we'll see that is vector is absolutely unmodified
35
00:01:37,620 --> 00:01:39,690
完全未被 NYC 单词修改。
by the NYC word.
36
00:01:39,690 --> 00:01:41,731
那是因为右边所有的话,即
That's because all the words on the right, also known
37
00:01:41,731 --> 00:01:45,276
单词的右侧上下文都被屏蔽了
as the right context of the word is masked rather
38
00:01:45,276 --> 00:01:49,230
而没有从左侧和右侧的所有单词中受益。
than benefiting from all the words on the left and right.
39
00:01:49,230 --> 00:01:51,600
所以双向上下文。
So the bidirectional context.
40
00:01:51,600 --> 00:01:55,020
解码器只能访问一个上下文
Decoders only have access to a single context
41
00:01:55,020 --> 00:01:58,203
可以是左上下文或右上下文。
which can be the left context or the right context.
42
00:01:59,539 --> 00:02:03,356
掩蔽自注意力机制不同于
The masked self attention mechanism differs
43
00:02:03,356 --> 00:02:04,320
自注意力机制
from the self attention mechanism
44
00:02:04,320 --> 00:02:07,110
通过使用额外的掩码在单词的两边
by using an additional mask to hide the context
45
00:02:07,110 --> 00:02:09,390
来隐藏上下文
on either side of the word
46
00:02:09,390 --> 00:02:12,810
通过隐藏上下文中的单词
the words numerical representation will not be affected
47
00:02:12,810 --> 00:02:14,853
单词数值表示不会受到影响。
by the words in the hidden context.
48
00:02:16,260 --> 00:02:18,330
那么什么时候应该使用解码器呢?
So when should one use a decoder?
49
00:02:18,330 --> 00:02:22,380
像编码器这样的解码器可以用作独立模型
Decoders like encoders can be used as standalone models
50
00:02:22,380 --> 00:02:25,020
因为它们生成数值表示。
as they generate a numerical representation.
51
00:02:25,020 --> 00:02:28,320
它们还可以用于各种各样的任务。
They can also be used in a wide variety of tasks.
52
00:02:28,320 --> 00:02:31,260
然而,解码器的力量在于方式。
However, the strength of a decoder lies in the way.
53
00:02:31,260 --> 00:02:34,530
一个词只能访问其左侧上下文
A word can only have access to its left context
54
00:02:34,530 --> 00:02:36,690
因为它只有左侧的上下文信息。
having only access to their left context.
55
00:02:36,690 --> 00:02:39,120
它们天生擅长文本生成
They're inherently good at text generation
56
00:02:39,120 --> 00:02:41,010
即在已知的词序列基础上生成一个单词
the ability to generate a word
57
00:02:41,010 --> 00:02:45,000
或单词序列的能力。
or a sequence of words given a known sequence of words.
58
00:02:45,000 --> 00:02:45,833
这被称为
This is known
59
00:02:45,833 --> 00:02:49,083
因果语言建模或自然语言生成。
as causal language modeling or natural language generation.
60
00:02:50,430 --> 00:02:53,520
下面是一个展示因果语言模型的工作原理的示例。
Here's an example of how causal language modeling works.
61
00:02:53,520 --> 00:02:56,410
我们从一个词 my 开始,
We start with an initial word, which is my
62
00:02:57,339 --> 00:02:59,973
我们将其用作解码器的输入。
we use this as input for the decoder.
63
00:03:00,810 --> 00:03:04,260
该模型输出一个数字向量
The model outputs a vector of numbers
64
00:03:04,260 --> 00:03:07,230
这个向量包含有关序列的信息
and this vector contains information about the sequence
65
00:03:07,230 --> 00:03:08,733
这里的序列是一个单词。
which is here a single word.
66
00:03:09,780 --> 00:03:11,430
我们对该向量
We apply a small transformation
67
00:03:11,430 --> 00:03:13,110
应用一个小的转换
to that vector so that it maps
68
00:03:13,110 --> 00:03:16,500
使其映射到模型已知的所有单词
to all the words known by the model, which is a mapping
69
00:03:16,500 --> 00:03:19,890
这个映射我们稍后会看到,称为语言模型头部信息
that we'll see later called a language modeling head.
70
00:03:19,890 --> 00:03:21,930
我们发现模型认为
We identify that the model believes
71
00:03:21,930 --> 00:03:25,053
接下来最有可能的单词是 “name”。
that the most probable following word is name.
72
00:03:26,250 --> 00:03:28,710
然后我们把这个新单词加到原始的序列 my 后面
We then take that new word and add it
73
00:03:28,710 --> 00:03:33,480
我们得到了 my name。
to the initial sequence from my, we are now at my name.
74
00:03:33,480 --> 00:03:36,870
这就是自回归(auto-regressive)的作用所在。
This is where the auto regressive aspect comes in.
75
00:03:36,870 --> 00:03:38,490
自回归模型。
Auto regressive models.
76
00:03:38,490 --> 00:03:42,513
我们使用它们过去的输出作为输入和接下来的步骤。
We use their past outputs as inputs and the following steps.
77
00:03:43,452 --> 00:03:46,980
我们再次执行完全相同的操作。
Once again, we do the exact same operation.
78
00:03:46,980 --> 00:03:49,500
我们通过解码器投射那个序列
We cast that sequence through the decoder
79
00:03:49,500 --> 00:03:51,993
并检索最有可能的后续词。
and retrieve the most probable following word.
80
00:03:52,978 --> 00:03:57,978
本例中就是 “is” 这个词,我们重复操作
In this case, it is the word "is", we repeat the operation
81
00:03:58,230 --> 00:04:02,040
直到我们满意为止,从一个词开始。
until we're satisfied, starting from a single word.
82
00:04:02,040 --> 00:04:04,590
我们现在已经生成了一个完整的句子。
We've now generated a full sentence.
83
00:04:04,590 --> 00:04:07,890
我们决定就此打住,但我们也可以继续一段时间。
We decide to stop there, but we could continue for a while.
84
00:04:07,890 --> 00:04:12,890
例如,GPT-2 的最大上下文大小为 1,024。
GPT-2, for example, has a maximum context size of 1,024.
85
00:04:13,170 --> 00:04:16,830
我们最终可以生成多达 1,024 个单词
We could eventually generate up to a 1,024 words
86
00:04:16,830 --> 00:04:19,050
并且解码器仍然会对这个序列的前几个单词
and the decoder would still have some memory
87
00:04:19,050 --> 00:04:21,003
有一些记忆。
of the first words in this sequence.