subtitles/zh-CN/07_transformer-models-encoder-decoders.srt (568 lines of code) (raw):
1
00:00:00,520 --> 00:00:02,603
(嗖嗖)
(swoosh)
2
00:00:04,230 --> 00:00:05,063
- 在这个视频中,
- In this video,
3
00:00:05,063 --> 00:00:07,638
我们将研究编码-解码器架构。
we'll study the encoder-decoder architecture.
4
00:00:07,638 --> 00:00:12,243
流行的编码-解码器模型的一个示例是 T5。
An example of a popular encoder-decoder model is T5.
5
00:00:13,770 --> 00:00:16,980
为了理解编码-解码器是如何工作的,
In order to understand how the encoder-decoder works,
6
00:00:16,980 --> 00:00:18,630
我们建议您查看
we recommend you check out the videos
7
00:00:18,630 --> 00:00:22,590
将编码-解码器作为独立的模型(Encoders and Decoders as standalone models)这一视频。
on encoders and decoders as standalone models.
8
00:00:22,590 --> 00:00:24,990
了解它们如何单独工作
Understanding how they work individually
9
00:00:24,990 --> 00:00:28,323
将有助于理解编码-解码器的工作原理。
will help understanding how an encoder-decoder works.
10
00:00:30,510 --> 00:00:33,390
让我们从我们已了解的编码器开始。
Let's start from what we've seen about the encoder.
11
00:00:33,390 --> 00:00:36,240
编码器将单词作为输入,
The encoder takes words as inputs,
12
00:00:36,240 --> 00:00:38,520
通过编码器进行转换,
casts them through the encoder,
13
00:00:38,520 --> 00:00:40,800
并检索每个单词的
and retrieves a numerical representation
14
00:00:40,800 --> 00:00:42,663
数值表示。
for each word cast through it.
15
00:00:43,560 --> 00:00:46,470
我们现在知道这个数值表示
We now know that this numerical representation
16
00:00:46,470 --> 00:00:49,473
包含关于序列意义的信息。
holds information about the meaning of the sequence.
17
00:00:51,090 --> 00:00:54,243
让我们把这个放在一边,将解码器添加到图中。
Let's put this aside and add the decoder to the diagram.
18
00:00:56,610 --> 00:00:57,510
在这种情况下,
In this scenario,
19
00:00:57,510 --> 00:00:59,190
我们以某种我们以前没见过的方式
we're using the decoder in a manner
20
00:00:59,190 --> 00:01:00,960
使用解码器。
that we haven't seen before.
21
00:01:00,960 --> 00:01:04,173
我们将编码器的输出直接传递给它。
We're passing the outputs of the encoder directly to it.
22
00:01:05,356 --> 00:01:07,770
另外,在给解码器输入序列的同时
Additionally to the encoder outputs,
23
00:01:07,770 --> 00:01:10,800
我们还需要编码器的输出。
we also give the decoder a sequence.
24
00:01:10,800 --> 00:01:12,840
在不给定初始序列的情况下
When prompting the decoder for an output
25
00:01:12,840 --> 00:01:14,190
向解码器提示输出时,
with no initial sequence,
26
00:01:14,190 --> 00:01:16,140
我们可以给它一个
we can give it the value that indicates
27
00:01:16,140 --> 00:01:18,060
表示序列开头的值。
the start of a sequence.
28
00:01:18,060 --> 00:01:20,919
这就是编码-解码器魔术发生的地方。
And that's where the encoder-decoder magic happens.
29
00:01:20,919 --> 00:01:24,082
编码器接受一个序列作为输入。
The encoder accepts a sequence as input.
30
00:01:24,082 --> 00:01:25,980
它计算一个预测,
It computes a prediction,
31
00:01:25,980 --> 00:01:28,858
并输出一个数值表示。
and outputs a numerical representation.
32
00:01:28,858 --> 00:01:33,120
然后,它将其发送到解码器。
Then, it sends that over to the decoder.
33
00:01:33,120 --> 00:01:36,300
从某种意义上说,它编码了那个序列。
It has, in a sense, encoded that sequence.
34
00:01:36,300 --> 00:01:38,130
反过来,解码器
And the decoder, in turn,
35
00:01:38,130 --> 00:01:40,847
将此输入与其通常的序列输入一起使用,
using this input alongside its usual sequence input,
36
00:01:40,847 --> 00:01:43,906
将尝试解码序列。
will take a stab at decoding the sequence.
37
00:01:43,906 --> 00:01:46,530
解码器解码序列,
The decoder decodes the sequence,
38
00:01:46,530 --> 00:01:48,360
并输出一个词。
and outputs a word.
39
00:01:48,360 --> 00:01:51,300
到目前为止,我们不需要理解这个词,
As of now, we don't need to make sense of that word,
40
00:01:51,300 --> 00:01:53,100
但我们可以理解解码器
but we can understand that the decoder
41
00:01:53,100 --> 00:01:56,103
本质上是解码编码器输出的内容。
is essentially decoding what the encoder has output.
42
00:01:57,008 --> 00:02:00,000
序列词的开头在这里
The start of sequence word here
43
00:02:00,000 --> 00:02:02,871
表示它应该开始解码序列。
indicates that it should start decoding the sequence.
44
00:02:02,871 --> 00:02:06,870
现在我们有了编码器的数字表示
Now that we have both the encoder numerical representation
45
00:02:06,870 --> 00:02:09,570
和一个初始生成的词,
and an initial generated word,
46
00:02:09,570 --> 00:02:11,343
我们不再需要编码器了。
we don't need the encoder anymore.
47
00:02:12,269 --> 00:02:15,540
正如我们之前在解码器中看到的那样,
As we have seen before with the decoder,
48
00:02:15,540 --> 00:02:18,720
它可以以自回归的方式起作用。
it can act in an auto-regressive manner.
49
00:02:18,720 --> 00:02:22,933
它刚刚输出的单词现在可以用作输入。
The word it has just output can now be used as an input.
50
00:02:22,933 --> 00:02:26,188
这个编码器输出的数值表示
This, in combination with the numerical representation
51
00:02:26,188 --> 00:02:28,560
和初始化的值结合,
output by the encoder,
52
00:02:28,560 --> 00:02:31,203
可以被用于生成第二个单词。
can now be used to generate a second word.
53
00:02:33,040 --> 00:02:35,910
请注意,第一个词仍然在这里,
Please note that the first word is still here,
54
00:02:35,910 --> 00:02:37,770
因为模型仍然输出它。
as the model still outputs it.
55
00:02:37,770 --> 00:02:39,240
但是,我们已将其变灰
However, we have grayed it out
56
00:02:39,240 --> 00:02:40,940
因为我们不再需要它了。
as we have no need for it anymore.
57
00:02:41,880 --> 00:02:44,070
我们可以继续下去,例如,
We can continue on and on, for example,
58
00:02:44,070 --> 00:02:46,320
直到解码器输出一个
until the decoder outputs a value
59
00:02:46,320 --> 00:02:48,540
我们认为是停止值的数值,
that we consider a stopping value,
60
00:02:48,540 --> 00:02:51,093
比如句号表示序列的结束。
like a dot meaning the end of a sequence.
61
00:02:53,580 --> 00:02:55,926
在这里,我们已经看到了编码-解码 transformer
Here, we've seen the full mechanism
62
00:02:55,926 --> 00:02:57,540
完整的机制。
of the encoder-decoder transformer.
63
00:02:57,540 --> 00:02:59,280
让我们再看一遍。
Let's go over it one more time.
64
00:02:59,280 --> 00:03:02,773
我们有一个初始序列被送到编码器中。
We have an initial sequence that is sent to the encoder.
65
00:03:02,773 --> 00:03:06,450
编码器的输出发送到解码器
That encoder output is then sent to the decoder
66
00:03:06,450 --> 00:03:07,563
以便对其进行解码。
for it to be decoded.
67
00:03:08,760 --> 00:03:12,450
虽然在一次使用后可以丢弃编码器,
While it can now discard the encoder after a single use,
68
00:03:12,450 --> 00:03:14,427
但解码器将被多次使用
the decoder will be used several times
69
00:03:14,427 --> 00:03:17,763
直到我们生成了所需要的每一个词。
until we have generated every word that we need.
70
00:03:19,288 --> 00:03:21,510
那么让我们结合翻译语言建模
So let's see a concrete example
71
00:03:21,510 --> 00:03:23,460
看一个具体的例子。
with Translation Language Modeling.
72
00:03:23,460 --> 00:03:24,930
也称为传导,
Also called transduction,
73
00:03:24,930 --> 00:03:28,200
这是翻译序列的行为。
which is the act of translating a sequence.
74
00:03:28,200 --> 00:03:30,577
在这里,我们想翻译这个英文序列
Here, we would like to translate this English sequence
75
00:03:30,577 --> 00:03:33,067
“Welcome to NYC”到法语。
"Welcome to NYC" in French.
76
00:03:33,067 --> 00:03:35,460
我们正在使用 transformer 模型
We're using a transformer model
77
00:03:35,460 --> 00:03:38,070
明确针对该任务进行了训练。
that is trained for that task explicitly.
78
00:03:38,070 --> 00:03:40,560
我们使用编码器来创建英语句子
We use the encoder to create a representation
79
00:03:40,560 --> 00:03:42,240
的表示。
of the English sentence.
80
00:03:42,240 --> 00:03:44,730
我们使用编码器来创建英语句子的表示形式,
We cast this to the decoder,
81
00:03:44,730 --> 00:03:46,620
然后将其传递给解码器,
with the use of the start of sequence word,
82
00:03:46,620 --> 00:03:49,173
在使用开始序列单词的情况下,请求它输出第一个单词。
we ask it to output the first word.
83
00:03:50,029 --> 00:03:53,607
输出 bienvenue,表示欢迎。
It outputs bienvenue, which means welcome.
84
00:03:53,607 --> 00:03:56,640
然后我们使用 bienvenue
And we then use bienvenue
85
00:03:56,640 --> 00:03:59,283
作为解码器的输入序列。
as the input sequence for the decoder.
86
00:04:00,188 --> 00:04:04,470
这与编码器数字表示一起,
This, alongside the encoder numerical representation,
87
00:04:04,470 --> 00:04:07,440
允许解码器预测第二个词,
allows the decoder to predict the second word, Ã,
88
00:04:07,440 --> 00:04:09,240
这是英文的。
which is to in English.
89
00:04:09,240 --> 00:04:13,590
最后,我们要求解码器预测第三个词
Finally, we ask the decoder to predict a third word
90
00:04:13,590 --> 00:04:15,330
它预测 NYC,这是正确的。
It predicts NYC, which is correct.
91
00:04:15,330 --> 00:04:18,288
我们已经翻译了这句话。
We've translated the sentence.
92
00:04:18,288 --> 00:04:20,760
编码-解码器真正发挥作用的地方,
Where the encoder-decoder really shines,
93
00:04:20,760 --> 00:04:23,550
是我们有一个编码器和一个解码器,
is that we have an encoder and a decoder,
94
00:04:23,550 --> 00:04:25,323
通常不共享权重。
which often do not share weights.
95
00:04:26,256 --> 00:04:29,460
因此,我们有一个完整的块,编码器,
Therefore, we have an entire block, the encoder,
96
00:04:29,460 --> 00:04:31,650
可以被训练,从而理解序列
that can be trained to understand the sequence
97
00:04:31,650 --> 00:04:34,290
并提取相关信息。
and extract the relevant information.
98
00:04:34,290 --> 00:04:36,450
对于我们之前看到的翻译场景,
For the translation scenario we've seen earlier,
99
00:04:36,450 --> 00:04:38,760
例如,这意味着解析
for example, this would mean parsing
100
00:04:38,760 --> 00:04:42,003
并理解用英语说的内容。
and understanding what was said in the English language.
101
00:04:42,900 --> 00:04:45,960
这意味着从该语言中提取信息,
It would mean extracting information from that language,
102
00:04:45,960 --> 00:04:49,413
并将所有这些放在一个信息密集的向量中。
and putting all of that in a vector dense in information.
103
00:04:50,361 --> 00:04:53,370
另一方面,我们有解码器,
On the other hand, we have the decoder,
104
00:04:53,370 --> 00:04:56,850
它的唯一目的是解码
whose sole purpose is to decode the numerical representation
105
00:04:56,850 --> 00:04:58,203
编码器输出的数值表示。
output by the encoder.
106
00:04:59,460 --> 00:05:01,170
这个解码器可以专门
This decoder can be specialized
107
00:05:01,170 --> 00:05:02,970
用完全不同的语言,
in a completely different language,
108
00:05:02,970 --> 00:05:05,403
甚至像图像或语音这样的模态。
or even modality like images or speech.
109
00:05:07,170 --> 00:05:10,473
编码-解码器之所以特殊,有几个原因。
Encoders-decoders are special for several reasons.
110
00:05:11,310 --> 00:05:15,570
首先,它们能够管理任务的顺序,
Firstly, they're able to manage sequence to sequence tasks,
111
00:05:15,570 --> 00:05:18,358
就像我们刚刚看到的翻译一样。
like translation that we have just seen.
112
00:05:18,358 --> 00:05:20,940
其次,编码器和解码器之间的权重
Secondly, the weights between the encoder
113
00:05:20,940 --> 00:05:24,540
并不一定共享。
and the decoder parts are not necessarily shared.
114
00:05:24,540 --> 00:05:27,172
再举一个翻译的例子。
Let's take another example of translation.
115
00:05:27,172 --> 00:05:30,810
这里我们用法语
Here we're translating "Transformers are powerful"
116
00:05:30,810 --> 00:05:32,048
翻译 Transformers are powerful
in French.
117
00:05:32,048 --> 00:05:35,258
首先,这意味着从三个单词的序列中,
Firstly, this means that from a sequence of three words,
118
00:05:35,258 --> 00:05:39,030
我们能够生成一个包含四个单词的序列。
we're able to generate a sequence of four words.
119
00:05:39,030 --> 00:05:42,480
有人可能会争辩说这可以用解码器来处理
One could argue that this could be handled with a decoder
120
00:05:42,480 --> 00:05:44,160
那会以自回归的方式
that would generate the translation
121
00:05:44,160 --> 00:05:46,260
生成翻译结果,
in an auto-regressive manner,
122
00:05:46,260 --> 00:05:47,460
他们是对的。
and they would be right.
123
00:05:49,980 --> 00:05:51,930
基于 Transformers Seq2Seq 模型的
Another example of where sequence to sequence
124
00:05:51,930 --> 00:05:54,810
另一个亮点是进行总结的能力
transformers shine is in summarization.
125
00:05:54,810 --> 00:05:58,379
这里我们有一个很长的序列,通常是全文,
Here we have a very long sequence, generally a full text,
126
00:05:58,379 --> 00:06:01,020
我们想总结一下。
and we want to summarize it.
127
00:06:01,020 --> 00:06:04,020
由于编码器和解码器是分开的,
Since the encoder and decoders are separated,
128
00:06:04,020 --> 00:06:06,300
我们可以有不同的上下文长度。
we can have different context lengths.
129
00:06:06,300 --> 00:06:08,910
例如,编码器的一个非常长的上下文,
For example, a very long context for the encoder,
130
00:06:08,910 --> 00:06:10,230
处理文本,
which handles the text,
131
00:06:10,230 --> 00:06:12,210
尔解码器上下文则较小
and a smaller context for the decoder
132
00:06:12,210 --> 00:06:14,223
它处理汇总序列。
which handles the summarized sequence.
133
00:06:16,470 --> 00:06:18,840
有很多序列到序列的模型。
There are a lot of sequence to sequence models.
134
00:06:18,840 --> 00:06:20,310
这包含了 Transformers 库中
This contains a few examples
135
00:06:20,310 --> 00:06:22,500
几个受欢迎的
of popular encoder-decoder models
136
00:06:22,500 --> 00:06:24,400
编码-解码器模型的示例。
available in the transformers library.
137
00:06:25,829 --> 00:06:29,940
此外,您可以在编码-解码器模型中
Additionally, you can load an encoder and a decoder
138
00:06:29,940 --> 00:06:32,130
加载编码器和解码器。
inside an encoder-decoder model.
139
00:06:32,130 --> 00:06:35,190
因此,根据您要解决的具体任务,
Therefore, according to the specific task you are targeting,
140
00:06:35,190 --> 00:06:38,700
您可能会选择使用在这些具体任务上证明
you may choose to use specific encoders and decoders,
141
00:06:38,700 --> 00:06:42,613
其价值的特定编码器和解码器。
which have proven their worth on these specific tasks.
142
00:06:42,613 --> 00:06:44,696
(嗖嗖)
(swoosh)