subtitles/zh-CN/05_transformer-models-encoders.srt (408 lines of code) (raw):
1
00:00:00,253 --> 00:00:03,003
(引人注目的介绍)
(intro striking)
2
00:00:04,440 --> 00:00:07,830
- 在本视频中,我们将研究编码器架构。
- In this video, we'll study the encoder architecture.
3
00:00:07,830 --> 00:00:11,070
一个流行的仅使用编码器架构的例子是 BURT
An example of a popular encoder only architecture is BURT
4
00:00:11,070 --> 00:00:13,323
这是同类产品中最受欢迎的模型。
which is the most popular model of its kind.
5
00:00:14,550 --> 00:00:16,950
让我们首先了解它是如何工作的。
Let's first start by understanding how it works.
6
00:00:18,360 --> 00:00:20,910
我们将使用一个三个单词的小例子。
We'll use a small example using three words.
7
00:00:20,910 --> 00:00:23,823
我们使用这些作为输入传递给编码器。
We use these as inputs and pass them through the encoder.
8
00:00:25,290 --> 00:00:28,173
得到了每个单词的数值表示。
We retrieve a numerical representation of each word.
9
00:00:29,970 --> 00:00:32,700
例如,在这里,编码器将这三个词,
Here, for example, the encoder converts those three words,
10
00:00:32,700 --> 00:00:37,350
welcome to NYC,转换为这三个数字序列。
welcome to NYC, in these three sequences of numbers.
11
00:00:37,350 --> 00:00:40,350
编码器对于每个输入单词
The encoder outputs exactly one sequence of numbers
12
00:00:40,350 --> 00:00:41,493
精确输出一个数字序列。
per input word.
13
00:00:42,330 --> 00:00:44,880
这种数值表示也可以称为
This numerical representation can also be called
14
00:00:44,880 --> 00:00:47,163
特征向量(feature vector)或特征张量(feature tensor)。
a feature vector, or a feature tensor.
15
00:00:49,080 --> 00:00:51,030
让我们深入研究这种表示。
Let's dive into this representation.
16
00:00:51,030 --> 00:00:52,740
每个词包含一个向量
It contains one vector per word
17
00:00:52,740 --> 00:00:54,540
这是通过编码器传递的。
that was passed through the encoder.
18
00:00:56,130 --> 00:00:58,620
每个向量都是
Each of these vector is a numerical representation
19
00:00:58,620 --> 00:01:00,033
该词的数字表示。
of the word in question.
20
00:01:01,080 --> 00:01:03,300
该向量的维度由
The dimension of that vector is defined
21
00:01:03,300 --> 00:01:05,520
模型的架构所决定。
by the architecture of the model.
22
00:01:05,520 --> 00:01:08,703
对于基本 BERT 模型,它是 768。
For the base BERT model, it is 768.
23
00:01:10,650 --> 00:01:13,230
这些表示包含一个词的值,
These representations contain the value of a word,
24
00:01:13,230 --> 00:01:15,240
但包含上下文化的处理。
but contextualized.
25
00:01:15,240 --> 00:01:18,570
例如,与单词 "to" 相关联的向量
For example, the vector attributed to the word "to"
26
00:01:18,570 --> 00:01:22,290
不只是 “to” 这个词的表示。
isn't the representation of only the "to" word.
27
00:01:22,290 --> 00:01:25,650
它还考虑了它周围的词
It also takes into account the words around it
28
00:01:25,650 --> 00:01:27,363
我们称之为上下文。
which we call the context.
29
00:01:28,650 --> 00:01:30,780
正如它在左侧上下文中所看到的那样,
As in it looks to the left context,
30
00:01:30,780 --> 00:01:32,970
我们正在学习的左边的单词,
the words on the left of the one we're studying,
31
00:01:32,970 --> 00:01:34,980
这里是 “Welcome” 这个词,
here the word "Welcome",
32
00:01:34,980 --> 00:01:37,497
和右边的上下文,这里是 “NYC” 这个词,
and the context on the right, here the word "NYC",
33
00:01:38,348 --> 00:01:42,000
并在给定上下文的情况下输出单词的值。
and it outputs a value for the word given its context.
34
00:01:42,000 --> 00:01:45,420
因此,它是一个上下文化的值。
It is therefore a contextualized value.
35
00:01:45,420 --> 00:01:48,810
可以说 768 个值的向量
One could say that the vector of 768 values
36
00:01:48,810 --> 00:01:51,993
保留文本中单词的含义。
holds the meaning of the word within the text.
37
00:01:53,310 --> 00:01:56,073
由于自注意力机制,它做到了这一点。
It does this thanks to the self-attention mechanism.
38
00:01:57,240 --> 00:02:00,630
自注意力机制指的是与单个序列中的不同位置
The self-attention mechanism relates to different positions,
39
00:02:00,630 --> 00:02:02,850
或不同单词相关联
or different words in a single sequence
40
00:02:02,850 --> 00:02:06,003
以计算该序列的表示形式。
in order to compute a representation of that sequence.
41
00:02:07,200 --> 00:02:09,000
正如我们之前所见,这意味着
As we've seen before, this means that
42
00:02:09,000 --> 00:02:11,130
一个词的结果表示
the resulting representation of a word
43
00:02:11,130 --> 00:02:13,983
已被序列中的其他词影响。
has been affected by other words in the sequence.
44
00:02:15,840 --> 00:02:18,030
我们不会在这里深入细节
We won't dive into the specifics here
45
00:02:18,030 --> 00:02:19,680
我们会提供一些进一步的阅读资料
which will offer some further readings
46
00:02:19,680 --> 00:02:21,330
如果您想对底层发生了什么
if you want to get a better understanding
47
00:02:21,330 --> 00:02:22,953
有更好的理解。
at what happens under the hood.
48
00:02:25,050 --> 00:02:27,480
那么为什么要使用编码器呢?
So why should one use and encoder?
49
00:02:27,480 --> 00:02:29,370
编码器可用作独立模型
Encoders can be used as stand-alone models
50
00:02:29,370 --> 00:02:31,263
在各种各样的任务中。
in a wide variety of tasks.
51
00:02:32,100 --> 00:02:33,360
例如,BERT,
For example, BERT,
52
00:02:33,360 --> 00:02:35,670
可以说是最著名的 transformer 模型,
arguably the most famous transformer model,
53
00:02:35,670 --> 00:02:37,590
它是一个独立的编码器模型,
is a standalone encoder model,
54
00:02:37,590 --> 00:02:38,820
并且在发布时,
and at the time of release,
55
00:02:38,820 --> 00:02:40,440
它是许多
it'd be the state of the art
56
00:02:40,440 --> 00:02:42,780
序列分类任务
in many sequence classification tasks,
57
00:02:42,780 --> 00:02:44,190
问答任务,
question answering tasks,
58
00:02:44,190 --> 00:02:46,743
和掩码语言建模等任务中的最先进技术。
and mask language modeling to only cite of few.
59
00:02:48,150 --> 00:02:50,460
编码器非常擅长
The idea is that encoders are very powerful
60
00:02:50,460 --> 00:02:52,470
提取包含有意义信息的
at extracting vectors that carry
61
00:02:52,470 --> 00:02:55,350
关于序列的向量。
meaningful information about a sequence.
62
00:02:55,350 --> 00:02:57,870
这个向量可以被传递给后续的神经元来进一步处理
This vector can then be handled down the road
63
00:02:57,870 --> 00:03:00,070
以便理解其中包含的信息。
by additional neurons to make sense of them.
64
00:03:01,380 --> 00:03:02,850
让我们看一些例子
Let's take a look at some examples
65
00:03:02,850 --> 00:03:04,563
编码器真正闪耀的地方。
where encoder really shine.
66
00:03:06,210 --> 00:03:09,900
首先,掩码语言建模或 MLM。
First of all, Masked Language Modeling, or MLM.
67
00:03:09,900 --> 00:03:11,970
这是在一个单词序列中
It's the task of predicting a hidden word
68
00:03:11,970 --> 00:03:13,590
预测隐藏词的任务。
in a sequence of word.
69
00:03:13,590 --> 00:03:15,630
在这里,例如,我们在 “My” 和 “is” 之间
Here, for example, we have hidden the word
70
00:03:15,630 --> 00:03:17,247
隐藏了这个词。
between "My" and "is".
71
00:03:18,270 --> 00:03:21,120
这是训练 BERT 的目标之一。
This is one of the objectives with which BERT was trained.
72
00:03:21,120 --> 00:03:24,393
它被训练来预测序列中的隐藏单词。
It was trained to predict hidden words in a sequence.
73
00:03:25,230 --> 00:03:27,930
编码器在这种情况下尤其大放异彩
Encoders shine in this scenario in particular
74
00:03:27,930 --> 00:03:31,140
因为双向信息在这里至关重要。
as bi-directional information is crucial here.
75
00:03:31,140 --> 00:03:32,947
如果我们没有右边的话,
If we didn't have the words on the right,
76
00:03:32,947 --> 00:03:34,650
“is”、“Sylvain” 和 “.”,
"is", "Sylvain" and the ".",
77
00:03:34,650 --> 00:03:35,940
那么BERT 将能够识别名称的
then there is very little chance
78
00:03:35,940 --> 00:03:38,580
作为正确的词
that BERT would have been able to identify name
79
00:03:38,580 --> 00:03:40,500
的机会就很小。
as the correct word.
80
00:03:40,500 --> 00:03:42,270
为了预测一个掩码单词
The encoder needs to have a good understanding
81
00:03:42,270 --> 00:03:45,360
编码器需要对序列有很好的理解
of the sequence in order to predict a masked word
82
00:03:45,360 --> 00:03:48,840
即使文本在语法上是正确的,
as even if the text is grammatically correct,
83
00:03:48,840 --> 00:03:50,610
但不一定符合
it does not necessarily make sense
84
00:03:50,610 --> 00:03:52,413
序列的上下文。
in the context of the sequence.
85
00:03:55,230 --> 00:03:56,580
如前面提到的,
As mentioned earlier,
86
00:03:56,580 --> 00:03:59,520
编码器擅长做序列分类。
encoders are good at doing sequence classification.
87
00:03:59,520 --> 00:04:02,883
情感分析是序列分类的一个例子。
Sentiment analysis is an example of sequence classification.
88
00:04:04,410 --> 00:04:09,410
该模型的目的是识别序列的情绪。
The model's aim is to identify the sentiment of a sequence.
89
00:04:09,540 --> 00:04:11,280
它可以从给出的一个序列,
It can range from giving a sequence,
90
00:04:11,280 --> 00:04:12,960
做出一颗星到五颗星的评级
a rating from one to five stars
91
00:04:12,960 --> 00:04:15,900
如果进行评论分析
if doing review analysis to giving a positive,
92
00:04:15,900 --> 00:04:17,820
来对一个序列进行积极或消极的评级
or negative rating to a sequence
93
00:04:17,820 --> 00:04:19,220
这就是这里显示的内容。
which is what is shown here.
94
00:04:20,280 --> 00:04:22,950
例如,在这里,给定两个序列,
For example, here, given the two sequences,
95
00:04:22,950 --> 00:04:25,860
我们使用模型来计算预测,
we use the model to compute a prediction,
96
00:04:25,860 --> 00:04:27,420
并对序列进行分类
and to classify the sequences
97
00:04:27,420 --> 00:04:30,393
在这两个类别中,正面和负面。
among these two classes, positive and negative.
98
00:04:31,230 --> 00:04:33,450
虽然这两个序列非常相似
While the two sequences are very similar
99
00:04:33,450 --> 00:04:35,220
包含相同的词,
containing the same words,
100
00:04:35,220 --> 00:04:37,170
意义却完全不同,
the meaning is entirely different,
101
00:04:37,170 --> 00:04:40,143
并且编码器模型能够掌握这种差异。
and the encoder model is able to grasp that difference.
102
00:04:41,404 --> 00:04:44,154
(引人注目的结尾)
(outro striking)