subtitles/zh-CN/05_transformer-models-encoders.srt (408 lines of code) (raw):

1 00:00:00,253 --> 00:00:03,003 (引人注目的介绍) (intro striking) 2 00:00:04,440 --> 00:00:07,830 - 在本视频中,我们将研究编码器架构。 - In this video, we'll study the encoder architecture. 3 00:00:07,830 --> 00:00:11,070 一个流行的仅使用编码器架构的例子是 BURT An example of a popular encoder only architecture is BURT 4 00:00:11,070 --> 00:00:13,323 这是同类产品中最受欢迎的模型。 which is the most popular model of its kind. 5 00:00:14,550 --> 00:00:16,950 让我们首先了解它是如何工作的。 Let's first start by understanding how it works. 6 00:00:18,360 --> 00:00:20,910 我们将使用一个三个单词的小例子。 We'll use a small example using three words. 7 00:00:20,910 --> 00:00:23,823 我们使用这些作为输入传递给编码器。 We use these as inputs and pass them through the encoder. 8 00:00:25,290 --> 00:00:28,173 得到了每个单词的数值表示。 We retrieve a numerical representation of each word. 9 00:00:29,970 --> 00:00:32,700 例如,在这里,编码器将这三个词, Here, for example, the encoder converts those three words, 10 00:00:32,700 --> 00:00:37,350 welcome to NYC,转换为这三个数字序列。 welcome to NYC, in these three sequences of numbers. 11 00:00:37,350 --> 00:00:40,350 编码器对于每个输入单词 The encoder outputs exactly one sequence of numbers 12 00:00:40,350 --> 00:00:41,493 精确输出一个数字序列。 per input word. 13 00:00:42,330 --> 00:00:44,880 这种数值表示也可以称为 This numerical representation can also be called 14 00:00:44,880 --> 00:00:47,163 特征向量(feature vector)或特征张量(feature tensor)。 a feature vector, or a feature tensor. 15 00:00:49,080 --> 00:00:51,030 让我们深入研究这种表示。 Let's dive into this representation. 16 00:00:51,030 --> 00:00:52,740 每个词包含一个向量 It contains one vector per word 17 00:00:52,740 --> 00:00:54,540 这是通过编码器传递的。 that was passed through the encoder. 18 00:00:56,130 --> 00:00:58,620 每个向量都是 Each of these vector is a numerical representation 19 00:00:58,620 --> 00:01:00,033 该词的数字表示。 of the word in question. 20 00:01:01,080 --> 00:01:03,300 该向量的维度由 The dimension of that vector is defined 21 00:01:03,300 --> 00:01:05,520 模型的架构所决定。 by the architecture of the model. 22 00:01:05,520 --> 00:01:08,703 对于基本 BERT 模型,它是 768。 For the base BERT model, it is 768. 23 00:01:10,650 --> 00:01:13,230 这些表示包含一个词的值, These representations contain the value of a word, 24 00:01:13,230 --> 00:01:15,240 但包含上下文化的处理。 but contextualized. 25 00:01:15,240 --> 00:01:18,570 例如,与单词 "to" 相关联的向量 For example, the vector attributed to the word "to" 26 00:01:18,570 --> 00:01:22,290 不只是 “to” 这个词的表示。 isn't the representation of only the "to" word. 27 00:01:22,290 --> 00:01:25,650 它还考虑了它周围的词 It also takes into account the words around it 28 00:01:25,650 --> 00:01:27,363 我们称之为上下文。 which we call the context. 29 00:01:28,650 --> 00:01:30,780 正如它在左侧上下文中所看到的那样, As in it looks to the left context, 30 00:01:30,780 --> 00:01:32,970 我们正在学习的左边的单词, the words on the left of the one we're studying, 31 00:01:32,970 --> 00:01:34,980 这里是 “Welcome” 这个词, here the word "Welcome", 32 00:01:34,980 --> 00:01:37,497 和右边的上下文,这里是 “NYC” 这个词, and the context on the right, here the word "NYC", 33 00:01:38,348 --> 00:01:42,000 并在给定上下文的情况下输出单词的值。 and it outputs a value for the word given its context. 34 00:01:42,000 --> 00:01:45,420 因此,它是一个上下文化的值。 It is therefore a contextualized value. 35 00:01:45,420 --> 00:01:48,810 可以说 768 个值的向量 One could say that the vector of 768 values 36 00:01:48,810 --> 00:01:51,993 保留文本中单词的含义。 holds the meaning of the word within the text. 37 00:01:53,310 --> 00:01:56,073 由于自注意力机制,它做到了这一点。 It does this thanks to the self-attention mechanism. 38 00:01:57,240 --> 00:02:00,630 自注意力机制指的是与单个序列中的不同位置 The self-attention mechanism relates to different positions, 39 00:02:00,630 --> 00:02:02,850 或不同单词相关联 or different words in a single sequence 40 00:02:02,850 --> 00:02:06,003 以计算该序列的表示形式。 in order to compute a representation of that sequence. 41 00:02:07,200 --> 00:02:09,000 正如我们之前所见,这意味着 As we've seen before, this means that 42 00:02:09,000 --> 00:02:11,130 一个词的结果表示 the resulting representation of a word 43 00:02:11,130 --> 00:02:13,983 已被序列中的其他词影响。 has been affected by other words in the sequence. 44 00:02:15,840 --> 00:02:18,030 我们不会在这里深入细节 We won't dive into the specifics here 45 00:02:18,030 --> 00:02:19,680 我们会提供一些进一步的阅读资料 which will offer some further readings 46 00:02:19,680 --> 00:02:21,330 如果您想对底层发生了什么 if you want to get a better understanding 47 00:02:21,330 --> 00:02:22,953 有更好的理解。 at what happens under the hood. 48 00:02:25,050 --> 00:02:27,480 那么为什么要使用编码器呢? So why should one use and encoder? 49 00:02:27,480 --> 00:02:29,370 编码器可用作独立模型 Encoders can be used as stand-alone models 50 00:02:29,370 --> 00:02:31,263 在各种各样的任务中。 in a wide variety of tasks. 51 00:02:32,100 --> 00:02:33,360 例如,BERT, For example, BERT, 52 00:02:33,360 --> 00:02:35,670 可以说是最著名的 transformer 模型, arguably the most famous transformer model, 53 00:02:35,670 --> 00:02:37,590 它是一个独立的编码器模型, is a standalone encoder model, 54 00:02:37,590 --> 00:02:38,820 并且在发布时, and at the time of release, 55 00:02:38,820 --> 00:02:40,440 它是许多 it'd be the state of the art 56 00:02:40,440 --> 00:02:42,780 序列分类任务 in many sequence classification tasks, 57 00:02:42,780 --> 00:02:44,190 问答任务, question answering tasks, 58 00:02:44,190 --> 00:02:46,743 和掩码语言建模等任务中的最先进技术。 and mask language modeling to only cite of few. 59 00:02:48,150 --> 00:02:50,460 编码器非常擅长 The idea is that encoders are very powerful 60 00:02:50,460 --> 00:02:52,470 提取包含有意义信息的 at extracting vectors that carry 61 00:02:52,470 --> 00:02:55,350 关于序列的向量。 meaningful information about a sequence. 62 00:02:55,350 --> 00:02:57,870 这个向量可以被传递给后续的神经元来进一步处理 This vector can then be handled down the road 63 00:02:57,870 --> 00:03:00,070 以便理解其中包含的信息。 by additional neurons to make sense of them. 64 00:03:01,380 --> 00:03:02,850 让我们看一些例子 Let's take a look at some examples 65 00:03:02,850 --> 00:03:04,563 编码器真正闪耀的地方。 where encoder really shine. 66 00:03:06,210 --> 00:03:09,900 首先,掩码语言建模或 MLM。 First of all, Masked Language Modeling, or MLM. 67 00:03:09,900 --> 00:03:11,970 这是在一个单词序列中 It's the task of predicting a hidden word 68 00:03:11,970 --> 00:03:13,590 预测隐藏词的任务。 in a sequence of word. 69 00:03:13,590 --> 00:03:15,630 在这里,例如,我们在 “My” 和 “is” 之间 Here, for example, we have hidden the word 70 00:03:15,630 --> 00:03:17,247 隐藏了这个词。 between "My" and "is". 71 00:03:18,270 --> 00:03:21,120 这是训练 BERT 的目标之一。 This is one of the objectives with which BERT was trained. 72 00:03:21,120 --> 00:03:24,393 它被训练来预测序列中的隐藏单词。 It was trained to predict hidden words in a sequence. 73 00:03:25,230 --> 00:03:27,930 编码器在这种情况下尤其大放异彩 Encoders shine in this scenario in particular 74 00:03:27,930 --> 00:03:31,140 因为双向信息在这里至关重要。 as bi-directional information is crucial here. 75 00:03:31,140 --> 00:03:32,947 如果我们没有右边的话, If we didn't have the words on the right, 76 00:03:32,947 --> 00:03:34,650 “is”、“Sylvain” 和 “.”, "is", "Sylvain" and the ".", 77 00:03:34,650 --> 00:03:35,940 那么BERT 将能够识别名称的 then there is very little chance 78 00:03:35,940 --> 00:03:38,580 作为正确的词 that BERT would have been able to identify name 79 00:03:38,580 --> 00:03:40,500 的机会就很小。 as the correct word. 80 00:03:40,500 --> 00:03:42,270 为了预测一个掩码单词 The encoder needs to have a good understanding 81 00:03:42,270 --> 00:03:45,360 编码器需要对序列有很好的理解 of the sequence in order to predict a masked word 82 00:03:45,360 --> 00:03:48,840 即使文本在语法上是正确的, as even if the text is grammatically correct, 83 00:03:48,840 --> 00:03:50,610 但不一定符合 it does not necessarily make sense 84 00:03:50,610 --> 00:03:52,413 序列的上下文。 in the context of the sequence. 85 00:03:55,230 --> 00:03:56,580 如前面提到的, As mentioned earlier, 86 00:03:56,580 --> 00:03:59,520 编码器擅长做序列分类。 encoders are good at doing sequence classification. 87 00:03:59,520 --> 00:04:02,883 情感分析是序列分类的一个例子。 Sentiment analysis is an example of sequence classification. 88 00:04:04,410 --> 00:04:09,410 该模型的目的是识别序列的情绪。 The model's aim is to identify the sentiment of a sequence. 89 00:04:09,540 --> 00:04:11,280 它可以从给出的一个序列, It can range from giving a sequence, 90 00:04:11,280 --> 00:04:12,960 做出一颗星到五颗星的评级 a rating from one to five stars 91 00:04:12,960 --> 00:04:15,900 如果进行评论分析 if doing review analysis to giving a positive, 92 00:04:15,900 --> 00:04:17,820 来对一个序列进行积极或消极的评级 or negative rating to a sequence 93 00:04:17,820 --> 00:04:19,220 这就是这里显示的内容。 which is what is shown here. 94 00:04:20,280 --> 00:04:22,950 例如,在这里,给定两个序列, For example, here, given the two sequences, 95 00:04:22,950 --> 00:04:25,860 我们使用模型来计算预测, we use the model to compute a prediction, 96 00:04:25,860 --> 00:04:27,420 并对序列进行分类 and to classify the sequences 97 00:04:27,420 --> 00:04:30,393 在这两个类别中,正面和负面。 among these two classes, positive and negative. 98 00:04:31,230 --> 00:04:33,450 虽然这两个序列非常相似 While the two sequences are very similar 99 00:04:33,450 --> 00:04:35,220 包含相同的词, containing the same words, 100 00:04:35,220 --> 00:04:37,170 意义却完全不同, the meaning is entirely different, 101 00:04:37,170 --> 00:04:40,143 并且编码器模型能够掌握这种差异。 and the encoder model is able to grasp that difference. 102 00:04:41,404 --> 00:04:44,154 (引人注目的结尾) (outro striking)