1 00:00:00,000 --> 00:00:03,083 (图形嘶嘶作响) (graphics whooshing) 2 00:00:05,370 --> 00:00:07,413 - 如何预处理句子对。 - How to pre-process pairs of sentences. 3 00:00:09,150 --> 00:00:11,340 我们已经看到了如何分词化单个句子 We have seen how to tokenize single sentences 4 00:00:11,340 --> 00:00:12,877 并将它们放在一起, and batch them together in the 5 00:00:12,877 --> 00:00:15,810 在 “Batching inputs together” 视频中 "Batching inputs together" video. 6 00:00:15,810 --> 00:00:18,330 如果这段代码对你而言不熟悉, If this code look unfamiliar to you, 7 00:00:18,330 --> 00:00:20,030 请务必再次查看该视频。 be sure to check that video again. 8 00:00:21,330 --> 00:00:24,543 这里将重点关注对句子进行分类的任务。 Here will focus on tasks that classify pair of sentences. 9 00:00:25,620 --> 00:00:28,470 例如,我们可能想要对两个文本是否被释义 For instance, we may want to classify whether two texts 10 00:00:28,470 --> 00:00:30,360 进行分类。 are paraphrased or not. 11 00:00:30,360 --> 00:00:32,880 这是从 Quora Question Pairs 数据集中获取的示例 Here is an example taken from the Quora Question Pairs dataset 12 00:00:32,880 --> 00:00:37,530 它专注于识别重复的问题。 which focuses on identifying duplicate questions. 13 00:00:37,530 --> 00:00:40,650 在第一对中,两个问题是重复的, In the first pair, the two questions are duplicates, 14 00:00:40,650 --> 00:00:42,000 在第二个它们不是。 in the second they are not. 15 00:00:43,283 --> 00:00:45,540 另一个对分类问题是 Another pair classification problem is 16 00:00:45,540 --> 00:00:47,400 当我们想知道两个句子是 when we want to know if two sentences are 17 00:00:47,400 --> 00:00:49,590 逻辑上相关与否, logically related or not, 18 00:00:49,590 --> 00:00:53,970 一个称为自然语言推理或 NLI 的问题。 a problem called natural language inference or NLI. 19 00:00:53,970 --> 00:00:57,000 在这个取自 MultiNLI 数据集的例子中, In this example, taken from the MultiNLI data set, 20 00:00:57,000 --> 00:00:59,880 对于每个可能的标签,我们都有一对句子。 we have a pair of sentences for each possible label. 21 00:00:59,880 --> 00:01:02,490 矛盾,自然的或蕴涵, Contradiction, natural or entailment, 22 00:01:02,490 --> 00:01:04,680 这是第一句话的奇特表达方式 which is a fancy way of saying the first sentence 23 00:01:04,680 --> 00:01:05,793 暗示第二种。 implies the second. 24 00:01:06,930 --> 00:01:08,820 所以分类成对的句子是一个 So classifying pairs of sentences is a problem 25 00:01:08,820 --> 00:01:10,260 值得研究的问题。 worth studying. 26 00:01:10,260 --> 00:01:12,630 事实上,在 GLUE 基准中, In fact, in the GLUE benchmark, 27 00:01:12,630 --> 00:01:15,750 这是文本分类问题的学术界基准 which is an academic benchmark for text classification 28 00:01:15,750 --> 00:01:17,910 10 个数据集中有 8 个是关注 8 of the 10 datasets are focused 29 00:01:17,910 --> 00:01:19,953 使用句子对的任务。 on tasks using pairs of sentences. 30 00:01:20,910 --> 00:01:22,560 这就是为什么像 BERT 这样的模型 That's why models like BERT 31 00:01:22,560 --> 00:01:25,320 通常预先训练有双重目标。 are often pre-trained with a dual objective. 32 00:01:25,320 --> 00:01:27,660 在语言建模目标之上, On top of the language modeling objective, 33 00:01:27,660 --> 00:01:31,230 他们通常有一个与句子对相关的目标。 they often have an objective related to sentence pairs. 34 00:01:31,230 --> 00:01:34,320 例如,在预训练期间 BERT 见到 For instance, during pretraining BERT is shown 35 00:01:34,320 --> 00:01:36,810 成对的句子,必须同时预测 pairs of sentences and must predict both 36 00:01:36,810 --> 00:01:39,930 随机掩蔽的标记值,以及第二个是否 the value of randomly masked tokens, and whether the second 37 00:01:39,930 --> 00:01:41,830 句子是否接着第一个句子。 sentence follow from the first or not. 38 00:01:43,084 --> 00:01:45,930 幸运的是,来自 Transformers 库的分词器 Fortunately, the tokenizer from the Transformers library 39 00:01:45,930 --> 00:01:49,170 有一个很好的 API 来处理成对的句子。 has a nice API to deal with pairs of sentences. 40 00:01:49,170 --> 00:01:51,270 你只需要将它们作为两个参数传递 You just have to pass them as two arguments 41 00:01:51,270 --> 00:01:52,120 到 tokenizer 。 to the tokenizer. 42 00:01:53,430 --> 00:01:55,470 在我们已经研究过的输入 ID On top of the input IDs and the attention mask 43 00:01:55,470 --> 00:01:56,970 和注意掩码之上, we studied already, 44 00:01:56,970 --> 00:01:59,910 它返回一个名为标记类型 ID 的新字段, it returns a new field called token type IDs, 45 00:01:59,910 --> 00:02:01,790 它告诉模型哪些标记属于 which tells the model which tokens belong 46 00:02:01,790 --> 00:02:03,630 第一句话, to the first sentence, 47 00:02:03,630 --> 00:02:05,943 哪些属于第二句。 and which ones belong to the second sentence. 48 00:02:07,290 --> 00:02:09,840 放大一点,这里有一个输入 ID Zooming in a little bit, here has an input IDs 49 00:02:09,840 --> 00:02:12,180 与它们对应的 token 对齐, aligned with the tokens they correspond to, 50 00:02:12,180 --> 00:02:15,213 它们各自的标记类型 ID 和注意掩码。 their respective token type ID and attention mask. 51 00:02:16,080 --> 00:02:19,260 我们可以看到分词器还添加了特殊标记。 We can see the tokenizer also added special tokens. 52 00:02:19,260 --> 00:02:22,620 所以我们有一个 CLS token ,来自第一句话的 token , So we have a CLS token, the tokens from the first sentence, 53 00:02:22,620 --> 00:02:25,770 一个 SEP 标记,第二句话中的标记, a SEP token, the tokens from the second sentence, 54 00:02:25,770 --> 00:02:27,003 和最终的 SEP 标记。 and a final SEP token. 55 00:02:28,500 --> 00:02:30,570 如果我们有几对句子, If we have several pairs of sentences, 56 00:02:30,570 --> 00:02:32,840 我们可以通过第一句话的传递列表 we can tokenize them together by passing the list 57 00:02:32,840 --> 00:02:36,630 将它们标记在一起,然后是第二句话的列表 of first sentences, then the list of second sentences 58 00:02:36,630 --> 00:02:39,300 以及我们已经研究过的所有关键字参数 and all the keyword arguments we studied already 59 00:02:39,300 --> 00:02:40,353 例如 padding=True。 like padding=True. 60 00:02:41,940 --> 00:02:43,140 放大结果, Zooming in at the result, 61 00:02:43,140 --> 00:02:45,030 我们可以看到分词器如何添加填充 we can see how the tokenizer added padding 62 00:02:45,030 --> 00:02:48,090 到第二对句子使得两个输出的 to the second pair sentences to make the two outputs 63 00:02:48,090 --> 00:02:51,360 长度相同,并正确处理标记类型 ID the same length, and properly dealt with token type IDs 64 00:02:51,360 --> 00:02:53,643 和两个句子的注意力掩码。 and attention masks for the two sentences. 65 00:02:54,900 --> 00:02:57,573 然后就可以通过我们的模型了。 This is then all ready to pass through our model.