subtitles/zh-CN/22_preprocessing-sentence-pairs-(tensorflow).srt (284 lines of code) (raw):

1 00:00:00,225 --> 00:00:02,892 (空气呼啸) (air whooshing) 2 00:00:05,578 --> 00:00:09,180 - 如何预处理成对的句子? - How to preprocess pairs of sentences? 3 00:00:09,180 --> 00:00:11,490 我们已经看到了如何标记单个句子 We have seen how to tokenize single sentences 4 00:00:11,490 --> 00:00:13,020 并将它们一起批处理 and batch them together 5 00:00:13,020 --> 00:00:15,660 在 “Batching inputs together” 视频中。 in the "Batching inputs together" video. 6 00:00:15,660 --> 00:00:18,060 如果你觉得这段代码不熟悉, If this code looks unfamiliar to you, 7 00:00:18,060 --> 00:00:19,760 请务必再次检查该视频! be sure to check that video again! 8 00:00:21,101 --> 00:00:22,110 在这里,我们将专注于任务 Here, we will focus on tasks 9 00:00:22,110 --> 00:00:24,033 对句子对进行分类。 that classify pairs of sentences. 10 00:00:24,900 --> 00:00:27,030 例如,我们可能想要分类 For instance, we may want to classify 11 00:00:27,030 --> 00:00:29,820 两个文本是否是释义。 whether two texts are paraphrases or not. 12 00:00:29,820 --> 00:00:30,900 这是一个例子 Here is an example taken 13 00:00:30,900 --> 00:00:33,180 来自 Quora Question Pairs 数据集, from the Quora Question Pairs dataset, 14 00:00:33,180 --> 00:00:36,033 它侧重于识别重复的问题。 which focuses on identifying duplicate questions. 15 00:00:37,110 --> 00:00:40,650 在第一对中,两个问题是重复的; In the first pair, the two questions are duplicates; 16 00:00:40,650 --> 00:00:43,620 在第二对中,他们不是。 in the second, they are not. 17 00:00:43,620 --> 00:00:44,730 另一个分类问题 Another classification problem 18 00:00:44,730 --> 00:00:46,980 是当我们想知道是否有两个句子 is when we want to know if two sentences 19 00:00:46,980 --> 00:00:49,290 逻辑上相关,或反之 are logically related or not, 20 00:00:49,290 --> 00:00:52,173 一个称为自然语言推理或 NLI 的问题。 a problem called Natural Language Inference or NLI. 21 00:00:53,100 --> 00:00:55,830 在这个取自 MultiNLI 数据集的例子中, In this example taken from the MultiNLI dataset, 22 00:00:55,830 --> 00:00:59,460 对于每个可能的标签,我们都有一对句子: we have a pair of sentences for each possible label: 23 00:00:59,460 --> 00:01:02,400 矛盾,中性或蕴涵, contradiction, neutral or entailment, 24 00:01:02,400 --> 00:01:04,680 这是第一句话的奇特表达方式 which is a fancy way of saying the first sentence 25 00:01:04,680 --> 00:01:05,853 暗示第二种。 implies the second. 26 00:01:07,140 --> 00:01:09,000 所以对句子对进行分类 So classifying pairs of sentences 27 00:01:09,000 --> 00:01:10,533 是一个值得研究的问题。 is a problem worth studying. 28 00:01:11,370 --> 00:01:13,770 事实上,在 GLUE 基准测试中, In fact, in the GLUE benchmark, 29 00:01:13,770 --> 00:01:16,830 这是文本分类的学术基准, which is an academic benchmark for text classification, 30 00:01:16,830 --> 00:01:19,680 10 个数据集中有 8 个专注于任务 eight of the 10 datasets are focused on tasks 31 00:01:19,680 --> 00:01:20,973 使用成对的句子。 using pairs of sentences. 32 00:01:22,110 --> 00:01:24,720 这就是为什么像 BERT 这样的模型经常被预训练的原因 That's why models like BERT are often pretrained 33 00:01:24,720 --> 00:01:26,520 具有双重目标: with a dual objective: 34 00:01:26,520 --> 00:01:28,890 在语言建模目标之上, on top of the language modeling objective, 35 00:01:28,890 --> 00:01:32,010 他们通常有一个与句子对相关的目标。 they often have an objective related to sentence pairs. 36 00:01:32,010 --> 00:01:34,560 例如,在预训练期间, For instance, during pretraining, 37 00:01:34,560 --> 00:01:36,690 BERT 显示成对的句子 BERT is shown pairs of sentences 38 00:01:36,690 --> 00:01:39,900 并且必须预测随机屏蔽 token 的值 and must predict both the value of randomly masked tokens 39 00:01:39,900 --> 00:01:41,250 以及第二句是否 and whether the second sentence 40 00:01:41,250 --> 00:01:42,903 从第一个开始。 follows from the first or not. 41 00:01:44,070 --> 00:01:47,100 幸运的是,来自 Transformers 库的 tokenizer Fortunately, the tokenizer from the Transformers library 42 00:01:47,100 --> 00:01:50,550 有一个很好的 API 来处理成对的句子: has a nice API to deal with pairs of sentences: 43 00:01:50,550 --> 00:01:52,650 你只需要将它们作为两个参数传递 you just have to pass them as two arguments 44 00:01:52,650 --> 00:01:53,613 到 tokenizer 。 to the tokenizer. 45 00:01:54,900 --> 00:01:56,040 在输入 ID 之上 On top of the input IDs 46 00:01:56,040 --> 00:01:58,440 和我们已经研究过的注意力掩码, and the attention mask we studied already, 47 00:01:58,440 --> 00:02:01,530 它返回一个名为 token 类型 ID 的新字段, it returns a new field called token type IDs, 48 00:02:01,530 --> 00:02:03,210 它告诉模型哪些 token which tells the model which tokens 49 00:02:03,210 --> 00:02:05,100 属于第一句 belong to the first sentence 50 00:02:05,100 --> 00:02:07,350 哪些属于第二句。 and which ones belong to the second sentence. 51 00:02:08,670 --> 00:02:11,430 放大一点,这里是输入 ID, Zooming in a little bit, here are the input IDs, 52 00:02:11,430 --> 00:02:13,710 与它们对应的 token 对齐, aligned with the tokens they correspond to, 53 00:02:13,710 --> 00:02:17,193 它们各自的 token 类型 ID 和注意掩码。 their respective token type ID and attention mask. 54 00:02:18,540 --> 00:02:21,300 我们可以看到 tokenizer 还添加了特殊 token We can see the tokenizer also added special tokens 55 00:02:21,300 --> 00:02:25,230 所以我们有一个 CLS token ,来自第一句话的 token , so we have a CLS token, the tokens from the first sentence, 56 00:02:25,230 --> 00:02:28,590 一个 SEP token ,第二句话中的 token , a SEP token, the tokens from the second sentence, 57 00:02:28,590 --> 00:02:30,153 和最终的 SEP token 。 and a final SEP token. 58 00:02:31,680 --> 00:02:33,720 如果我们有几对句子, If we have several pairs of sentences, 59 00:02:33,720 --> 00:02:35,640 我们可以将它们标记在一起 we can tokenize them together 60 00:02:35,640 --> 00:02:38,280 通过传递第一句话的列表, by passing the list of first sentences, 61 00:02:38,280 --> 00:02:40,710 然后是第二句话的列表 then the list of second sentences 62 00:02:40,710 --> 00:02:43,050 以及我们已经研究过的所有关键字参数, and all the keyword arguments we studied already, 63 00:02:43,050 --> 00:02:44,133 像 padding=True 。 like padding=True. 64 00:02:45,510 --> 00:02:46,770 放大结果, Zooming in at the result, 65 00:02:46,770 --> 00:02:49,050 我们可以看到 tokenizer 是如何添加填充的 we can see how the tokenizer added padding 66 00:02:49,050 --> 00:02:50,940 对于第二对句子, to the second pair of sentences, 67 00:02:50,940 --> 00:02:53,490 使两个输出的长度相同。 to make the two outputs the same length. 68 00:02:53,490 --> 00:02:55,620 它还可以正确处理 token 类型 IDS It also properly dealt with token type IDS 69 00:02:55,620 --> 00:02:57,720 和两个句子的注意力掩码。 and attention masks for the two sentences. 70 00:02:59,010 --> 00:03:01,460 然后就可以通过我们的模型了! This is then all ready to pass through our model! 71 00:03:03,799 --> 00:03:06,466 (空气呼啸) (air whooshing)