subtitles/zh-CN/47_inside-the-question-answering-pipeline-(pytorch).srt (304 lines of code) (raw):

1 00:00:04,230 --> 00:00:07,699 - 让我们来看看问答管线的内部情况。 - Let's have a look inside the question answering pipeline. 2 00:00:07,699 --> 00:00:10,680 问答管线可以提取答案 The question answering pipeline can extracts answers 3 00:00:10,680 --> 00:00:14,190 对问题, 从给定的上下文或文本段落中, to questions from a given context or passage of text, 4 00:00:14,190 --> 00:00:16,540 就像 transformers 仓库的 README 文档的这部分。 like this part of the transformers repo README. 5 00:00:18,060 --> 00:00:20,310 它也适用于很长的上下文, It also works for very long contexts, 6 00:00:20,310 --> 00:00:23,850 即使答案很靠后,就像这个例子一样。 even if the answer is at the very end, like in this example. 7 00:00:23,850 --> 00:00:25,400 在本视频中,我们将了解原因。 In this video, we will see why. 8 00:00:26,820 --> 00:00:29,460 问答管线遵循相同的步骤 The question answering pipeline follows the same steps 9 00:00:29,460 --> 00:00:31,050 与其他 pipeline 一样: as the other pipelines: 10 00:00:31,050 --> 00:00:34,200 问题和上下文被标记为一个句子对, the question and context are tokenized as a sentence pair, 11 00:00:34,200 --> 00:00:37,955 提供给模型,然后应用一些后处理。 fed to the model then some post-processing is applied. 12 00:00:37,955 --> 00:00:41,730 分词化和模型步骤应该很熟悉。 The tokenization and model steps should be familiar. 13 00:00:41,730 --> 00:00:44,610 我们使用适合问答的 auto 类 We use the auto class suitable for question answering 14 00:00:44,610 --> 00:00:47,070 而不是序列分类, instead of sequence classification, 15 00:00:47,070 --> 00:00:49,392 但与文本分类的一个关键区别 but one key difference with text classification 16 00:00:49,392 --> 00:00:52,980 是我们的模型输出两个张量, 名为 start logits is that our model outputs two tensors named start logits 17 00:00:52,980 --> 00:00:54,570 和 end logits 。 and end logits. 18 00:00:54,570 --> 00:00:55,830 这是为什么? Why is that? 19 00:00:55,830 --> 00:00:57,930 嗯,这就是模型找到答案的方式 Well, this is the way the model finds the answer 20 00:00:57,930 --> 00:00:58,803 对这个问题。 to the question. 21 00:00:59,790 --> 00:01:02,130 首先,让我们看一下模型输入。 First, let's have a look at the model inputs. 22 00:01:02,130 --> 00:01:04,350 它的数字相关联于分词化 Its numbers associated with the tokenization 23 00:01:04,350 --> 00:01:06,843 问题, 对应于上下文 of the question followed by the context 24 00:01:06,843 --> 00:01:09,723 使用通常的 CLS 和 SEP 特殊 token 。 with the usual CLS and SEP special tokens. 25 00:01:10,620 --> 00:01:13,320 答案是那些 token 的一部分。 The answer is a part of those tokens. 26 00:01:13,320 --> 00:01:15,510 所以我们要求模型预测哪个 token 开始 So we ask the model to predict which token starts 27 00:01:15,510 --> 00:01:17,373 答案并结束答案。 the answer and which ends the answer. 28 00:01:18,548 --> 00:01:19,650 对于我们的两个 logit 输出, For our two logit outputs, 29 00:01:19,650 --> 00:01:22,803 理论上的标签是粉色和紫色的向量。 the theoretical labels are the pink and purple vectors. 30 00:01:24,300 --> 00:01:26,430 要将这些 logits 转换为概率, To convert those logits into probabilities, 31 00:01:26,430 --> 00:01:28,436 我们需要应用 SoftMax, we will need to apply a SoftMax, 32 00:01:28,436 --> 00:01:30,360 就像在文本分类管线中一样。 like in the text classification pipeline. 33 00:01:30,360 --> 00:01:33,390 我们只是掩蔽不属于上下文的 token We just mask the tokens that are not part of the context 34 00:01:33,390 --> 00:01:36,855 在此之前,不掩蔽初始 CLS token before doing that, leaving the initial CLS token unmasked 35 00:01:36,855 --> 00:01:39,303 因为我们用它来预测一个不可能的答案。 as we use it to predict an impossible answer. 36 00:01:40,267 --> 00:01:43,500 这就是它在代码方面的样子。 This is what it looks in terms of code. 37 00:01:43,500 --> 00:01:45,870 我们使用一个大的负数作为掩码, We use a large negative number for the masking, 38 00:01:45,870 --> 00:01:48,957 因为它的指数将为零。 since its exponential will then be zero. 39 00:01:48,957 --> 00:01:50,580 现在,每个概率, 对于开始的 Now, the probability for each start 40 00:01:50,580 --> 00:01:53,550 和对于结束的可能答案的位置, and end position corresponding to a possible answer, 41 00:01:53,550 --> 00:01:55,050 我们给的分数就是 we give a score that is the 42 00:01:55,050 --> 00:01:57,630 开始概率和结束概率之乘积 product of the start probabilities and end probabilities 43 00:01:57,630 --> 00:01:58,803 在那些位置。 at those positions. 44 00:02:00,120 --> 00:02:02,670 当然,开始索引大于结束索引 Of course, a start index greater than an end index 45 00:02:02,670 --> 00:02:04,503 对应一个不可能的答案。 corresponds to an impossible answer. 46 00:02:05,430 --> 00:02:07,080 这是找到最佳分数的代码 Here is the code to find the best score 47 00:02:07,080 --> 00:02:08,820 对一个可能的答案。 for a possible answer. 48 00:02:08,820 --> 00:02:11,430 一旦我们有了这些 token 的开始和结束位置, Once we have the start and end positions of the tokens, 49 00:02:11,430 --> 00:02:14,130 我们使用分词器提供的偏移量映射 we use the offset mappings provided by our tokenizer 50 00:02:14,130 --> 00:02:16,950 找到初始上下文中的字符范围, to find the span of characters in the initial context, 51 00:02:16,950 --> 00:02:17,900 并得到我们的答案。 and get our answer. 52 00:02:19,470 --> 00:02:21,900 现在,当上下文很长时,它可能会被截断 Now, when the context is long, it might get truncated 53 00:02:21,900 --> 00:02:22,750 由分词器。 by the tokenizer. 54 00:02:23,760 --> 00:02:26,220 这可能会导致部分答案,或者更糟的是, This might result in part of the answer, or worse, 55 00:02:26,220 --> 00:02:28,113 整个答案,被截断了。 the whole answer, being truncated. 56 00:02:29,100 --> 00:02:31,050 所以我们不丢弃截断的 token So we don't discard the truncated tokens 57 00:02:31,050 --> 00:02:33,330 但与他们一起构建新功能。 but build new features with them. 58 00:02:33,330 --> 00:02:35,994 这些功能中的每一个都包含问题, Each of those features contains the question, 59 00:02:35,994 --> 00:02:39,240 然后是上下文中的一大块文本。 then a chunk of text in the context. 60 00:02:39,240 --> 00:02:41,430 如果我们采用不相交的文本块, If we take disjoint chunks of texts, 61 00:02:41,430 --> 00:02:43,530 我们最终可能会得到拆分的答案 we might end up with the answer being split 62 00:02:43,530 --> 00:02:45,330 于两个特征之间。 between two features. 63 00:02:45,330 --> 00:02:48,060 因此,我们取而代之的是重叠的文本块, So instead, we take overlapping chunks of texts, 64 00:02:48,060 --> 00:02:50,640 确保至少一个块将完全包含 to make sure at least one of the chunks will fully contain 65 00:02:50,640 --> 00:02:51,990 问题的答案。 the answer to the question. 66 00:02:52,830 --> 00:02:55,260 分词器自动为我们完成所有这些 The tokenizers do all of this for us automatically 67 00:02:55,260 --> 00:02:58,170 使用 return overflowing tokens 选项。 with the return overflowing tokens option. 68 00:02:58,170 --> 00:02:59,700 步幅参数控制 The stride argument controls 69 00:02:59,700 --> 00:03:02,070 重叠 token 的数量。 the number of overlapping tokens. 70 00:03:02,070 --> 00:03:04,020 这是我们非常长的上下文如何被截断的 Here is how our very long context gets truncated 71 00:03:04,020 --> 00:03:05,850 在两个有一些重叠的特征中。 in two features with some overlap. 72 00:03:05,850 --> 00:03:07,950 通过应用我们之前看到的相同的后处理 By applying the same post-processing we saw before 73 00:03:07,950 --> 00:03:10,636 对于每个特征,我们得到一个分数的答案 for each feature, we get the answer with a score 74 00:03:10,636 --> 00:03:12,453 对于他们每个, for each of them, 75 00:03:12,453 --> 00:03:14,910 我们选择得分最高的答案 and we take the answer with the best score 76 00:03:14,910 --> 00:03:16,203 作为最终解决方案。 as a final solution.