1 00:00:00,000 --> 00:00:03,417 (轻过渡音乐) (light transition music) 2 00:00:05,490 --> 00:00:08,440 - 让我们来看看问答 pipeline(管线) 的内部情况。 - Let's have a look inside the question answering pipeline. 3 00:00:09,780 --> 00:00:11,370 问答管线 The question answering pipeline 4 00:00:11,370 --> 00:00:13,710 可以提取问题的答案 can extract answers to questions 5 00:00:13,710 --> 00:00:16,020 来自给定的上下文或文本段落 from a given context or passage of text 6 00:00:16,020 --> 00:00:18,370 就像 transformer 仓库的 README 文件的这一部分。 like this part of the Transformers repo README. 7 00:00:19,290 --> 00:00:21,180 它也适用于很长的上下文, It also works for very long context, 8 00:00:21,180 --> 00:00:24,720 即使答案靠后,就像这个例子一样。 even if the answer is at the very end, like in this example. 9 00:00:24,720 --> 00:00:26,223 在本视频中,我们将了解原因。 In this video, we'll see why. 10 00:00:27,840 --> 00:00:29,310 问答管线 The question answering pipeline 11 00:00:29,310 --> 00:00:32,130 遵循与其他管线相同的步骤。 follows the same steps as the other pipelines. 12 00:00:32,130 --> 00:00:35,550 问题和上下文被标记为一个句子对, The question and context are tokenized as a sentence pair, 13 00:00:35,550 --> 00:00:38,463 提供给模型,然后应用一些后处理。 fed to the model then some post-processing is applied. 14 00:00:39,540 --> 00:00:42,840 所以分词化和模型步骤应该很熟悉。 So tokenization and model steps should be familiar. 15 00:00:42,840 --> 00:00:45,000 我们使用适合问答的 auto 类 We use the auto class suitable for question answering 16 00:00:45,000 --> 00:00:47,460 而不是序列分类, instead of sequence classification, 17 00:00:47,460 --> 00:00:50,190 但与文本分类的一个关键区别 but one key difference with text classification 18 00:00:50,190 --> 00:00:52,380 是我们的模型输出两个张量 is that our model outputs two tensors 19 00:00:52,380 --> 00:00:55,230 命名 start logits 和 end logits。 named start logits and end logits. 20 00:00:55,230 --> 00:00:56,160 这是为什么? Why is that? 21 00:00:56,160 --> 00:00:58,170 嗯,这就是模型找到答案的方式 Well, this is the way the model finds the answer 22 00:00:58,170 --> 00:00:59,043 对这个问题。 to the question. 23 00:01:00,090 --> 00:01:02,610 首先,让我们看一下模型输入。 First, let's have a look at the model inputs. 24 00:01:02,610 --> 00:01:05,800 它是与问题分词化相关的数字 It's numbers associated with the tokenization of the question 25 00:01:05,800 --> 00:01:05,850 , , 26 00:01:05,850 --> 00:01:07,753 对于该上下文 followed by the context 27 00:01:07,753 --> 00:01:10,233 使用通常的 CLS 和 SEP 特殊 token 。 with the usual CLS and SEP special tokens. 28 00:01:11,130 --> 00:01:13,203 答案是那些 token 的一部分。 The answer is a part of those tokens. 29 00:01:14,040 --> 00:01:15,330 所以我们要求模型预测 So we ask the model to predict 30 00:01:15,330 --> 00:01:17,040 哪个 token 开始回答 which token starts the answer 31 00:01:17,040 --> 00:01:19,320 并结束答案。 and which ends the answer. 32 00:01:19,320 --> 00:01:20,910 对于我们的两个 logit 输出, For our two logit outputs, 33 00:01:20,910 --> 00:01:23,823 理论标签是粉色和紫色的向量。 the theoretical labels are the pink and purple vectors. 34 00:01:24,870 --> 00:01:26,700 要将这些 logits 转换为概率, To convert those logits into probabilities, 35 00:01:26,700 --> 00:01:28,596 我们需要应用 SoftMax, we will need to apply a SoftMax, 36 00:01:28,596 --> 00:01:31,020 就像在文本分类管线中一样。 like in the text classification pipeline. 37 00:01:31,020 --> 00:01:32,310 我们只是掩蔽 token We just mask the tokens 38 00:01:32,310 --> 00:01:35,940 在这样做之前不属于上下文的一部分, that are not part of the context before doing that, 39 00:01:35,940 --> 00:01:38,310 不掩蔽初始 CLS token leaving the initial CLS token unmasked 40 00:01:38,310 --> 00:01:40,773 我们用它来预测一个不可能的答案。 as we use it to predict an impossible answer. 41 00:01:41,940 --> 00:01:44,730 这就是它在代码方面的样子。 This is what it looks like in terms of code. 42 00:01:44,730 --> 00:01:47,340 我们使用一个大的负数作为掩码 We use a large negative number for the masking 43 00:01:47,340 --> 00:01:49,533 因为它的指数将为零。 since its exponential will then be zero. 44 00:01:50,850 --> 00:01:53,160 现在每个开始和结束位置的概率 Now the probability for each start and end position 45 00:01:53,160 --> 00:01:55,740 对应一个可能的答案 corresponding to a possible answer 46 00:01:55,740 --> 00:01:57,540 会给一个分数是一个产品 will give a score that is a product 47 00:01:57,540 --> 00:01:58,680 开始概率和结束概率 of the start probabilities and end probabilities 48 00:01:59,680 --> 00:02:00,873 在这些位置。 at those position. 49 00:02:01,920 --> 00:02:04,530 当然,开始索引大于结束索引 Of course, a start index greater than an end index 50 00:02:04,530 --> 00:02:06,330 对应一个不可能的答案。 corresponds to an impossible answer. 51 00:02:07,744 --> 00:02:09,510 这是找到最佳分数的代码 Here is the code to find the best score 52 00:02:09,510 --> 00:02:11,280 对一个可能的答案。 for a possible answer. 53 00:02:11,280 --> 00:02:13,830 一旦我们有了 token 的开始和结束位置, Once we have the start and end position for the tokens, 54 00:02:13,830 --> 00:02:16,650 我们使用分词器提供的偏移量映射 we use the offset mappings provided by our tokenizer 55 00:02:16,650 --> 00:02:19,710 找到初始上下文中的字符范围, to find the span of characters in the initial context, 56 00:02:19,710 --> 00:02:20,810 我们得到了答案。 and we get our answer. 57 00:02:22,080 --> 00:02:23,700 现在,当上下文很长时, Now, when the context is long, 58 00:02:23,700 --> 00:02:25,977 它可能会被分词器截断。 it might get truncated by the tokenizer. 59 00:02:26,834 --> 00:02:29,790 这可能会导致部分答案,或者更糟的是, This might result in part of the answer, or worse, 60 00:02:29,790 --> 00:02:32,190 整个答案,被截断了。 the whole answer, being truncated. 61 00:02:32,190 --> 00:02:34,020 所以我们不丢弃截断的 token So we don't discard the truncated tokens 62 00:02:34,020 --> 00:02:36,420 但与他们一起构建新功能。 but build new features with them. 63 00:02:36,420 --> 00:02:39,330 这些功能中的每一个都包含问题, Each of those features contains the question, 64 00:02:39,330 --> 00:02:42,150 然后是上下文中的一大块文本。 then a chunk of text in the context. 65 00:02:42,150 --> 00:02:44,520 如果我们采用不相交的文本块, If we take disjoint chunks of texts, 66 00:02:44,520 --> 00:02:45,840 我们可能会得到答案 we might end up with the answer 67 00:02:45,840 --> 00:02:47,733 被分成两个特征。 being split between two features. 68 00:02:48,720 --> 00:02:52,050 因此,我们取而代之的是重叠的文本块 So instead, we take overlapping chunks of text 69 00:02:52,050 --> 00:02:53,910 确保至少其中一个块 to make sure at least one of the chunks 70 00:02:53,910 --> 00:02:56,940 将完整包含问题的答案。 will fully contain the answer to the question. 71 00:02:56,940 --> 00:02:59,220 所以,分词器会自动为我们完成所有这些 So, tokenizers does all of this for us automatically 72 00:02:59,220 --> 00:03:01,920 使用 return overflowing tokens 选项。 with the return overflowing tokens option. 73 00:03:01,920 --> 00:03:02,753 步幅参数 The stride argument 74 00:03:02,753 --> 00:03:04,830 控制重叠标记的数量。 controls the number of overlapping tokens. 75 00:03:05,940 --> 00:03:07,740 这是我们很长的上下文 Here is how our very long context 76 00:03:07,740 --> 00:03:10,323 在一些重叠的两个特征中被截断。 gets truncated in two features with some overlap. 77 00:03:11,160 --> 00:03:12,720 通过应用相同的后处理 By applying the same post-processing 78 00:03:12,720 --> 00:03:14,850 我们之前看到的每个功能, we saw before for each feature, 79 00:03:14,850 --> 00:03:17,970 我们得到每个分数的答案, we get the answer with a score for each of them, 80 00:03:17,970 --> 00:03:19,920 我们选择得分最高的答案 and we take the answer with the best score 81 00:03:19,920 --> 00:03:21,303 作为最终解决方案。 as a final solution. 82 00:03:23,089 --> 00:03:26,506 (轻过渡音乐) (light transition music)