subtitles/zh-CN/47_inside-the-question-answering-pipeline-(pytorch).srt (304 lines of code) (raw):
1
00:00:04,230 --> 00:00:07,699
- 让我们来看看问答管线的内部情况。
- Let's have a look inside the question answering pipeline.
2
00:00:07,699 --> 00:00:10,680
问答管线可以提取答案
The question answering pipeline can extracts answers
3
00:00:10,680 --> 00:00:14,190
对问题, 从给定的上下文或文本段落中,
to questions from a given context or passage of text,
4
00:00:14,190 --> 00:00:16,540
就像 transformers 仓库的 README 文档的这部分。
like this part of the transformers repo README.
5
00:00:18,060 --> 00:00:20,310
它也适用于很长的上下文,
It also works for very long contexts,
6
00:00:20,310 --> 00:00:23,850
即使答案很靠后,就像这个例子一样。
even if the answer is at the very end, like in this example.
7
00:00:23,850 --> 00:00:25,400
在本视频中,我们将了解原因。
In this video, we will see why.
8
00:00:26,820 --> 00:00:29,460
问答管线遵循相同的步骤
The question answering pipeline follows the same steps
9
00:00:29,460 --> 00:00:31,050
与其他 pipeline 一样:
as the other pipelines:
10
00:00:31,050 --> 00:00:34,200
问题和上下文被标记为一个句子对,
the question and context are tokenized as a sentence pair,
11
00:00:34,200 --> 00:00:37,955
提供给模型,然后应用一些后处理。
fed to the model then some post-processing is applied.
12
00:00:37,955 --> 00:00:41,730
分词化和模型步骤应该很熟悉。
The tokenization and model steps should be familiar.
13
00:00:41,730 --> 00:00:44,610
我们使用适合问答的 auto 类
We use the auto class suitable for question answering
14
00:00:44,610 --> 00:00:47,070
而不是序列分类,
instead of sequence classification,
15
00:00:47,070 --> 00:00:49,392
但与文本分类的一个关键区别
but one key difference with text classification
16
00:00:49,392 --> 00:00:52,980
是我们的模型输出两个张量, 名为 start logits
is that our model outputs two tensors named start logits
17
00:00:52,980 --> 00:00:54,570
和 end logits 。
and end logits.
18
00:00:54,570 --> 00:00:55,830
这是为什么?
Why is that?
19
00:00:55,830 --> 00:00:57,930
嗯,这就是模型找到答案的方式
Well, this is the way the model finds the answer
20
00:00:57,930 --> 00:00:58,803
对这个问题。
to the question.
21
00:00:59,790 --> 00:01:02,130
首先,让我们看一下模型输入。
First, let's have a look at the model inputs.
22
00:01:02,130 --> 00:01:04,350
它的数字相关联于分词化
Its numbers associated with the tokenization
23
00:01:04,350 --> 00:01:06,843
问题, 对应于上下文
of the question followed by the context
24
00:01:06,843 --> 00:01:09,723
使用通常的 CLS 和 SEP 特殊 token 。
with the usual CLS and SEP special tokens.
25
00:01:10,620 --> 00:01:13,320
答案是那些 token 的一部分。
The answer is a part of those tokens.
26
00:01:13,320 --> 00:01:15,510
所以我们要求模型预测哪个 token 开始
So we ask the model to predict which token starts
27
00:01:15,510 --> 00:01:17,373
答案并结束答案。
the answer and which ends the answer.
28
00:01:18,548 --> 00:01:19,650
对于我们的两个 logit 输出,
For our two logit outputs,
29
00:01:19,650 --> 00:01:22,803
理论上的标签是粉色和紫色的向量。
the theoretical labels are the pink and purple vectors.
30
00:01:24,300 --> 00:01:26,430
要将这些 logits 转换为概率,
To convert those logits into probabilities,
31
00:01:26,430 --> 00:01:28,436
我们需要应用 SoftMax,
we will need to apply a SoftMax,
32
00:01:28,436 --> 00:01:30,360
就像在文本分类管线中一样。
like in the text classification pipeline.
33
00:01:30,360 --> 00:01:33,390
我们只是掩蔽不属于上下文的 token
We just mask the tokens that are not part of the context
34
00:01:33,390 --> 00:01:36,855
在此之前,不掩蔽初始 CLS token
before doing that, leaving the initial CLS token unmasked
35
00:01:36,855 --> 00:01:39,303
因为我们用它来预测一个不可能的答案。
as we use it to predict an impossible answer.
36
00:01:40,267 --> 00:01:43,500
这就是它在代码方面的样子。
This is what it looks in terms of code.
37
00:01:43,500 --> 00:01:45,870
我们使用一个大的负数作为掩码,
We use a large negative number for the masking,
38
00:01:45,870 --> 00:01:48,957
因为它的指数将为零。
since its exponential will then be zero.
39
00:01:48,957 --> 00:01:50,580
现在,每个概率, 对于开始的
Now, the probability for each start
40
00:01:50,580 --> 00:01:53,550
和对于结束的可能答案的位置,
and end position corresponding to a possible answer,
41
00:01:53,550 --> 00:01:55,050
我们给的分数就是
we give a score that is the
42
00:01:55,050 --> 00:01:57,630
开始概率和结束概率之乘积
product of the start probabilities and end probabilities
43
00:01:57,630 --> 00:01:58,803
在那些位置。
at those positions.
44
00:02:00,120 --> 00:02:02,670
当然,开始索引大于结束索引
Of course, a start index greater than an end index
45
00:02:02,670 --> 00:02:04,503
对应一个不可能的答案。
corresponds to an impossible answer.
46
00:02:05,430 --> 00:02:07,080
这是找到最佳分数的代码
Here is the code to find the best score
47
00:02:07,080 --> 00:02:08,820
对一个可能的答案。
for a possible answer.
48
00:02:08,820 --> 00:02:11,430
一旦我们有了这些 token 的开始和结束位置,
Once we have the start and end positions of the tokens,
49
00:02:11,430 --> 00:02:14,130
我们使用分词器提供的偏移量映射
we use the offset mappings provided by our tokenizer
50
00:02:14,130 --> 00:02:16,950
找到初始上下文中的字符范围,
to find the span of characters in the initial context,
51
00:02:16,950 --> 00:02:17,900
并得到我们的答案。
and get our answer.
52
00:02:19,470 --> 00:02:21,900
现在,当上下文很长时,它可能会被截断
Now, when the context is long, it might get truncated
53
00:02:21,900 --> 00:02:22,750
由分词器。
by the tokenizer.
54
00:02:23,760 --> 00:02:26,220
这可能会导致部分答案,或者更糟的是,
This might result in part of the answer, or worse,
55
00:02:26,220 --> 00:02:28,113
整个答案,被截断了。
the whole answer, being truncated.
56
00:02:29,100 --> 00:02:31,050
所以我们不丢弃截断的 token
So we don't discard the truncated tokens
57
00:02:31,050 --> 00:02:33,330
但与他们一起构建新功能。
but build new features with them.
58
00:02:33,330 --> 00:02:35,994
这些功能中的每一个都包含问题,
Each of those features contains the question,
59
00:02:35,994 --> 00:02:39,240
然后是上下文中的一大块文本。
then a chunk of text in the context.
60
00:02:39,240 --> 00:02:41,430
如果我们采用不相交的文本块,
If we take disjoint chunks of texts,
61
00:02:41,430 --> 00:02:43,530
我们最终可能会得到拆分的答案
we might end up with the answer being split
62
00:02:43,530 --> 00:02:45,330
于两个特征之间。
between two features.
63
00:02:45,330 --> 00:02:48,060
因此,我们取而代之的是重叠的文本块,
So instead, we take overlapping chunks of texts,
64
00:02:48,060 --> 00:02:50,640
确保至少一个块将完全包含
to make sure at least one of the chunks will fully contain
65
00:02:50,640 --> 00:02:51,990
问题的答案。
the answer to the question.
66
00:02:52,830 --> 00:02:55,260
分词器自动为我们完成所有这些
The tokenizers do all of this for us automatically
67
00:02:55,260 --> 00:02:58,170
使用 return overflowing tokens 选项。
with the return overflowing tokens option.
68
00:02:58,170 --> 00:02:59,700
步幅参数控制
The stride argument controls
69
00:02:59,700 --> 00:03:02,070
重叠 token 的数量。
the number of overlapping tokens.
70
00:03:02,070 --> 00:03:04,020
这是我们非常长的上下文如何被截断的
Here is how our very long context gets truncated
71
00:03:04,020 --> 00:03:05,850
在两个有一些重叠的特征中。
in two features with some overlap.
72
00:03:05,850 --> 00:03:07,950
通过应用我们之前看到的相同的后处理
By applying the same post-processing we saw before
73
00:03:07,950 --> 00:03:10,636
对于每个特征,我们得到一个分数的答案
for each feature, we get the answer with a score
74
00:03:10,636 --> 00:03:12,453
对于他们每个,
for each of them,
75
00:03:12,453 --> 00:03:14,910
我们选择得分最高的答案
and we take the answer with the best score
76
00:03:14,910 --> 00:03:16,203
作为最终解决方案。
as a final solution.