subtitles/zh-CN/48_inside-the-question-answering-pipeline-(tensorflow).srt (328 lines of code) (raw):
1
00:00:00,000 --> 00:00:03,417
(轻过渡音乐)
(light transition music)
2
00:00:05,490 --> 00:00:08,440
- 让我们来看看问答 pipeline(管线) 的内部情况。
- Let's have a look inside the question answering pipeline.
3
00:00:09,780 --> 00:00:11,370
问答管线
The question answering pipeline
4
00:00:11,370 --> 00:00:13,710
可以提取问题的答案
can extract answers to questions
5
00:00:13,710 --> 00:00:16,020
来自给定的上下文或文本段落
from a given context or passage of text
6
00:00:16,020 --> 00:00:18,370
就像 transformer 仓库的 README 文件的这一部分。
like this part of the Transformers repo README.
7
00:00:19,290 --> 00:00:21,180
它也适用于很长的上下文,
It also works for very long context,
8
00:00:21,180 --> 00:00:24,720
即使答案靠后,就像这个例子一样。
even if the answer is at the very end, like in this example.
9
00:00:24,720 --> 00:00:26,223
在本视频中,我们将了解原因。
In this video, we'll see why.
10
00:00:27,840 --> 00:00:29,310
问答管线
The question answering pipeline
11
00:00:29,310 --> 00:00:32,130
遵循与其他管线相同的步骤。
follows the same steps as the other pipelines.
12
00:00:32,130 --> 00:00:35,550
问题和上下文被标记为一个句子对,
The question and context are tokenized as a sentence pair,
13
00:00:35,550 --> 00:00:38,463
提供给模型,然后应用一些后处理。
fed to the model then some post-processing is applied.
14
00:00:39,540 --> 00:00:42,840
所以分词化和模型步骤应该很熟悉。
So tokenization and model steps should be familiar.
15
00:00:42,840 --> 00:00:45,000
我们使用适合问答的 auto 类
We use the auto class suitable for question answering
16
00:00:45,000 --> 00:00:47,460
而不是序列分类,
instead of sequence classification,
17
00:00:47,460 --> 00:00:50,190
但与文本分类的一个关键区别
but one key difference with text classification
18
00:00:50,190 --> 00:00:52,380
是我们的模型输出两个张量
is that our model outputs two tensors
19
00:00:52,380 --> 00:00:55,230
命名 start logits 和 end logits。
named start logits and end logits.
20
00:00:55,230 --> 00:00:56,160
这是为什么?
Why is that?
21
00:00:56,160 --> 00:00:58,170
嗯,这就是模型找到答案的方式
Well, this is the way the model finds the answer
22
00:00:58,170 --> 00:00:59,043
对这个问题。
to the question.
23
00:01:00,090 --> 00:01:02,610
首先,让我们看一下模型输入。
First, let's have a look at the model inputs.
24
00:01:02,610 --> 00:01:05,800
它是与问题分词化相关的数字
It's numbers associated with the tokenization of the question
25
00:01:05,800 --> 00:01:05,850
,
,
26
00:01:05,850 --> 00:01:07,753
对于该上下文
followed by the context
27
00:01:07,753 --> 00:01:10,233
使用通常的 CLS 和 SEP 特殊 token 。
with the usual CLS and SEP special tokens.
28
00:01:11,130 --> 00:01:13,203
答案是那些 token 的一部分。
The answer is a part of those tokens.
29
00:01:14,040 --> 00:01:15,330
所以我们要求模型预测
So we ask the model to predict
30
00:01:15,330 --> 00:01:17,040
哪个 token 开始回答
which token starts the answer
31
00:01:17,040 --> 00:01:19,320
并结束答案。
and which ends the answer.
32
00:01:19,320 --> 00:01:20,910
对于我们的两个 logit 输出,
For our two logit outputs,
33
00:01:20,910 --> 00:01:23,823
理论标签是粉色和紫色的向量。
the theoretical labels are the pink and purple vectors.
34
00:01:24,870 --> 00:01:26,700
要将这些 logits 转换为概率,
To convert those logits into probabilities,
35
00:01:26,700 --> 00:01:28,596
我们需要应用 SoftMax,
we will need to apply a SoftMax,
36
00:01:28,596 --> 00:01:31,020
就像在文本分类管线中一样。
like in the text classification pipeline.
37
00:01:31,020 --> 00:01:32,310
我们只是掩蔽 token
We just mask the tokens
38
00:01:32,310 --> 00:01:35,940
在这样做之前不属于上下文的一部分,
that are not part of the context before doing that,
39
00:01:35,940 --> 00:01:38,310
不掩蔽初始 CLS token
leaving the initial CLS token unmasked
40
00:01:38,310 --> 00:01:40,773
我们用它来预测一个不可能的答案。
as we use it to predict an impossible answer.
41
00:01:41,940 --> 00:01:44,730
这就是它在代码方面的样子。
This is what it looks like in terms of code.
42
00:01:44,730 --> 00:01:47,340
我们使用一个大的负数作为掩码
We use a large negative number for the masking
43
00:01:47,340 --> 00:01:49,533
因为它的指数将为零。
since its exponential will then be zero.
44
00:01:50,850 --> 00:01:53,160
现在每个开始和结束位置的概率
Now the probability for each start and end position
45
00:01:53,160 --> 00:01:55,740
对应一个可能的答案
corresponding to a possible answer
46
00:01:55,740 --> 00:01:57,540
会给一个分数是一个产品
will give a score that is a product
47
00:01:57,540 --> 00:01:58,680
开始概率和结束概率
of the start probabilities and end probabilities
48
00:01:59,680 --> 00:02:00,873
在这些位置。
at those position.
49
00:02:01,920 --> 00:02:04,530
当然,开始索引大于结束索引
Of course, a start index greater than an end index
50
00:02:04,530 --> 00:02:06,330
对应一个不可能的答案。
corresponds to an impossible answer.
51
00:02:07,744 --> 00:02:09,510
这是找到最佳分数的代码
Here is the code to find the best score
52
00:02:09,510 --> 00:02:11,280
对一个可能的答案。
for a possible answer.
53
00:02:11,280 --> 00:02:13,830
一旦我们有了 token 的开始和结束位置,
Once we have the start and end position for the tokens,
54
00:02:13,830 --> 00:02:16,650
我们使用分词器提供的偏移量映射
we use the offset mappings provided by our tokenizer
55
00:02:16,650 --> 00:02:19,710
找到初始上下文中的字符范围,
to find the span of characters in the initial context,
56
00:02:19,710 --> 00:02:20,810
我们得到了答案。
and we get our answer.
57
00:02:22,080 --> 00:02:23,700
现在,当上下文很长时,
Now, when the context is long,
58
00:02:23,700 --> 00:02:25,977
它可能会被分词器截断。
it might get truncated by the tokenizer.
59
00:02:26,834 --> 00:02:29,790
这可能会导致部分答案,或者更糟的是,
This might result in part of the answer, or worse,
60
00:02:29,790 --> 00:02:32,190
整个答案,被截断了。
the whole answer, being truncated.
61
00:02:32,190 --> 00:02:34,020
所以我们不丢弃截断的 token
So we don't discard the truncated tokens
62
00:02:34,020 --> 00:02:36,420
但与他们一起构建新功能。
but build new features with them.
63
00:02:36,420 --> 00:02:39,330
这些功能中的每一个都包含问题,
Each of those features contains the question,
64
00:02:39,330 --> 00:02:42,150
然后是上下文中的一大块文本。
then a chunk of text in the context.
65
00:02:42,150 --> 00:02:44,520
如果我们采用不相交的文本块,
If we take disjoint chunks of texts,
66
00:02:44,520 --> 00:02:45,840
我们可能会得到答案
we might end up with the answer
67
00:02:45,840 --> 00:02:47,733
被分成两个特征。
being split between two features.
68
00:02:48,720 --> 00:02:52,050
因此,我们取而代之的是重叠的文本块
So instead, we take overlapping chunks of text
69
00:02:52,050 --> 00:02:53,910
确保至少其中一个块
to make sure at least one of the chunks
70
00:02:53,910 --> 00:02:56,940
将完整包含问题的答案。
will fully contain the answer to the question.
71
00:02:56,940 --> 00:02:59,220
所以,分词器会自动为我们完成所有这些
So, tokenizers does all of this for us automatically
72
00:02:59,220 --> 00:03:01,920
使用 return overflowing tokens 选项。
with the return overflowing tokens option.
73
00:03:01,920 --> 00:03:02,753
步幅参数
The stride argument
74
00:03:02,753 --> 00:03:04,830
控制重叠标记的数量。
controls the number of overlapping tokens.
75
00:03:05,940 --> 00:03:07,740
这是我们很长的上下文
Here is how our very long context
76
00:03:07,740 --> 00:03:10,323
在一些重叠的两个特征中被截断。
gets truncated in two features with some overlap.
77
00:03:11,160 --> 00:03:12,720
通过应用相同的后处理
By applying the same post-processing
78
00:03:12,720 --> 00:03:14,850
我们之前看到的每个功能,
we saw before for each feature,
79
00:03:14,850 --> 00:03:17,970
我们得到每个分数的答案,
we get the answer with a score for each of them,
80
00:03:17,970 --> 00:03:19,920
我们选择得分最高的答案
and we take the answer with the best score
81
00:03:19,920 --> 00:03:21,303
作为最终解决方案。
as a final solution.
82
00:03:23,089 --> 00:03:26,506
(轻过渡音乐)
(light transition music)