subtitles/zh-CN/22_preprocessing-sentence-pairs-(tensorflow).srt (284 lines of code) (raw):
1
00:00:00,225 --> 00:00:02,892
(空气呼啸)
(air whooshing)
2
00:00:05,578 --> 00:00:09,180
- 如何预处理成对的句子?
- How to preprocess pairs of sentences?
3
00:00:09,180 --> 00:00:11,490
我们已经看到了如何标记单个句子
We have seen how to tokenize single sentences
4
00:00:11,490 --> 00:00:13,020
并将它们一起批处理
and batch them together
5
00:00:13,020 --> 00:00:15,660
在 “Batching inputs together” 视频中。
in the "Batching inputs together" video.
6
00:00:15,660 --> 00:00:18,060
如果你觉得这段代码不熟悉,
If this code looks unfamiliar to you,
7
00:00:18,060 --> 00:00:19,760
请务必再次检查该视频!
be sure to check that video again!
8
00:00:21,101 --> 00:00:22,110
在这里,我们将专注于任务
Here, we will focus on tasks
9
00:00:22,110 --> 00:00:24,033
对句子对进行分类。
that classify pairs of sentences.
10
00:00:24,900 --> 00:00:27,030
例如,我们可能想要分类
For instance, we may want to classify
11
00:00:27,030 --> 00:00:29,820
两个文本是否是释义。
whether two texts are paraphrases or not.
12
00:00:29,820 --> 00:00:30,900
这是一个例子
Here is an example taken
13
00:00:30,900 --> 00:00:33,180
来自 Quora Question Pairs 数据集,
from the Quora Question Pairs dataset,
14
00:00:33,180 --> 00:00:36,033
它侧重于识别重复的问题。
which focuses on identifying duplicate questions.
15
00:00:37,110 --> 00:00:40,650
在第一对中,两个问题是重复的;
In the first pair, the two questions are duplicates;
16
00:00:40,650 --> 00:00:43,620
在第二对中,他们不是。
in the second, they are not.
17
00:00:43,620 --> 00:00:44,730
另一个分类问题
Another classification problem
18
00:00:44,730 --> 00:00:46,980
是当我们想知道是否有两个句子
is when we want to know if two sentences
19
00:00:46,980 --> 00:00:49,290
逻辑上相关,或反之
are logically related or not,
20
00:00:49,290 --> 00:00:52,173
一个称为自然语言推理或 NLI 的问题。
a problem called Natural Language Inference or NLI.
21
00:00:53,100 --> 00:00:55,830
在这个取自 MultiNLI 数据集的例子中,
In this example taken from the MultiNLI dataset,
22
00:00:55,830 --> 00:00:59,460
对于每个可能的标签,我们都有一对句子:
we have a pair of sentences for each possible label:
23
00:00:59,460 --> 00:01:02,400
矛盾,中性或蕴涵,
contradiction, neutral or entailment,
24
00:01:02,400 --> 00:01:04,680
这是第一句话的奇特表达方式
which is a fancy way of saying the first sentence
25
00:01:04,680 --> 00:01:05,853
暗示第二种。
implies the second.
26
00:01:07,140 --> 00:01:09,000
所以对句子对进行分类
So classifying pairs of sentences
27
00:01:09,000 --> 00:01:10,533
是一个值得研究的问题。
is a problem worth studying.
28
00:01:11,370 --> 00:01:13,770
事实上,在 GLUE 基准测试中,
In fact, in the GLUE benchmark,
29
00:01:13,770 --> 00:01:16,830
这是文本分类的学术基准,
which is an academic benchmark for text classification,
30
00:01:16,830 --> 00:01:19,680
10 个数据集中有 8 个专注于任务
eight of the 10 datasets are focused on tasks
31
00:01:19,680 --> 00:01:20,973
使用成对的句子。
using pairs of sentences.
32
00:01:22,110 --> 00:01:24,720
这就是为什么像 BERT 这样的模型经常被预训练的原因
That's why models like BERT are often pretrained
33
00:01:24,720 --> 00:01:26,520
具有双重目标:
with a dual objective:
34
00:01:26,520 --> 00:01:28,890
在语言建模目标之上,
on top of the language modeling objective,
35
00:01:28,890 --> 00:01:32,010
他们通常有一个与句子对相关的目标。
they often have an objective related to sentence pairs.
36
00:01:32,010 --> 00:01:34,560
例如,在预训练期间,
For instance, during pretraining,
37
00:01:34,560 --> 00:01:36,690
BERT 显示成对的句子
BERT is shown pairs of sentences
38
00:01:36,690 --> 00:01:39,900
并且必须预测随机屏蔽 token 的值
and must predict both the value of randomly masked tokens
39
00:01:39,900 --> 00:01:41,250
以及第二句是否
and whether the second sentence
40
00:01:41,250 --> 00:01:42,903
从第一个开始。
follows from the first or not.
41
00:01:44,070 --> 00:01:47,100
幸运的是,来自 Transformers 库的 tokenizer
Fortunately, the tokenizer from the Transformers library
42
00:01:47,100 --> 00:01:50,550
有一个很好的 API 来处理成对的句子:
has a nice API to deal with pairs of sentences:
43
00:01:50,550 --> 00:01:52,650
你只需要将它们作为两个参数传递
you just have to pass them as two arguments
44
00:01:52,650 --> 00:01:53,613
到 tokenizer 。
to the tokenizer.
45
00:01:54,900 --> 00:01:56,040
在输入 ID 之上
On top of the input IDs
46
00:01:56,040 --> 00:01:58,440
和我们已经研究过的注意力掩码,
and the attention mask we studied already,
47
00:01:58,440 --> 00:02:01,530
它返回一个名为 token 类型 ID 的新字段,
it returns a new field called token type IDs,
48
00:02:01,530 --> 00:02:03,210
它告诉模型哪些 token
which tells the model which tokens
49
00:02:03,210 --> 00:02:05,100
属于第一句
belong to the first sentence
50
00:02:05,100 --> 00:02:07,350
哪些属于第二句。
and which ones belong to the second sentence.
51
00:02:08,670 --> 00:02:11,430
放大一点,这里是输入 ID,
Zooming in a little bit, here are the input IDs,
52
00:02:11,430 --> 00:02:13,710
与它们对应的 token 对齐,
aligned with the tokens they correspond to,
53
00:02:13,710 --> 00:02:17,193
它们各自的 token 类型 ID 和注意掩码。
their respective token type ID and attention mask.
54
00:02:18,540 --> 00:02:21,300
我们可以看到 tokenizer 还添加了特殊 token
We can see the tokenizer also added special tokens
55
00:02:21,300 --> 00:02:25,230
所以我们有一个 CLS token ,来自第一句话的 token ,
so we have a CLS token, the tokens from the first sentence,
56
00:02:25,230 --> 00:02:28,590
一个 SEP token ,第二句话中的 token ,
a SEP token, the tokens from the second sentence,
57
00:02:28,590 --> 00:02:30,153
和最终的 SEP token 。
and a final SEP token.
58
00:02:31,680 --> 00:02:33,720
如果我们有几对句子,
If we have several pairs of sentences,
59
00:02:33,720 --> 00:02:35,640
我们可以将它们标记在一起
we can tokenize them together
60
00:02:35,640 --> 00:02:38,280
通过传递第一句话的列表,
by passing the list of first sentences,
61
00:02:38,280 --> 00:02:40,710
然后是第二句话的列表
then the list of second sentences
62
00:02:40,710 --> 00:02:43,050
以及我们已经研究过的所有关键字参数,
and all the keyword arguments we studied already,
63
00:02:43,050 --> 00:02:44,133
像 padding=True 。
like padding=True.
64
00:02:45,510 --> 00:02:46,770
放大结果,
Zooming in at the result,
65
00:02:46,770 --> 00:02:49,050
我们可以看到 tokenizer 是如何添加填充的
we can see how the tokenizer added padding
66
00:02:49,050 --> 00:02:50,940
对于第二对句子,
to the second pair of sentences,
67
00:02:50,940 --> 00:02:53,490
使两个输出的长度相同。
to make the two outputs the same length.
68
00:02:53,490 --> 00:02:55,620
它还可以正确处理 token 类型 IDS
It also properly dealt with token type IDS
69
00:02:55,620 --> 00:02:57,720
和两个句子的注意力掩码。
and attention masks for the two sentences.
70
00:02:59,010 --> 00:03:01,460
然后就可以通过我们的模型了!
This is then all ready to pass through our model!
71
00:03:03,799 --> 00:03:06,466
(空气呼啸)
(air whooshing)