subtitles/zh-CN/21_preprocessing-sentence-pairs-(pytorch).srt (260 lines of code) (raw):
1
00:00:00,000 --> 00:00:03,083
(图形嘶嘶作响)
(graphics whooshing)
2
00:00:05,370 --> 00:00:07,413
- 如何预处理句子对。
- How to pre-process pairs of sentences.
3
00:00:09,150 --> 00:00:11,340
我们已经看到了如何分词化单个句子
We have seen how to tokenize single sentences
4
00:00:11,340 --> 00:00:12,877
并将它们放在一起,
and batch them together in the
5
00:00:12,877 --> 00:00:15,810
在 “Batching inputs together” 视频中
"Batching inputs together" video.
6
00:00:15,810 --> 00:00:18,330
如果这段代码对你而言不熟悉,
If this code look unfamiliar to you,
7
00:00:18,330 --> 00:00:20,030
请务必再次查看该视频。
be sure to check that video again.
8
00:00:21,330 --> 00:00:24,543
这里将重点关注对句子进行分类的任务。
Here will focus on tasks that classify pair of sentences.
9
00:00:25,620 --> 00:00:28,470
例如,我们可能想要对两个文本是否被释义
For instance, we may want to classify whether two texts
10
00:00:28,470 --> 00:00:30,360
进行分类。
are paraphrased or not.
11
00:00:30,360 --> 00:00:32,880
这是从 Quora Question Pairs 数据集中获取的示例
Here is an example taken from the Quora Question Pairs dataset
12
00:00:32,880 --> 00:00:37,530
它专注于识别重复的问题。
which focuses on identifying duplicate questions.
13
00:00:37,530 --> 00:00:40,650
在第一对中,两个问题是重复的,
In the first pair, the two questions are duplicates,
14
00:00:40,650 --> 00:00:42,000
在第二个它们不是。
in the second they are not.
15
00:00:43,283 --> 00:00:45,540
另一个对分类问题是
Another pair classification problem is
16
00:00:45,540 --> 00:00:47,400
当我们想知道两个句子是
when we want to know if two sentences are
17
00:00:47,400 --> 00:00:49,590
逻辑上相关与否,
logically related or not,
18
00:00:49,590 --> 00:00:53,970
一个称为自然语言推理或 NLI 的问题。
a problem called natural language inference or NLI.
19
00:00:53,970 --> 00:00:57,000
在这个取自 MultiNLI 数据集的例子中,
In this example, taken from the MultiNLI data set,
20
00:00:57,000 --> 00:00:59,880
对于每个可能的标签,我们都有一对句子。
we have a pair of sentences for each possible label.
21
00:00:59,880 --> 00:01:02,490
矛盾,自然的或蕴涵,
Contradiction, natural or entailment,
22
00:01:02,490 --> 00:01:04,680
这是第一句话的奇特表达方式
which is a fancy way of saying the first sentence
23
00:01:04,680 --> 00:01:05,793
暗示第二种。
implies the second.
24
00:01:06,930 --> 00:01:08,820
所以分类成对的句子是一个
So classifying pairs of sentences is a problem
25
00:01:08,820 --> 00:01:10,260
值得研究的问题。
worth studying.
26
00:01:10,260 --> 00:01:12,630
事实上,在 GLUE 基准中,
In fact, in the GLUE benchmark,
27
00:01:12,630 --> 00:01:15,750
这是文本分类问题的学术界基准
which is an academic benchmark for text classification
28
00:01:15,750 --> 00:01:17,910
10 个数据集中有 8 个是关注
8 of the 10 datasets are focused
29
00:01:17,910 --> 00:01:19,953
使用句子对的任务。
on tasks using pairs of sentences.
30
00:01:20,910 --> 00:01:22,560
这就是为什么像 BERT 这样的模型
That's why models like BERT
31
00:01:22,560 --> 00:01:25,320
通常预先训练有双重目标。
are often pre-trained with a dual objective.
32
00:01:25,320 --> 00:01:27,660
在语言建模目标之上,
On top of the language modeling objective,
33
00:01:27,660 --> 00:01:31,230
他们通常有一个与句子对相关的目标。
they often have an objective related to sentence pairs.
34
00:01:31,230 --> 00:01:34,320
例如,在预训练期间 BERT 见到
For instance, during pretraining BERT is shown
35
00:01:34,320 --> 00:01:36,810
成对的句子,必须同时预测
pairs of sentences and must predict both
36
00:01:36,810 --> 00:01:39,930
随机掩蔽的标记值,以及第二个是否
the value of randomly masked tokens, and whether the second
37
00:01:39,930 --> 00:01:41,830
句子是否接着第一个句子。
sentence follow from the first or not.
38
00:01:43,084 --> 00:01:45,930
幸运的是,来自 Transformers 库的分词器
Fortunately, the tokenizer from the Transformers library
39
00:01:45,930 --> 00:01:49,170
有一个很好的 API 来处理成对的句子。
has a nice API to deal with pairs of sentences.
40
00:01:49,170 --> 00:01:51,270
你只需要将它们作为两个参数传递
You just have to pass them as two arguments
41
00:01:51,270 --> 00:01:52,120
到 tokenizer 。
to the tokenizer.
42
00:01:53,430 --> 00:01:55,470
在我们已经研究过的输入 ID
On top of the input IDs and the attention mask
43
00:01:55,470 --> 00:01:56,970
和注意掩码之上,
we studied already,
44
00:01:56,970 --> 00:01:59,910
它返回一个名为标记类型 ID 的新字段,
it returns a new field called token type IDs,
45
00:01:59,910 --> 00:02:01,790
它告诉模型哪些标记属于
which tells the model which tokens belong
46
00:02:01,790 --> 00:02:03,630
第一句话,
to the first sentence,
47
00:02:03,630 --> 00:02:05,943
哪些属于第二句。
and which ones belong to the second sentence.
48
00:02:07,290 --> 00:02:09,840
放大一点,这里有一个输入 ID
Zooming in a little bit, here has an input IDs
49
00:02:09,840 --> 00:02:12,180
与它们对应的 token 对齐,
aligned with the tokens they correspond to,
50
00:02:12,180 --> 00:02:15,213
它们各自的标记类型 ID 和注意掩码。
their respective token type ID and attention mask.
51
00:02:16,080 --> 00:02:19,260
我们可以看到分词器还添加了特殊标记。
We can see the tokenizer also added special tokens.
52
00:02:19,260 --> 00:02:22,620
所以我们有一个 CLS token ,来自第一句话的 token ,
So we have a CLS token, the tokens from the first sentence,
53
00:02:22,620 --> 00:02:25,770
一个 SEP 标记,第二句话中的标记,
a SEP token, the tokens from the second sentence,
54
00:02:25,770 --> 00:02:27,003
和最终的 SEP 标记。
and a final SEP token.
55
00:02:28,500 --> 00:02:30,570
如果我们有几对句子,
If we have several pairs of sentences,
56
00:02:30,570 --> 00:02:32,840
我们可以通过第一句话的传递列表
we can tokenize them together by passing the list
57
00:02:32,840 --> 00:02:36,630
将它们标记在一起,然后是第二句话的列表
of first sentences, then the list of second sentences
58
00:02:36,630 --> 00:02:39,300
以及我们已经研究过的所有关键字参数
and all the keyword arguments we studied already
59
00:02:39,300 --> 00:02:40,353
例如 padding=True。
like padding=True.
60
00:02:41,940 --> 00:02:43,140
放大结果,
Zooming in at the result,
61
00:02:43,140 --> 00:02:45,030
我们可以看到分词器如何添加填充
we can see how the tokenizer added padding
62
00:02:45,030 --> 00:02:48,090
到第二对句子使得两个输出的
to the second pair sentences to make the two outputs
63
00:02:48,090 --> 00:02:51,360
长度相同,并正确处理标记类型 ID
the same length, and properly dealt with token type IDs
64
00:02:51,360 --> 00:02:53,643
和两个句子的注意力掩码。
and attention masks for the two sentences.
65
00:02:54,900 --> 00:02:57,573
然后就可以通过我们的模型了。
This is then all ready to pass through our model.