1
00:00:00,000 --> 00:00:03,083
（图形嘶嘶作响）
(graphics whooshing)

2
00:00:05,370 --> 00:00:07,413
- 如何预处理句子对。
- How to pre-process pairs of sentences.

3
00:00:09,150 --> 00:00:11,340
我们已经看到了如何分词化单个句子
We have seen how to tokenize single sentences

4
00:00:11,340 --> 00:00:12,877
并将它们放在一起，
and batch them together in the

5
00:00:12,877 --> 00:00:15,810
在 “Batching inputs together” 视频中
"Batching inputs together" video.

6
00:00:15,810 --> 00:00:18,330
如果这段代码对你而言不熟悉，
If this code look unfamiliar to you,

7
00:00:18,330 --> 00:00:20,030
请务必再次查看该视频。
be sure to check that video again.

8
00:00:21,330 --> 00:00:24,543
这里将重点关注对句子进行分类的任务。
Here will focus on tasks that classify pair of sentences.

9
00:00:25,620 --> 00:00:28,470
例如，我们可能想要对两个文本是否被释义
For instance, we may want to classify whether two texts

10
00:00:28,470 --> 00:00:30,360
进行分类。
are paraphrased or not.

11
00:00:30,360 --> 00:00:32,880
这是从 Quora Question Pairs 数据集中获取的示例
Here is an example taken from the Quora Question Pairs dataset

12
00:00:32,880 --> 00:00:37,530
它专注于识别重复的问题。
which focuses on identifying duplicate questions.

13
00:00:37,530 --> 00:00:40,650
在第一对中，两个问题是重复的，
In the first pair, the two questions are duplicates,

14
00:00:40,650 --> 00:00:42,000
在第二个它们不是。
in the second they are not.

15
00:00:43,283 --> 00:00:45,540
另一个对分类问题是
Another pair classification problem is

16
00:00:45,540 --> 00:00:47,400
当我们想知道两个句子是
when we want to know if two sentences are

17
00:00:47,400 --> 00:00:49,590
逻辑上相关与否，
logically related or not,

18
00:00:49,590 --> 00:00:53,970
一个称为自然语言推理或 NLI 的问题。
a problem called natural language inference or NLI.

19
00:00:53,970 --> 00:00:57,000
在这个取自 MultiNLI 数据集的例子中，
In this example, taken from the MultiNLI data set,

20
00:00:57,000 --> 00:00:59,880
对于每个可能的标签，我们都有一对句子。
we have a pair of sentences for each possible label.

21
00:00:59,880 --> 00:01:02,490
矛盾，自然的或蕴涵，
Contradiction, natural or entailment,

22
00:01:02,490 --> 00:01:04,680
这是第一句话的奇特表达方式
which is a fancy way of saying the first sentence

23
00:01:04,680 --> 00:01:05,793
暗示第二种。
implies the second.

24
00:01:06,930 --> 00:01:08,820
所以分类成对的句子是一个
So classifying pairs of sentences is a problem

25
00:01:08,820 --> 00:01:10,260
值得研究的问题。
worth studying.

26
00:01:10,260 --> 00:01:12,630
事实上，在 GLUE 基准中，
In fact, in the GLUE benchmark,

27
00:01:12,630 --> 00:01:15,750
这是文本分类问题的学术界基准
which is an academic benchmark for text classification

28
00:01:15,750 --> 00:01:17,910
10 个数据集中有 8 个是关注
8 of the 10 datasets are focused

29
00:01:17,910 --> 00:01:19,953
使用句子对的任务。
on tasks using pairs of sentences.

30
00:01:20,910 --> 00:01:22,560
这就是为什么像 BERT 这样的模型
That's why models like BERT

31
00:01:22,560 --> 00:01:25,320
通常预先训练有双重目标。
are often pre-trained with a dual objective.

32
00:01:25,320 --> 00:01:27,660
在语言建模目标之上，
On top of the language modeling objective,

33
00:01:27,660 --> 00:01:31,230
他们通常有一个与句子对相关的目标。
they often have an objective related to sentence pairs.

34
00:01:31,230 --> 00:01:34,320
例如，在预训练期间 BERT 见到
For instance, during pretraining BERT is shown

35
00:01:34,320 --> 00:01:36,810
成对的句子，必须同时预测
pairs of sentences and must predict both

36
00:01:36,810 --> 00:01:39,930
随机掩蔽的标记值，以及第二个是否
the value of randomly masked tokens, and whether the second

37
00:01:39,930 --> 00:01:41,830
句子是否接着第一个句子。
sentence follow from the first or not.

38
00:01:43,084 --> 00:01:45,930
幸运的是，来自 Transformers 库的分词器
Fortunately, the tokenizer from the Transformers library

39
00:01:45,930 --> 00:01:49,170
有一个很好的 API 来处理成对的句子。
has a nice API to deal with pairs of sentences.

40
00:01:49,170 --> 00:01:51,270
你只需要将它们作为两个参数传递
You just have to pass them as two arguments

41
00:01:51,270 --> 00:01:52,120
到 tokenizer 。
to the tokenizer.

42
00:01:53,430 --> 00:01:55,470
在我们已经研究过的输入 ID 
On top of the input IDs and the attention mask

43
00:01:55,470 --> 00:01:56,970
和注意掩码之上，
we studied already,

44
00:01:56,970 --> 00:01:59,910
它返回一个名为标记类型 ID 的新字段，
it returns a new field called token type IDs,

45
00:01:59,910 --> 00:02:01,790
它告诉模型哪些标记属于
which tells the model which tokens belong

46
00:02:01,790 --> 00:02:03,630
第一句话，
to the first sentence,

47
00:02:03,630 --> 00:02:05,943
哪些属于第二句。
and which ones belong to the second sentence.

48
00:02:07,290 --> 00:02:09,840
放大一点，这里有一个输入 ID
Zooming in a little bit, here has an input IDs

49
00:02:09,840 --> 00:02:12,180
与它们对应的 token 对齐，
aligned with the tokens they correspond to,

50
00:02:12,180 --> 00:02:15,213
它们各自的标记类型 ID 和注意掩码。
their respective token type ID and attention mask.

51
00:02:16,080 --> 00:02:19,260
我们可以看到分词器还添加了特殊标记。
We can see the tokenizer also added special tokens.

52
00:02:19,260 --> 00:02:22,620
所以我们有一个 CLS token ，来自第一句话的 token ，
So we have a CLS token, the tokens from the first sentence,

53
00:02:22,620 --> 00:02:25,770
一个 SEP 标记，第二句话中的标记，
a SEP token, the tokens from the second sentence,

54
00:02:25,770 --> 00:02:27,003
和最终的 SEP 标记。
and a final SEP token.

55
00:02:28,500 --> 00:02:30,570
如果我们有几对句子，
If we have several pairs of sentences,

56
00:02:30,570 --> 00:02:32,840
我们可以通过第一句话的传递列表
we can tokenize them together by passing the list

57
00:02:32,840 --> 00:02:36,630
将它们标记在一起，然后是第二句话的列表
of first sentences, then the list of second sentences

58
00:02:36,630 --> 00:02:39,300
以及我们已经研究过的所有关键字参数
and all the keyword arguments we studied already

59
00:02:39,300 --> 00:02:40,353
例如 padding=True。
like padding=True.

60
00:02:41,940 --> 00:02:43,140
放大结果，
Zooming in at the result,

61
00:02:43,140 --> 00:02:45,030
我们可以看到分词器如何添加填充
we can see how the tokenizer added padding

62
00:02:45,030 --> 00:02:48,090
到第二对句子使得两个输出的
to the second pair sentences to make the two outputs

63
00:02:48,090 --> 00:02:51,360
长度相同，并正确处理标记类型 ID
the same length, and properly dealt with token type IDs

64
00:02:51,360 --> 00:02:53,643
和两个句子的注意力掩码。
and attention masks for the two sentences.

65
00:02:54,900 --> 00:02:57,573
然后就可以通过我们的模型了。
This is then all ready to pass through our model.