1
00:00:00,000 --> 00:00:03,083
(graphics whooshing)

2
00:00:05,370 --> 00:00:07,413
- How to pre-process pairs of sentences.

3
00:00:09,150 --> 00:00:11,340
We have seen how to
tokenize single sentences

4
00:00:11,340 --> 00:00:12,877
and batch them together in the,

5
00:00:12,877 --> 00:00:15,810
"Batching inputs together video."

6
00:00:15,810 --> 00:00:18,330
If this code look unfamiliar to you,

7
00:00:18,330 --> 00:00:20,030
be sure to check that video again.

8
00:00:21,330 --> 00:00:24,543
Here will focus on tasks that
classify pair of sentences.

9
00:00:25,620 --> 00:00:28,470
For instance, we may want to
classify whether two texts

10
00:00:28,470 --> 00:00:30,360
are paraphrased or not.

11
00:00:30,360 --> 00:00:32,880
Here is an example taken
from the Quora Question Pairs

12
00:00:32,880 --> 00:00:37,530
dataset, which focuses on
identifying duplicate questions.

13
00:00:37,530 --> 00:00:40,650
In the first pair, the two
questions are duplicates,

14
00:00:40,650 --> 00:00:42,000
in the second they are not.

15
00:00:43,283 --> 00:00:45,540
Another pair classification problem is

16
00:00:45,540 --> 00:00:47,400
when we want to know if two sentences are

17
00:00:47,400 --> 00:00:49,590
logically related or not,

18
00:00:49,590 --> 00:00:53,970
a problem called natural
language inference or NLI.

19
00:00:53,970 --> 00:00:57,000
In this example, taken
from the MultiNLI data set,

20
00:00:57,000 --> 00:00:59,880
we have a pair of sentences
for each possible label.

21
00:00:59,880 --> 00:01:02,490
Contradiction, natural or entailment,

22
00:01:02,490 --> 00:01:04,680
which is a fancy way of
saying the first sentence

23
00:01:04,680 --> 00:01:05,793
implies the second.

24
00:01:06,930 --> 00:01:08,820
So classifying pairs of
sentences is a problem

25
00:01:08,820 --> 00:01:10,260
worth studying.

26
00:01:10,260 --> 00:01:12,630
In fact, in the GLUE benchmark,

27
00:01:12,630 --> 00:01:15,750
which is an academic benchmark
for text classification

28
00:01:15,750 --> 00:01:17,910
eight of the 10 data sets are focused

29
00:01:17,910 --> 00:01:19,953
on tasks using pairs of sentences.

30
00:01:20,910 --> 00:01:22,560
That's why models like BERT

31
00:01:22,560 --> 00:01:25,320
are often pre-trained
with a dual objective.

32
00:01:25,320 --> 00:01:27,660
On top of the language modeling objective,

33
00:01:27,660 --> 00:01:31,230
they often have an objective
related to sentence pairs.

34
00:01:31,230 --> 00:01:34,320
For instance, during
pretraining BERT is shown

35
00:01:34,320 --> 00:01:36,810
pairs of sentences and must predict both

36
00:01:36,810 --> 00:01:39,930
the value of randomly masked
tokens, and whether the second

37
00:01:39,930 --> 00:01:41,830
sentence follow from the first or not.

38
00:01:43,084 --> 00:01:45,930
Fortunately, the tokenizer
from the Transformers library

39
00:01:45,930 --> 00:01:49,170
has a nice API to deal
with pairs of sentences.

40
00:01:49,170 --> 00:01:51,270
You just have to pass
them as two arguments

41
00:01:51,270 --> 00:01:52,120
to the tokenizer.

42
00:01:53,430 --> 00:01:55,470
On top of the input IDs
and the attention mask

43
00:01:55,470 --> 00:01:56,970
we studied already,

44
00:01:56,970 --> 00:01:59,910
it returns a new field
called token type IDs,

45
00:01:59,910 --> 00:02:01,790
which tells the model which tokens belong

46
00:02:01,790 --> 00:02:03,630
to the first sentence,

47
00:02:03,630 --> 00:02:05,943
and which ones belong
to the second sentence.

48
00:02:07,290 --> 00:02:09,840
Zooming in a little bit,
here has an input IDs

49
00:02:09,840 --> 00:02:12,180
aligned with the tokens
they correspond to,

50
00:02:12,180 --> 00:02:15,213
their respective token
type ID and attention mask.

51
00:02:16,080 --> 00:02:19,260
We can see the tokenizer
also added special tokens.

52
00:02:19,260 --> 00:02:22,620
So we have a CLS token, the
tokens from the first sentence,

53
00:02:22,620 --> 00:02:25,770
a SEP token, the tokens
from the second sentence,

54
00:02:25,770 --> 00:02:27,003
and a final SEP token.

55
00:02:28,500 --> 00:02:30,570
If we have several pairs of sentences,

56
00:02:30,570 --> 00:02:32,840
we can tokenize them
together by passing the list

57
00:02:32,840 --> 00:02:36,630
of first sentences, then
the list of second sentences

58
00:02:36,630 --> 00:02:39,300
and all the keyword
arguments we studied already

59
00:02:39,300 --> 00:02:40,353
like padding=True.

60
00:02:41,940 --> 00:02:43,140
Zooming in at the result,

61
00:02:43,140 --> 00:02:45,030
we can see also tokenize added padding

62
00:02:45,030 --> 00:02:48,090
to the second pair sentences
to make the two outputs

63
00:02:48,090 --> 00:02:51,360
the same length, and properly
dealt with token type IDs

64
00:02:51,360 --> 00:02:53,643
and attention masks for the two sentences.

65
00:02:54,900 --> 00:02:57,573
This is then all ready to
pass through our model.