subtitles/en/21_preprocessing-sentence-pairs-(pytorch).srt (229 lines of code) (raw):
1
00:00:00,000 --> 00:00:03,083
(graphics whooshing)
2
00:00:05,370 --> 00:00:07,413
- How to pre-process pairs of sentences.
3
00:00:09,150 --> 00:00:11,340
We have seen how to
tokenize single sentences
4
00:00:11,340 --> 00:00:12,877
and batch them together in the,
5
00:00:12,877 --> 00:00:15,810
"Batching inputs together video."
6
00:00:15,810 --> 00:00:18,330
If this code look unfamiliar to you,
7
00:00:18,330 --> 00:00:20,030
be sure to check that video again.
8
00:00:21,330 --> 00:00:24,543
Here will focus on tasks that
classify pair of sentences.
9
00:00:25,620 --> 00:00:28,470
For instance, we may want to
classify whether two texts
10
00:00:28,470 --> 00:00:30,360
are paraphrased or not.
11
00:00:30,360 --> 00:00:32,880
Here is an example taken
from the Quora Question Pairs
12
00:00:32,880 --> 00:00:37,530
dataset, which focuses on
identifying duplicate questions.
13
00:00:37,530 --> 00:00:40,650
In the first pair, the two
questions are duplicates,
14
00:00:40,650 --> 00:00:42,000
in the second they are not.
15
00:00:43,283 --> 00:00:45,540
Another pair classification problem is
16
00:00:45,540 --> 00:00:47,400
when we want to know if two sentences are
17
00:00:47,400 --> 00:00:49,590
logically related or not,
18
00:00:49,590 --> 00:00:53,970
a problem called natural
language inference or NLI.
19
00:00:53,970 --> 00:00:57,000
In this example, taken
from the MultiNLI data set,
20
00:00:57,000 --> 00:00:59,880
we have a pair of sentences
for each possible label.
21
00:00:59,880 --> 00:01:02,490
Contradiction, natural or entailment,
22
00:01:02,490 --> 00:01:04,680
which is a fancy way of
saying the first sentence
23
00:01:04,680 --> 00:01:05,793
implies the second.
24
00:01:06,930 --> 00:01:08,820
So classifying pairs of
sentences is a problem
25
00:01:08,820 --> 00:01:10,260
worth studying.
26
00:01:10,260 --> 00:01:12,630
In fact, in the GLUE benchmark,
27
00:01:12,630 --> 00:01:15,750
which is an academic benchmark
for text classification
28
00:01:15,750 --> 00:01:17,910
eight of the 10 data sets are focused
29
00:01:17,910 --> 00:01:19,953
on tasks using pairs of sentences.
30
00:01:20,910 --> 00:01:22,560
That's why models like BERT
31
00:01:22,560 --> 00:01:25,320
are often pre-trained
with a dual objective.
32
00:01:25,320 --> 00:01:27,660
On top of the language modeling objective,
33
00:01:27,660 --> 00:01:31,230
they often have an objective
related to sentence pairs.
34
00:01:31,230 --> 00:01:34,320
For instance, during
pretraining BERT is shown
35
00:01:34,320 --> 00:01:36,810
pairs of sentences and must predict both
36
00:01:36,810 --> 00:01:39,930
the value of randomly masked
tokens, and whether the second
37
00:01:39,930 --> 00:01:41,830
sentence follow from the first or not.
38
00:01:43,084 --> 00:01:45,930
Fortunately, the tokenizer
from the Transformers library
39
00:01:45,930 --> 00:01:49,170
has a nice API to deal
with pairs of sentences.
40
00:01:49,170 --> 00:01:51,270
You just have to pass
them as two arguments
41
00:01:51,270 --> 00:01:52,120
to the tokenizer.
42
00:01:53,430 --> 00:01:55,470
On top of the input IDs
and the attention mask
43
00:01:55,470 --> 00:01:56,970
we studied already,
44
00:01:56,970 --> 00:01:59,910
it returns a new field
called token type IDs,
45
00:01:59,910 --> 00:02:01,790
which tells the model which tokens belong
46
00:02:01,790 --> 00:02:03,630
to the first sentence,
47
00:02:03,630 --> 00:02:05,943
and which ones belong
to the second sentence.
48
00:02:07,290 --> 00:02:09,840
Zooming in a little bit,
here has an input IDs
49
00:02:09,840 --> 00:02:12,180
aligned with the tokens
they correspond to,
50
00:02:12,180 --> 00:02:15,213
their respective token
type ID and attention mask.
51
00:02:16,080 --> 00:02:19,260
We can see the tokenizer
also added special tokens.
52
00:02:19,260 --> 00:02:22,620
So we have a CLS token, the
tokens from the first sentence,
53
00:02:22,620 --> 00:02:25,770
a SEP token, the tokens
from the second sentence,
54
00:02:25,770 --> 00:02:27,003
and a final SEP token.
55
00:02:28,500 --> 00:02:30,570
If we have several pairs of sentences,
56
00:02:30,570 --> 00:02:32,840
we can tokenize them
together by passing the list
57
00:02:32,840 --> 00:02:36,630
of first sentences, then
the list of second sentences
58
00:02:36,630 --> 00:02:39,300
and all the keyword
arguments we studied already
59
00:02:39,300 --> 00:02:40,353
like padding=True.
60
00:02:41,940 --> 00:02:43,140
Zooming in at the result,
61
00:02:43,140 --> 00:02:45,030
we can see also tokenize added padding
62
00:02:45,030 --> 00:02:48,090
to the second pair sentences
to make the two outputs
63
00:02:48,090 --> 00:02:51,360
the same length, and properly
dealt with token type IDs
64
00:02:51,360 --> 00:02:53,643
and attention masks for the two sentences.
65
00:02:54,900 --> 00:02:57,573
This is then all ready to
pass through our model.