subtitles/en/22_preprocessing-sentence-pairs-(tensorflow).srt (238 lines of code) (raw):
1
00:00:00,225 --> 00:00:02,892
(air whooshing)
2
00:00:05,578 --> 00:00:09,180
- How to preprocess pairs of sentences?
3
00:00:09,180 --> 00:00:11,490
We have seen how to
tokenize single sentences
4
00:00:11,490 --> 00:00:13,020
and batch them together
5
00:00:13,020 --> 00:00:15,660
in the "Batching inputs together" video.
6
00:00:15,660 --> 00:00:18,060
If this code looks unfamiliar to you,
7
00:00:18,060 --> 00:00:19,760
be sure to check that video again!
8
00:00:21,101 --> 00:00:22,110
Here, we will focus on tasks
9
00:00:22,110 --> 00:00:24,033
that classify pairs of sentences.
10
00:00:24,900 --> 00:00:27,030
For instance, we may want to classify
11
00:00:27,030 --> 00:00:29,820
whether two texts are paraphrases or not.
12
00:00:29,820 --> 00:00:30,900
Here is an example taken
13
00:00:30,900 --> 00:00:33,180
from the Quora Question Pairs dataset,
14
00:00:33,180 --> 00:00:36,033
which focuses on identifying
duplicate questions.
15
00:00:37,110 --> 00:00:40,650
In the first pair, the two
questions are duplicates;
16
00:00:40,650 --> 00:00:43,620
in the second, they are not.
17
00:00:43,620 --> 00:00:44,730
Another classification problem
18
00:00:44,730 --> 00:00:46,980
is when we want to know if two sentences
19
00:00:46,980 --> 00:00:49,290
are logically related or not,
20
00:00:49,290 --> 00:00:52,173
a problem called Natural
Language Inference or NLI.
21
00:00:53,100 --> 00:00:55,830
In this example taken
from the MultiNLI dataset,
22
00:00:55,830 --> 00:00:59,460
we have a pair of sentences
for each possible label:
23
00:00:59,460 --> 00:01:02,400
contradiction, neutral or entailment,
24
00:01:02,400 --> 00:01:04,680
which is a fancy way of
saying the first sentence
25
00:01:04,680 --> 00:01:05,853
implies the second.
26
00:01:07,140 --> 00:01:09,000
So classifying pairs of sentences
27
00:01:09,000 --> 00:01:10,533
is a problem worth studying.
28
00:01:11,370 --> 00:01:13,770
In fact, in the GLUE benchmark,
29
00:01:13,770 --> 00:01:16,830
which is an academic benchmark
for text classification,
30
00:01:16,830 --> 00:01:19,680
eight of the 10 datasets
are focused on tasks
31
00:01:19,680 --> 00:01:20,973
using pairs of sentences.
32
00:01:22,110 --> 00:01:24,720
That's why models like
BERT are often pretrained
33
00:01:24,720 --> 00:01:26,520
with a dual objective:
34
00:01:26,520 --> 00:01:28,890
on top of the language modeling objective,
35
00:01:28,890 --> 00:01:32,010
they often have an objective
related to sentence pairs.
36
00:01:32,010 --> 00:01:34,560
For instance, during pretraining,
37
00:01:34,560 --> 00:01:36,690
BERT is shown pairs of sentences
38
00:01:36,690 --> 00:01:39,900
and must predict both the
value of randomly masked tokens
39
00:01:39,900 --> 00:01:41,250
and whether the second sentence
40
00:01:41,250 --> 00:01:42,903
follows from the first or not.
41
00:01:44,070 --> 00:01:47,100
Fortunately, the tokenizer
from the Transformers library
42
00:01:47,100 --> 00:01:50,550
has a nice API to deal
with pairs of sentences:
43
00:01:50,550 --> 00:01:52,650
you just have to pass
them as two arguments
44
00:01:52,650 --> 00:01:53,613
to the tokenizer.
45
00:01:54,900 --> 00:01:56,040
On top of the input IDs
46
00:01:56,040 --> 00:01:58,440
and the attention mask we studied already,
47
00:01:58,440 --> 00:02:01,530
it returns a new field
called token type IDs,
48
00:02:01,530 --> 00:02:03,210
which tells the model which tokens
49
00:02:03,210 --> 00:02:05,100
belong to the first sentence
50
00:02:05,100 --> 00:02:07,350
and which ones belong
to the second sentence.
51
00:02:08,670 --> 00:02:11,430
Zooming in a little bit,
here are the input IDs,
52
00:02:11,430 --> 00:02:13,710
aligned with the tokens
they correspond to,
53
00:02:13,710 --> 00:02:17,193
their respective token
type ID and attention mask.
54
00:02:18,540 --> 00:02:21,300
We can see the tokenizer
also added special tokens
55
00:02:21,300 --> 00:02:25,230
so we have a CLS token, the
tokens from the first sentence,
56
00:02:25,230 --> 00:02:28,590
a SEP token, the tokens
from the second sentence,
57
00:02:28,590 --> 00:02:30,153
and a final SEP token.
58
00:02:31,680 --> 00:02:33,720
If we have several pairs of sentences,
59
00:02:33,720 --> 00:02:35,640
we can tokenize them together
60
00:02:35,640 --> 00:02:38,280
by passing the list of first sentences,
61
00:02:38,280 --> 00:02:40,710
then the list of second sentences
62
00:02:40,710 --> 00:02:43,050
and all the keyword
arguments we studied already,
63
00:02:43,050 --> 00:02:44,133
like padding=True.
64
00:02:45,510 --> 00:02:46,770
Zooming in at the result,
65
00:02:46,770 --> 00:02:49,050
we can see how the tokenizer added padding
66
00:02:49,050 --> 00:02:50,940
to the second pair of sentences,
67
00:02:50,940 --> 00:02:53,490
to make the two outputs the same length.
68
00:02:53,490 --> 00:02:55,620
It also properly dealt with token type IDS
69
00:02:55,620 --> 00:02:57,720
and attention masks for the two sentences.
70
00:02:59,010 --> 00:03:01,460
This is then all ready to
pass through our model!
71
00:03:03,799 --> 00:03:06,466
(air whooshing)