subtitles/en/17_batching-inputs-together-(pytorch).srt (225 lines of code) (raw):
1
00:00:00,373 --> 00:00:02,956
(subtle blast)
2
00:00:05,400 --> 00:00:07,590
- How to batch inputs together.
3
00:00:07,590 --> 00:00:09,240
In this video, we will see how
4
00:00:09,240 --> 00:00:11,073
to batch input sequences together.
5
00:00:12,137 --> 00:00:15,420
In general, the sentences we
want to pass through our model
6
00:00:15,420 --> 00:00:17,670
won't all have the same lengths.
7
00:00:17,670 --> 00:00:19,740
Here, we are using the model we saw
8
00:00:19,740 --> 00:00:22,080
in the sentiment analysis pipeline
9
00:00:22,080 --> 00:00:24,063
and want to classify two sentences.
10
00:00:24,900 --> 00:00:27,360
When tokenizing them
and mapping each token
11
00:00:27,360 --> 00:00:29,610
to its corresponding input IDs,
12
00:00:29,610 --> 00:00:31,593
we get two lists of different lengths.
13
00:00:33,240 --> 00:00:35,340
Trying to create a tensor or a NumPy array
14
00:00:35,340 --> 00:00:38,220
from those two lists
will result in an error,
15
00:00:38,220 --> 00:00:41,043
because all arrays and
tensors should be rectangular.
16
00:00:42,240 --> 00:00:44,160
One way to overcome this limit
17
00:00:44,160 --> 00:00:45,690
is to make the second sentence
18
00:00:45,690 --> 00:00:47,640
the same length as the first
19
00:00:47,640 --> 00:00:50,463
by adding a special token
as many times as necessary.
20
00:00:51,600 --> 00:00:53,970
Another way would be to
truncate the first sequence
21
00:00:53,970 --> 00:00:55,710
to the length of the second,
22
00:00:55,710 --> 00:00:58,140
but we would them lose
a lot of information
23
00:00:58,140 --> 00:01:01,083
that might be necessary to
properly classify the sentence.
24
00:01:02,190 --> 00:01:04,830
In general, we only truncate sentences
25
00:01:04,830 --> 00:01:06,840
when they are longer
than the maximum length
26
00:01:06,840 --> 00:01:08,073
the model can handle.
27
00:01:09,720 --> 00:01:11,850
The value used to pad the second sentence
28
00:01:11,850 --> 00:01:13,740
should not be picked randomly;
29
00:01:13,740 --> 00:01:16,680
the model has been pretrained
with a certain padding ID,
30
00:01:16,680 --> 00:01:19,533
which you can find in
tokenizer.pad_token_id.
31
00:01:21,090 --> 00:01:22,800
Now that we have padded our sentences,
32
00:01:22,800 --> 00:01:24,303
we can make a batch with them.
33
00:01:25,380 --> 00:01:28,320
If we pass the two sentences
to the model separately
34
00:01:28,320 --> 00:01:30,120
and batched together however,
35
00:01:30,120 --> 00:01:32,100
we notice that we don't
get the same results
36
00:01:32,100 --> 00:01:34,060
for the sentence that is padded,
37
00:01:34,060 --> 00:01:35,403
here, the second one.
38
00:01:36,390 --> 00:01:39,420
It's at the back in the
Transformers Library? No.
39
00:01:39,420 --> 00:01:40,770
If you remember that Transformer models
40
00:01:40,770 --> 00:01:42,810
make heavy use of attention layers,
41
00:01:42,810 --> 00:01:45,210
this should not come as a total surprise;
42
00:01:45,210 --> 00:01:48,277
when computing the contextual
representation of each token,
43
00:01:48,277 --> 00:01:50,910
the attention layers look
at all the other words
44
00:01:50,910 --> 00:01:52,410
in the sentence.
45
00:01:52,410 --> 00:01:53,850
If we have just the sentence
46
00:01:53,850 --> 00:01:56,970
or the sentence with several
padding tokens added,
47
00:01:56,970 --> 00:01:59,073
it's logical we don't get the same values.
48
00:02:00,270 --> 00:02:03,030
To get the same results
with or without padding,
49
00:02:03,030 --> 00:02:05,340
we need to indicate to
the attention layers
50
00:02:05,340 --> 00:02:08,070
that they should ignore
those padding tokens.
51
00:02:08,070 --> 00:02:10,620
This is done by creating
an attention mask,
52
00:02:10,620 --> 00:02:13,320
a tensor with the same
shape as the input IDs,
53
00:02:13,320 --> 00:02:14,733
with zeros and ones.
54
00:02:15,780 --> 00:02:18,120
Ones indicate the tokens
the attention layers
55
00:02:18,120 --> 00:02:20,100
should consider in the context
56
00:02:20,100 --> 00:02:22,100
and zeros the tokens they should ignore.
57
00:02:23,520 --> 00:02:26,760
Now, passing this attention
mask along with the input ID
58
00:02:26,760 --> 00:02:28,170
will give us the same results
59
00:02:28,170 --> 00:02:31,170
as when we sent the two sentences
individually to the model.
60
00:02:32,400 --> 00:02:34,950
This is all done behind
the scenes by the tokenizer
61
00:02:34,950 --> 00:02:36,900
when you apply it to several sentences
62
00:02:36,900 --> 00:02:38,613
with the flag padding=True.
63
00:02:39,599 --> 00:02:41,490
It will apply the padding
with the proper value
64
00:02:41,490 --> 00:02:43,140
to the smaller sentences
65
00:02:43,140 --> 00:02:45,423
and create the appropriate attention mask.
66
00:02:46,993 --> 00:02:49,576
(subtle blast)