1
00:00:00,458 --> 00:00:02,791
(logo whooshes)

2
00:00:05,310 --> 00:00:07,590
- How to batch inputs together.

3
00:00:07,590 --> 00:00:09,150
In this video, we'll see

4
00:00:09,150 --> 00:00:11,050
how to batch input sequences together.

5
00:00:12,630 --> 00:00:14,910
In general, the sentences we want to pass

6
00:00:14,910 --> 00:00:18,000
through our model won't
all have the same lengths.

7
00:00:18,000 --> 00:00:20,310
Here, we are using the model we saw

8
00:00:20,310 --> 00:00:22,650
in the sentiment analysis pipeline

9
00:00:22,650 --> 00:00:24,753
and want to classify two sentences.

10
00:00:25,860 --> 00:00:27,870
When tokenizing them
and mapping each token

11
00:00:27,870 --> 00:00:30,000
to its corresponding input IDs,

12
00:00:30,000 --> 00:00:31,900
we get two lists of different lengths.

13
00:00:33,360 --> 00:00:35,070
Trying to create a tensor and NumPy array

14
00:00:35,070 --> 00:00:38,100
from those two lists
will result in an error

15
00:00:38,100 --> 00:00:40,953
because all arrays and
tensors should be rectangular.

16
00:00:42,510 --> 00:00:43,920
One way to overcome this limit

17
00:00:43,920 --> 00:00:47,340
is to make the second sentence
the same length as the first

18
00:00:47,340 --> 00:00:50,373
by adding a special token
as many times as necessary.

19
00:00:51,300 --> 00:00:53,340
Another way would be to
truncate the first sequence

20
00:00:53,340 --> 00:00:56,550
to the length of the second,
but we would then lose a lot

21
00:00:56,550 --> 00:00:58,590
of information that may be necessary

22
00:00:58,590 --> 00:01:01,230
to properly classify the sentence.

23
00:01:01,230 --> 00:01:04,710
In general, we only truncate
sentences when they are longer

24
00:01:04,710 --> 00:01:07,083
than the maximum length
the model can handle.

25
00:01:08,310 --> 00:01:10,320
The value used to pad the second sentence

26
00:01:10,320 --> 00:01:12,390
should not be picked randomly.

27
00:01:12,390 --> 00:01:15,330
The model has been pretrained
with a certain padding ID,

28
00:01:15,330 --> 00:01:18,093
which you can find in
tokenizer.pad_token_id.

29
00:01:19,950 --> 00:01:21,630
Now that we have padded our sentences,

30
00:01:21,630 --> 00:01:23,130
we can make a batch with them.

31
00:01:24,210 --> 00:01:26,730
If we pass the two sentences
to the model separately

32
00:01:26,730 --> 00:01:29,130
or batched together, however, we notice

33
00:01:29,130 --> 00:01:30,630
that we don't get the same results

34
00:01:30,630 --> 00:01:32,070
for the sentence that is padded.

35
00:01:32,070 --> 00:01:34,440
Here, the second one.

36
00:01:34,440 --> 00:01:36,690
Expect the word in the
transformer library?

37
00:01:36,690 --> 00:01:37,620
No.

38
00:01:37,620 --> 00:01:39,720
If you remember that Transformer
models make heavy use

39
00:01:39,720 --> 00:01:43,800
of attention layers, it should
not come as a total surprise.

40
00:01:43,800 --> 00:01:47,100
When computing the contextual
representation of each token,

41
00:01:47,100 --> 00:01:49,440
the attention layers look
at all the other words

42
00:01:49,440 --> 00:01:51,240
in the sentence.

43
00:01:51,240 --> 00:01:52,252
If we have just a sentence

44
00:01:52,252 --> 00:01:55,650
or the sentence with several
padding tokens added,

45
00:01:55,650 --> 00:01:57,750
it's logical we don't get the same values.

46
00:01:58,830 --> 00:02:01,410
To get the same results
with or without padding,

47
00:02:01,410 --> 00:02:03,750
we need to indicate to
the attention layers

48
00:02:03,750 --> 00:02:06,660
that they should ignore
those padding tokens.

49
00:02:06,660 --> 00:02:08,970
This is done by creating
an attention mask,

50
00:02:08,970 --> 00:02:11,700
a tensor with the same
shape as the input IDs

51
00:02:11,700 --> 00:02:13,173
with zeros and ones.

52
00:02:14,640 --> 00:02:16,830
Ones indicate the tokens
the attention layers

53
00:02:16,830 --> 00:02:18,660
should consider in the context,

54
00:02:18,660 --> 00:02:20,823
and zeros, the tokens they should ignore.

55
00:02:21,810 --> 00:02:23,290
Now, passing this attention mask

56
00:02:23,290 --> 00:02:26,460
along with the input IDs
will give us the same results

57
00:02:26,460 --> 00:02:29,460
as when we sent the two sentences
individually to the model.

58
00:02:30,870 --> 00:02:33,870
This is all done behind
the scenes by the tokenizer

59
00:02:33,870 --> 00:02:35,583
when you apply it to several sentences

60
00:02:35,583 --> 00:02:37,713
with the flag padding equals true.

61
00:02:38,640 --> 00:02:39,690
It will apply the padding

62
00:02:39,690 --> 00:02:42,180
with the proper value
to the smaller sentences

63
00:02:42,180 --> 00:02:44,373
and create the appropriate attention mask.