subtitles/en/23_what-is-dynamic-padding.srt (232 lines of code) (raw):
1
00:00:00,242 --> 00:00:02,909
(air whooshing)
2
00:00:05,460 --> 00:00:06,963
- What is dynamic padding?
3
00:00:08,630 --> 00:00:10,890
In the "Batching Inputs together" video,
4
00:00:10,890 --> 00:00:12,720
we have seen that to
be able to group inputs
5
00:00:12,720 --> 00:00:15,300
of different lengths in the same batch,
6
00:00:15,300 --> 00:00:18,270
we need to add padding tokens
to all the short inputs
7
00:00:18,270 --> 00:00:20,970
until they are all of the same length.
8
00:00:20,970 --> 00:00:24,600
Here, for instance, the longest
sentence is the third one,
9
00:00:24,600 --> 00:00:27,270
and we need to add five,
two, or seven pad tokens
10
00:00:27,270 --> 00:00:30,090
to the other sentences
to have four sentences
11
00:00:30,090 --> 00:00:31,090
of the same lengths.
12
00:00:32,430 --> 00:00:33,900
When dealing with a whole dataset,
13
00:00:33,900 --> 00:00:36,633
there are various padding
strategies we can apply.
14
00:00:37,560 --> 00:00:39,540
The most obvious one is
to pad all the elements
15
00:00:39,540 --> 00:00:40,923
of the dataset to the same length:
16
00:00:40,923 --> 00:00:43,053
the length of the longest sample.
17
00:00:44,070 --> 00:00:45,330
This will then give us batches
18
00:00:45,330 --> 00:00:46,890
that all have the same shape
19
00:00:46,890 --> 00:00:49,800
determined by the maximum sequence length.
20
00:00:49,800 --> 00:00:52,893
The downside is that batches
composed from short sentences
21
00:00:52,893 --> 00:00:54,960
will have a lot of padding tokens
22
00:00:54,960 --> 00:00:57,660
which will introduce more
computations in the model
23
00:00:57,660 --> 00:00:58,910
we ultimately don't need.
24
00:01:00,060 --> 00:01:03,300
To avoid this, another
strategy is to pad the elements
25
00:01:03,300 --> 00:01:05,280
when we batch them together,
26
00:01:05,280 --> 00:01:08,190
to the longest sentence inside the batch.
27
00:01:08,190 --> 00:01:12,000
This way, batches composed of
short inputs will be smaller
28
00:01:12,000 --> 00:01:13,920
than the batch containing
the longest sentence
29
00:01:13,920 --> 00:01:15,510
in the dataset.
30
00:01:15,510 --> 00:01:18,063
This will yield some nice
speedup on CPU and GPU.
31
00:01:19,110 --> 00:01:20,490
The downside is that all batches
32
00:01:20,490 --> 00:01:22,140
will then have different shapes,
33
00:01:22,140 --> 00:01:24,740
which slows down training
on accelerators like TPUs.
34
00:01:26,160 --> 00:01:29,370
Let's see how to apply both
strategies in practice.
35
00:01:29,370 --> 00:01:31,280
We have actually seen how
to apply fixed padding
36
00:01:31,280 --> 00:01:33,390
in the Datasets Overview video,
37
00:01:33,390 --> 00:01:36,030
when we preprocessed the MRPC dataset:
38
00:01:36,030 --> 00:01:38,250
after loading the dataset and tokenizer,
39
00:01:38,250 --> 00:01:40,680
we applied the tokenization
to all the dataset
40
00:01:40,680 --> 00:01:42,480
with padding and truncation
41
00:01:42,480 --> 00:01:45,273
to make all samples of length 128.
42
00:01:46,530 --> 00:01:48,360
As a result, if we pass this dataset
43
00:01:48,360 --> 00:01:50,520
to a PyTorch DataLoader,
44
00:01:50,520 --> 00:01:55,503
we get batches of shape
batch size, here 16, by 128.
45
00:01:57,060 --> 00:01:58,380
To apply dynamic padding,
46
00:01:58,380 --> 00:02:01,440
we must defer the padding
to the batch preparation,
47
00:02:01,440 --> 00:02:04,740
so we remove that part
from our tokenize function.
48
00:02:04,740 --> 00:02:06,150
We still leave the truncation part
49
00:02:06,150 --> 00:02:08,580
so that inputs that are
bigger than the maximum length
50
00:02:08,580 --> 00:02:12,060
accepted by the model, usually 512,
51
00:02:12,060 --> 00:02:13,510
get truncated to that length.
52
00:02:14,940 --> 00:02:16,380
Then we pad our samples dynamically
53
00:02:16,380 --> 00:02:18,330
by using a data collator.
54
00:02:18,330 --> 00:02:20,280
Those classes in the Transformers library
55
00:02:20,280 --> 00:02:22,740
are responsible for applying
all the final processing
56
00:02:22,740 --> 00:02:25,290
needed before forming a batch,
57
00:02:25,290 --> 00:02:28,470
here DataCollatorWithPadding
will pad the samples
58
00:02:28,470 --> 00:02:31,083
to the maximum length inside
the batch of sentences.
59
00:02:32,160 --> 00:02:35,310
We pass it to the PyTorch
DataLoader as a collate function,
60
00:02:35,310 --> 00:02:37,620
then observe that the batches generated
61
00:02:37,620 --> 00:02:38,850
have various lengths,
62
00:02:38,850 --> 00:02:41,253
all way below the 128 from before.
63
00:02:42,660 --> 00:02:44,820
Dynamic batching will
almost always be faster
64
00:02:44,820 --> 00:02:47,913
on CPUs and GPUs, so you
should apply it if you can.
65
00:02:48,930 --> 00:02:51,330
Remember to switch back
to fixed padding, however,
66
00:02:51,330 --> 00:02:53,490
if you run your training script on TPU
67
00:02:53,490 --> 00:02:55,293
or need batches of fixed shapes.
68
00:02:56,917 --> 00:02:59,584
(air whooshing)