subtitles/en/23_what-is-dynamic-padding.srt (232 lines of code) (raw):

1 00:00:00,242 --> 00:00:02,909 (air whooshing) 2 00:00:05,460 --> 00:00:06,963 - What is dynamic padding? 3 00:00:08,630 --> 00:00:10,890 In the "Batching Inputs together" video, 4 00:00:10,890 --> 00:00:12,720 we have seen that to be able to group inputs 5 00:00:12,720 --> 00:00:15,300 of different lengths in the same batch, 6 00:00:15,300 --> 00:00:18,270 we need to add padding tokens to all the short inputs 7 00:00:18,270 --> 00:00:20,970 until they are all of the same length. 8 00:00:20,970 --> 00:00:24,600 Here, for instance, the longest sentence is the third one, 9 00:00:24,600 --> 00:00:27,270 and we need to add five, two, or seven pad tokens 10 00:00:27,270 --> 00:00:30,090 to the other sentences to have four sentences 11 00:00:30,090 --> 00:00:31,090 of the same lengths. 12 00:00:32,430 --> 00:00:33,900 When dealing with a whole dataset, 13 00:00:33,900 --> 00:00:36,633 there are various padding strategies we can apply. 14 00:00:37,560 --> 00:00:39,540 The most obvious one is to pad all the elements 15 00:00:39,540 --> 00:00:40,923 of the dataset to the same length: 16 00:00:40,923 --> 00:00:43,053 the length of the longest sample. 17 00:00:44,070 --> 00:00:45,330 This will then give us batches 18 00:00:45,330 --> 00:00:46,890 that all have the same shape 19 00:00:46,890 --> 00:00:49,800 determined by the maximum sequence length. 20 00:00:49,800 --> 00:00:52,893 The downside is that batches composed from short sentences 21 00:00:52,893 --> 00:00:54,960 will have a lot of padding tokens 22 00:00:54,960 --> 00:00:57,660 which will introduce more computations in the model 23 00:00:57,660 --> 00:00:58,910 we ultimately don't need. 24 00:01:00,060 --> 00:01:03,300 To avoid this, another strategy is to pad the elements 25 00:01:03,300 --> 00:01:05,280 when we batch them together, 26 00:01:05,280 --> 00:01:08,190 to the longest sentence inside the batch. 27 00:01:08,190 --> 00:01:12,000 This way, batches composed of short inputs will be smaller 28 00:01:12,000 --> 00:01:13,920 than the batch containing the longest sentence 29 00:01:13,920 --> 00:01:15,510 in the dataset. 30 00:01:15,510 --> 00:01:18,063 This will yield some nice speedup on CPU and GPU. 31 00:01:19,110 --> 00:01:20,490 The downside is that all batches 32 00:01:20,490 --> 00:01:22,140 will then have different shapes, 33 00:01:22,140 --> 00:01:24,740 which slows down training on accelerators like TPUs. 34 00:01:26,160 --> 00:01:29,370 Let's see how to apply both strategies in practice. 35 00:01:29,370 --> 00:01:31,280 We have actually seen how to apply fixed padding 36 00:01:31,280 --> 00:01:33,390 in the Datasets Overview video, 37 00:01:33,390 --> 00:01:36,030 when we preprocessed the MRPC dataset: 38 00:01:36,030 --> 00:01:38,250 after loading the dataset and tokenizer, 39 00:01:38,250 --> 00:01:40,680 we applied the tokenization to all the dataset 40 00:01:40,680 --> 00:01:42,480 with padding and truncation 41 00:01:42,480 --> 00:01:45,273 to make all samples of length 128. 42 00:01:46,530 --> 00:01:48,360 As a result, if we pass this dataset 43 00:01:48,360 --> 00:01:50,520 to a PyTorch DataLoader, 44 00:01:50,520 --> 00:01:55,503 we get batches of shape batch size, here 16, by 128. 45 00:01:57,060 --> 00:01:58,380 To apply dynamic padding, 46 00:01:58,380 --> 00:02:01,440 we must defer the padding to the batch preparation, 47 00:02:01,440 --> 00:02:04,740 so we remove that part from our tokenize function. 48 00:02:04,740 --> 00:02:06,150 We still leave the truncation part 49 00:02:06,150 --> 00:02:08,580 so that inputs that are bigger than the maximum length 50 00:02:08,580 --> 00:02:12,060 accepted by the model, usually 512, 51 00:02:12,060 --> 00:02:13,510 get truncated to that length. 52 00:02:14,940 --> 00:02:16,380 Then we pad our samples dynamically 53 00:02:16,380 --> 00:02:18,330 by using a data collator. 54 00:02:18,330 --> 00:02:20,280 Those classes in the Transformers library 55 00:02:20,280 --> 00:02:22,740 are responsible for applying all the final processing 56 00:02:22,740 --> 00:02:25,290 needed before forming a batch, 57 00:02:25,290 --> 00:02:28,470 here DataCollatorWithPadding will pad the samples 58 00:02:28,470 --> 00:02:31,083 to the maximum length inside the batch of sentences. 59 00:02:32,160 --> 00:02:35,310 We pass it to the PyTorch DataLoader as a collate function, 60 00:02:35,310 --> 00:02:37,620 then observe that the batches generated 61 00:02:37,620 --> 00:02:38,850 have various lengths, 62 00:02:38,850 --> 00:02:41,253 all way below the 128 from before. 63 00:02:42,660 --> 00:02:44,820 Dynamic batching will almost always be faster 64 00:02:44,820 --> 00:02:47,913 on CPUs and GPUs, so you should apply it if you can. 65 00:02:48,930 --> 00:02:51,330 Remember to switch back to fixed padding, however, 66 00:02:51,330 --> 00:02:53,490 if you run your training script on TPU 67 00:02:53,490 --> 00:02:55,293 or need batches of fixed shapes. 68 00:02:56,917 --> 00:02:59,584 (air whooshing)