1 00:00:00,242 --> 00:00:02,909 (空气呼啸) (air whooshing) 2 00:00:05,460 --> 00:00:06,963 - 什么是动态填充? - What is dynamic padding? 3 00:00:08,630 --> 00:00:10,890 在 “一起批量输入” 视频中, In the "Batching Inputs together" video, 4 00:00:10,890 --> 00:00:12,720 我们已经看到为了能够对(不同长度的同批次)输入进行分组 we have seen that to be able to group inputs 5 00:00:12,720 --> 00:00:15,300 (同一批不同长度的), of different lengths in the same batch, 6 00:00:15,300 --> 00:00:18,270 我们需要向所有短输入添加填充标记 we need to add padding tokens to all the short inputs 7 00:00:18,270 --> 00:00:20,970 直到它们的长度都相同。 until they are all of the same length. 8 00:00:20,970 --> 00:00:24,600 例如,这里最长的句子是第三句, Here, for instance, the longest sentence is the third one, 9 00:00:24,600 --> 00:00:27,270 我们需要添加五个、两个或七个填充标记 and we need to add five, two, or seven pad tokens 10 00:00:27,270 --> 00:00:30,090 到其他句子使得四个句子具有 to the other sentences to have four sentences 11 00:00:30,090 --> 00:00:31,090 相同的长度。 of the same lengths. 12 00:00:32,430 --> 00:00:33,900 在处理整个数据集时, When dealing with a whole dataset, 13 00:00:33,900 --> 00:00:36,633 我们可以应用各种填充策略。 there are various padding strategies we can apply. 14 00:00:37,560 --> 00:00:39,540 最明显的一种是填充整个数据集所有的样本 The most obvious one is to pad all the elements 15 00:00:39,540 --> 00:00:40,923 达到相同的长度: of the dataset to the same length: 16 00:00:40,923 --> 00:00:43,053 最长样本的长度。 the length of the longest sample. 17 00:00:44,070 --> 00:00:45,330 我们得到具有相同形状的批次 This will then give us batches 18 00:00:45,330 --> 00:00:46,890 that all have the same shape 19 00:00:46,890 --> 00:00:49,800 (其长度)由最大序列长度决定。 determined by the maximum sequence length. 20 00:00:49,800 --> 00:00:52,893 缺点是(如果)批次样本由短句组成 The downside is that batches composed from short sentences 21 00:00:52,893 --> 00:00:54,960 将带来很多填充符号 will have a lot of padding tokens 22 00:00:54,960 --> 00:00:57,660 并且在模型中引入更多不必要的计算。 which will introduce more computations in the model 23 00:00:57,660 --> 00:00:58,910 we ultimately don't need. 24 00:01:00,060 --> 00:01:03,300 为了避免这种情况,另一种策略是填充(短样本)符号 To avoid this, another strategy is to pad the elements 25 00:01:03,300 --> 00:01:05,280 当把它们放在一批时, when we batch them together, 26 00:01:05,280 --> 00:01:08,190 达到本批次中最长句子的长度。 to the longest sentence inside the batch. 27 00:01:08,190 --> 00:01:12,000 这样,由短样本输入组成的批次大小 This way, batches composed of short inputs will be smaller 28 00:01:12,000 --> 00:01:13,920 会比按整个数据集最长句子的长度(补齐)批次更小 than the batch containing the longest sentence 29 00:01:13,920 --> 00:01:15,510 in the dataset. 30 00:01:15,510 --> 00:01:18,063 这将在 CPU 和 GPU 上产生一些不错的加速。 This will yield some nice speedup on CPU and GPU. 31 00:01:19,110 --> 00:01:20,490 缺点是所有批次 The downside is that all batches 32 00:01:20,490 --> 00:01:22,140 会有不同的形状, will then have different shapes, 33 00:01:22,140 --> 00:01:24,740 这会减慢 TPU 等加速器的训练速度。 which slows down training on accelerators like TPUs. 34 00:01:26,160 --> 00:01:29,370 让我们看看如何在实践中应用这两种策略。 Let's see how to apply both strategies in practice. 35 00:01:29,370 --> 00:01:31,280 我们实际上已经知道了如何使用固定填充 We have actually seen how to apply fixed padding 36 00:01:31,280 --> 00:01:33,390 在数据集概述视频中, in the Datasets Overview video, 37 00:01:33,390 --> 00:01:36,030 当我们预处理 MRPC 数据集时: when we preprocessed the MRPC dataset: 38 00:01:36,030 --> 00:01:38,250 加载数据集和分词器后, after loading the dataset and tokenizer, 39 00:01:38,250 --> 00:01:40,680 我们将符号化应用于所有数据集 we applied the tokenization to all the dataset 40 00:01:40,680 --> 00:01:42,480 包括填充和截断 with padding and truncation 41 00:01:42,480 --> 00:01:45,273 保证所有样本的长度为 128 。 to make all samples of length 128. 42 00:01:46,530 --> 00:01:48,360 最后,如果我们传递这个数据集 As a result, if we pass this dataset 43 00:01:48,360 --> 00:01:50,520 到 PyTorch DataLoader, to a PyTorch DataLoader, 44 00:01:50,520 --> 00:01:55,503 我们得到形状为 batch_size 乘以 16 乘以 128 的批次。 we get batches of shape batch size, here 16, by 128. 45 00:01:57,060 --> 00:01:58,380 要应用动态填充, To apply dynamic padding, 46 00:01:58,380 --> 00:02:01,440 我们必须将填充推迟到批量准备, we must defer the padding to the batch preparation, 47 00:02:01,440 --> 00:02:04,740 所以我们从标记函数中删除了那部分。 so we remove that part from our tokenize function. 48 00:02:04,740 --> 00:02:06,150 我们仍然保留截断部分 We still leave the truncation part 49 00:02:06,150 --> 00:02:08,580 以便大于最大长度的输入 so that inputs that are bigger than the maximum length 50 00:02:08,580 --> 00:02:12,060 被模型接受,通常是 512, accepted by the model, usually 512, 51 00:02:12,060 --> 00:02:13,510 被截断到那个长度。 get truncated to that length. 52 00:02:14,940 --> 00:02:16,380 然后我们动态地填充我们的样本 Then we pad our samples dynamically 53 00:02:16,380 --> 00:02:18,330 通过使用数据整理器。 by using a data collator. 54 00:02:18,330 --> 00:02:20,280 Transformers 库中的那些类 Those classes in the Transformers library 55 00:02:20,280 --> 00:02:22,740 负责应用所有的最终处理 are responsible for applying all the final processing 56 00:02:22,740 --> 00:02:25,290 在形成批次之前需要, needed before forming a batch, 57 00:02:25,290 --> 00:02:28,470 这里 DataCollatorWithPadding 将填充样本 here DataCollatorWithPadding will pad the samples 58 00:02:28,470 --> 00:02:31,083 到这批句子中的最大长度。 to the maximum length inside the batch of sentences. 59 00:02:32,160 --> 00:02:35,310 我们将它作为整理函数传递给 PyTorch DataLoader, We pass it to the PyTorch DataLoader as a collate function, 60 00:02:35,310 --> 00:02:37,620 然后观察到生成的批次 then observe that the batches generated 61 00:02:37,620 --> 00:02:38,850 有不同的长度, have various lengths, 62 00:02:38,850 --> 00:02:41,253 一直低于之前的 128。 all way below the 128 from before. 63 00:02:42,660 --> 00:02:44,820 动态批处理几乎在 CPU 和 GPU 上更快, Dynamic batching will almost always be faster 64 00:02:44,820 --> 00:02:47,913 所以如果可以的话你应该应用它。 on CPUs and GPUs, so you should apply it if you can. 65 00:02:48,930 --> 00:02:51,330 但是,请记住切换回固定填充 Remember to switch back to fixed padding, however, 66 00:02:51,330 --> 00:02:53,490 如果你在 TPU 上运行你的训练脚本 if you run your training script on TPU 67 00:02:53,490 --> 00:02:55,293 或者需要固定形状的批次输入。 or need batches of fixed shapes. 68 00:02:56,917 --> 00:02:59,584 (空气呼啸) (air whooshing)