1
00:00:00,242 --> 00:00:02,909
（空气呼啸）
(air whooshing)

2
00:00:05,460 --> 00:00:06,963
- 什么是动态填充？
- What is dynamic padding?

3
00:00:08,630 --> 00:00:10,890
在 “一起批量输入” 视频中，
In the "Batching Inputs together" video,

4
00:00:10,890 --> 00:00:12,720
我们已经看到为了能够对（不同长度的同批次）输入进行分组
we have seen that to be able to group inputs

5
00:00:12,720 --> 00:00:15,300
（同一批不同长度的），
of different lengths in the same batch,

6
00:00:15,300 --> 00:00:18,270
我们需要向所有短输入添加填充标记
we need to add padding tokens to all the short inputs

7
00:00:18,270 --> 00:00:20,970
直到它们的长度都相同。
until they are all of the same length.

8
00:00:20,970 --> 00:00:24,600
例如，这里最长的句子是第三句，
Here, for instance, the longest sentence is the third one,

9
00:00:24,600 --> 00:00:27,270
我们需要添加五个、两个或七个填充标记
and we need to add five, two, or seven pad tokens

10
00:00:27,270 --> 00:00:30,090
到其他句子使得四个句子具有
to the other sentences to have four sentences

11
00:00:30,090 --> 00:00:31,090
相同的长度。
of the same lengths.

12
00:00:32,430 --> 00:00:33,900
在处理整个数据集时，
When dealing with a whole dataset,

13
00:00:33,900 --> 00:00:36,633
我们可以应用各种填充策略。
there are various padding strategies we can apply.

14
00:00:37,560 --> 00:00:39,540
最明显的一种是填充整个数据集所有的样本
The most obvious one is to pad all the elements

15
00:00:39,540 --> 00:00:40,923
达到相同的长度：
of the dataset to the same length:

16
00:00:40,923 --> 00:00:43,053
最长样本的长度。
the length of the longest sample.

17
00:00:44,070 --> 00:00:45,330
我们得到具有相同形状的批次
This will then give us batches

18
00:00:45,330 --> 00:00:46,890

that all have the same shape

19
00:00:46,890 --> 00:00:49,800
（其长度）由最大序列长度决定。
determined by the maximum sequence length.

20
00:00:49,800 --> 00:00:52,893
缺点是（如果）批次样本由短句组成
The downside is that batches composed from short sentences

21
00:00:52,893 --> 00:00:54,960
将带来很多填充符号
will have a lot of padding tokens

22
00:00:54,960 --> 00:00:57,660
并且在模型中引入更多不必要的计算。
which will introduce more computations in the model

23
00:00:57,660 --> 00:00:58,910

we ultimately don't need.

24
00:01:00,060 --> 00:01:03,300
为了避免这种情况，另一种策略是填充（短样本）符号
To avoid this, another strategy is to pad the elements

25
00:01:03,300 --> 00:01:05,280
当把它们放在一批时，
when we batch them together,

26
00:01:05,280 --> 00:01:08,190
达到本批次中最长句子的长度。
to the longest sentence inside the batch.

27
00:01:08,190 --> 00:01:12,000
这样，由短样本输入组成的批次大小
This way, batches composed of short inputs will be smaller

28
00:01:12,000 --> 00:01:13,920
会比按整个数据集最长句子的长度（补齐）批次更小
than the batch containing the longest sentence

29
00:01:13,920 --> 00:01:15,510

in the dataset.

30
00:01:15,510 --> 00:01:18,063
这将在 CPU 和 GPU 上产生一些不错的加速。
This will yield some nice speedup on CPU and GPU.

31
00:01:19,110 --> 00:01:20,490
缺点是所有批次
The downside is that all batches

32
00:01:20,490 --> 00:01:22,140
会有不同的形状，
will then have different shapes,

33
00:01:22,140 --> 00:01:24,740
这会减慢 TPU 等加速器的训练速度。
which slows down training on accelerators like TPUs.

34
00:01:26,160 --> 00:01:29,370
让我们看看如何在实践中应用这两种策略。
Let's see how to apply both strategies in practice.

35
00:01:29,370 --> 00:01:31,280
我们实际上已经知道了如何使用固定填充
We have actually seen how to apply fixed padding

36
00:01:31,280 --> 00:01:33,390
在数据集概述视频中，
in the Datasets Overview video,

37
00:01:33,390 --> 00:01:36,030
当我们预处理 MRPC 数据集时：
when we preprocessed the MRPC dataset:

38
00:01:36,030 --> 00:01:38,250
加载数据集和分词器后，
after loading the dataset and tokenizer,

39
00:01:38,250 --> 00:01:40,680
我们将符号化应用于所有数据集
we applied the tokenization to all the dataset

40
00:01:40,680 --> 00:01:42,480
包括填充和截断
with padding and truncation

41
00:01:42,480 --> 00:01:45,273
保证所有样本的长度为 128 。
to make all samples of length 128.

42
00:01:46,530 --> 00:01:48,360
最后，如果我们传递这个数据集
As a result, if we pass this dataset

43
00:01:48,360 --> 00:01:50,520
到 PyTorch DataLoader，
to a PyTorch DataLoader,

44
00:01:50,520 --> 00:01:55,503
我们得到形状为 batch_size 乘以 16 乘以 128 的批次。

we get batches of shape batch size, here 16, by 128.

45
00:01:57,060 --> 00:01:58,380
要应用动态填充，
To apply dynamic padding,

46
00:01:58,380 --> 00:02:01,440
我们必须将填充推迟到批量准备，
we must defer the padding to the batch preparation,

47
00:02:01,440 --> 00:02:04,740
所以我们从标记函数中删除了那部分。
so we remove that part from our tokenize function.

48
00:02:04,740 --> 00:02:06,150
我们仍然保留截断部分
We still leave the truncation part

49
00:02:06,150 --> 00:02:08,580
以便大于最大长度的输入
so that inputs that are bigger than the maximum length

50
00:02:08,580 --> 00:02:12,060
被模型接受，通常是 512，
accepted by the model, usually 512,

51
00:02:12,060 --> 00:02:13,510
被截断到那个长度。
get truncated to that length.

52
00:02:14,940 --> 00:02:16,380
然后我们动态地填充我们的样本
Then we pad our samples dynamically

53
00:02:16,380 --> 00:02:18,330
通过使用数据整理器。
by using a data collator.

54
00:02:18,330 --> 00:02:20,280
Transformers 库中的那些类
Those classes in the Transformers library

55
00:02:20,280 --> 00:02:22,740
负责应用所有的最终处理
are responsible for applying all the final processing

56
00:02:22,740 --> 00:02:25,290
在形成批次之前需要，
needed before forming a batch,

57
00:02:25,290 --> 00:02:28,470
这里 DataCollatorWithPadding 将填充样本
here DataCollatorWithPadding will pad the samples

58
00:02:28,470 --> 00:02:31,083
到这批句子中的最大长度。
to the maximum length inside the batch of sentences.

59
00:02:32,160 --> 00:02:35,310
我们将它作为整理函数传递给 PyTorch DataLoader，
We pass it to the PyTorch DataLoader as a collate function,

60
00:02:35,310 --> 00:02:37,620
然后观察到生成的批次
then observe that the batches generated

61
00:02:37,620 --> 00:02:38,850
有不同的长度，
have various lengths,

62
00:02:38,850 --> 00:02:41,253
一直低于之前的 128。
all way below the 128 from before.

63
00:02:42,660 --> 00:02:44,820
动态批处理几乎在 CPU 和 GPU 上更快，

Dynamic batching will almost always be faster

64
00:02:44,820 --> 00:02:47,913
所以如果可以的话你应该应用它。
on CPUs and GPUs, so you should apply it if you can.

65
00:02:48,930 --> 00:02:51,330
但是，请记住切换回固定填充
Remember to switch back to fixed padding, however,

66
00:02:51,330 --> 00:02:53,490
如果你在 TPU 上运行你的训练脚本
if you run your training script on TPU

67
00:02:53,490 --> 00:02:55,293
或者需要固定形状的批次输入。
or need batches of fixed shapes.

68
00:02:56,917 --> 00:02:59,584
（空气呼啸）
(air whooshing)