subtitles/zh-CN/23_what-is-dynamic-padding.srt (269 lines of code) (raw):
1
00:00:00,242 --> 00:00:02,909
(空气呼啸)
(air whooshing)
2
00:00:05,460 --> 00:00:06,963
- 什么是动态填充?
- What is dynamic padding?
3
00:00:08,630 --> 00:00:10,890
在 “一起批量输入” 视频中,
In the "Batching Inputs together" video,
4
00:00:10,890 --> 00:00:12,720
我们已经看到为了能够对(不同长度的同批次)输入进行分组
we have seen that to be able to group inputs
5
00:00:12,720 --> 00:00:15,300
(同一批不同长度的),
of different lengths in the same batch,
6
00:00:15,300 --> 00:00:18,270
我们需要向所有短输入添加填充标记
we need to add padding tokens to all the short inputs
7
00:00:18,270 --> 00:00:20,970
直到它们的长度都相同。
until they are all of the same length.
8
00:00:20,970 --> 00:00:24,600
例如,这里最长的句子是第三句,
Here, for instance, the longest sentence is the third one,
9
00:00:24,600 --> 00:00:27,270
我们需要添加五个、两个或七个填充标记
and we need to add five, two, or seven pad tokens
10
00:00:27,270 --> 00:00:30,090
到其他句子使得四个句子具有
to the other sentences to have four sentences
11
00:00:30,090 --> 00:00:31,090
相同的长度。
of the same lengths.
12
00:00:32,430 --> 00:00:33,900
在处理整个数据集时,
When dealing with a whole dataset,
13
00:00:33,900 --> 00:00:36,633
我们可以应用各种填充策略。
there are various padding strategies we can apply.
14
00:00:37,560 --> 00:00:39,540
最明显的一种是填充整个数据集所有的样本
The most obvious one is to pad all the elements
15
00:00:39,540 --> 00:00:40,923
达到相同的长度:
of the dataset to the same length:
16
00:00:40,923 --> 00:00:43,053
最长样本的长度。
the length of the longest sample.
17
00:00:44,070 --> 00:00:45,330
我们得到具有相同形状的批次
This will then give us batches
18
00:00:45,330 --> 00:00:46,890
that all have the same shape
19
00:00:46,890 --> 00:00:49,800
(其长度)由最大序列长度决定。
determined by the maximum sequence length.
20
00:00:49,800 --> 00:00:52,893
缺点是(如果)批次样本由短句组成
The downside is that batches composed from short sentences
21
00:00:52,893 --> 00:00:54,960
将带来很多填充符号
will have a lot of padding tokens
22
00:00:54,960 --> 00:00:57,660
并且在模型中引入更多不必要的计算。
which will introduce more computations in the model
23
00:00:57,660 --> 00:00:58,910
we ultimately don't need.
24
00:01:00,060 --> 00:01:03,300
为了避免这种情况,另一种策略是填充(短样本)符号
To avoid this, another strategy is to pad the elements
25
00:01:03,300 --> 00:01:05,280
当把它们放在一批时,
when we batch them together,
26
00:01:05,280 --> 00:01:08,190
达到本批次中最长句子的长度。
to the longest sentence inside the batch.
27
00:01:08,190 --> 00:01:12,000
这样,由短样本输入组成的批次大小
This way, batches composed of short inputs will be smaller
28
00:01:12,000 --> 00:01:13,920
会比按整个数据集最长句子的长度(补齐)批次更小
than the batch containing the longest sentence
29
00:01:13,920 --> 00:01:15,510
in the dataset.
30
00:01:15,510 --> 00:01:18,063
这将在 CPU 和 GPU 上产生一些不错的加速。
This will yield some nice speedup on CPU and GPU.
31
00:01:19,110 --> 00:01:20,490
缺点是所有批次
The downside is that all batches
32
00:01:20,490 --> 00:01:22,140
会有不同的形状,
will then have different shapes,
33
00:01:22,140 --> 00:01:24,740
这会减慢 TPU 等加速器的训练速度。
which slows down training on accelerators like TPUs.
34
00:01:26,160 --> 00:01:29,370
让我们看看如何在实践中应用这两种策略。
Let's see how to apply both strategies in practice.
35
00:01:29,370 --> 00:01:31,280
我们实际上已经知道了如何使用固定填充
We have actually seen how to apply fixed padding
36
00:01:31,280 --> 00:01:33,390
在数据集概述视频中,
in the Datasets Overview video,
37
00:01:33,390 --> 00:01:36,030
当我们预处理 MRPC 数据集时:
when we preprocessed the MRPC dataset:
38
00:01:36,030 --> 00:01:38,250
加载数据集和分词器后,
after loading the dataset and tokenizer,
39
00:01:38,250 --> 00:01:40,680
我们将符号化应用于所有数据集
we applied the tokenization to all the dataset
40
00:01:40,680 --> 00:01:42,480
包括填充和截断
with padding and truncation
41
00:01:42,480 --> 00:01:45,273
保证所有样本的长度为 128 。
to make all samples of length 128.
42
00:01:46,530 --> 00:01:48,360
最后,如果我们传递这个数据集
As a result, if we pass this dataset
43
00:01:48,360 --> 00:01:50,520
到 PyTorch DataLoader,
to a PyTorch DataLoader,
44
00:01:50,520 --> 00:01:55,503
我们得到形状为 batch_size 乘以 16 乘以 128 的批次。
we get batches of shape batch size, here 16, by 128.
45
00:01:57,060 --> 00:01:58,380
要应用动态填充,
To apply dynamic padding,
46
00:01:58,380 --> 00:02:01,440
我们必须将填充推迟到批量准备,
we must defer the padding to the batch preparation,
47
00:02:01,440 --> 00:02:04,740
所以我们从标记函数中删除了那部分。
so we remove that part from our tokenize function.
48
00:02:04,740 --> 00:02:06,150
我们仍然保留截断部分
We still leave the truncation part
49
00:02:06,150 --> 00:02:08,580
以便大于最大长度的输入
so that inputs that are bigger than the maximum length
50
00:02:08,580 --> 00:02:12,060
被模型接受,通常是 512,
accepted by the model, usually 512,
51
00:02:12,060 --> 00:02:13,510
被截断到那个长度。
get truncated to that length.
52
00:02:14,940 --> 00:02:16,380
然后我们动态地填充我们的样本
Then we pad our samples dynamically
53
00:02:16,380 --> 00:02:18,330
通过使用数据整理器。
by using a data collator.
54
00:02:18,330 --> 00:02:20,280
Transformers 库中的那些类
Those classes in the Transformers library
55
00:02:20,280 --> 00:02:22,740
负责应用所有的最终处理
are responsible for applying all the final processing
56
00:02:22,740 --> 00:02:25,290
在形成批次之前需要,
needed before forming a batch,
57
00:02:25,290 --> 00:02:28,470
这里 DataCollatorWithPadding 将填充样本
here DataCollatorWithPadding will pad the samples
58
00:02:28,470 --> 00:02:31,083
到这批句子中的最大长度。
to the maximum length inside the batch of sentences.
59
00:02:32,160 --> 00:02:35,310
我们将它作为整理函数传递给 PyTorch DataLoader,
We pass it to the PyTorch DataLoader as a collate function,
60
00:02:35,310 --> 00:02:37,620
然后观察到生成的批次
then observe that the batches generated
61
00:02:37,620 --> 00:02:38,850
有不同的长度,
have various lengths,
62
00:02:38,850 --> 00:02:41,253
一直低于之前的 128。
all way below the 128 from before.
63
00:02:42,660 --> 00:02:44,820
动态批处理几乎在 CPU 和 GPU 上更快,
Dynamic batching will almost always be faster
64
00:02:44,820 --> 00:02:47,913
所以如果可以的话你应该应用它。
on CPUs and GPUs, so you should apply it if you can.
65
00:02:48,930 --> 00:02:51,330
但是,请记住切换回固定填充
Remember to switch back to fixed padding, however,
66
00:02:51,330 --> 00:02:53,490
如果你在 TPU 上运行你的训练脚本
if you run your training script on TPU
67
00:02:53,490 --> 00:02:55,293
或者需要固定形状的批次输入。
or need batches of fixed shapes.
68
00:02:56,917 --> 00:02:59,584
(空气呼啸)
(air whooshing)