subtitles/zh-CN/68_data-collators-a-tour.srt (588 lines of code) (raw):
1
00:00:00,670 --> 00:00:01,503
(嘶嘶声)
(whooshing sound)
2
00:00:01,503 --> 00:00:02,469
(贴纸弹出)
(sticker popping)
3
00:00:02,469 --> 00:00:05,302
(嘶嘶声)
(whooshing sound)
4
00:00:06,240 --> 00:00:08,220
在我们的很多例子中,
In a lot of our examples,
5
00:00:08,220 --> 00:00:12,150
你将看到 DataCollators 一遍又一遍地弹出。
you're going to see DataCollators popping up over and over.
6
00:00:12,150 --> 00:00:16,020
它们用于 PyTorch 和 TensorFlow 工作流程,
They're used in both PyTorch and TensorFlow workflows,
7
00:00:16,020 --> 00:00:17,460
甚至在 JAX 中,
and maybe even in JAX,
8
00:00:17,460 --> 00:00:20,130
但是没有人真正知道 JAX 中发生了什么。
but no-one really knows what's happening in JAX.
9
00:00:20,130 --> 00:00:21,840
不过,我们确实有一个研究团队正在研究它,
We do have a research team working on it though,
10
00:00:21,840 --> 00:00:23,970
所以也许他们很快就会告诉我们。
so maybe they'll tell us soon.
11
00:00:23,970 --> 00:00:25,620
但是回到主题。
But coming back on topic.
12
00:00:25,620 --> 00:00:27,600
什么是数据整理器?
What are data collators?
13
00:00:27,600 --> 00:00:30,480
数据整理器整理数据。
Data collators collate data.
14
00:00:30,480 --> 00:00:31,800
好像和没说一样。
That's not that helpful.
15
00:00:31,800 --> 00:00:35,023
更具体地说,他们整理了一份样本清单
But to be more specific, they put together a list of samples
16
00:00:35,023 --> 00:00:37,830
组成一个单独的小批量训练数据。
into a single training minibatch.
17
00:00:37,830 --> 00:00:38,910
对于某些任务,
For some tasks,
18
00:00:38,910 --> 00:00:41,670
数据整理器可以非常简单明了。
the data collator can be very straightforward.
19
00:00:41,670 --> 00:00:44,820
例如,当你进行序列分类时,
For example, when you're doing sequence classification,
20
00:00:44,820 --> 00:00:47,010
你真正需要数据整理器做的是
all you really need from your data collator
21
00:00:47,010 --> 00:00:49,860
将你的样本数据填充到相同的长度
is that it pads your samples to the same length
22
00:00:49,860 --> 00:00:52,413
并将它们连接成一个单独的 Tensor。
and concatenates them into a single Tensor.
23
00:00:53,340 --> 00:00:57,750
但对于其他工作流程,数据整理器可能非常复杂
But for other workflows, data collators can be quite complex
24
00:00:57,750 --> 00:00:59,910
因为他们为特定任务
as they handle some of the preprocessing
25
00:00:59,910 --> 00:01:02,340
完成一些预处理操作。
needed for that particular task.
26
00:01:02,340 --> 00:01:04,800
所以,如果你想使用数据整理器,
So, if you want to use a data collator,
27
00:01:04,800 --> 00:01:07,860
对于 PyTorch 用户,你通常将数据整理器
for PyTorch users, you usually pass the data collator
28
00:01:07,860 --> 00:01:09,780
传递到你的 Trainer 对象。
to your Trainer object.
29
00:01:09,780 --> 00:01:11,310
在 TensorFlow 中,情况有点不同。
In TensorFlow, it's a bit different.
30
00:01:11,310 --> 00:01:12,960
使用数据整理器的最简单方法
The easiest way to use a data collator
31
00:01:12,960 --> 00:01:16,860
是将它传递给数据集的 to_tf_dataset 方法。
is to pass it to the to_tf_dataset method of your dataset.
32
00:01:16,860 --> 00:01:20,198
这会给你 tensorflow_tf_data.dataset
And this will give you a tensorflow_tf_data.dataset
33
00:01:20,198 --> 00:01:22,743
然后你可以将其传递给 model.fit。
that you can then pass to model.fit.
34
00:01:23,580 --> 00:01:25,890
你将在示例中看到这些方法
You'll see these approaches used in the examples
35
00:01:25,890 --> 00:01:28,068
和整个课程的笔记本。
and notebooks throughout this course.
36
00:01:28,068 --> 00:01:30,180
另请注意,我们所有的整理器
Also note that all of our collators
37
00:01:30,180 --> 00:01:32,610
可接受 return_tensors 参数。
take a return_tensors argument.
38
00:01:32,610 --> 00:01:35,737
你可以将其设置为 “pt” 以获取 PyTorch Tensor,
You can set this to "pt" to get PyTorch Tensors,
39
00:01:35,737 --> 00:01:37,920
"tf" 获取 TensorFlow Tensor,
"tf" to get TensorFlow Tensors,
40
00:01:37,920 --> 00:01:40,404
或 “np” 获取 Numpy 数组。
or "np" to get Numpy arrays.
41
00:01:40,404 --> 00:01:42,450
出于向后兼容的原因,
For backward compatibility reasons,
42
00:01:42,450 --> 00:01:44,460
默认值为 “pt”,
the default value is "pt",
43
00:01:44,460 --> 00:01:47,160
所以 PyTorch 用户甚至大多数情况下
so PyTorch users don't even have to set this argument
44
00:01:47,160 --> 00:01:48,270
不必设置这个参数。
most of the time.
45
00:01:48,270 --> 00:01:50,820
因此,他们通常完全没有意识到
And so as a result, they're often totally unaware
46
00:01:50,820 --> 00:01:52,713
这个参数的存在。
that this argument even exists.
47
00:01:53,730 --> 00:01:55,050
我们可以从中学到一些东西
We can learn something from this
48
00:01:55,050 --> 00:01:57,120
也就是特权的受益人
which is that the beneficiaries of privilege
49
00:01:57,120 --> 00:01:59,793
往往最无视它的存在。
are often the most blind to its existence.
50
00:02:00,690 --> 00:02:01,920
好吧好吧,扯远了。
But okay, coming back.
51
00:02:01,920 --> 00:02:06,540
让我们看看一些特定的数据整理器是如何工作的。
Let's see how some specific data collators work in action.
52
00:02:06,540 --> 00:02:08,070
虽然再次记住
Although again, remember if none
53
00:02:08,070 --> 00:02:09,900
如果内置数据整理器无法满足你的需求,
of the built-in data collators do what you need,
54
00:02:09,900 --> 00:02:13,650
你可以自己实现数据整理器,而且它们通常很短。
you can always write your own and they're often quite short.
55
00:02:13,650 --> 00:02:16,950
所以首先,我们将看到 “基本” 数据整理器。
So first, we'll see the "basic" data collators.
56
00:02:16,950 --> 00:02:20,433
它们是 DefaultDataCollator 和 DataCollatorWithPadding。
These are DefaultDataCollator and DataCollatorWithPadding.
57
00:02:21,420 --> 00:02:22,830
如果在准备训练之前
These are the ones you should use
58
00:02:22,830 --> 00:02:24,720
你的标签很简单
if your labels are straightforward
59
00:02:24,720 --> 00:02:27,300
并且你的数据不需要任何特殊处理
and your data doesn't need any special processing
60
00:02:27,300 --> 00:02:29,673
就可以使用上述的数据整理器
before being ready for training.
61
00:02:29,673 --> 00:02:31,272
请注意,因为不同的模型
Notice that because different models
62
00:02:31,272 --> 00:02:33,690
有不同的填充词元,
have different padding tokens,
63
00:02:33,690 --> 00:02:37,170
DataCollatorWithPadding 将需要你模型的 Tokenizer
DataCollatorWithPadding will need your model's Tokenizer
64
00:02:37,170 --> 00:02:40,150
所以它知道如何正确填充序列。
so it knows how to pad sequences properly.
65
00:02:40,150 --> 00:02:44,790
默认的数据整理器不需要 Tokenizer 工作,
The default data collator doesn't need a Tokenizer to work,
66
00:02:44,790 --> 00:02:46,710
但它会因此抛出错误
but it will as a result throw an error
67
00:02:46,710 --> 00:02:48,900
除非你所有的序列都是相同的长度。
unless all of your sequences are the same length.
68
00:02:48,900 --> 00:02:50,500
所以,你应该意识到这一点。
So, you should be aware of that.
69
00:02:51,480 --> 00:02:52,860
继续前进。
Moving on though.
70
00:02:52,860 --> 00:02:54,300
许多其他数据整理器
A lot of the other data collators
71
00:02:54,300 --> 00:02:56,130
除了基本的两个,
aside from the basic two are,
72
00:02:56,130 --> 00:02:59,490
它们通常旨在处理一项特定任务。
they're usually designed to handle one specific task.
73
00:02:59,490 --> 00:03:01,050
所以,我要在这里展示两个整理器。
And so, I'm going to show a couple here.
74
00:03:01,050 --> 00:03:04,320
它们是 DataCollatorForTokenClassification
These are DataCollatorForTokenClassification
75
00:03:04,320 --> 00:03:06,447
和 DataCollatorForSeqToSeq。
and DataCollatorForSeqToSeq.
76
00:03:06,447 --> 00:03:09,540
而这些任务需要特殊整理器的原因
And the reason these tasks need special collators
77
00:03:09,540 --> 00:03:12,600
是因为它们的标签长度可变。
is because their labels are variable in length.
78
00:03:12,600 --> 00:03:15,960
在词元分类中,每个词元都有一个标签,
In token classification there's one label for each token,
79
00:03:15,960 --> 00:03:17,400
标签的长度
and so the length of the labels
80
00:03:17,400 --> 00:03:18,993
是序列的长度。
is the length of the sequence.
81
00:03:20,280 --> 00:03:23,520
而在 SeqToSeq 中,标签是一系列词元
While in SeqToSeq the labels are a sequence of tokens
82
00:03:23,520 --> 00:03:24,780
它可以是可变长度,
that can be variable length,
83
00:03:24,780 --> 00:03:25,800
那可能会和输入序列
that can be very different
84
00:03:25,800 --> 00:03:28,200
的长度不同。
from the length of the input sequence.
85
00:03:28,200 --> 00:03:32,880
所以在这两种情况下,我们通过填充标签
So in both of these cases, we handle collating that batch
86
00:03:32,880 --> 00:03:35,280
处理整理那批数据,
by padding the labels as well,
87
00:03:35,280 --> 00:03:37,410
正如你在此示例中看到的那样。
as you can see here in this example.
88
00:03:37,410 --> 00:03:40,770
因此,如果我们想加入可变长度的样本
So, inputs and the labels will need to be padded
89
00:03:40,770 --> 00:03:43,860
到同一个小批量数据,
if we want to join samples of variable length
90
00:03:43,860 --> 00:03:45,120
就需要填充输入和标签。
into the same minibatch.
91
00:03:45,120 --> 00:03:47,520
这正是数据整理者所做的
That's exactly what the data collators
92
00:03:47,520 --> 00:03:50,460
而这正是这些数据整理者将为我们做的
and that's exactly what these data collators will do for us
93
00:03:50,460 --> 00:03:52,383
你懂的,为了这个特定的任务。
you know, for this particular task.
94
00:03:53,820 --> 00:03:56,070
那么,还有最后一个
So, there's one final data collator
95
00:03:56,070 --> 00:03:58,560
想在这里与大家分享的数据整理器。
I want to show you as well just in this lecture.
96
00:03:58,560 --> 00:04:00,473
这就是 DataCollatorForLanguageModeling。
And that's the DataCollatorForLanguageModeling.
97
00:04:01,410 --> 00:04:03,390
嗯,它非常重要,首先,
So, it's very important, and it's firstly,
98
00:04:03,390 --> 00:04:05,820
因为语言模型是我们这段时间以来
because language models are just so foundational
99
00:04:05,820 --> 00:04:09,720
处理 NLP 所涉及的最基本的概念。
to do for everything we do with NLP these days.
100
00:04:09,720 --> 00:04:12,060
但其次,因为它有两种模式
But secondly, because it has two modes
101
00:04:12,060 --> 00:04:14,760
可以完成两件截然不同的事情。
that do two very different things.
102
00:04:14,760 --> 00:04:19,230
因此,你可以使用 mlm 参数选择所需的模式。
So you choose which mode you want with the mlm argument.
103
00:04:19,230 --> 00:04:22,470
将其设置为 True 以进行屏蔽语言建模,
Set it to True for masked language modeling,
104
00:04:22,470 --> 00:04:26,190
并将其设置为 False 以进行因果语言建模。
and set it to False for causal language modeling.
105
00:04:26,190 --> 00:04:28,620
因此,为因果语言建模整理数据
So, collating data for causal language modeling
106
00:04:28,620 --> 00:04:30,750
其实很简单。
is actually quite straightforward.
107
00:04:30,750 --> 00:04:32,640
该模型只是做出预测
The model is just making predictions
108
00:04:32,640 --> 00:04:35,460
接下来是什么词元,所以你的标签
for what token comes next, and so your labels
109
00:04:35,460 --> 00:04:37,800
或多或少只是你输入的副本,
are more or less just a copy of your inputs,
110
00:04:37,800 --> 00:04:39,090
整理器会处理你的输入
and the collator will handle that
111
00:04:39,090 --> 00:04:42,240
并确保正确填充输入和标签。
and ensure that the inputs and labels are padded correctly.
112
00:04:42,240 --> 00:04:44,910
但是,当你将 mlm 设置为 True 时,
When you set mlm to True though,
113
00:04:44,910 --> 00:04:46,786
你会得到完全不同的效果,
you get quite different behavior,
114
00:04:46,786 --> 00:04:49,200
这与任何其他数据整理器不同,
that's different from any other data collator,
115
00:04:49,200 --> 00:04:51,660
那是因为将 mlm 设置为 True
and that's because setting mlm to True
116
00:04:51,660 --> 00:04:53,550
表示屏蔽语言建模
means masked language modeling
117
00:04:53,550 --> 00:04:55,680
这意味着标签必须是
and that means the labels need to be,
118
00:04:55,680 --> 00:04:58,080
你懂的,输入需要被屏蔽。
you know, the inputs need to be masked.
119
00:04:58,080 --> 00:05:00,093
那么,那看起来像什么?
So, what does that look like?
120
00:05:01,050 --> 00:05:03,900
所以,回想一下,在屏蔽语言建模中,
So, recall that in masked language modeling,
121
00:05:03,900 --> 00:05:06,570
该模型没有预测下一个词,
the model is not predicting the next word,
122
00:05:06,570 --> 00:05:09,240
相反,我们随机屏蔽掉一些词元
instead we randomly mask out some tokens
123
00:05:09,240 --> 00:05:11,130
模型会同时预测所有这些。
and the model predicts all of them at once.
124
00:05:11,130 --> 00:05:12,780
所以,对于那些被屏蔽的词元
So, it tries to kinda fill in the blanks
125
00:05:12,780 --> 00:05:14,790
它试图填补空白。
for those masked tokens.
126
00:05:14,790 --> 00:05:18,210
但是随机掩蔽的过程出奇地复杂。
But the process of random masking is surprisingly complex.
127
00:05:18,210 --> 00:05:21,330
如果我们遵循原始 BERT 论文中的协议,
If we follow the protocol from the original BERT paper,
128
00:05:21,330 --> 00:05:23,970
我们需要用掩码词元替换一些词元,
we need to replace some tokens with a masked token,
129
00:05:23,970 --> 00:05:26,190
一些其他带有随机词元的词元,
some other tokens with a random token,
130
00:05:26,190 --> 00:05:29,820
然后保持第三组词元不变。
and then keep a third set of tokens unchanged.
131
00:05:29,820 --> 00:05:30,840
好吧,具体细节和我们这么做的原因
Yeah, this is not the lecture
132
00:05:30,840 --> 00:05:33,903
就不在这里和大家详细说明了。
to go into the specifics of that or why we do it.
133
00:05:33,903 --> 00:05:36,660
如果你好奇的话,可以随时查看原始的
You can always check out the original BERT paper
134
00:05:36,660 --> 00:05:37,493
BERT 论文。
if you're curious.
135
00:05:37,493 --> 00:05:39,620
它写得非常好。也很容易理解。
It's well written. It's easy to understand.
136
00:05:40,650 --> 00:05:44,190
这里要知道的主要事情是自己实现起来
The main thing to know here is that it can be a real pain
137
00:05:44,190 --> 00:05:46,770
是一件非常痛苦的事情,而且也非常地复杂。
and quite complex to implement that yourself.
138
00:05:46,770 --> 00:05:49,740
但是当你将 mlm 设置为 True 时,DataCollatorForLanguageModeling
But DataCollatorForLanguageModeling will do it for you
139
00:05:49,740 --> 00:05:51,750
会为你完成。
when you set mlm to True.
140
00:05:51,750 --> 00:05:54,690
这是一些数据整理器所完成的
And that's an example of the more intricate
141
00:05:54,690 --> 00:05:57,870
更加复杂的预处理操作。
preprocessing that some of our data collators do.
142
00:05:57,870 --> 00:05:59,430
就是这样!
And that's it!
143
00:05:59,430 --> 00:06:01,920
因此,这涵盖了最常用的数据整理器
So, this covers the most commonly used data collators
144
00:06:01,920 --> 00:06:03,480
以及它们所针对的具体任务。
and the tasks they're used for.
145
00:06:03,480 --> 00:06:06,990
希望现在你会知道何时使用数据整理器
And hopefully, now you'll know when to use data collators
146
00:06:06,990 --> 00:06:10,833
以及该为你的特定任务选择哪一个。
and which one to choose for your specific task.
147
00:06:11,765 --> 00:06:14,598
(嘶嘶声)
(whooshing sound)