subtitles/zh-CN/68_data-collators-a-tour.srt (588 lines of code) (raw):

1 00:00:00,670 --> 00:00:01,503 (嘶嘶声) (whooshing sound) 2 00:00:01,503 --> 00:00:02,469 (贴纸弹出) (sticker popping) 3 00:00:02,469 --> 00:00:05,302 (嘶嘶声) (whooshing sound) 4 00:00:06,240 --> 00:00:08,220 在我们的很多例子中, In a lot of our examples, 5 00:00:08,220 --> 00:00:12,150 你将看到 DataCollators 一遍又一遍地弹出。 you're going to see DataCollators popping up over and over. 6 00:00:12,150 --> 00:00:16,020 它们用于 PyTorch 和 TensorFlow 工作流程, They're used in both PyTorch and TensorFlow workflows, 7 00:00:16,020 --> 00:00:17,460 甚至在 JAX 中, and maybe even in JAX, 8 00:00:17,460 --> 00:00:20,130 但是没有人真正知道 JAX 中发生了什么。 but no-one really knows what's happening in JAX. 9 00:00:20,130 --> 00:00:21,840 不过,我们确实有一个研究团队正在研究它, We do have a research team working on it though, 10 00:00:21,840 --> 00:00:23,970 所以也许他们很快就会告诉我们。 so maybe they'll tell us soon. 11 00:00:23,970 --> 00:00:25,620 但是回到主题。 But coming back on topic. 12 00:00:25,620 --> 00:00:27,600 什么是数据整理器? What are data collators? 13 00:00:27,600 --> 00:00:30,480 数据整理器整理数据。 Data collators collate data. 14 00:00:30,480 --> 00:00:31,800 好像和没说一样。 That's not that helpful. 15 00:00:31,800 --> 00:00:35,023 更具体地说,他们整理了一份样本清单 But to be more specific, they put together a list of samples 16 00:00:35,023 --> 00:00:37,830 组成一个单独的小批量训练数据。 into a single training minibatch. 17 00:00:37,830 --> 00:00:38,910 对于某些任务, For some tasks, 18 00:00:38,910 --> 00:00:41,670 数据整理器可以非常简单明了。 the data collator can be very straightforward. 19 00:00:41,670 --> 00:00:44,820 例如,当你进行序列分类时, For example, when you're doing sequence classification, 20 00:00:44,820 --> 00:00:47,010 你真正需要数据整理器做的是 all you really need from your data collator 21 00:00:47,010 --> 00:00:49,860 将你的样本数据填充到相同的长度 is that it pads your samples to the same length 22 00:00:49,860 --> 00:00:52,413 并将它们连接成一个单独的 Tensor。 and concatenates them into a single Tensor. 23 00:00:53,340 --> 00:00:57,750 但对于其他工作流程,数据整理器可能非常复杂 But for other workflows, data collators can be quite complex 24 00:00:57,750 --> 00:00:59,910 因为他们为特定任务 as they handle some of the preprocessing 25 00:00:59,910 --> 00:01:02,340 完成一些预处理操作。 needed for that particular task. 26 00:01:02,340 --> 00:01:04,800 所以,如果你想使用数据整理器, So, if you want to use a data collator, 27 00:01:04,800 --> 00:01:07,860 对于 PyTorch 用户,你通常将数据整理器 for PyTorch users, you usually pass the data collator 28 00:01:07,860 --> 00:01:09,780 传递到你的 Trainer 对象。 to your Trainer object. 29 00:01:09,780 --> 00:01:11,310 在 TensorFlow 中,情况有点不同。 In TensorFlow, it's a bit different. 30 00:01:11,310 --> 00:01:12,960 使用数据整理器的最简单方法 The easiest way to use a data collator 31 00:01:12,960 --> 00:01:16,860 是将它传递给数据集的 to_tf_dataset 方法。 is to pass it to the to_tf_dataset method of your dataset. 32 00:01:16,860 --> 00:01:20,198 这会给你 tensorflow_tf_data.dataset And this will give you a tensorflow_tf_data.dataset 33 00:01:20,198 --> 00:01:22,743 然后你可以将其传递给 model.fit。 that you can then pass to model.fit. 34 00:01:23,580 --> 00:01:25,890 你将在示例中看到这些方法 You'll see these approaches used in the examples 35 00:01:25,890 --> 00:01:28,068 和整个课程的笔记本。 and notebooks throughout this course. 36 00:01:28,068 --> 00:01:30,180 另请注意,我们所有的整理器 Also note that all of our collators 37 00:01:30,180 --> 00:01:32,610 可接受 return_tensors 参数。 take a return_tensors argument. 38 00:01:32,610 --> 00:01:35,737 你可以将其设置为 “pt” 以获取 PyTorch Tensor, You can set this to "pt" to get PyTorch Tensors, 39 00:01:35,737 --> 00:01:37,920 "tf" 获取 TensorFlow Tensor, "tf" to get TensorFlow Tensors, 40 00:01:37,920 --> 00:01:40,404 或 “np” 获取 Numpy 数组。 or "np" to get Numpy arrays. 41 00:01:40,404 --> 00:01:42,450 出于向后兼容的原因, For backward compatibility reasons, 42 00:01:42,450 --> 00:01:44,460 默认值为 “pt”, the default value is "pt", 43 00:01:44,460 --> 00:01:47,160 所以 PyTorch 用户甚至大多数情况下 so PyTorch users don't even have to set this argument 44 00:01:47,160 --> 00:01:48,270 不必设置这个参数。 most of the time. 45 00:01:48,270 --> 00:01:50,820 因此,他们通常完全没有意识到 And so as a result, they're often totally unaware 46 00:01:50,820 --> 00:01:52,713 这个参数的存在。 that this argument even exists. 47 00:01:53,730 --> 00:01:55,050 我们可以从中学到一些东西 We can learn something from this 48 00:01:55,050 --> 00:01:57,120 也就是特权的受益人 which is that the beneficiaries of privilege 49 00:01:57,120 --> 00:01:59,793 往往最无视它的存在。 are often the most blind to its existence. 50 00:02:00,690 --> 00:02:01,920 好吧好吧,扯远了。 But okay, coming back. 51 00:02:01,920 --> 00:02:06,540 让我们看看一些特定的数据整理器是如何工作的。 Let's see how some specific data collators work in action. 52 00:02:06,540 --> 00:02:08,070 虽然再次记住 Although again, remember if none 53 00:02:08,070 --> 00:02:09,900 如果内置数据整理器无法满足你的需求, of the built-in data collators do what you need, 54 00:02:09,900 --> 00:02:13,650 你可以自己实现数据整理器,而且它们通常很短。 you can always write your own and they're often quite short. 55 00:02:13,650 --> 00:02:16,950 所以首先,我们将看到 “基本” 数据整理器。 So first, we'll see the "basic" data collators. 56 00:02:16,950 --> 00:02:20,433 它们是 DefaultDataCollator 和 DataCollatorWithPadding。 These are DefaultDataCollator and DataCollatorWithPadding. 57 00:02:21,420 --> 00:02:22,830 如果在准备训练之前 These are the ones you should use 58 00:02:22,830 --> 00:02:24,720 你的标签很简单 if your labels are straightforward 59 00:02:24,720 --> 00:02:27,300 并且你的数据不需要任何特殊处理 and your data doesn't need any special processing 60 00:02:27,300 --> 00:02:29,673 就可以使用上述的数据整理器 before being ready for training. 61 00:02:29,673 --> 00:02:31,272 请注意,因为不同的模型 Notice that because different models 62 00:02:31,272 --> 00:02:33,690 有不同的填充词元, have different padding tokens, 63 00:02:33,690 --> 00:02:37,170 DataCollatorWithPadding 将需要你模型的 Tokenizer DataCollatorWithPadding will need your model's Tokenizer 64 00:02:37,170 --> 00:02:40,150 所以它知道如何正确填充序列。 so it knows how to pad sequences properly. 65 00:02:40,150 --> 00:02:44,790 默认的数据整理器不需要 Tokenizer 工作, The default data collator doesn't need a Tokenizer to work, 66 00:02:44,790 --> 00:02:46,710 但它会因此抛出错误 but it will as a result throw an error 67 00:02:46,710 --> 00:02:48,900 除非你所有的序列都是相同的长度。 unless all of your sequences are the same length. 68 00:02:48,900 --> 00:02:50,500 所以,你应该意识到这一点。 So, you should be aware of that. 69 00:02:51,480 --> 00:02:52,860 继续前进。 Moving on though. 70 00:02:52,860 --> 00:02:54,300 许多其他数据整理器 A lot of the other data collators 71 00:02:54,300 --> 00:02:56,130 除了基本的两个, aside from the basic two are, 72 00:02:56,130 --> 00:02:59,490 它们通常旨在处理一项特定任务。 they're usually designed to handle one specific task. 73 00:02:59,490 --> 00:03:01,050 所以,我要在这里展示两个整理器。 And so, I'm going to show a couple here. 74 00:03:01,050 --> 00:03:04,320 它们是 DataCollatorForTokenClassification These are DataCollatorForTokenClassification 75 00:03:04,320 --> 00:03:06,447 和 DataCollatorForSeqToSeq。 and DataCollatorForSeqToSeq. 76 00:03:06,447 --> 00:03:09,540 而这些任务需要特殊整理器的原因 And the reason these tasks need special collators 77 00:03:09,540 --> 00:03:12,600 是因为它们的标签长度可变。 is because their labels are variable in length. 78 00:03:12,600 --> 00:03:15,960 在词元分类中,每个词元都有一个标签, In token classification there's one label for each token, 79 00:03:15,960 --> 00:03:17,400 标签的长度 and so the length of the labels 80 00:03:17,400 --> 00:03:18,993 是序列的长度。 is the length of the sequence. 81 00:03:20,280 --> 00:03:23,520 而在 SeqToSeq 中,标签是一系列词元 While in SeqToSeq the labels are a sequence of tokens 82 00:03:23,520 --> 00:03:24,780 它可以是可变长度, that can be variable length, 83 00:03:24,780 --> 00:03:25,800 那可能会和输入序列 that can be very different 84 00:03:25,800 --> 00:03:28,200 的长度不同。 from the length of the input sequence. 85 00:03:28,200 --> 00:03:32,880 所以在这两种情况下,我们通过填充标签 So in both of these cases, we handle collating that batch 86 00:03:32,880 --> 00:03:35,280 处理整理那批数据, by padding the labels as well, 87 00:03:35,280 --> 00:03:37,410 正如你在此示例中看到的那样。 as you can see here in this example. 88 00:03:37,410 --> 00:03:40,770 因此,如果我们想加入可变长度的样本 So, inputs and the labels will need to be padded 89 00:03:40,770 --> 00:03:43,860 到同一个小批量数据, if we want to join samples of variable length 90 00:03:43,860 --> 00:03:45,120 就需要填充输入和标签。 into the same minibatch. 91 00:03:45,120 --> 00:03:47,520 这正是数据整理者所做的 That's exactly what the data collators 92 00:03:47,520 --> 00:03:50,460 而这正是这些数据整理者将为我们做的 and that's exactly what these data collators will do for us 93 00:03:50,460 --> 00:03:52,383 你懂的,为了这个特定的任务。 you know, for this particular task. 94 00:03:53,820 --> 00:03:56,070 那么,还有最后一个 So, there's one final data collator 95 00:03:56,070 --> 00:03:58,560 想在这里与大家分享的数据整理器。 I want to show you as well just in this lecture. 96 00:03:58,560 --> 00:04:00,473 这就是 DataCollatorForLanguageModeling。 And that's the DataCollatorForLanguageModeling. 97 00:04:01,410 --> 00:04:03,390 嗯,它非常重要,首先, So, it's very important, and it's firstly, 98 00:04:03,390 --> 00:04:05,820 因为语言模型是我们这段时间以来 because language models are just so foundational 99 00:04:05,820 --> 00:04:09,720 处理 NLP 所涉及的最基本的概念。 to do for everything we do with NLP these days. 100 00:04:09,720 --> 00:04:12,060 但其次,因为它有两种模式 But secondly, because it has two modes 101 00:04:12,060 --> 00:04:14,760 可以完成两件截然不同的事情。 that do two very different things. 102 00:04:14,760 --> 00:04:19,230 因此,你可以使用 mlm 参数选择所需的模式。 So you choose which mode you want with the mlm argument. 103 00:04:19,230 --> 00:04:22,470 将其设置为 True 以进行屏蔽语言建模, Set it to True for masked language modeling, 104 00:04:22,470 --> 00:04:26,190 并将其设置为 False 以进行因果语言建模。 and set it to False for causal language modeling. 105 00:04:26,190 --> 00:04:28,620 因此,为因果语言建模整理数据 So, collating data for causal language modeling 106 00:04:28,620 --> 00:04:30,750 其实很简单。 is actually quite straightforward. 107 00:04:30,750 --> 00:04:32,640 该模型只是做出预测 The model is just making predictions 108 00:04:32,640 --> 00:04:35,460 接下来是什么词元,所以你的标签 for what token comes next, and so your labels 109 00:04:35,460 --> 00:04:37,800 或多或少只是你输入的副本, are more or less just a copy of your inputs, 110 00:04:37,800 --> 00:04:39,090 整理器会处理你的输入 and the collator will handle that 111 00:04:39,090 --> 00:04:42,240 并确保正确填充输入和标签。 and ensure that the inputs and labels are padded correctly. 112 00:04:42,240 --> 00:04:44,910 但是,当你将 mlm 设置为 True 时, When you set mlm to True though, 113 00:04:44,910 --> 00:04:46,786 你会得到完全不同的效果, you get quite different behavior, 114 00:04:46,786 --> 00:04:49,200 这与任何其他数据整理器不同, that's different from any other data collator, 115 00:04:49,200 --> 00:04:51,660 那是因为将 mlm 设置为 True and that's because setting mlm to True 116 00:04:51,660 --> 00:04:53,550 表示屏蔽语言建模 means masked language modeling 117 00:04:53,550 --> 00:04:55,680 这意味着标签必须是 and that means the labels need to be, 118 00:04:55,680 --> 00:04:58,080 你懂的,输入需要被屏蔽。 you know, the inputs need to be masked. 119 00:04:58,080 --> 00:05:00,093 那么,那看起来像什么? So, what does that look like? 120 00:05:01,050 --> 00:05:03,900 所以,回想一下,在屏蔽语言建模中, So, recall that in masked language modeling, 121 00:05:03,900 --> 00:05:06,570 该模型没有预测下一个词, the model is not predicting the next word, 122 00:05:06,570 --> 00:05:09,240 相反,我们随机屏蔽掉一些词元 instead we randomly mask out some tokens 123 00:05:09,240 --> 00:05:11,130 模型会同时预测所有这些。 and the model predicts all of them at once. 124 00:05:11,130 --> 00:05:12,780 所以,对于那些被屏蔽的词元 So, it tries to kinda fill in the blanks 125 00:05:12,780 --> 00:05:14,790 它试图填补空白。 for those masked tokens. 126 00:05:14,790 --> 00:05:18,210 但是随机掩蔽的过程出奇地复杂。 But the process of random masking is surprisingly complex. 127 00:05:18,210 --> 00:05:21,330 如果我们遵循原始 BERT 论文中的协议, If we follow the protocol from the original BERT paper, 128 00:05:21,330 --> 00:05:23,970 我们需要用掩码词元替换一些词元, we need to replace some tokens with a masked token, 129 00:05:23,970 --> 00:05:26,190 一些其他带有随机词元的词元, some other tokens with a random token, 130 00:05:26,190 --> 00:05:29,820 然后保持第三组词元不变。 and then keep a third set of tokens unchanged. 131 00:05:29,820 --> 00:05:30,840 好吧,具体细节和我们这么做的原因 Yeah, this is not the lecture 132 00:05:30,840 --> 00:05:33,903 就不在这里和大家详细说明了。 to go into the specifics of that or why we do it. 133 00:05:33,903 --> 00:05:36,660 如果你好奇的话,可以随时查看原始的 You can always check out the original BERT paper 134 00:05:36,660 --> 00:05:37,493 BERT 论文。 if you're curious. 135 00:05:37,493 --> 00:05:39,620 它写得非常好。也很容易理解。 It's well written. It's easy to understand. 136 00:05:40,650 --> 00:05:44,190 这里要知道的主要事情是自己实现起来 The main thing to know here is that it can be a real pain 137 00:05:44,190 --> 00:05:46,770 是一件非常痛苦的事情,而且也非常地复杂。 and quite complex to implement that yourself. 138 00:05:46,770 --> 00:05:49,740 但是当你将 mlm 设置为 True 时,DataCollatorForLanguageModeling But DataCollatorForLanguageModeling will do it for you 139 00:05:49,740 --> 00:05:51,750 会为你完成。 when you set mlm to True. 140 00:05:51,750 --> 00:05:54,690 这是一些数据整理器所完成的 And that's an example of the more intricate 141 00:05:54,690 --> 00:05:57,870 更加复杂的预处理操作。 preprocessing that some of our data collators do. 142 00:05:57,870 --> 00:05:59,430 就是这样! And that's it! 143 00:05:59,430 --> 00:06:01,920 因此,这涵盖了最常用的数据整理器 So, this covers the most commonly used data collators 144 00:06:01,920 --> 00:06:03,480 以及它们所针对的具体任务。 and the tasks they're used for. 145 00:06:03,480 --> 00:06:06,990 希望现在你会知道何时使用数据整理器 And hopefully, now you'll know when to use data collators 146 00:06:06,990 --> 00:06:10,833 以及该为你的特定任务选择哪一个。 and which one to choose for your specific task. 147 00:06:11,765 --> 00:06:14,598 (嘶嘶声) (whooshing sound)