subtitles/zh-CN/17_batching-inputs-together-(pytorch).srt (265 lines of code) (raw):

1 00:00:00,373 --> 00:00:02,956 (微妙的爆炸) (subtle blast) 2 00:00:05,400 --> 00:00:07,590 - 如何一起批量输入。 - How to batch inputs together. 3 00:00:07,590 --> 00:00:09,240 在本视频中,我们将看到如何 In this video, we will see how 4 00:00:09,240 --> 00:00:11,073 将输入序列一起批处理。 to batch input sequences together. 5 00:00:12,137 --> 00:00:15,420 一般来说,我们想要输入模型的句子 In general, the sentences we want to pass through our model 6 00:00:15,420 --> 00:00:17,670 不会都有相同的长度。 won't all have the same lengths. 7 00:00:17,670 --> 00:00:19,740 在这里,我们使用模型 Here, we are using the model we saw 8 00:00:19,740 --> 00:00:22,080 在情绪分析 pipeline 中讲的 in the sentiment analysis pipeline 9 00:00:22,080 --> 00:00:24,063 并想对两个句子进行分类。 and want to classify two sentences. 10 00:00:24,900 --> 00:00:27,360 将它们分词化并映射每个分词时 *[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳] When tokenizing them and mapping each token 11 00:00:27,360 --> 00:00:29,610 到其相应的输入 ID, to its corresponding input IDs, 12 00:00:29,610 --> 00:00:31,593 我们得到两个不同长度的列表。 we get two lists of different lengths. 13 00:00:33,240 --> 00:00:35,340 尝试创建 tensor 或 NumPy 数组 Trying to create a tensor or a NumPy array 14 00:00:35,340 --> 00:00:38,220 从这两个列表中, 将导致错误, from those two lists will result in an error, 15 00:00:38,220 --> 00:00:41,043 因为所有数组和张量都应该是矩形的。 because all arrays and tensors should be rectangular. 16 00:00:42,240 --> 00:00:44,160 突破此限制的一种方法 One way to overcome this limit 17 00:00:44,160 --> 00:00:45,690 是让第二句 is to make the second sentence 18 00:00:45,690 --> 00:00:47,640 与第一个长度相同 the same length as the first 19 00:00:47,640 --> 00:00:50,463 通过根据需要多次添加特殊分词。 by adding a special token as many times as necessary. 20 00:00:51,600 --> 00:00:53,970 另一种方法是截断第一个序列 Another way would be to truncate the first sequence 21 00:00:53,970 --> 00:00:55,710 到第二个的长度, to the length of the second, 22 00:00:55,710 --> 00:00:58,140 但我们会失去很多信息 but we would then lose a lot of information 23 00:00:58,140 --> 00:01:01,083 而这可能是正确分类句子所必需的。 that might be necessary to properly classify the sentence. 24 00:01:02,190 --> 00:01:04,830 一般来说,我们只截断句子 In general, we only truncate sentences 25 00:01:04,830 --> 00:01:06,840 当它们长于最大长度时 when they are longer than the maximum length 26 00:01:06,840 --> 00:01:08,073 该模型可以处理。 the model can handle. 27 00:01:09,720 --> 00:01:11,850 用于填充第二句的值 The value used to pad the second sentence 28 00:01:11,850 --> 00:01:13,740 不应被随意挑选; should not be picked randomly; 29 00:01:13,740 --> 00:01:16,680 该模型已经用特定的填充 ID 进行了预训练, the model has been pretrained with a certain padding ID, 30 00:01:16,680 --> 00:01:19,533 你可以在 tokenizer.pad_token_id 中找到它。 which you can find in tokenizer.pad_token_id. 31 00:01:21,090 --> 00:01:22,800 现在我们已经填充了句子, Now that we have padded our sentences, 32 00:01:22,800 --> 00:01:24,303 我们可以和他们做成一批。 we can make a batch with them. 33 00:01:25,380 --> 00:01:28,320 如果我们分别将两个句子传递给模型 If we pass the two sentences to the model separately 34 00:01:28,320 --> 00:01:30,120 和并批在一起,然而 and batched together however, 35 00:01:30,120 --> 00:01:32,100 我们注意到我们没有得到相同的结果 we notice that we don't get the same results 36 00:01:32,100 --> 00:01:34,060 对于被填充的句子, for the sentence that is padded, 37 00:01:34,060 --> 00:01:35,403 在这里,第二个。 here, the second one. 38 00:01:36,390 --> 00:01:39,420 是 Transformers 有问题?不。 It's at the backend in the Transformers Library? No. 39 00:01:39,420 --> 00:01:40,770 如果你还记得 Transformer 模型 If you remember that Transformer models 40 00:01:40,770 --> 00:01:42,810 大量使用注意力层, make heavy use of attention layers, 41 00:01:42,810 --> 00:01:45,210 这应该不足为奇; this should not come as a total surprise; 42 00:01:45,210 --> 00:01:48,277 在计算每个分词的上下文表示时, when computing the contextual representation of each token, 43 00:01:48,277 --> 00:01:50,910 注意层查看所有其他词 the attention layers look at all the other words 44 00:01:50,910 --> 00:01:52,410 在句子中。 in the sentence. 45 00:01:52,410 --> 00:01:53,850 如果我们只有这句话 If we have just the sentence 46 00:01:53,850 --> 00:01:56,970 或者添加了几个填充 token 的句子, or the sentence with several padding tokens added, 47 00:01:56,970 --> 00:01:59,073 我们没有得到相同的值是合乎逻辑的。 it's logical we don't get the same values. 48 00:02:00,270 --> 00:02:03,030 要在有或没有填充的情况下获得相同的结果, To get the same results with or without padding, 49 00:02:03,030 --> 00:02:05,340 我们需要向注意力层表明 we need to indicate to the attention layers 50 00:02:05,340 --> 00:02:08,070 他们应该忽略那些填充 token 。 that they should ignore those padding tokens. 51 00:02:08,070 --> 00:02:10,620 这是通过创建一个注意力掩码来完成的, This is done by creating an attention mask, 52 00:02:10,620 --> 00:02:13,320 与输入 ID 具有相同形状的张量, a tensor with the same shape as the input IDs, 53 00:02:13,320 --> 00:02:14,733 用 0 和 1 。 with zeros and ones. 54 00:02:15,780 --> 00:02:18,120 1 的分词表示注意层 Ones indicate the tokens the attention layers 55 00:02:18,120 --> 00:02:20,100 应该结合上下文考虑 should consider in the context 56 00:02:20,100 --> 00:02:22,100 并且 0 的分词他们应该忽略。 and zeros the tokens they should ignore. 57 00:02:23,520 --> 00:02:26,760 现在,将这个注意掩码与输入 ID 一起传入 Now, passing this attention mask along with the input ID 58 00:02:26,760 --> 00:02:28,170 会给我们相同的结果 will give us the same results 59 00:02:28,170 --> 00:02:31,170 就像我们将两个句子单独发送给模型一样。 as when we sent the two sentences individually to the model. 60 00:02:32,400 --> 00:02:34,950 这一切都是由分词器在幕后完成的 This is all done behind the scenes by the tokenizer 61 00:02:34,950 --> 00:02:36,900 当你将它应用于几个句子时 when you apply it to several sentences 62 00:02:36,900 --> 00:02:38,613 设参数 padding=True。 with the flag padding=True. 63 00:02:39,599 --> 00:02:41,490 它将应用具有适当值的填充 It will apply the padding with the proper value 64 00:02:41,490 --> 00:02:43,140 对较小的句子 to the smaller sentences 65 00:02:43,140 --> 00:02:45,423 并创建适当的注意掩码。 and create the appropriate attention mask. 66 00:02:46,993 --> 00:02:49,576 (微妙的爆炸) (subtle blast)