subtitles/zh-CN/18_batching-inputs-together-(tensorflow).srt (253 lines of code) (raw):

1 00:00:00,458 --> 00:00:02,791 (徽标嗖嗖声) (logo whooshes) 2 00:00:05,310 --> 00:00:07,590 - 如何批量一起输入。 - How to batch inputs together. 3 00:00:07,590 --> 00:00:09,150 在本视频中,我们将看到 In this video, we'll see 4 00:00:09,150 --> 00:00:11,050 如何将输入序列一起批量处理。 how to batch input sequences together. 5 00:00:12,630 --> 00:00:14,910 一般来说,我们要传递的句子 In general, the sentences we want to pass 6 00:00:14,910 --> 00:00:18,000 进入我们的模型不会都具有相同的长度。 through our model won't all have the same lengths. 7 00:00:18,000 --> 00:00:20,310 在这里,我们使用模型 Here, we are using the model we saw 8 00:00:20,310 --> 00:00:22,650 在情绪分析 pipeline 中的 in the sentiment analysis pipeline 9 00:00:22,650 --> 00:00:24,753 并想对两个句子进行分类。 and want to classify two sentences. 10 00:00:25,860 --> 00:00:27,870 将它们分词化并映射每个标记时 *[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳] When tokenizing them and mapping each token 11 00:00:27,870 --> 00:00:30,000 到其相应的输入 ID, to its corresponding input IDs, 12 00:00:30,000 --> 00:00:31,900 我们得到两个不同长度的列表。 we get two lists of different lengths. 13 00:00:33,360 --> 00:00:35,070 尝试创建 tensor 和 NumPy 数组 Trying to create a tensor and NumPy array 14 00:00:35,070 --> 00:00:38,100 从这两个列表中将导致错误 from those two lists will result in an error 15 00:00:38,100 --> 00:00:40,953 因为所有数组和张量都应该是矩形的。 because all arrays and tensors should be rectangular. 16 00:00:42,510 --> 00:00:43,920 克服此困难的一种方法 One way to overcome this limit 17 00:00:43,920 --> 00:00:47,340 就是让第二句和第一句一样长 is to make the second sentence the same length as the first 18 00:00:47,340 --> 00:00:50,373 通过根据需要多次添加特殊分词。 by adding a special token as many times as necessary. 19 00:00:51,300 --> 00:00:53,340 另一种方法是截断第一个序列 Another way would be to truncate the first sequence 20 00:00:53,340 --> 00:00:56,550 到第二个的长度,但我们会失去很多 to the length of the second, but we would then lose a lot 21 00:00:56,550 --> 00:00:58,590 可能需要的信息 of information that may be necessary 22 00:00:58,590 --> 00:01:01,230 来正确地对句子进行分类。 to properly classify the sentence. 23 00:01:01,230 --> 00:01:04,710 一般来说,我们只会截断句子 In general, we only truncate sentences when they are longer 24 00:01:04,710 --> 00:01:07,083 超过模型可以处理的最大长度的。 than the maximum length the model can handle. 25 00:01:08,310 --> 00:01:10,320 用于填充第二句的值 The value used to pad the second sentence 26 00:01:10,320 --> 00:01:12,390 不应被随意挑选。 should not be picked randomly. 27 00:01:12,390 --> 00:01:15,330 该模型已经使用特定的填充 ID 进行了预训练, The model has been pretrained with a certain padding ID, 28 00:01:15,330 --> 00:01:18,093 你可以在 tokenizer.pad_token_id 中找到它。 which you can find in tokenizer.pad_token_id. 29 00:01:19,950 --> 00:01:21,630 现在我们已经填充了句子, Now that we have padded our sentences, 30 00:01:21,630 --> 00:01:23,130 我们可以和他们一起做一批。 we can make a batch with them. 31 00:01:24,210 --> 00:01:26,730 如果我们分别将两个句子传递给模型 If we pass the two sentences to the model separately 32 00:01:26,730 --> 00:01:29,130 或并批在一起,但是,我们注意到 or batched together, however, we notice 33 00:01:29,130 --> 00:01:30,630 我们没有得到相同的结果 that we don't get the same results 34 00:01:30,630 --> 00:01:32,070 对于被填充的句子。 for the sentence that is padded. 35 00:01:32,070 --> 00:01:34,440 在这里,第二个。 Here, the second one. 36 00:01:34,440 --> 00:01:36,690 希望是 transformer 库中的单词? Expect the word in the transformer library? 37 00:01:36,690 --> 00:01:37,620 不。 No. 38 00:01:37,620 --> 00:01:39,720 如果你还记得 Transformer 模型大量使用 If you remember that Transformer models make heavy use 39 00:01:39,720 --> 00:01:43,800 注意力层,它应该不足为奇。 of attention layers, it should not come as a total surprise. 40 00:01:43,800 --> 00:01:47,100 在计算每个分词的上下文表示时, When computing the contextual representation of each token, 41 00:01:47,100 --> 00:01:49,440 注意层查看所有其他词 the attention layers look at all the other words 42 00:01:49,440 --> 00:01:51,240 在句子中。 in the sentence. 43 00:01:51,240 --> 00:01:52,252 如果我们只有一句话 If we have just a sentence 44 00:01:52,252 --> 00:01:55,650 或者添加了几个填充 token 的句子, or the sentence with several padding tokens added, 45 00:01:55,650 --> 00:01:57,750 我们没有得到相同的值是合乎逻辑的。 it's logical we don't get the same values. 46 00:01:58,830 --> 00:02:01,410 要在有或没有填充的情况下获得相同的结果, To get the same results with or without padding, 47 00:02:01,410 --> 00:02:03,750 我们需要向注意力层表明 we need to indicate to the attention layers 48 00:02:03,750 --> 00:02:06,660 他们应该忽略那些填充 token 。 that they should ignore those padding tokens. 49 00:02:06,660 --> 00:02:08,970 这是通过创建一个注意力掩码来完成的, This is done by creating an attention mask, 50 00:02:08,970 --> 00:02:11,700 与输入 ID 具有相同形状的 tensor a tensor with the same shape as the input IDs 51 00:02:11,700 --> 00:02:13,173 用 0 和 1 。 with zeros and ones. 52 00:02:14,640 --> 00:02:16,830 1 的 token 表示注意层 Ones indicate the tokens the attention layers 53 00:02:16,830 --> 00:02:18,660 应该在上下文中考虑, should consider in the context, 54 00:02:18,660 --> 00:02:20,823 并且 0 的 token 他们应该忽略。 and zeros, the tokens they should ignore. 55 00:02:21,810 --> 00:02:23,290 现在,通过这个注意力掩码 Now, passing this attention mask 56 00:02:23,290 --> 00:02:26,460 连同输入 ID 会给我们相同的结果 along with the input IDs will give us the same results 57 00:02:26,460 --> 00:02:29,460 就像我们将两个句子分别发送给模型一样。 as when we sent the two sentences individually to the model. 58 00:02:30,870 --> 00:02:33,870 这一切都是由分词器在幕后完成的 This is all done behind the scenes by the tokenizer 59 00:02:33,870 --> 00:02:35,583 当你将它应用于几个句子时 when you apply it to several sentences 60 00:02:35,583 --> 00:02:37,713 设置参数 padding=True。 with the flag padding=True. 61 00:02:38,640 --> 00:02:39,690 它将应用填充 It will apply the padding 62 00:02:39,690 --> 00:02:42,180 对较小的句子具有适当值 with the proper value to the smaller sentences 63 00:02:42,180 --> 00:02:44,373 并创建适当的注意掩码。 and create the appropriate attention mask.