subtitles/zh-CN/18_batching-inputs-together-(tensorflow).srt (253 lines of code) (raw):
1
00:00:00,458 --> 00:00:02,791
(徽标嗖嗖声)
(logo whooshes)
2
00:00:05,310 --> 00:00:07,590
- 如何批量一起输入。
- How to batch inputs together.
3
00:00:07,590 --> 00:00:09,150
在本视频中,我们将看到
In this video, we'll see
4
00:00:09,150 --> 00:00:11,050
如何将输入序列一起批量处理。
how to batch input sequences together.
5
00:00:12,630 --> 00:00:14,910
一般来说,我们要传递的句子
In general, the sentences we want to pass
6
00:00:14,910 --> 00:00:18,000
进入我们的模型不会都具有相同的长度。
through our model won't all have the same lengths.
7
00:00:18,000 --> 00:00:20,310
在这里,我们使用模型
Here, we are using the model we saw
8
00:00:20,310 --> 00:00:22,650
在情绪分析 pipeline 中的
in the sentiment analysis pipeline
9
00:00:22,650 --> 00:00:24,753
并想对两个句子进行分类。
and want to classify two sentences.
10
00:00:25,860 --> 00:00:27,870
将它们分词化并映射每个标记时
*[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳]
When tokenizing them and mapping each token
11
00:00:27,870 --> 00:00:30,000
到其相应的输入 ID,
to its corresponding input IDs,
12
00:00:30,000 --> 00:00:31,900
我们得到两个不同长度的列表。
we get two lists of different lengths.
13
00:00:33,360 --> 00:00:35,070
尝试创建 tensor 和 NumPy 数组
Trying to create a tensor and NumPy array
14
00:00:35,070 --> 00:00:38,100
从这两个列表中将导致错误
from those two lists will result in an error
15
00:00:38,100 --> 00:00:40,953
因为所有数组和张量都应该是矩形的。
because all arrays and tensors should be rectangular.
16
00:00:42,510 --> 00:00:43,920
克服此困难的一种方法
One way to overcome this limit
17
00:00:43,920 --> 00:00:47,340
就是让第二句和第一句一样长
is to make the second sentence the same length as the first
18
00:00:47,340 --> 00:00:50,373
通过根据需要多次添加特殊分词。
by adding a special token as many times as necessary.
19
00:00:51,300 --> 00:00:53,340
另一种方法是截断第一个序列
Another way would be to truncate the first sequence
20
00:00:53,340 --> 00:00:56,550
到第二个的长度,但我们会失去很多
to the length of the second, but we would then lose a lot
21
00:00:56,550 --> 00:00:58,590
可能需要的信息
of information that may be necessary
22
00:00:58,590 --> 00:01:01,230
来正确地对句子进行分类。
to properly classify the sentence.
23
00:01:01,230 --> 00:01:04,710
一般来说,我们只会截断句子
In general, we only truncate sentences when they are longer
24
00:01:04,710 --> 00:01:07,083
超过模型可以处理的最大长度的。
than the maximum length the model can handle.
25
00:01:08,310 --> 00:01:10,320
用于填充第二句的值
The value used to pad the second sentence
26
00:01:10,320 --> 00:01:12,390
不应被随意挑选。
should not be picked randomly.
27
00:01:12,390 --> 00:01:15,330
该模型已经使用特定的填充 ID 进行了预训练,
The model has been pretrained with a certain padding ID,
28
00:01:15,330 --> 00:01:18,093
你可以在 tokenizer.pad_token_id 中找到它。
which you can find in tokenizer.pad_token_id.
29
00:01:19,950 --> 00:01:21,630
现在我们已经填充了句子,
Now that we have padded our sentences,
30
00:01:21,630 --> 00:01:23,130
我们可以和他们一起做一批。
we can make a batch with them.
31
00:01:24,210 --> 00:01:26,730
如果我们分别将两个句子传递给模型
If we pass the two sentences to the model separately
32
00:01:26,730 --> 00:01:29,130
或并批在一起,但是,我们注意到
or batched together, however, we notice
33
00:01:29,130 --> 00:01:30,630
我们没有得到相同的结果
that we don't get the same results
34
00:01:30,630 --> 00:01:32,070
对于被填充的句子。
for the sentence that is padded.
35
00:01:32,070 --> 00:01:34,440
在这里,第二个。
Here, the second one.
36
00:01:34,440 --> 00:01:36,690
希望是 transformer 库中的单词?
Expect the word in the transformer library?
37
00:01:36,690 --> 00:01:37,620
不。
No.
38
00:01:37,620 --> 00:01:39,720
如果你还记得 Transformer 模型大量使用
If you remember that Transformer models make heavy use
39
00:01:39,720 --> 00:01:43,800
注意力层,它应该不足为奇。
of attention layers, it should not come as a total surprise.
40
00:01:43,800 --> 00:01:47,100
在计算每个分词的上下文表示时,
When computing the contextual representation of each token,
41
00:01:47,100 --> 00:01:49,440
注意层查看所有其他词
the attention layers look at all the other words
42
00:01:49,440 --> 00:01:51,240
在句子中。
in the sentence.
43
00:01:51,240 --> 00:01:52,252
如果我们只有一句话
If we have just a sentence
44
00:01:52,252 --> 00:01:55,650
或者添加了几个填充 token 的句子,
or the sentence with several padding tokens added,
45
00:01:55,650 --> 00:01:57,750
我们没有得到相同的值是合乎逻辑的。
it's logical we don't get the same values.
46
00:01:58,830 --> 00:02:01,410
要在有或没有填充的情况下获得相同的结果,
To get the same results with or without padding,
47
00:02:01,410 --> 00:02:03,750
我们需要向注意力层表明
we need to indicate to the attention layers
48
00:02:03,750 --> 00:02:06,660
他们应该忽略那些填充 token 。
that they should ignore those padding tokens.
49
00:02:06,660 --> 00:02:08,970
这是通过创建一个注意力掩码来完成的,
This is done by creating an attention mask,
50
00:02:08,970 --> 00:02:11,700
与输入 ID 具有相同形状的 tensor
a tensor with the same shape as the input IDs
51
00:02:11,700 --> 00:02:13,173
用 0 和 1 。
with zeros and ones.
52
00:02:14,640 --> 00:02:16,830
1 的 token 表示注意层
Ones indicate the tokens the attention layers
53
00:02:16,830 --> 00:02:18,660
应该在上下文中考虑,
should consider in the context,
54
00:02:18,660 --> 00:02:20,823
并且 0 的 token 他们应该忽略。
and zeros, the tokens they should ignore.
55
00:02:21,810 --> 00:02:23,290
现在,通过这个注意力掩码
Now, passing this attention mask
56
00:02:23,290 --> 00:02:26,460
连同输入 ID 会给我们相同的结果
along with the input IDs will give us the same results
57
00:02:26,460 --> 00:02:29,460
就像我们将两个句子分别发送给模型一样。
as when we sent the two sentences individually to the model.
58
00:02:30,870 --> 00:02:33,870
这一切都是由分词器在幕后完成的
This is all done behind the scenes by the tokenizer
59
00:02:33,870 --> 00:02:35,583
当你将它应用于几个句子时
when you apply it to several sentences
60
00:02:35,583 --> 00:02:37,713
设置参数 padding=True。
with the flag padding=True.
61
00:02:38,640 --> 00:02:39,690
它将应用填充
It will apply the padding
62
00:02:39,690 --> 00:02:42,180
对较小的句子具有适当值
with the proper value to the smaller sentences
63
00:02:42,180 --> 00:02:44,373
并创建适当的注意掩码。
and create the appropriate attention mask.