subtitles/zh-CN/09_what-happens-inside-the-pipeline-function-(tensorflow).srt (430 lines of code) (raw):
1
00:00:00,397 --> 00:00:02,980
(微妙的爆炸)
(subtle blast)
2
00:00:05,490 --> 00:00:07,953
- 管道函数内部发生了什么?
- What happens inside the pipeline function?
3
00:00:09,930 --> 00:00:13,050
在这段视频中,我们将看看实际发生了什么
In this video, we will look at what actually happens
4
00:00:13,050 --> 00:00:14,820
当我们使用 Transformers 库的
when we use the pipeline function
5
00:00:14,820 --> 00:00:16,920
pipeline 函数时
of the Transformers library.
6
00:00:16,920 --> 00:00:18,930
更具体地说,我们将看看
More specifically, we will look at
7
00:00:18,930 --> 00:00:21,030
情绪分析管道,
the sentiment analysis pipeline,
8
00:00:21,030 --> 00:00:23,760
以及它是如何从以下两个句子中得出的
and how it went from the two following sentences
9
00:00:23,760 --> 00:00:25,800
正负标签
to the positive and negative labels
10
00:00:25,800 --> 00:00:27,250
加上各自的分数。
with their respective scores.
11
00:00:28,740 --> 00:00:31,110
正如我们在管道视频中看到的那样,
As we have seen in the pipeline video,
12
00:00:31,110 --> 00:00:33,900
管道分为三个阶段。
there are three stages in the pipeline.
13
00:00:33,900 --> 00:00:36,810
首先,我们将原始文本转换为数字
First, we convert the raw texts to numbers
14
00:00:36,810 --> 00:00:39,160
该模型可以使用分词器来理解。
the model can make sense of, using a tokenizer.
15
00:00:40,140 --> 00:00:42,600
然后,这些数字通过模型,
Then, those numbers go through the model,
16
00:00:42,600 --> 00:00:44,550
输出 logits 。
*[译者注: logits 作为逻辑值的意思]
which outputs logits.
17
00:00:44,550 --> 00:00:47,190
最后是后处理步骤
Finally, the post-processing steps
18
00:00:47,190 --> 00:00:49,490
将这些 logits 转换为标签和分数。
transforms those logits into labels and score.
19
00:00:51,000 --> 00:00:52,590
让我们详细看看这三个步骤,
Let's look in detail at those three steps,
20
00:00:52,590 --> 00:00:55,200
以及如何使用 Transformers 库复现它们,
and how to replicate them using the Transformers library,
21
00:00:55,200 --> 00:00:57,903
从第一阶段开始,分词化。
beginning with the first stage, tokenization.
22
00:00:59,905 --> 00:01:02,520
分词化过程有几个步骤。
The tokenization process has several steps.
23
00:01:02,520 --> 00:01:06,900
首先,文本被分成称为 token 的小块。
First, the text is split into small chunks called token.
24
00:01:06,900 --> 00:01:09,933
它们可以是单词、单词的一部分或标点符号。
They can be words, parts of words or punctuation symbols.
25
00:01:10,800 --> 00:01:14,310
然后分词器将有一些特殊的 token
Then the tokenizer will had some special tokens
26
00:01:14,310 --> 00:01:15,573
如果模型期望它们。
if the model expect them.
27
00:01:16,440 --> 00:01:20,430
在这里,所使用的模型在开头需要一个 CLS token
Here, the model used expects a CLS token at the beginning
28
00:01:20,430 --> 00:01:23,910
以及用于分类的句子末尾的 SEP token。
and a SEP token at the end of the sentence to classify.
29
00:01:23,910 --> 00:01:27,630
最后,标记器将每个标记与其唯一 ID 匹配
Lastly, the tokenizer matches each token to its unique ID
30
00:01:27,630 --> 00:01:29,730
在预训练模型的词汇表中。
in the vocabulary of the pretrained model.
31
00:01:30,660 --> 00:01:32,040
要加载这样的分词器,
To load such a tokenizer,
32
00:01:32,040 --> 00:01:34,983
Transformers 库提供了 AutoTokenizer API。
the Transformers library provides the AutoTokenizer API.
33
00:01:35,880 --> 00:01:39,510
这个类最重要的方法是 from_pretrained,
The most important method of this class is from_pretrained,
34
00:01:39,510 --> 00:01:41,940
这将下载并缓存配置
which will download and cache the configuration
35
00:01:41,940 --> 00:01:44,913
以及与给定 checkpoint 相关联的词汇表。
*[译者注: 在深度学习中, checkpoint 作为检查点是用来备份模型的, 后不翻译]
and the vocabulary associated to a given checkpoint.
36
00:01:46,410 --> 00:01:48,180
这里默认使用的 checkpoint
Here, the checkpoint used by default
37
00:01:48,180 --> 00:01:50,310
用于情绪分析的 pipeline
for the sentiment analysis pipeline
38
00:01:50,310 --> 00:01:54,510
是 distilbert-base-uncased-finetuned-sst2-English,
is distilbert-base-uncased-finetuned-sst2-English,
39
00:01:54,510 --> 00:01:55,960
这有点含糊。
which is a bit of a mouthful.
40
00:01:56,820 --> 00:01:59,760
我们实例化一个与该 checkpoint 关联的分词器,
We instantiate a tokenizer associated with that checkpoint,
41
00:01:59,760 --> 00:02:01,833
然后给它输入两个句子。
then feed it the two sentences.
42
00:02:02,790 --> 00:02:05,490
由于这两个句子的大小不同,
Since those two sentences are not of the same size,
43
00:02:05,490 --> 00:02:07,440
我们需要填充最短的一个
we will need to pad the shortest one
44
00:02:07,440 --> 00:02:09,570
能够构建一个数组。
to be able to build an array.
45
00:02:09,570 --> 00:02:10,403
这是由分词器完成的
This is done by the tokenizer
46
00:02:10,403 --> 00:02:12,603
使用选项 padding=True。
with the option padding=True.
47
00:02:14,130 --> 00:02:17,340
使用 truncation=True,我们确保任何句子更长
With truncation=True, we ensure that any sentence longer
48
00:02:17,340 --> 00:02:19,953
比模型可以处理的最大值被截断。
than the maximum the model can handle is truncated.
49
00:02:20,820 --> 00:02:24,200
最后,return_tensors 选项告诉分词器
Lastly, the return_tensors option tells the tokenizer
50
00:02:24,200 --> 00:02:25,773
返回 PyTorch 张量。
to return a PyTorch tensor.
51
00:02:26,910 --> 00:02:28,050
看看结果,
Looking at the result,
52
00:02:28,050 --> 00:02:30,450
我们看到我们有一个有两个键的字典。
we see we have a dictionary with two keys.
53
00:02:30,450 --> 00:02:33,840
输入 ID 包含两个句子的 ID,
Input IDs contains the IDs of both sentences,
54
00:02:33,840 --> 00:02:35,840
在应用填充的地方使用零。
with zeros where the padding is applied.
55
00:02:36,750 --> 00:02:38,550
第二把钥匙,注意力掩码,
The second key, attention mask,
56
00:02:38,550 --> 00:02:40,650
指示已应用填充的位置,
indicates where padding has been applied,
57
00:02:40,650 --> 00:02:42,750
所以模型不会关注它。
so the model does not pay attention to it.
58
00:02:43,590 --> 00:02:46,380
这就是标记化步骤中的全部内容。
This is all what is inside the tokenization step.
59
00:02:46,380 --> 00:02:49,653
现在让我们来看看第二步,模型。
Now let's have a look at the second step, the model.
60
00:02:51,090 --> 00:02:53,850
至于分词器,有一个 AutoModel API,
As for the tokenizer, there is an AutoModel API,
61
00:02:53,850 --> 00:02:55,890
使用 from_pretrained 方法。
with a from_pretrained method.
62
00:02:55,890 --> 00:02:59,100
它将下载并缓存模型的配置
It will download and cache the configuration of the model
63
00:02:59,100 --> 00:03:01,560
以及预训练的权重。
as well as the pretrained weights.
64
00:03:01,560 --> 00:03:04,830
但是,AutoModel API 只会实例化
However, the AutoModel API will only instantiate
65
00:03:04,830 --> 00:03:06,540
模型的主体,
the body of the model,
66
00:03:06,540 --> 00:03:09,120
也就是说,模型中剩下的部分
that is, the part of the model that is left
67
00:03:09,120 --> 00:03:11,103
一旦预训练头被移除。
once the pretraining head is removed.
68
00:03:12,210 --> 00:03:14,460
它会输出一个高维张量
It will output a high-dimensional tensor
69
00:03:14,460 --> 00:03:17,190
这是通过的句子的表示,
that is a representation of the sentences passed,
70
00:03:17,190 --> 00:03:18,930
但这不是直接有用的
but which is not directly useful
71
00:03:18,930 --> 00:03:20,480
对于我们的分类问题。
for our classification problem.
72
00:03:21,930 --> 00:03:24,210
这里张量有两个句子,
Here the tensor has two sentences,
73
00:03:24,210 --> 00:03:26,070
每十六个 token,
each of sixteen token,
74
00:03:26,070 --> 00:03:30,393
最后一个维度是我们模型的隐藏大小,768。
and the last dimension is the hidden size of our model, 768.
75
00:03:31,620 --> 00:03:34,020
要获得与我们的分类问题相关的输出,
To get an output linked to our classification problem,
76
00:03:34,020 --> 00:03:37,800
我们需要使用 AutoModelForSequenceClassification 类。
we need to use the AutoModelForSequenceClassification class.
77
00:03:37,800 --> 00:03:40,170
它与 AutoModel 类完全一样工作,
It works exactly as the AutoModel class,
78
00:03:40,170 --> 00:03:41,970
除了它会建立一个模型
except that it will build a model
79
00:03:41,970 --> 00:03:43,353
带分类头。
with a classification head.
80
00:03:44,520 --> 00:03:46,770
每个常见的 NLP 任务在 Transformers 库
There is one auto class for each common NLP task
81
00:03:46,770 --> 00:03:48,170
都有一个自动类
in the Transformers library.
82
00:03:49,050 --> 00:03:52,380
在这里,在给我们的模型两个句子之后,
Here, after giving our model the two sentences,
83
00:03:52,380 --> 00:03:54,600
我们得到一个大小为二乘二的张量;
we get a tensor of size two by two;
84
00:03:54,600 --> 00:03:57,783
每个句子和每个可能的标签都有一个结果。
one result for each sentence and for each possible label.
85
00:03:59,100 --> 00:04:01,470
这些输出还不是概率。
Those outputs are not probabilities yet.
86
00:04:01,470 --> 00:04:03,660
我们可以看到它们的总和不为 1。
We can see they don't sum to 1.
87
00:04:03,660 --> 00:04:06,090
这是因为 Transformers 库的每个模型
This is because each model of the Transformers library
88
00:04:06,090 --> 00:04:07,830
返回 logits 。
returns logits.
89
00:04:07,830 --> 00:04:09,480
为了理解这些 logits ,
To make sense of those logits,
90
00:04:09,480 --> 00:04:10,980
我们需要深入研究第三个
we need to dig into the third
91
00:04:10,980 --> 00:04:13,653
管道的最后一步,后处理。
and last step of the pipeline, post-processing.
92
00:04:15,300 --> 00:04:17,310
要将 logits 转换为概率,
To convert logits into probabilities,
93
00:04:17,310 --> 00:04:19,950
我们需要对它们应用一个 SoftMax 层。
we need to apply a SoftMax layer to them.
94
00:04:19,950 --> 00:04:22,800
正如我们所见,这会将它们转换为正数
As we can see, this transforms them into positive numbers
95
00:04:22,800 --> 00:04:23,793
总和为 1。
that sum up to 1.
96
00:04:24,990 --> 00:04:27,030
最后一步是知道哪些对应
The last step is to know which of those corresponds
97
00:04:27,030 --> 00:04:29,400
正面或负面的标签。
to the positive or the negative label.
98
00:04:29,400 --> 00:04:33,480
这是由模型配置的 id2label 字段给出的。
This is given by the id2label field of the model config.
99
00:04:33,480 --> 00:04:36,000
第一概率,索引 0,
The first probabilities, index 0,
100
00:04:36,000 --> 00:04:37,740
对应负标签,
correspond to the negative label,
101
00:04:37,740 --> 00:04:42,060
秒,索引 1,对应于正标签。
and the seconds, index 1, correspond to the positive label.
102
00:04:42,060 --> 00:04:43,830
这就是我们的分类器的构建方式
This is how our classifier built
103
00:04:43,830 --> 00:04:46,260
使用管道功能选择了那些标签
with the pipeline function picked those labels
104
00:04:46,260 --> 00:04:47,560
并计算出这些分数。
and computed those scores.
105
00:04:48,420 --> 00:04:50,400
既然你知道每个步骤是如何工作的,
Now that you know how each steps works,
106
00:04:50,400 --> 00:04:52,533
你可以轻松地根据需要调整它们。
you can easily tweak them to your needs.
107
00:04:55,314 --> 00:04:57,897
(微妙的爆炸)
(subtle blast)