subtitles/zh-CN/08_what-happens-inside-the-pipeline-function-(pytorch).srt (426 lines of code) (raw):
1
00:00:00,554 --> 00:00:03,304
(徽标呼啸而过)
(logo whooshing)
2
00:00:05,340 --> 00:00:07,563
- pipeline 函数内部发生了什么?
*[译者注: pipeline 作为 流水线 的意思]
- What happens inside the pipeline function?
3
00:00:08,760 --> 00:00:11,580
在这段视频中,我们将看看实际发生了什么
In this video, we will look at what actually happens
4
00:00:11,580 --> 00:00:13,080
当我们使用 Transformers 库的
when we use the pipeline function
5
00:00:13,080 --> 00:00:15,090
pipeline 函数时
of the Transformers library.
6
00:00:15,090 --> 00:00:16,860
详细来讲,我们将看
More specifically, we will look
7
00:00:16,860 --> 00:00:19,200
在情绪分析的 pipeline 中,
at the sentiment analysis pipeline,
8
00:00:19,200 --> 00:00:22,020
它是如何从以下两个句子开始的,
and how it went from the two following sentences,
9
00:00:22,020 --> 00:00:23,970
将正负标签
to the positive and negative labels
10
00:00:23,970 --> 00:00:25,420
加上各自的分数。
with their respective scores.
11
00:00:26,760 --> 00:00:29,190
正如我们在 pipeline 展示中看到的那样,
As we have seen in the pipeline presentation,
12
00:00:29,190 --> 00:00:31,860
pipeline 分为三个阶段。
there are three stages in the pipeline.
13
00:00:31,860 --> 00:00:34,620
首先,我们将原始文本转换为数字
First, we convert the raw texts to numbers
14
00:00:34,620 --> 00:00:37,173
该模型可以通过使用分词器理解。
the model can make sense of using a tokenizer.
15
00:00:38,010 --> 00:00:40,530
然后这些数字通过模型,
Then those numbers go through the model,
16
00:00:40,530 --> 00:00:41,943
输出 logits 。
which outputs logits.
17
00:00:42,780 --> 00:00:45,600
最后,后处理步骤转换
Finally, the post-processing steps transforms
18
00:00:45,600 --> 00:00:48,150
那些 logits 包含标签和分数。
those logits into labels and scores.
19
00:00:48,150 --> 00:00:50,700
让我们详细看看这三个步骤
Let's look in detail at those three steps
20
00:00:50,700 --> 00:00:53,640
以及如何使用 Transformers 库复制它们,
and how to replicate them using the Transformers library,
21
00:00:53,640 --> 00:00:56,043
从第一阶段开始,分词化。
beginning with the first stage, tokenization.
22
00:00:57,915 --> 00:01:00,360
分词化过程有几个步骤。
The tokenization process has several steps.
23
00:01:00,360 --> 00:01:04,950
首先,文本被分成小块, 称之为 token。
*[译者注: 后面 token-* 均翻译成 分词-*]
First, the text is split into small chunks called tokens.
24
00:01:04,950 --> 00:01:08,550
它们可以是单词、单词的一部分或标点符号。
They can be words, parts of words or punctuation symbols.
25
00:01:08,550 --> 00:01:11,580
然后分词器将有一些特殊的 token ,
Then the tokenizer will had some special tokens,
26
00:01:11,580 --> 00:01:13,500
如果模型期望它们。
if the model expect them.
27
00:01:13,500 --> 00:01:16,860
这里的模型在开头使用期望 CLS token
Here the model uses expects a CLS token at the beginning
28
00:01:16,860 --> 00:01:19,743
以及用于分类的句子末尾的 SEP token。
and a SEP token at the end of the sentence to classify.
29
00:01:20,580 --> 00:01:24,180
最后,分词器将每个 token 与其唯一 ID 匹配
Lastly, the tokenizer matches each token to its unique ID
30
00:01:24,180 --> 00:01:27,000
在预训练模型的词汇表中。
in the vocabulary of the pretrained model.
31
00:01:27,000 --> 00:01:28,680
要加载这样的分词器,
To load such a tokenizer,
32
00:01:28,680 --> 00:01:31,743
Transformers 库提供了 AutoTokenizer API。
the Transformers library provides the AutoTokenizer API.
33
00:01:32,730 --> 00:01:36,120
这个类最重要的方法是 from_pretrained,
The most important method of this class is from_pretrained,
34
00:01:36,120 --> 00:01:38,910
这将下载并缓存配置
which will download and cache the configuration
35
00:01:38,910 --> 00:01:41,853
以及与给定检查点相关联的词汇表。
and the vocabulary associated to a given checkpoint.
36
00:01:43,200 --> 00:01:45,360
这里默认使用的 checkpoint
Here the checkpoint used by default
37
00:01:45,360 --> 00:01:47,280
用于情绪分析的 pipeline
for the sentiment analysis pipeline
38
00:01:47,280 --> 00:01:51,986
是 distilbert-base-uncased-finetuned-sst-2-English。
is distilbert-base-uncased-finetuned-sst-2-English.
39
00:01:51,986 --> 00:01:53,700
(模糊)
(indistinct)
40
00:01:53,700 --> 00:01:56,490
我们实例化一个与该检查点关联的分词器,
We instantiate a tokenizer associated with that checkpoint,
41
00:01:56,490 --> 00:01:59,490
然后给它输入两个句子。
then feed it the two sentences.
42
00:01:59,490 --> 00:02:02,100
由于这两个句子的大小不同,
Since those two sentences are not of the same size,
43
00:02:02,100 --> 00:02:03,930
我们需要填充最短的一个
we will need to pad the shortest one
44
00:02:03,930 --> 00:02:06,030
能够构建一个数组。
to be able to build an array.
45
00:02:06,030 --> 00:02:09,840
这是由标记器使用选项 padding=True 完成的。
This is done by the tokenizer with the option, padding=True.
46
00:02:09,840 --> 00:02:12,810
使用 truncation=True,我们确保任何句子
With truncation=True, we ensure that any sentence
47
00:02:12,810 --> 00:02:15,873
超过模型可以处理的最大值的长度将被截断。
longer than the maximum the model can handle is truncated.
48
00:02:17,010 --> 00:02:19,620
最后, return_tensors 选项
Lastly, the return_tensors option
49
00:02:19,620 --> 00:02:22,323
告诉分词器返回一个 PyTorch 张量。
tells the tokenizer to return a PyTorch tensor.
50
00:02:23,190 --> 00:02:25,590
查看结果,我们看到我们有一本字典
Looking at the result, we see we have a dictionary
51
00:02:25,590 --> 00:02:26,670
和两个主键
with two keys.
52
00:02:26,670 --> 00:02:29,970
输入 ID 包含两个句子的 ID,
Input IDs contains the IDs of both sentences,
53
00:02:29,970 --> 00:02:32,550
应用填充的位置为零。
with zero where the padding is applied.
54
00:02:32,550 --> 00:02:34,260
第二个键值,注意力掩码,
The second key, attention mask,
55
00:02:34,260 --> 00:02:36,150
指示已应用填充的位置,
indicates where padding has been applied,
56
00:02:36,150 --> 00:02:38,940
所以模型不会关注它。
so the model does not pay attention to it.
57
00:02:38,940 --> 00:02:42,090
这就是分词化步骤中的全部内容。
This is all what is inside the tokenization step.
58
00:02:42,090 --> 00:02:46,289
现在,让我们来看看第二步,模型。
Now, let's have a look at the second step, the model.
59
00:02:46,289 --> 00:02:47,952
至于分词器,
As for the tokenizer,
60
00:02:47,952 --> 00:02:51,133
有一个带有 from_pretrained 方法的 AutoModel API。
there is an AutoModel API with a from_pretrained method.
61
00:02:51,133 --> 00:02:53,954
它将下载并缓存模型的配置
It will download and cache the configuration of the model
62
00:02:53,954 --> 00:02:56,280
以及预训练的权重。
as well as the pretrained weights.
63
00:02:56,280 --> 00:02:58,200
然而,AutoModel API
However, the AutoModel API
64
00:02:58,200 --> 00:03:00,630
只会实例化模型的主体,
will only instantiate the body of the model,
65
00:03:00,630 --> 00:03:03,420
那是模型剩下的部分
that is the part of the model that is left
66
00:03:03,420 --> 00:03:06,090
一旦预训练头被移除。
once the pretraining head is removed.
67
00:03:06,090 --> 00:03:08,610
它会输出一个高维张量
It will output a high-dimensional tensor
68
00:03:08,610 --> 00:03:11,220
这是通过的句子的表示,
that is a representation of the sentences passed,
69
00:03:11,220 --> 00:03:12,690
但这不是直接有用的
but which is not directly useful
70
00:03:12,690 --> 00:03:15,030
对于我们的分类问题。
for our classification problem.
71
00:03:15,030 --> 00:03:19,230
这里的张量有两个句子,每个句子有 16 个 token ,
Here the tensor has two sentences, each of 16 tokens,
72
00:03:19,230 --> 00:03:23,433
最后一个维度是我们模型的隐藏大小,768。
and the last dimension is the hidden size of our model, 768.
73
00:03:24,900 --> 00:03:27,510
要获得与我们的分类问题相关的输出,
To get an output linked to our classification problem,
74
00:03:27,510 --> 00:03:31,170
我们需要使用 AutoModelForSequenceClassification 类。
we need to use the AutoModelForSequenceClassification class.
75
00:03:31,170 --> 00:03:33,330
它与 AutoModel 类完全一样工作,
It works exactly as the AutoModel class,
76
00:03:33,330 --> 00:03:35,130
除了它会建立一个模型
except that it will build a model
77
00:03:35,130 --> 00:03:36,543
带分类头。
with a classification head.
78
00:03:37,483 --> 00:03:39,560
每个常见的 NLP 任务在 Transformers 库中
There is one auto class for each common NLP task
79
00:03:39,560 --> 00:03:40,960
都有一个自动类
in the Transformers library.
80
00:03:42,150 --> 00:03:45,570
在给我们的模型两个句子之后,
Here after giving our model the two sentences,
81
00:03:45,570 --> 00:03:47,820
我们得到一个大小为二乘二的张量,
we get a tensor of size two by two,
82
00:03:47,820 --> 00:03:50,943
每个句子和每个可能的标签都有一个结果。
one result for each sentence and for each possible label.
83
00:03:51,840 --> 00:03:53,970
这些输出还不是概率,
Those outputs are not probabilities yet,
84
00:03:53,970 --> 00:03:56,100
我们可以看到它们的总和不为 1。
we can see they don't sum to 1.
85
00:03:56,100 --> 00:03:57,270
这是因为 Transformers 库中
This is because each model
86
00:03:57,270 --> 00:04:00,810
每个模型都会返回 logits 。
of the Transformers library returns logits.
87
00:04:00,810 --> 00:04:02,250
为了理解这些 logits ,
To make sense of those logits,
88
00:04:02,250 --> 00:04:05,910
我们需要深入研究管道的第三步也是最后一步。
we need to dig into the third and last step of the pipeline.
89
00:04:05,910 --> 00:04:10,620
后处理,将 logits 转换为概率,
Post-processing, to convert logits into probabilities,
90
00:04:10,620 --> 00:04:13,470
我们需要对它们应用 SoftMax 层。
we need to apply a SoftMax layers to them.
91
00:04:13,470 --> 00:04:14,610
正如我们所见,
As we can see,
92
00:04:14,610 --> 00:04:17,267
这会将它们转换为正数
this transforms them into positive number
93
00:04:17,267 --> 00:04:18,663
总结为一个。
that sum up to one.
94
00:04:18,663 --> 00:04:21,360
最后一步是知道哪些对应
The last step is to know which of those corresponds
95
00:04:21,360 --> 00:04:23,580
正面或负面的标签。
to the positive or the negative label.
96
00:04:23,580 --> 00:04:28,020
这是由模型配置的 id2label 字段给出的。
This is given by the id2label field of the model config.
97
00:04:28,020 --> 00:04:30,390
第一概率,指数零,
The first probabilities, index zero,
98
00:04:30,390 --> 00:04:32,250
对应负标签,
correspond to the negative label,
99
00:04:32,250 --> 00:04:34,140
然后第二个,索引一,
and the seconds, index one,
100
00:04:34,140 --> 00:04:36,480
对应正标签。
correspond to the positive label.
101
00:04:36,480 --> 00:04:37,950
这就是我们的分类器的构建方式
This is how our classifier built
102
00:04:37,950 --> 00:04:40,230
使用 pipeline 功能选择了那些标签
with the pipeline function picked those labels
103
00:04:40,230 --> 00:04:42,240
并计算出这些分数。
and computed those scores.
104
00:04:42,240 --> 00:04:44,220
既然你知道每个步骤是如何工作的,
Now that you know how each steps works,
105
00:04:44,220 --> 00:04:46,220
你可以轻松地根据需要调整它们。
you can easily tweak them to your needs.
106
00:04:47,524 --> 00:04:50,274
(徽标呼啸而过)
(logo whooshing)