1 00:00:00,397 --> 00:00:02,980 (微妙的爆炸) (subtle blast) 2 00:00:05,490 --> 00:00:07,953 - 管道函数内部发生了什么? - What happens inside the pipeline function? 3 00:00:09,930 --> 00:00:13,050 在这段视频中,我们将看看实际发生了什么 In this video, we will look at what actually happens 4 00:00:13,050 --> 00:00:14,820 当我们使用 Transformers 库的 when we use the pipeline function 5 00:00:14,820 --> 00:00:16,920 pipeline 函数时 of the Transformers library. 6 00:00:16,920 --> 00:00:18,930 更具体地说,我们将看看 More specifically, we will look at 7 00:00:18,930 --> 00:00:21,030 情绪分析管道, the sentiment analysis pipeline, 8 00:00:21,030 --> 00:00:23,760 以及它是如何从以下两个句子中得出的 and how it went from the two following sentences 9 00:00:23,760 --> 00:00:25,800 正负标签 to the positive and negative labels 10 00:00:25,800 --> 00:00:27,250 加上各自的分数。 with their respective scores. 11 00:00:28,740 --> 00:00:31,110 正如我们在管道视频中看到的那样, As we have seen in the pipeline video, 12 00:00:31,110 --> 00:00:33,900 管道分为三个阶段。 there are three stages in the pipeline. 13 00:00:33,900 --> 00:00:36,810 首先,我们将原始文本转换为数字 First, we convert the raw texts to numbers 14 00:00:36,810 --> 00:00:39,160 该模型可以使用分词器来理解。 the model can make sense of, using a tokenizer. 15 00:00:40,140 --> 00:00:42,600 然后,这些数字通过模型, Then, those numbers go through the model, 16 00:00:42,600 --> 00:00:44,550 输出 logits 。 *[译者注: logits 作为逻辑值的意思] which outputs logits. 17 00:00:44,550 --> 00:00:47,190 最后是后处理步骤 Finally, the post-processing steps 18 00:00:47,190 --> 00:00:49,490 将这些 logits 转换为标签和分数。 transforms those logits into labels and score. 19 00:00:51,000 --> 00:00:52,590 让我们详细看看这三个步骤, Let's look in detail at those three steps, 20 00:00:52,590 --> 00:00:55,200 以及如何使用 Transformers 库复现它们, and how to replicate them using the Transformers library, 21 00:00:55,200 --> 00:00:57,903 从第一阶段开始,分词化。 beginning with the first stage, tokenization. 22 00:00:59,905 --> 00:01:02,520 分词化过程有几个步骤。 The tokenization process has several steps. 23 00:01:02,520 --> 00:01:06,900 首先,文本被分成称为 token 的小块。 First, the text is split into small chunks called token. 24 00:01:06,900 --> 00:01:09,933 它们可以是单词、单词的一部分或标点符号。 They can be words, parts of words or punctuation symbols. 25 00:01:10,800 --> 00:01:14,310 然后分词器将有一些特殊的 token Then the tokenizer will had some special tokens 26 00:01:14,310 --> 00:01:15,573 如果模型期望它们。 if the model expect them. 27 00:01:16,440 --> 00:01:20,430 在这里,所使用的模型在开头需要一个 CLS token Here, the model used expects a CLS token at the beginning 28 00:01:20,430 --> 00:01:23,910 以及用于分类的句子末尾的 SEP token。 and a SEP token at the end of the sentence to classify. 29 00:01:23,910 --> 00:01:27,630 最后,标记器将每个标记与其唯一 ID 匹配 Lastly, the tokenizer matches each token to its unique ID 30 00:01:27,630 --> 00:01:29,730 在预训练模型的词汇表中。 in the vocabulary of the pretrained model. 31 00:01:30,660 --> 00:01:32,040 要加载这样的分词器, To load such a tokenizer, 32 00:01:32,040 --> 00:01:34,983 Transformers 库提供了 AutoTokenizer API。 the Transformers library provides the AutoTokenizer API. 33 00:01:35,880 --> 00:01:39,510 这个类最重要的方法是 from_pretrained, The most important method of this class is from_pretrained, 34 00:01:39,510 --> 00:01:41,940 这将下载并缓存配置 which will download and cache the configuration 35 00:01:41,940 --> 00:01:44,913 以及与给定 checkpoint 相关联的词汇表。 *[译者注: 在深度学习中, checkpoint 作为检查点是用来备份模型的, 后不翻译] and the vocabulary associated to a given checkpoint. 36 00:01:46,410 --> 00:01:48,180 这里默认使用的 checkpoint Here, the checkpoint used by default 37 00:01:48,180 --> 00:01:50,310 用于情绪分析的 pipeline for the sentiment analysis pipeline 38 00:01:50,310 --> 00:01:54,510 是 distilbert-base-uncased-finetuned-sst2-English, is distilbert-base-uncased-finetuned-sst2-English, 39 00:01:54,510 --> 00:01:55,960 这有点含糊。 which is a bit of a mouthful. 40 00:01:56,820 --> 00:01:59,760 我们实例化一个与该 checkpoint 关联的分词器, We instantiate a tokenizer associated with that checkpoint, 41 00:01:59,760 --> 00:02:01,833 然后给它输入两个句子。 then feed it the two sentences. 42 00:02:02,790 --> 00:02:05,490 由于这两个句子的大小不同, Since those two sentences are not of the same size, 43 00:02:05,490 --> 00:02:07,440 我们需要填充最短的一个 we will need to pad the shortest one 44 00:02:07,440 --> 00:02:09,570 能够构建一个数组。 to be able to build an array. 45 00:02:09,570 --> 00:02:10,403 这是由分词器完成的 This is done by the tokenizer 46 00:02:10,403 --> 00:02:12,603 使用选项 padding=True。 with the option padding=True. 47 00:02:14,130 --> 00:02:17,340 使用 truncation=True,我们确保任何句子更长 With truncation=True, we ensure that any sentence longer 48 00:02:17,340 --> 00:02:19,953 比模型可以处理的最大值被截断。 than the maximum the model can handle is truncated. 49 00:02:20,820 --> 00:02:24,200 最后,return_tensors 选项告诉分词器 Lastly, the return_tensors option tells the tokenizer 50 00:02:24,200 --> 00:02:25,773 返回 PyTorch 张量。 to return a PyTorch tensor. 51 00:02:26,910 --> 00:02:28,050 看看结果, Looking at the result, 52 00:02:28,050 --> 00:02:30,450 我们看到我们有一个有两个键的字典。 we see we have a dictionary with two keys. 53 00:02:30,450 --> 00:02:33,840 输入 ID 包含两个句子的 ID, Input IDs contains the IDs of both sentences, 54 00:02:33,840 --> 00:02:35,840 在应用填充的地方使用零。 with zeros where the padding is applied. 55 00:02:36,750 --> 00:02:38,550 第二把钥匙,注意力掩码, The second key, attention mask, 56 00:02:38,550 --> 00:02:40,650 指示已应用填充的位置, indicates where padding has been applied, 57 00:02:40,650 --> 00:02:42,750 所以模型不会关注它。 so the model does not pay attention to it. 58 00:02:43,590 --> 00:02:46,380 这就是标记化步骤中的全部内容。 This is all what is inside the tokenization step. 59 00:02:46,380 --> 00:02:49,653 现在让我们来看看第二步,模型。 Now let's have a look at the second step, the model. 60 00:02:51,090 --> 00:02:53,850 至于分词器,有一个 AutoModel API, As for the tokenizer, there is an AutoModel API, 61 00:02:53,850 --> 00:02:55,890 使用 from_pretrained 方法。 with a from_pretrained method. 62 00:02:55,890 --> 00:02:59,100 它将下载并缓存模型的配置 It will download and cache the configuration of the model 63 00:02:59,100 --> 00:03:01,560 以及预训练的权重。 as well as the pretrained weights. 64 00:03:01,560 --> 00:03:04,830 但是,AutoModel API 只会实例化 However, the AutoModel API will only instantiate 65 00:03:04,830 --> 00:03:06,540 模型的主体, the body of the model, 66 00:03:06,540 --> 00:03:09,120 也就是说,模型中剩下的部分 that is, the part of the model that is left 67 00:03:09,120 --> 00:03:11,103 一旦预训练头被移除。 once the pretraining head is removed. 68 00:03:12,210 --> 00:03:14,460 它会输出一个高维张量 It will output a high-dimensional tensor 69 00:03:14,460 --> 00:03:17,190 这是通过的句子的表示, that is a representation of the sentences passed, 70 00:03:17,190 --> 00:03:18,930 但这不是直接有用的 but which is not directly useful 71 00:03:18,930 --> 00:03:20,480 对于我们的分类问题。 for our classification problem. 72 00:03:21,930 --> 00:03:24,210 这里张量有两个句子, Here the tensor has two sentences, 73 00:03:24,210 --> 00:03:26,070 每十六个 token, each of sixteen token, 74 00:03:26,070 --> 00:03:30,393 最后一个维度是我们模型的隐藏大小,768。 and the last dimension is the hidden size of our model, 768. 75 00:03:31,620 --> 00:03:34,020 要获得与我们的分类问题相关的输出, To get an output linked to our classification problem, 76 00:03:34,020 --> 00:03:37,800 我们需要使用 AutoModelForSequenceClassification 类。 we need to use the AutoModelForSequenceClassification class. 77 00:03:37,800 --> 00:03:40,170 它与 AutoModel 类完全一样工作, It works exactly as the AutoModel class, 78 00:03:40,170 --> 00:03:41,970 除了它会建立一个模型 except that it will build a model 79 00:03:41,970 --> 00:03:43,353 带分类头。 with a classification head. 80 00:03:44,520 --> 00:03:46,770 每个常见的 NLP 任务在 Transformers 库 There is one auto class for each common NLP task 81 00:03:46,770 --> 00:03:48,170 都有一个自动类 in the Transformers library. 82 00:03:49,050 --> 00:03:52,380 在这里,在给我们的模型两个句子之后, Here, after giving our model the two sentences, 83 00:03:52,380 --> 00:03:54,600 我们得到一个大小为二乘二的张量; we get a tensor of size two by two; 84 00:03:54,600 --> 00:03:57,783 每个句子和每个可能的标签都有一个结果。 one result for each sentence and for each possible label. 85 00:03:59,100 --> 00:04:01,470 这些输出还不是概率。 Those outputs are not probabilities yet. 86 00:04:01,470 --> 00:04:03,660 我们可以看到它们的总和不为 1。 We can see they don't sum to 1. 87 00:04:03,660 --> 00:04:06,090 这是因为 Transformers 库的每个模型 This is because each model of the Transformers library 88 00:04:06,090 --> 00:04:07,830 返回 logits 。 returns logits. 89 00:04:07,830 --> 00:04:09,480 为了理解这些 logits , To make sense of those logits, 90 00:04:09,480 --> 00:04:10,980 我们需要深入研究第三个 we need to dig into the third 91 00:04:10,980 --> 00:04:13,653 管道的最后一步,后处理。 and last step of the pipeline, post-processing. 92 00:04:15,300 --> 00:04:17,310 要将 logits 转换为概率, To convert logits into probabilities, 93 00:04:17,310 --> 00:04:19,950 我们需要对它们应用一个 SoftMax 层。 we need to apply a SoftMax layer to them. 94 00:04:19,950 --> 00:04:22,800 正如我们所见,这会将它们转换为正数 As we can see, this transforms them into positive numbers 95 00:04:22,800 --> 00:04:23,793 总和为 1。 that sum up to 1. 96 00:04:24,990 --> 00:04:27,030 最后一步是知道哪些对应 The last step is to know which of those corresponds 97 00:04:27,030 --> 00:04:29,400 正面或负面的标签。 to the positive or the negative label. 98 00:04:29,400 --> 00:04:33,480 这是由模型配置的 id2label 字段给出的。 This is given by the id2label field of the model config. 99 00:04:33,480 --> 00:04:36,000 第一概率,索引 0, The first probabilities, index 0, 100 00:04:36,000 --> 00:04:37,740 对应负标签, correspond to the negative label, 101 00:04:37,740 --> 00:04:42,060 秒,索引 1,对应于正标签。 and the seconds, index 1, correspond to the positive label. 102 00:04:42,060 --> 00:04:43,830 这就是我们的分类器的构建方式 This is how our classifier built 103 00:04:43,830 --> 00:04:46,260 使用管道功能选择了那些标签 with the pipeline function picked those labels 104 00:04:46,260 --> 00:04:47,560 并计算出这些分数。 and computed those scores. 105 00:04:48,420 --> 00:04:50,400 既然你知道每个步骤是如何工作的, Now that you know how each steps works, 106 00:04:50,400 --> 00:04:52,533 你可以轻松地根据需要调整它们。 you can easily tweak them to your needs. 107 00:04:55,314 --> 00:04:57,897 (微妙的爆炸) (subtle blast)