subtitles/zh-CN/08_what-happens-inside-the-pipeline-function-(pytorch).srt (426 lines of code) (raw):

1 00:00:00,554 --> 00:00:03,304 (徽标呼啸而过) (logo whooshing) 2 00:00:05,340 --> 00:00:07,563 - pipeline 函数内部发生了什么? *[译者注: pipeline 作为 流水线 的意思] - What happens inside the pipeline function? 3 00:00:08,760 --> 00:00:11,580 在这段视频中,我们将看看实际发生了什么 In this video, we will look at what actually happens 4 00:00:11,580 --> 00:00:13,080 当我们使用 Transformers 库的 when we use the pipeline function 5 00:00:13,080 --> 00:00:15,090 pipeline 函数时 of the Transformers library. 6 00:00:15,090 --> 00:00:16,860 详细来讲,我们将看 More specifically, we will look 7 00:00:16,860 --> 00:00:19,200 在情绪分析的 pipeline 中, at the sentiment analysis pipeline, 8 00:00:19,200 --> 00:00:22,020 它是如何从以下两个句子开始的, and how it went from the two following sentences, 9 00:00:22,020 --> 00:00:23,970 将正负标签 to the positive and negative labels 10 00:00:23,970 --> 00:00:25,420 加上各自的分数。 with their respective scores. 11 00:00:26,760 --> 00:00:29,190 正如我们在 pipeline 展示中看到的那样, As we have seen in the pipeline presentation, 12 00:00:29,190 --> 00:00:31,860 pipeline 分为三个阶段。 there are three stages in the pipeline. 13 00:00:31,860 --> 00:00:34,620 首先,我们将原始文本转换为数字 First, we convert the raw texts to numbers 14 00:00:34,620 --> 00:00:37,173 该模型可以通过使用分词器理解。 the model can make sense of using a tokenizer. 15 00:00:38,010 --> 00:00:40,530 然后这些数字通过模型, Then those numbers go through the model, 16 00:00:40,530 --> 00:00:41,943 输出 logits 。 which outputs logits. 17 00:00:42,780 --> 00:00:45,600 最后,后处理步骤转换 Finally, the post-processing steps transforms 18 00:00:45,600 --> 00:00:48,150 那些 logits 包含标签和分数。 those logits into labels and scores. 19 00:00:48,150 --> 00:00:50,700 让我们详细看看这三个步骤 Let's look in detail at those three steps 20 00:00:50,700 --> 00:00:53,640 以及如何使用 Transformers 库复制它们, and how to replicate them using the Transformers library, 21 00:00:53,640 --> 00:00:56,043 从第一阶段开始,分词化。 beginning with the first stage, tokenization. 22 00:00:57,915 --> 00:01:00,360 分词化过程有几个步骤。 The tokenization process has several steps. 23 00:01:00,360 --> 00:01:04,950 首先,文本被分成小块, 称之为 token。 *[译者注: 后面 token-* 均翻译成 分词-*] First, the text is split into small chunks called tokens. 24 00:01:04,950 --> 00:01:08,550 它们可以是单词、单词的一部分或标点符号。 They can be words, parts of words or punctuation symbols. 25 00:01:08,550 --> 00:01:11,580 然后分词器将有一些特殊的 token , Then the tokenizer will had some special tokens, 26 00:01:11,580 --> 00:01:13,500 如果模型期望它们。 if the model expect them. 27 00:01:13,500 --> 00:01:16,860 这里的模型在开头使用期望 CLS token Here the model uses expects a CLS token at the beginning 28 00:01:16,860 --> 00:01:19,743 以及用于分类的句子末尾的 SEP token。 and a SEP token at the end of the sentence to classify. 29 00:01:20,580 --> 00:01:24,180 最后,分词器将每个 token 与其唯一 ID 匹配 Lastly, the tokenizer matches each token to its unique ID 30 00:01:24,180 --> 00:01:27,000 在预训练模型的词汇表中。 in the vocabulary of the pretrained model. 31 00:01:27,000 --> 00:01:28,680 要加载这样的分词器, To load such a tokenizer, 32 00:01:28,680 --> 00:01:31,743 Transformers 库提供了 AutoTokenizer API。 the Transformers library provides the AutoTokenizer API. 33 00:01:32,730 --> 00:01:36,120 这个类最重要的方法是 from_pretrained, The most important method of this class is from_pretrained, 34 00:01:36,120 --> 00:01:38,910 这将下载并缓存配置 which will download and cache the configuration 35 00:01:38,910 --> 00:01:41,853 以及与给定检查点相关联的词汇表。 and the vocabulary associated to a given checkpoint. 36 00:01:43,200 --> 00:01:45,360 这里默认使用的 checkpoint Here the checkpoint used by default 37 00:01:45,360 --> 00:01:47,280 用于情绪分析的 pipeline for the sentiment analysis pipeline 38 00:01:47,280 --> 00:01:51,986 是 distilbert-base-uncased-finetuned-sst-2-English。 is distilbert-base-uncased-finetuned-sst-2-English. 39 00:01:51,986 --> 00:01:53,700 (模糊) (indistinct) 40 00:01:53,700 --> 00:01:56,490 我们实例化一个与该检查点关联的分词器, We instantiate a tokenizer associated with that checkpoint, 41 00:01:56,490 --> 00:01:59,490 然后给它输入两个句子。 then feed it the two sentences. 42 00:01:59,490 --> 00:02:02,100 由于这两个句子的大小不同, Since those two sentences are not of the same size, 43 00:02:02,100 --> 00:02:03,930 我们需要填充最短的一个 we will need to pad the shortest one 44 00:02:03,930 --> 00:02:06,030 能够构建一个数组。 to be able to build an array. 45 00:02:06,030 --> 00:02:09,840 这是由标记器使用选项 padding=True 完成的。 This is done by the tokenizer with the option, padding=True. 46 00:02:09,840 --> 00:02:12,810 使用 truncation=True,我们确保任何句子 With truncation=True, we ensure that any sentence 47 00:02:12,810 --> 00:02:15,873 超过模型可以处理的最大值的长度将被截断。 longer than the maximum the model can handle is truncated. 48 00:02:17,010 --> 00:02:19,620 最后, return_tensors 选项 Lastly, the return_tensors option 49 00:02:19,620 --> 00:02:22,323 告诉分词器返回一个 PyTorch 张量。 tells the tokenizer to return a PyTorch tensor. 50 00:02:23,190 --> 00:02:25,590 查看结果,我们看到我们有一本字典 Looking at the result, we see we have a dictionary 51 00:02:25,590 --> 00:02:26,670 和两个主键 with two keys. 52 00:02:26,670 --> 00:02:29,970 输入 ID 包含两个句子的 ID, Input IDs contains the IDs of both sentences, 53 00:02:29,970 --> 00:02:32,550 应用填充的位置为零。 with zero where the padding is applied. 54 00:02:32,550 --> 00:02:34,260 第二个键值,注意力掩码, The second key, attention mask, 55 00:02:34,260 --> 00:02:36,150 指示已应用填充的位置, indicates where padding has been applied, 56 00:02:36,150 --> 00:02:38,940 所以模型不会关注它。 so the model does not pay attention to it. 57 00:02:38,940 --> 00:02:42,090 这就是分词化步骤中的全部内容。 This is all what is inside the tokenization step. 58 00:02:42,090 --> 00:02:46,289 现在,让我们来看看第二步,模型。 Now, let's have a look at the second step, the model. 59 00:02:46,289 --> 00:02:47,952 至于分词器, As for the tokenizer, 60 00:02:47,952 --> 00:02:51,133 有一个带有 from_pretrained 方法的 AutoModel API。 there is an AutoModel API with a from_pretrained method. 61 00:02:51,133 --> 00:02:53,954 它将下载并缓存模型的配置 It will download and cache the configuration of the model 62 00:02:53,954 --> 00:02:56,280 以及预训练的权重。 as well as the pretrained weights. 63 00:02:56,280 --> 00:02:58,200 然而,AutoModel API However, the AutoModel API 64 00:02:58,200 --> 00:03:00,630 只会实例化模型的主体, will only instantiate the body of the model, 65 00:03:00,630 --> 00:03:03,420 那是模型剩下的部分 that is the part of the model that is left 66 00:03:03,420 --> 00:03:06,090 一旦预训练头被移除。 once the pretraining head is removed. 67 00:03:06,090 --> 00:03:08,610 它会输出一个高维张量 It will output a high-dimensional tensor 68 00:03:08,610 --> 00:03:11,220 这是通过的句子的表示, that is a representation of the sentences passed, 69 00:03:11,220 --> 00:03:12,690 但这不是直接有用的 but which is not directly useful 70 00:03:12,690 --> 00:03:15,030 对于我们的分类问题。 for our classification problem. 71 00:03:15,030 --> 00:03:19,230 这里的张量有两个句子,每个句子有 16 个 token , Here the tensor has two sentences, each of 16 tokens, 72 00:03:19,230 --> 00:03:23,433 最后一个维度是我们模型的隐藏大小,768。 and the last dimension is the hidden size of our model, 768. 73 00:03:24,900 --> 00:03:27,510 要获得与我们的分类问题相关的输出, To get an output linked to our classification problem, 74 00:03:27,510 --> 00:03:31,170 我们需要使用 AutoModelForSequenceClassification 类。 we need to use the AutoModelForSequenceClassification class. 75 00:03:31,170 --> 00:03:33,330 它与 AutoModel 类完全一样工作, It works exactly as the AutoModel class, 76 00:03:33,330 --> 00:03:35,130 除了它会建立一个模型 except that it will build a model 77 00:03:35,130 --> 00:03:36,543 带分类头。 with a classification head. 78 00:03:37,483 --> 00:03:39,560 每个常见的 NLP 任务在 Transformers 库中 There is one auto class for each common NLP task 79 00:03:39,560 --> 00:03:40,960 都有一个自动类 in the Transformers library. 80 00:03:42,150 --> 00:03:45,570 在给我们的模型两个句子之后, Here after giving our model the two sentences, 81 00:03:45,570 --> 00:03:47,820 我们得到一个大小为二乘二的张量, we get a tensor of size two by two, 82 00:03:47,820 --> 00:03:50,943 每个句子和每个可能的标签都有一个结果。 one result for each sentence and for each possible label. 83 00:03:51,840 --> 00:03:53,970 这些输出还不是概率, Those outputs are not probabilities yet, 84 00:03:53,970 --> 00:03:56,100 我们可以看到它们的总和不为 1。 we can see they don't sum to 1. 85 00:03:56,100 --> 00:03:57,270 这是因为 Transformers 库中 This is because each model 86 00:03:57,270 --> 00:04:00,810 每个模型都会返回 logits 。 of the Transformers library returns logits. 87 00:04:00,810 --> 00:04:02,250 为了理解这些 logits , To make sense of those logits, 88 00:04:02,250 --> 00:04:05,910 我们需要深入研究管道的第三步也是最后一步。 we need to dig into the third and last step of the pipeline. 89 00:04:05,910 --> 00:04:10,620 后处理,将 logits 转换为概率, Post-processing, to convert logits into probabilities, 90 00:04:10,620 --> 00:04:13,470 我们需要对它们应用 SoftMax 层。 we need to apply a SoftMax layers to them. 91 00:04:13,470 --> 00:04:14,610 正如我们所见, As we can see, 92 00:04:14,610 --> 00:04:17,267 这会将它们转换为正数 this transforms them into positive number 93 00:04:17,267 --> 00:04:18,663 总结为一个。 that sum up to one. 94 00:04:18,663 --> 00:04:21,360 最后一步是知道哪些对应 The last step is to know which of those corresponds 95 00:04:21,360 --> 00:04:23,580 正面或负面的标签。 to the positive or the negative label. 96 00:04:23,580 --> 00:04:28,020 这是由模型配置的 id2label 字段给出的。 This is given by the id2label field of the model config. 97 00:04:28,020 --> 00:04:30,390 第一概率,指数零, The first probabilities, index zero, 98 00:04:30,390 --> 00:04:32,250 对应负标签, correspond to the negative label, 99 00:04:32,250 --> 00:04:34,140 然后第二个,索引一, and the seconds, index one, 100 00:04:34,140 --> 00:04:36,480 对应正标签。 correspond to the positive label. 101 00:04:36,480 --> 00:04:37,950 这就是我们的分类器的构建方式 This is how our classifier built 102 00:04:37,950 --> 00:04:40,230 使用 pipeline 功能选择了那些标签 with the pipeline function picked those labels 103 00:04:40,230 --> 00:04:42,240 并计算出这些分数。 and computed those scores. 104 00:04:42,240 --> 00:04:44,220 既然你知道每个步骤是如何工作的, Now that you know how each steps works, 105 00:04:44,220 --> 00:04:46,220 你可以轻松地根据需要调整它们。 you can easily tweak them to your needs. 106 00:04:47,524 --> 00:04:50,274 (徽标呼啸而过) (logo whooshing)