subtitles/zh-CN/45_inside-the-token-classification-pipeline-(pytorch).srt (308 lines of code) (raw):

1 00:00:00,076 --> 00:00:01,462 (标题嘶嘶作响) (title whooshes) 2 00:00:01,462 --> 00:00:02,382 (标志弹出) (logo pops) 3 00:00:02,382 --> 00:00:05,340 (标题嘶嘶作响) (title whooshes) 4 00:00:05,340 --> 00:00:06,210 - 我们来看一下 - Let's have a look 5 00:00:06,210 --> 00:00:08,283 在 token 分类管线( pipeline )内。 inside the token classification pipeline. 6 00:00:10,080 --> 00:00:11,580 在有关管线视频中, In the pipeline video, 7 00:00:11,580 --> 00:00:13,320 我们学习了不同的应用 we looked at the different applications 8 00:00:13,320 --> 00:00:15,960 transformers 库支持开箱即用的, the Transformers library supports out of the box, 9 00:00:15,960 --> 00:00:18,780 其中之一是 token 分类, one of them being token classification, 10 00:00:18,780 --> 00:00:21,810 例如预测句子中的每个单词 for instance predicting for each word in a sentence 11 00:00:21,810 --> 00:00:24,510 是否对应于一个人,一个组织 whether they correspond to a person, an organization 12 00:00:24,510 --> 00:00:25,353 或一个位置。 or a location. 13 00:00:26,670 --> 00:00:28,920 我们甚至可以将相应的 token 组合在一起 We can even group together the tokens corresponding 14 00:00:28,920 --> 00:00:32,040 到同一个实体,例如所有 token to the same entity, for instance all the tokens 15 00:00:32,040 --> 00:00:35,373 在这里形成了 Sylvain 这个词,或 Hugging 和 Face。 that formed the word Sylvain here, or Hugging and Face. 16 00:00:37,290 --> 00:00:40,230 token 分类管线的工作方式相同 The token classification pipeline works the same way 17 00:00:40,230 --> 00:00:42,630 和我们研究的文本分类的管线 as the text classification pipeline we studied 18 00:00:42,630 --> 00:00:44,430 在上一个视频中。 in the previous video. 19 00:00:44,430 --> 00:00:45,930 分为三个步骤。 There are three steps. 20 00:00:45,930 --> 00:00:49,623 分词化、模型和后处理。 The tokenization, the model, and the postprocessing. 21 00:00:50,940 --> 00:00:52,530 前两个步骤相同 The first two steps are identical 22 00:00:52,530 --> 00:00:54,630 到文本分类管线, to the text classification pipeline, 23 00:00:54,630 --> 00:00:57,300 除了我们使用 auto (自动) 的 token 分类模型 except we use an auto token classification model 24 00:00:57,300 --> 00:01:00,150 而不是序列分类。 instead of a sequence classification one. 25 00:01:00,150 --> 00:01:03,720 我们分词化我们的文本,然后将其提供给模型。 We tokenize our text then feed it to the model. 26 00:01:03,720 --> 00:01:05,877 而不是为每个可能的标签获取一个数字 Instead of getting one number for each possible label 27 00:01:05,877 --> 00:01:08,700 对于整个句子,我们得到一个数字 for the whole sentence, we get one number 28 00:01:08,700 --> 00:01:10,770 对于可能的九个标签中的每一个 for each of the possible nine labels 29 00:01:10,770 --> 00:01:13,983 对于句子中的每个 token ,此处为 19。 for every token in the sentence, here 19. 30 00:01:15,300 --> 00:01:18,090 与 transformers 库的所有其他模型一样, Like all the other models of the transformers library, 31 00:01:18,090 --> 00:01:19,830 我们的模型输出 logits, our model outputs logits, 32 00:01:19,830 --> 00:01:23,073 我们使用 SoftMax 将其转化为预测值。 which we turn into predictions by using a SoftMax. 33 00:01:23,940 --> 00:01:26,190 我们还获得了每个 token 的预测标签 We also get the predicted label for each token 34 00:01:26,190 --> 00:01:27,990 通过最大预测, by taking the maximum prediction, 35 00:01:27,990 --> 00:01:29,880 因为 SoftMax 函数保留了顺序, since the SoftMax function preserves the orders, 36 00:01:29,880 --> 00:01:31,200 我们本可以在 logits 上完成 we could have done it on the logits 37 00:01:31,200 --> 00:01:33,050 如果我们不需要预测。 if we had no need of the predictions. 38 00:01:33,930 --> 00:01:35,880 模型配置包含标签映射 The model config contains the label mapping 39 00:01:35,880 --> 00:01:37,740 在其 id2label 字段中。 in its id2label field. 40 00:01:37,740 --> 00:01:41,430 使用它,我们可以将每个 token 映射到其相应的标签。 Using it, we can map every token to its corresponding label. 41 00:01:41,430 --> 00:01:43,950 标签 O 不对应任何实体, The label, O, correspond to no entity, 42 00:01:43,950 --> 00:01:45,985 这就是为什么我们没有在结果中看到它 which is why we didn't see it in our results 43 00:01:45,985 --> 00:01:47,547 在第一张幻灯片中。 in the first slide. 44 00:01:47,547 --> 00:01:49,440 在标签和概率之上, On top of the label and the probability, 45 00:01:49,440 --> 00:01:51,000 这些结果包括开始 those results included the start 46 00:01:51,000 --> 00:01:53,103 和句末字符。 and end character in the sentence. 47 00:01:54,120 --> 00:01:55,380 我们需要使用偏移映射 We'll need to use the offset mapping 48 00:01:55,380 --> 00:01:56,640 分词器得到那些。 of the tokenizer to get those. 49 00:01:56,640 --> 00:01:58,050 看看下面链接的视频 Look at the video linked below 50 00:01:58,050 --> 00:02:00,300 如果你还不知道它们。 if you don't know about them already. 51 00:02:00,300 --> 00:02:02,280 然后,遍历每个 token Then, looping through each token 52 00:02:02,280 --> 00:02:04,080 具有不同于 O 的标签, that has a label distinct from O, 53 00:02:04,080 --> 00:02:06,120 我们可以建立我们得到的结果列表 we can build the list of results we got 54 00:02:06,120 --> 00:02:07,320 用我们的第一条管线。 with our first pipeline. 55 00:02:08,460 --> 00:02:10,560 最后一步是将 token 组合在一起 The last step is to group together tokens 56 00:02:10,560 --> 00:02:12,310 对应于同一个实体。 that correspond to the same entity. 57 00:02:13,264 --> 00:02:16,140 这就是为什么我们为每种类型的实体设置了两个标签, This is why we had two labels for each type of entity, 58 00:02:16,140 --> 00:02:18,450 例如,I-PER 和 B-PER。 I-PER and B-PER, for instance. 59 00:02:18,450 --> 00:02:20,100 它让我们知道一个 token 是否是 It allows us to know if a token is 60 00:02:20,100 --> 00:02:22,323 在与前一个相同的实体中。 in the same entity as the previous one. 61 00:02:23,310 --> 00:02:25,350 请注意,有两种标记方式 Note, that there are two ways of labeling used 62 00:02:25,350 --> 00:02:26,850 用于 token 分类。 for token classification. 63 00:02:26,850 --> 00:02:29,420 一个,这里是粉红色的,使用 B-PER 标签 One, in pink here, uses the B-PER label 64 00:02:29,420 --> 00:02:30,810 在每个新实体的开始, at the beginning of each new entity, 65 00:02:30,810 --> 00:02:32,760 但另一个,蓝色的, but the other, in blue, 66 00:02:32,760 --> 00:02:35,340 只用它来分隔两个相邻的实体 only uses it to separate two adjacent entities 67 00:02:35,340 --> 00:02:37,140 同类型的。 of the same type. 68 00:02:37,140 --> 00:02:39,690 在这两种情况下,我们都可以标记一个新实体 In both cases, we can flag a new entity 69 00:02:39,690 --> 00:02:41,940 每次我们看到一个新标签出现时, each time we see a new label appearing, 70 00:02:41,940 --> 00:02:44,730 带有 I 或 B 前缀, either with the I or B prefix, 71 00:02:44,730 --> 00:02:47,130 然后将以下所有标记为相同的标记, then take all the following tokens labeled the same, 72 00:02:47,130 --> 00:02:48,870 带有 I 标志。 with an I-flag. 73 00:02:48,870 --> 00:02:51,330 这与偏移映射一起开始 This, coupled with the offset mapping to get the start 74 00:02:51,330 --> 00:02:54,210 和结束字符允许我们获得文本的跨度 and end characters allows us to get the span of texts 75 00:02:54,210 --> 00:02:55,233 对于每个实体。 for each entity. 76 00:02:56,569 --> 00:02:59,532 (标题嘶嘶作响) (title whooshes) 77 00:02:59,532 --> 00:03:01,134 (标题失败) (title fizzles)