subtitles/zh-CN/46_inside-the-token-classification-pipeline-(tensorflow).srt (300 lines of code) (raw):

1 00:00:00,180 --> 00:00:03,013 (嘶嘶声) (whooshing sound) 2 00:00:05,310 --> 00:00:06,143 - 我们来看一下 - Let's have a look 3 00:00:06,143 --> 00:00:08,133 在 token 分类管线(pipeline)内。 inside the token classification pipeline. 4 00:00:09,780 --> 00:00:11,430 在有关管线的视频中, In the pipeline video, 5 00:00:11,430 --> 00:00:13,230 我们研究了不同的应用 we looked at the different applications 6 00:00:13,230 --> 00:00:16,050 transformers 库支持开箱即用的。 the transformers library supports out of the box. 7 00:00:16,050 --> 00:00:18,660 其中之一是 token 分类。 One of them being token classification. 8 00:00:18,660 --> 00:00:22,050 例如,预测句子中的每个单词, For instance, predicting for each word in a sentence, 9 00:00:22,050 --> 00:00:23,790 他们是否对应于一个人, whether they correspond to a person, 10 00:00:23,790 --> 00:00:26,043 一个组织或地点。 an organization, or location. 11 00:00:27,690 --> 00:00:29,250 我们甚至可以将 token 组合在一起 We can even group together the tokens 12 00:00:29,250 --> 00:00:31,320 对应同一个实体。 corresponding to the same entity. 13 00:00:31,320 --> 00:00:34,890 例如,这里构成单词 Sylvain 的所有 token For instance, all the tokens that form the word Sylvain here 14 00:00:34,890 --> 00:00:36,423 或 Hugging 和 Face。 or Hugging and Face. 15 00:00:37,320 --> 00:00:39,720 因此,token 分类管线 So, token classification pipeline 16 00:00:39,720 --> 00:00:42,480 与文本分类管线的工作方式相同 works the same way as a text classification pipeline 17 00:00:42,480 --> 00:00:44,910 我们在之前的视频中学习过。 we studied in a previous video. 18 00:00:44,910 --> 00:00:46,500 分为三个步骤。 There are three steps. 19 00:00:46,500 --> 00:00:50,043 分词化、模型和后处理。 Tokenization, the model, and the post processing. 20 00:00:51,690 --> 00:00:53,190 前两个步骤相同 The first two steps are identical 21 00:00:53,190 --> 00:00:55,230 到文本分类管道, to the text classification pipeline, 22 00:00:55,230 --> 00:00:58,230 除了我们使用 auto 的 token 分类模型 except we use an auto token classification model 23 00:00:58,230 --> 00:01:00,303 而不是分类。 instead of a sequence classification one. 24 00:01:01,560 --> 00:01:04,593 我们分词化我们的文本,然后将其提供给模型。 We tokenize our text, then feed it to the model. 25 00:01:05,580 --> 00:01:08,160 而不是为每个可能的级别获得一个数字 Instead of getting one number for each possible level 26 00:01:08,160 --> 00:01:09,600 对于整个句子, for the whole sentence, 27 00:01:09,600 --> 00:01:12,270 我们为可能的九个级别中的每一个获得一个数字 we get one number for each of the possible nine levels 28 00:01:12,270 --> 00:01:14,250 对于句子中的每个 token 。 for every token in the sentence. 29 00:01:14,250 --> 00:01:15,573 在这里,19。 Here, 19. 30 00:01:17,070 --> 00:01:19,710 与 transformers 库的所有其他模型一样, Like all the other models of the transformers library, 31 00:01:19,710 --> 00:01:22,560 我们的模型输出我们需要转换的 logits our model outputs logits which we need to turn 32 00:01:22,560 --> 00:01:24,663 使用 SoftMax 进行预测。 into predictions by using a SoftMax. 33 00:01:25,830 --> 00:01:28,170 我们还获得了每个 token 的预测标签 We also get the predicted label for each token 34 00:01:28,170 --> 00:01:30,063 通过最大预测。 by taking the maximum prediction. 35 00:01:31,080 --> 00:01:33,540 由于 softmax 函数保留顺序, Since the softmax function preserves the order, 36 00:01:33,540 --> 00:01:34,980 我们本可以在 logits 上完成 we could have done it on the logits 37 00:01:34,980 --> 00:01:36,830 如果我们不需要预测。 if we had no need of the predictions. 38 00:01:37,680 --> 00:01:40,050 模型配置包含标签映射 The model config contains the label mapping 39 00:01:40,050 --> 00:01:42,090 在其 id2label 字段中。 in its id2label field. 40 00:01:42,090 --> 00:01:45,600 使用它,我们可以将每个 token 映射到其相应的标签。 Using it, we can map every token to its corresponding label. 41 00:01:45,600 --> 00:01:48,630 标签 O 对应 “no entity” (没有实体) The label O corresponds to "no entity" 42 00:01:48,630 --> 00:01:50,460 这就是为什么我们没有在结果中看到它 which is why we didn't see it in our results 43 00:01:50,460 --> 00:01:52,110 在第一张幻灯片中。 in the first slide. 44 00:01:52,110 --> 00:01:54,150 在标签和概率之上, On top of the label and the probability, 45 00:01:54,150 --> 00:01:55,620 这些结果包括开始 those results included the start 46 00:01:55,620 --> 00:01:57,423 和句末字符。 and end character in the sentence. 47 00:01:58,294 --> 00:01:59,880 我们需要使用偏移映射 We'll need to use the offset mapping 48 00:01:59,880 --> 00:02:01,110 对于 tokenizer 以得到那些。 of the tokenizer to get those. 49 00:02:01,110 --> 00:02:03,090 看看下面的视频链接 Look at the video link below 50 00:02:03,090 --> 00:02:05,340 如果你还不知道它们。 if you don't know about them already. 51 00:02:05,340 --> 00:02:06,990 然后,遍历每个 token Then, looping through each token 52 00:02:06,990 --> 00:02:09,090 具有不同于 O 的标签, that has a label distinct from O, 53 00:02:09,090 --> 00:02:10,590 我们可以建立结果列表 we can build the list of results 54 00:02:10,590 --> 00:02:12,140 我们得到了我们的第一条管线。 we got with our first pipeline. 55 00:02:13,650 --> 00:02:15,840 最后一步是将 token 组合在一起 The last step is to group together tokens 56 00:02:15,840 --> 00:02:17,640 对应于同一个实体。 that corresponds to the same entity. 57 00:02:18,930 --> 00:02:21,540 这就是为什么我们为每种类型的实体设置了两个标签, This is why we had two labels for each type of entity, 58 00:02:21,540 --> 00:02:23,940 例如 I-PER 和 B-PER。 I-PER and B-PER for instance. 59 00:02:23,940 --> 00:02:25,530 它让我们知道一个 token 是否 It allows us to know if a token 60 00:02:25,530 --> 00:02:27,603 与前一个实体处于同一实体中。 is in the same entity as a previous one. 61 00:02:28,620 --> 00:02:29,850 注意有两种方式 Note that there are two ways 62 00:02:29,850 --> 00:02:32,490 用于 token 分类的标签。 of labeling used for token classification. 63 00:02:32,490 --> 00:02:35,360 一个,这里是粉红色的,使用 B-PER 标签 One, in pink here, uses the B-PER label 64 00:02:35,360 --> 00:02:37,530 在每个新实体的开头。 at the beginning of each new entity. 65 00:02:37,530 --> 00:02:39,990 但另一个蓝色的只使用它 But the other in blue only uses it 66 00:02:39,990 --> 00:02:42,933 分隔两个相邻的相同类型的实体。 to separate two adjacent entities of the same types. 67 00:02:44,340 --> 00:02:46,560 在这两种情况下,我们都可以标记一个新实体 In both cases we can flag a new entity 68 00:02:46,560 --> 00:02:49,110 每次我们看到一个新标签出现时, each time we see a new label appearing, 69 00:02:49,110 --> 00:02:51,330 带有 I 或 B 前缀。 either with the I or B prefix. 70 00:02:51,330 --> 00:02:53,850 然后,将以下所有标记为相同的标记 Then, take all the following tokens labeled the same 71 00:02:53,850 --> 00:02:55,470 带有 I 标志。 with an I-flag. 72 00:02:55,470 --> 00:02:57,000 这与偏移映射相结合 This, coupled with the offset mapping 73 00:02:57,000 --> 00:02:59,010 获取开始和结束字符 to get the start and end characters 74 00:02:59,010 --> 00:03:01,560 允许我们获得每个实体的文本跨度。 allows us to get the span of texts for each entity. 75 00:03:02,869 --> 00:03:05,702 (嘶嘶声) (whooshing sound)