subtitles/zh-CN/46_inside-the-token-classification-pipeline-(tensorflow).srt (300 lines of code) (raw):
1
00:00:00,180 --> 00:00:03,013
(嘶嘶声)
(whooshing sound)
2
00:00:05,310 --> 00:00:06,143
- 我们来看一下
- Let's have a look
3
00:00:06,143 --> 00:00:08,133
在 token 分类管线(pipeline)内。
inside the token classification pipeline.
4
00:00:09,780 --> 00:00:11,430
在有关管线的视频中,
In the pipeline video,
5
00:00:11,430 --> 00:00:13,230
我们研究了不同的应用
we looked at the different applications
6
00:00:13,230 --> 00:00:16,050
transformers 库支持开箱即用的。
the transformers library supports out of the box.
7
00:00:16,050 --> 00:00:18,660
其中之一是 token 分类。
One of them being token classification.
8
00:00:18,660 --> 00:00:22,050
例如,预测句子中的每个单词,
For instance, predicting for each word in a sentence,
9
00:00:22,050 --> 00:00:23,790
他们是否对应于一个人,
whether they correspond to a person,
10
00:00:23,790 --> 00:00:26,043
一个组织或地点。
an organization, or location.
11
00:00:27,690 --> 00:00:29,250
我们甚至可以将 token 组合在一起
We can even group together the tokens
12
00:00:29,250 --> 00:00:31,320
对应同一个实体。
corresponding to the same entity.
13
00:00:31,320 --> 00:00:34,890
例如,这里构成单词 Sylvain 的所有 token
For instance, all the tokens that form the word Sylvain here
14
00:00:34,890 --> 00:00:36,423
或 Hugging 和 Face。
or Hugging and Face.
15
00:00:37,320 --> 00:00:39,720
因此,token 分类管线
So, token classification pipeline
16
00:00:39,720 --> 00:00:42,480
与文本分类管线的工作方式相同
works the same way as a text classification pipeline
17
00:00:42,480 --> 00:00:44,910
我们在之前的视频中学习过。
we studied in a previous video.
18
00:00:44,910 --> 00:00:46,500
分为三个步骤。
There are three steps.
19
00:00:46,500 --> 00:00:50,043
分词化、模型和后处理。
Tokenization, the model, and the post processing.
20
00:00:51,690 --> 00:00:53,190
前两个步骤相同
The first two steps are identical
21
00:00:53,190 --> 00:00:55,230
到文本分类管道,
to the text classification pipeline,
22
00:00:55,230 --> 00:00:58,230
除了我们使用 auto 的 token 分类模型
except we use an auto token classification model
23
00:00:58,230 --> 00:01:00,303
而不是分类。
instead of a sequence classification one.
24
00:01:01,560 --> 00:01:04,593
我们分词化我们的文本,然后将其提供给模型。
We tokenize our text, then feed it to the model.
25
00:01:05,580 --> 00:01:08,160
而不是为每个可能的级别获得一个数字
Instead of getting one number for each possible level
26
00:01:08,160 --> 00:01:09,600
对于整个句子,
for the whole sentence,
27
00:01:09,600 --> 00:01:12,270
我们为可能的九个级别中的每一个获得一个数字
we get one number for each of the possible nine levels
28
00:01:12,270 --> 00:01:14,250
对于句子中的每个 token 。
for every token in the sentence.
29
00:01:14,250 --> 00:01:15,573
在这里,19。
Here, 19.
30
00:01:17,070 --> 00:01:19,710
与 transformers 库的所有其他模型一样,
Like all the other models of the transformers library,
31
00:01:19,710 --> 00:01:22,560
我们的模型输出我们需要转换的 logits
our model outputs logits which we need to turn
32
00:01:22,560 --> 00:01:24,663
使用 SoftMax 进行预测。
into predictions by using a SoftMax.
33
00:01:25,830 --> 00:01:28,170
我们还获得了每个 token 的预测标签
We also get the predicted label for each token
34
00:01:28,170 --> 00:01:30,063
通过最大预测。
by taking the maximum prediction.
35
00:01:31,080 --> 00:01:33,540
由于 softmax 函数保留顺序,
Since the softmax function preserves the order,
36
00:01:33,540 --> 00:01:34,980
我们本可以在 logits 上完成
we could have done it on the logits
37
00:01:34,980 --> 00:01:36,830
如果我们不需要预测。
if we had no need of the predictions.
38
00:01:37,680 --> 00:01:40,050
模型配置包含标签映射
The model config contains the label mapping
39
00:01:40,050 --> 00:01:42,090
在其 id2label 字段中。
in its id2label field.
40
00:01:42,090 --> 00:01:45,600
使用它,我们可以将每个 token 映射到其相应的标签。
Using it, we can map every token to its corresponding label.
41
00:01:45,600 --> 00:01:48,630
标签 O 对应 “no entity” (没有实体)
The label O corresponds to "no entity"
42
00:01:48,630 --> 00:01:50,460
这就是为什么我们没有在结果中看到它
which is why we didn't see it in our results
43
00:01:50,460 --> 00:01:52,110
在第一张幻灯片中。
in the first slide.
44
00:01:52,110 --> 00:01:54,150
在标签和概率之上,
On top of the label and the probability,
45
00:01:54,150 --> 00:01:55,620
这些结果包括开始
those results included the start
46
00:01:55,620 --> 00:01:57,423
和句末字符。
and end character in the sentence.
47
00:01:58,294 --> 00:01:59,880
我们需要使用偏移映射
We'll need to use the offset mapping
48
00:01:59,880 --> 00:02:01,110
对于 tokenizer 以得到那些。
of the tokenizer to get those.
49
00:02:01,110 --> 00:02:03,090
看看下面的视频链接
Look at the video link below
50
00:02:03,090 --> 00:02:05,340
如果你还不知道它们。
if you don't know about them already.
51
00:02:05,340 --> 00:02:06,990
然后,遍历每个 token
Then, looping through each token
52
00:02:06,990 --> 00:02:09,090
具有不同于 O 的标签,
that has a label distinct from O,
53
00:02:09,090 --> 00:02:10,590
我们可以建立结果列表
we can build the list of results
54
00:02:10,590 --> 00:02:12,140
我们得到了我们的第一条管线。
we got with our first pipeline.
55
00:02:13,650 --> 00:02:15,840
最后一步是将 token 组合在一起
The last step is to group together tokens
56
00:02:15,840 --> 00:02:17,640
对应于同一个实体。
that corresponds to the same entity.
57
00:02:18,930 --> 00:02:21,540
这就是为什么我们为每种类型的实体设置了两个标签,
This is why we had two labels for each type of entity,
58
00:02:21,540 --> 00:02:23,940
例如 I-PER 和 B-PER。
I-PER and B-PER for instance.
59
00:02:23,940 --> 00:02:25,530
它让我们知道一个 token 是否
It allows us to know if a token
60
00:02:25,530 --> 00:02:27,603
与前一个实体处于同一实体中。
is in the same entity as a previous one.
61
00:02:28,620 --> 00:02:29,850
注意有两种方式
Note that there are two ways
62
00:02:29,850 --> 00:02:32,490
用于 token 分类的标签。
of labeling used for token classification.
63
00:02:32,490 --> 00:02:35,360
一个,这里是粉红色的,使用 B-PER 标签
One, in pink here, uses the B-PER label
64
00:02:35,360 --> 00:02:37,530
在每个新实体的开头。
at the beginning of each new entity.
65
00:02:37,530 --> 00:02:39,990
但另一个蓝色的只使用它
But the other in blue only uses it
66
00:02:39,990 --> 00:02:42,933
分隔两个相邻的相同类型的实体。
to separate two adjacent entities of the same types.
67
00:02:44,340 --> 00:02:46,560
在这两种情况下,我们都可以标记一个新实体
In both cases we can flag a new entity
68
00:02:46,560 --> 00:02:49,110
每次我们看到一个新标签出现时,
each time we see a new label appearing,
69
00:02:49,110 --> 00:02:51,330
带有 I 或 B 前缀。
either with the I or B prefix.
70
00:02:51,330 --> 00:02:53,850
然后,将以下所有标记为相同的标记
Then, take all the following tokens labeled the same
71
00:02:53,850 --> 00:02:55,470
带有 I 标志。
with an I-flag.
72
00:02:55,470 --> 00:02:57,000
这与偏移映射相结合
This, coupled with the offset mapping
73
00:02:57,000 --> 00:02:59,010
获取开始和结束字符
to get the start and end characters
74
00:02:59,010 --> 00:03:01,560
允许我们获得每个实体的文本跨度。
allows us to get the span of texts for each entity.
75
00:03:02,869 --> 00:03:05,702
(嘶嘶声)
(whooshing sound)