subtitles/zh-CN/45_inside-the-token-classification-pipeline-(pytorch).srt (308 lines of code) (raw):
1
00:00:00,076 --> 00:00:01,462
(标题嘶嘶作响)
(title whooshes)
2
00:00:01,462 --> 00:00:02,382
(标志弹出)
(logo pops)
3
00:00:02,382 --> 00:00:05,340
(标题嘶嘶作响)
(title whooshes)
4
00:00:05,340 --> 00:00:06,210
- 我们来看一下
- Let's have a look
5
00:00:06,210 --> 00:00:08,283
在 token 分类管线( pipeline )内。
inside the token classification pipeline.
6
00:00:10,080 --> 00:00:11,580
在有关管线视频中,
In the pipeline video,
7
00:00:11,580 --> 00:00:13,320
我们学习了不同的应用
we looked at the different applications
8
00:00:13,320 --> 00:00:15,960
transformers 库支持开箱即用的,
the Transformers library supports out of the box,
9
00:00:15,960 --> 00:00:18,780
其中之一是 token 分类,
one of them being token classification,
10
00:00:18,780 --> 00:00:21,810
例如预测句子中的每个单词
for instance predicting for each word in a sentence
11
00:00:21,810 --> 00:00:24,510
是否对应于一个人,一个组织
whether they correspond to a person, an organization
12
00:00:24,510 --> 00:00:25,353
或一个位置。
or a location.
13
00:00:26,670 --> 00:00:28,920
我们甚至可以将相应的 token 组合在一起
We can even group together the tokens corresponding
14
00:00:28,920 --> 00:00:32,040
到同一个实体,例如所有 token
to the same entity, for instance all the tokens
15
00:00:32,040 --> 00:00:35,373
在这里形成了 Sylvain 这个词,或 Hugging 和 Face。
that formed the word Sylvain here, or Hugging and Face.
16
00:00:37,290 --> 00:00:40,230
token 分类管线的工作方式相同
The token classification pipeline works the same way
17
00:00:40,230 --> 00:00:42,630
和我们研究的文本分类的管线
as the text classification pipeline we studied
18
00:00:42,630 --> 00:00:44,430
在上一个视频中。
in the previous video.
19
00:00:44,430 --> 00:00:45,930
分为三个步骤。
There are three steps.
20
00:00:45,930 --> 00:00:49,623
分词化、模型和后处理。
The tokenization, the model, and the postprocessing.
21
00:00:50,940 --> 00:00:52,530
前两个步骤相同
The first two steps are identical
22
00:00:52,530 --> 00:00:54,630
到文本分类管线,
to the text classification pipeline,
23
00:00:54,630 --> 00:00:57,300
除了我们使用 auto (自动) 的 token 分类模型
except we use an auto token classification model
24
00:00:57,300 --> 00:01:00,150
而不是序列分类。
instead of a sequence classification one.
25
00:01:00,150 --> 00:01:03,720
我们分词化我们的文本,然后将其提供给模型。
We tokenize our text then feed it to the model.
26
00:01:03,720 --> 00:01:05,877
而不是为每个可能的标签获取一个数字
Instead of getting one number for each possible label
27
00:01:05,877 --> 00:01:08,700
对于整个句子,我们得到一个数字
for the whole sentence, we get one number
28
00:01:08,700 --> 00:01:10,770
对于可能的九个标签中的每一个
for each of the possible nine labels
29
00:01:10,770 --> 00:01:13,983
对于句子中的每个 token ,此处为 19。
for every token in the sentence, here 19.
30
00:01:15,300 --> 00:01:18,090
与 transformers 库的所有其他模型一样,
Like all the other models of the transformers library,
31
00:01:18,090 --> 00:01:19,830
我们的模型输出 logits,
our model outputs logits,
32
00:01:19,830 --> 00:01:23,073
我们使用 SoftMax 将其转化为预测值。
which we turn into predictions by using a SoftMax.
33
00:01:23,940 --> 00:01:26,190
我们还获得了每个 token 的预测标签
We also get the predicted label for each token
34
00:01:26,190 --> 00:01:27,990
通过最大预测,
by taking the maximum prediction,
35
00:01:27,990 --> 00:01:29,880
因为 SoftMax 函数保留了顺序,
since the SoftMax function preserves the orders,
36
00:01:29,880 --> 00:01:31,200
我们本可以在 logits 上完成
we could have done it on the logits
37
00:01:31,200 --> 00:01:33,050
如果我们不需要预测。
if we had no need of the predictions.
38
00:01:33,930 --> 00:01:35,880
模型配置包含标签映射
The model config contains the label mapping
39
00:01:35,880 --> 00:01:37,740
在其 id2label 字段中。
in its id2label field.
40
00:01:37,740 --> 00:01:41,430
使用它,我们可以将每个 token 映射到其相应的标签。
Using it, we can map every token to its corresponding label.
41
00:01:41,430 --> 00:01:43,950
标签 O 不对应任何实体,
The label, O, correspond to no entity,
42
00:01:43,950 --> 00:01:45,985
这就是为什么我们没有在结果中看到它
which is why we didn't see it in our results
43
00:01:45,985 --> 00:01:47,547
在第一张幻灯片中。
in the first slide.
44
00:01:47,547 --> 00:01:49,440
在标签和概率之上,
On top of the label and the probability,
45
00:01:49,440 --> 00:01:51,000
这些结果包括开始
those results included the start
46
00:01:51,000 --> 00:01:53,103
和句末字符。
and end character in the sentence.
47
00:01:54,120 --> 00:01:55,380
我们需要使用偏移映射
We'll need to use the offset mapping
48
00:01:55,380 --> 00:01:56,640
分词器得到那些。
of the tokenizer to get those.
49
00:01:56,640 --> 00:01:58,050
看看下面链接的视频
Look at the video linked below
50
00:01:58,050 --> 00:02:00,300
如果你还不知道它们。
if you don't know about them already.
51
00:02:00,300 --> 00:02:02,280
然后,遍历每个 token
Then, looping through each token
52
00:02:02,280 --> 00:02:04,080
具有不同于 O 的标签,
that has a label distinct from O,
53
00:02:04,080 --> 00:02:06,120
我们可以建立我们得到的结果列表
we can build the list of results we got
54
00:02:06,120 --> 00:02:07,320
用我们的第一条管线。
with our first pipeline.
55
00:02:08,460 --> 00:02:10,560
最后一步是将 token 组合在一起
The last step is to group together tokens
56
00:02:10,560 --> 00:02:12,310
对应于同一个实体。
that correspond to the same entity.
57
00:02:13,264 --> 00:02:16,140
这就是为什么我们为每种类型的实体设置了两个标签,
This is why we had two labels for each type of entity,
58
00:02:16,140 --> 00:02:18,450
例如,I-PER 和 B-PER。
I-PER and B-PER, for instance.
59
00:02:18,450 --> 00:02:20,100
它让我们知道一个 token 是否是
It allows us to know if a token is
60
00:02:20,100 --> 00:02:22,323
在与前一个相同的实体中。
in the same entity as the previous one.
61
00:02:23,310 --> 00:02:25,350
请注意,有两种标记方式
Note, that there are two ways of labeling used
62
00:02:25,350 --> 00:02:26,850
用于 token 分类。
for token classification.
63
00:02:26,850 --> 00:02:29,420
一个,这里是粉红色的,使用 B-PER 标签
One, in pink here, uses the B-PER label
64
00:02:29,420 --> 00:02:30,810
在每个新实体的开始,
at the beginning of each new entity,
65
00:02:30,810 --> 00:02:32,760
但另一个,蓝色的,
but the other, in blue,
66
00:02:32,760 --> 00:02:35,340
只用它来分隔两个相邻的实体
only uses it to separate two adjacent entities
67
00:02:35,340 --> 00:02:37,140
同类型的。
of the same type.
68
00:02:37,140 --> 00:02:39,690
在这两种情况下,我们都可以标记一个新实体
In both cases, we can flag a new entity
69
00:02:39,690 --> 00:02:41,940
每次我们看到一个新标签出现时,
each time we see a new label appearing,
70
00:02:41,940 --> 00:02:44,730
带有 I 或 B 前缀,
either with the I or B prefix,
71
00:02:44,730 --> 00:02:47,130
然后将以下所有标记为相同的标记,
then take all the following tokens labeled the same,
72
00:02:47,130 --> 00:02:48,870
带有 I 标志。
with an I-flag.
73
00:02:48,870 --> 00:02:51,330
这与偏移映射一起开始
This, coupled with the offset mapping to get the start
74
00:02:51,330 --> 00:02:54,210
和结束字符允许我们获得文本的跨度
and end characters allows us to get the span of texts
75
00:02:54,210 --> 00:02:55,233
对于每个实体。
for each entity.
76
00:02:56,569 --> 00:02:59,532
(标题嘶嘶作响)
(title whooshes)
77
00:02:59,532 --> 00:03:01,134
(标题失败)
(title fizzles)