subtitles/en/45_inside-the-token-classification-pipeline-(pytorch).srt (256 lines of code) (raw):
1
00:00:00,076 --> 00:00:01,462
(title whooshes)
2
00:00:01,462 --> 00:00:02,382
(logo pops)
3
00:00:02,382 --> 00:00:05,340
(title whooshes)
4
00:00:05,340 --> 00:00:06,210
- Let's have a look
5
00:00:06,210 --> 00:00:08,283
inside the token classification pipeline.
6
00:00:10,080 --> 00:00:11,580
In the pipeline video,
7
00:00:11,580 --> 00:00:13,320
we looked at the different applications
8
00:00:13,320 --> 00:00:15,960
the Transformers library
supports out of the box,
9
00:00:15,960 --> 00:00:18,780
one of them being token classification,
10
00:00:18,780 --> 00:00:21,810
for instance predicting
for each word in a sentence
11
00:00:21,810 --> 00:00:24,510
whether they correspond to
a person, an organization
12
00:00:24,510 --> 00:00:25,353
or a location.
13
00:00:26,670 --> 00:00:28,920
We can even group together
the tokens corresponding
14
00:00:28,920 --> 00:00:32,040
to the same entity, for
instance all the tokens
15
00:00:32,040 --> 00:00:35,373
that formed the word Sylvain
here, or Hugging and Face.
16
00:00:37,290 --> 00:00:40,230
The token classification
pipeline works the same way
17
00:00:40,230 --> 00:00:42,630
as the text classification
pipeline we studied
18
00:00:42,630 --> 00:00:44,430
in the previous video.
19
00:00:44,430 --> 00:00:45,930
There are three steps.
20
00:00:45,930 --> 00:00:49,623
The tokenization, the model,
and the postprocessing.
21
00:00:50,940 --> 00:00:52,530
The first two steps are identical
22
00:00:52,530 --> 00:00:54,630
to the text classification pipeline,
23
00:00:54,630 --> 00:00:57,300
except we use an auto
token classification model
24
00:00:57,300 --> 00:01:00,150
instead of a sequence classification one.
25
00:01:00,150 --> 00:01:03,720
We tokenize our text then
feed it to the model.
26
00:01:03,720 --> 00:01:05,877
Instead of getting one number
for each possible label
27
00:01:05,877 --> 00:01:08,700
for the whole sentence, we get one number
28
00:01:08,700 --> 00:01:10,770
for each of the possible nine labels
29
00:01:10,770 --> 00:01:13,983
for every token in the sentence, here 19.
30
00:01:15,300 --> 00:01:18,090
Like all the other models
of the Transformers library,
31
00:01:18,090 --> 00:01:19,830
our model outputs logits,
32
00:01:19,830 --> 00:01:23,073
which we turn into predictions
by using a SoftMax.
33
00:01:23,940 --> 00:01:26,190
We also get the predicted
label for each token
34
00:01:26,190 --> 00:01:27,990
by taking the maximum prediction,
35
00:01:27,990 --> 00:01:29,880
since the SoftMax function
preserves the orders,
36
00:01:29,880 --> 00:01:31,200
we could have done it on the logits
37
00:01:31,200 --> 00:01:33,050
if we had no need of the predictions.
38
00:01:33,930 --> 00:01:35,880
The model config contains
the label mapping
39
00:01:35,880 --> 00:01:37,740
in its id2label field.
40
00:01:37,740 --> 00:01:41,430
Using it, we can map every token
to its corresponding label.
41
00:01:41,430 --> 00:01:43,950
The label, O, correspond to no entity,
42
00:01:43,950 --> 00:01:45,985
which is why we didn't
see it in our results
43
00:01:45,985 --> 00:01:47,547
in the first slide.
44
00:01:47,547 --> 00:01:49,440
On top of the label and the probability,
45
00:01:49,440 --> 00:01:51,000
those results included the start
46
00:01:51,000 --> 00:01:53,103
and end character in the sentence.
47
00:01:54,120 --> 00:01:55,380
We'll need to use the offset mapping
48
00:01:55,380 --> 00:01:56,640
of the tokenizer to get those.
49
00:01:56,640 --> 00:01:58,050
Look at the video linked below
50
00:01:58,050 --> 00:02:00,300
if you don't know about them already.
51
00:02:00,300 --> 00:02:02,280
Then, looping through each token
52
00:02:02,280 --> 00:02:04,080
that has a label distinct from O,
53
00:02:04,080 --> 00:02:06,120
we can build the list of results we got
54
00:02:06,120 --> 00:02:07,320
with our first pipeline.
55
00:02:08,460 --> 00:02:10,560
The last step is to group together tokens
56
00:02:10,560 --> 00:02:12,310
that correspond to the same entity.
57
00:02:13,264 --> 00:02:16,140
This is why we had two labels
for each type of entity,
58
00:02:16,140 --> 00:02:18,450
I-PER and B-PER, for instance.
59
00:02:18,450 --> 00:02:20,100
It allows us to know if a token is
60
00:02:20,100 --> 00:02:22,323
in the same entity as the previous one.
61
00:02:23,310 --> 00:02:25,350
Note, that there are two
ways of labeling used
62
00:02:25,350 --> 00:02:26,850
for token classification.
63
00:02:26,850 --> 00:02:29,420
One, in pink here, uses the B-PER label
64
00:02:29,420 --> 00:02:30,810
at the beginning of each new entity,
65
00:02:30,810 --> 00:02:32,760
but the other, in blue,
66
00:02:32,760 --> 00:02:35,340
only uses it to separate
two adjacent entities
67
00:02:35,340 --> 00:02:37,140
of the same type.
68
00:02:37,140 --> 00:02:39,690
In both cases, we can flag a new entity
69
00:02:39,690 --> 00:02:41,940
each time we see a new label appearing,
70
00:02:41,940 --> 00:02:44,730
either with the I or B prefix,
71
00:02:44,730 --> 00:02:47,130
then take all the following
tokens labeled the same,
72
00:02:47,130 --> 00:02:48,870
with an I-flag.
73
00:02:48,870 --> 00:02:51,330
This, coupled with the offset
mapping to get the start
74
00:02:51,330 --> 00:02:54,210
and end characters allows
us to get the span of texts
75
00:02:54,210 --> 00:02:55,233
for each entity.
76
00:02:56,569 --> 00:02:59,532
(title whooshes)
77
00:02:59,532 --> 00:03:01,134
(title fizzles)