subtitles/en/46_inside-the-token-classification-pipeline-(tensorflow).srt (245 lines of code) (raw):
1
00:00:00,180 --> 00:00:03,013
(whooshing sound)
2
00:00:05,310 --> 00:00:06,143
- Let's have a look
3
00:00:06,143 --> 00:00:08,133
inside the token classification pipeline.
4
00:00:09,780 --> 00:00:11,430
In the pipeline video,
5
00:00:11,430 --> 00:00:13,230
we looked at the different applications
6
00:00:13,230 --> 00:00:16,050
the Transformers library
supports out of the box.
7
00:00:16,050 --> 00:00:18,660
One of them being token classification.
8
00:00:18,660 --> 00:00:22,050
For instance, predicting
for each word in a sentence,
9
00:00:22,050 --> 00:00:23,790
whether they correspond to a person,
10
00:00:23,790 --> 00:00:26,043
an organization, or location.
11
00:00:27,690 --> 00:00:29,250
We can even group together the tokens
12
00:00:29,250 --> 00:00:31,320
corresponding to the same entity.
13
00:00:31,320 --> 00:00:34,890
For instance, all the tokens
that form the word Sylvain here
14
00:00:34,890 --> 00:00:36,423
or Hugging and Face.
15
00:00:37,320 --> 00:00:39,720
So, token classification pipeline
16
00:00:39,720 --> 00:00:42,480
works the same way as a
text classification pipeline
17
00:00:42,480 --> 00:00:44,910
we studied in a previous video.
18
00:00:44,910 --> 00:00:46,500
There are three steps.
19
00:00:46,500 --> 00:00:50,043
Tokenization, the model,
and the post processing.
20
00:00:51,690 --> 00:00:53,190
The first two steps are identical
21
00:00:53,190 --> 00:00:55,230
to the text classification pipeline,
22
00:00:55,230 --> 00:00:58,230
except we use an auto
token classification model
23
00:00:58,230 --> 00:01:00,303
instead of a sequence classification one.
24
00:01:01,560 --> 00:01:04,593
We tokenize our text,
then feed it to the model.
25
00:01:05,580 --> 00:01:08,160
Instead of getting one number
for each possible level
26
00:01:08,160 --> 00:01:09,600
for the whole sentence,
27
00:01:09,600 --> 00:01:12,270
we get one number for each
of the possible nine levels
28
00:01:12,270 --> 00:01:14,250
for every token in the sentence.
29
00:01:14,250 --> 00:01:15,573
Here, 19.
30
00:01:17,070 --> 00:01:19,710
Like all the other models
of the Transformers library,
31
00:01:19,710 --> 00:01:22,560
our model outputs logits
which we need to turn
32
00:01:22,560 --> 00:01:24,663
into predictions by using a SoftMax.
33
00:01:25,830 --> 00:01:28,170
We also get the predicted
label for each token
34
00:01:28,170 --> 00:01:30,063
by taking the maximum prediction.
35
00:01:31,080 --> 00:01:33,540
Since the softmax function
preserves the order,
36
00:01:33,540 --> 00:01:34,980
we could have done it on the logits
37
00:01:34,980 --> 00:01:36,830
if we had no need of the predictions.
38
00:01:37,680 --> 00:01:40,050
The model config contains
the label mapping
39
00:01:40,050 --> 00:01:42,090
in its id2label field.
40
00:01:42,090 --> 00:01:45,600
Using it, we can map every token
to its corresponding label.
41
00:01:45,600 --> 00:01:48,630
The label O corresponds to "no entity"
42
00:01:48,630 --> 00:01:50,460
which is why we didn't
see it in our results
43
00:01:50,460 --> 00:01:52,110
in the first slide.
44
00:01:52,110 --> 00:01:54,150
On top of the label and the probability,
45
00:01:54,150 --> 00:01:55,620
those results included the start
46
00:01:55,620 --> 00:01:57,423
and end character in the sentence.
47
00:01:58,294 --> 00:01:59,880
We'll need to use the offset mapping
48
00:01:59,880 --> 00:02:01,110
of the tokenizer to get those.
49
00:02:01,110 --> 00:02:03,090
Look at the video link below
50
00:02:03,090 --> 00:02:05,340
if you don't know about them already.
51
00:02:05,340 --> 00:02:06,990
Then, looping through each token
52
00:02:06,990 --> 00:02:09,090
that has a label distinct from O,
53
00:02:09,090 --> 00:02:10,590
we can build the list of results
54
00:02:10,590 --> 00:02:12,140
we got with our first pipeline.
55
00:02:13,650 --> 00:02:15,840
The last step is to group together tokens
56
00:02:15,840 --> 00:02:17,640
that corresponds to the same entity.
57
00:02:18,930 --> 00:02:21,540
This is why we had two labels
for each type of entity,
58
00:02:21,540 --> 00:02:23,940
I-PER and B-PER for instance.
59
00:02:23,940 --> 00:02:25,530
It allows us to know if a token
60
00:02:25,530 --> 00:02:27,603
is in the same entity as a previous one.
61
00:02:28,620 --> 00:02:29,850
Note that there are two ways
62
00:02:29,850 --> 00:02:32,490
of labeling used for token classification.
63
00:02:32,490 --> 00:02:35,360
One, in pink here, uses the B-PER label
64
00:02:35,360 --> 00:02:37,530
at the beginning of each new entity.
65
00:02:37,530 --> 00:02:39,990
But the other in blue only uses it
66
00:02:39,990 --> 00:02:42,933
to separate two adjacent
entities of the same types.
67
00:02:44,340 --> 00:02:46,560
In both cases we can flag a new entity
68
00:02:46,560 --> 00:02:49,110
each time we see a new label appearing,
69
00:02:49,110 --> 00:02:51,330
either with the I or B prefix.
70
00:02:51,330 --> 00:02:53,850
Then, take all the following
tokens labeled the same
71
00:02:53,850 --> 00:02:55,470
with an I-flag.
72
00:02:55,470 --> 00:02:57,000
This, coupled with the offset mapping
73
00:02:57,000 --> 00:02:59,010
to get the start and end characters
74
00:02:59,010 --> 00:03:01,560
allows us to get the span
of texts for each entity.
75
00:03:02,869 --> 00:03:05,702
(whooshing sound)