1
00:00:00,076 --> 00:00:01,462
(title whooshes)

2
00:00:01,462 --> 00:00:02,382
(logo pops)

3
00:00:02,382 --> 00:00:05,340
(title whooshes)

4
00:00:05,340 --> 00:00:06,210
- Let's have a look

5
00:00:06,210 --> 00:00:08,283
inside the token classification pipeline.

6
00:00:10,080 --> 00:00:11,580
In the pipeline video,

7
00:00:11,580 --> 00:00:13,320
we looked at the different applications

8
00:00:13,320 --> 00:00:15,960
the Transformers library
supports out of the box,

9
00:00:15,960 --> 00:00:18,780
one of them being token classification,

10
00:00:18,780 --> 00:00:21,810
for instance predicting
for each word in a sentence

11
00:00:21,810 --> 00:00:24,510
whether they correspond to
a person, an organization

12
00:00:24,510 --> 00:00:25,353
or a location.

13
00:00:26,670 --> 00:00:28,920
We can even group together
the tokens corresponding

14
00:00:28,920 --> 00:00:32,040
to the same entity, for
instance all the tokens

15
00:00:32,040 --> 00:00:35,373
that formed the word Sylvain
here, or Hugging and Face.

16
00:00:37,290 --> 00:00:40,230
The token classification
pipeline works the same way

17
00:00:40,230 --> 00:00:42,630
as the text classification
pipeline we studied

18
00:00:42,630 --> 00:00:44,430
in the previous video.

19
00:00:44,430 --> 00:00:45,930
There are three steps.

20
00:00:45,930 --> 00:00:49,623
The tokenization, the model,
and the postprocessing.

21
00:00:50,940 --> 00:00:52,530
The first two steps are identical

22
00:00:52,530 --> 00:00:54,630
to the text classification pipeline,

23
00:00:54,630 --> 00:00:57,300
except we use an auto
token classification model

24
00:00:57,300 --> 00:01:00,150
instead of a sequence classification one.

25
00:01:00,150 --> 00:01:03,720
We tokenize our text then
feed it to the model.

26
00:01:03,720 --> 00:01:05,877
Instead of getting one number
for each possible label

27
00:01:05,877 --> 00:01:08,700
for the whole sentence, we get one number

28
00:01:08,700 --> 00:01:10,770
for each of the possible nine labels

29
00:01:10,770 --> 00:01:13,983
for every token in the sentence, here 19.

30
00:01:15,300 --> 00:01:18,090
Like all the other models
of the Transformers library,

31
00:01:18,090 --> 00:01:19,830
our model outputs logits,

32
00:01:19,830 --> 00:01:23,073
which we turn into predictions
by using a SoftMax.

33
00:01:23,940 --> 00:01:26,190
We also get the predicted
label for each token

34
00:01:26,190 --> 00:01:27,990
by taking the maximum prediction,

35
00:01:27,990 --> 00:01:29,880
since the SoftMax function
preserves the orders,

36
00:01:29,880 --> 00:01:31,200
we could have done it on the logits

37
00:01:31,200 --> 00:01:33,050
if we had no need of the predictions.

38
00:01:33,930 --> 00:01:35,880
The model config contains
the label mapping

39
00:01:35,880 --> 00:01:37,740
in its id2label field.

40
00:01:37,740 --> 00:01:41,430
Using it, we can map every token
to its corresponding label.

41
00:01:41,430 --> 00:01:43,950
The label, O, correspond to no entity,

42
00:01:43,950 --> 00:01:45,985
which is why we didn't
see it in our results

43
00:01:45,985 --> 00:01:47,547
in the first slide.

44
00:01:47,547 --> 00:01:49,440
On top of the label and the probability,

45
00:01:49,440 --> 00:01:51,000
those results included the start

46
00:01:51,000 --> 00:01:53,103
and end character in the sentence.

47
00:01:54,120 --> 00:01:55,380
We'll need to use the offset mapping

48
00:01:55,380 --> 00:01:56,640
of the tokenizer to get those.

49
00:01:56,640 --> 00:01:58,050
Look at the video linked below

50
00:01:58,050 --> 00:02:00,300
if you don't know about them already.

51
00:02:00,300 --> 00:02:02,280
Then, looping through each token

52
00:02:02,280 --> 00:02:04,080
that has a label distinct from O,

53
00:02:04,080 --> 00:02:06,120
we can build the list of results we got

54
00:02:06,120 --> 00:02:07,320
with our first pipeline.

55
00:02:08,460 --> 00:02:10,560
The last step is to group together tokens

56
00:02:10,560 --> 00:02:12,310
that correspond to the same entity.

57
00:02:13,264 --> 00:02:16,140
This is why we had two labels
for each type of entity,

58
00:02:16,140 --> 00:02:18,450
I-PER and B-PER, for instance.

59
00:02:18,450 --> 00:02:20,100
It allows us to know if a token is

60
00:02:20,100 --> 00:02:22,323
in the same entity as the previous one.

61
00:02:23,310 --> 00:02:25,350
Note, that there are two
ways of labeling used

62
00:02:25,350 --> 00:02:26,850
for token classification.

63
00:02:26,850 --> 00:02:29,420
One, in pink here, uses the B-PER label

64
00:02:29,420 --> 00:02:30,810
at the beginning of each new entity,

65
00:02:30,810 --> 00:02:32,760
but the other, in blue,

66
00:02:32,760 --> 00:02:35,340
only uses it to separate
two adjacent entities

67
00:02:35,340 --> 00:02:37,140
of the same type.

68
00:02:37,140 --> 00:02:39,690
In both cases, we can flag a new entity

69
00:02:39,690 --> 00:02:41,940
each time we see a new label appearing,

70
00:02:41,940 --> 00:02:44,730
either with the I or B prefix,

71
00:02:44,730 --> 00:02:47,130
then take all the following
tokens labeled the same,

72
00:02:47,130 --> 00:02:48,870
with an I-flag.

73
00:02:48,870 --> 00:02:51,330
This, coupled with the offset
mapping to get the start

74
00:02:51,330 --> 00:02:54,210
and end characters allows
us to get the span of texts

75
00:02:54,210 --> 00:02:55,233
for each entity.

76
00:02:56,569 --> 00:02:59,532
(title whooshes)

77
00:02:59,532 --> 00:03:01,134
(title fizzles)