subtitles/en/45_inside-the-token-classification-pipeline-(pytorch).srt

1 00:00:00,076 --> 00:00:01,462 (title whooshes) 2 00:00:01,462 --> 00:00:02,382 (logo pops) 3 00:00:02,382 --> 00:00:05,340 (title whooshes) 4 00:00:05,340 --> 00:00:06,210 - Let's have a look 5 00:00:06,210 --> 00:00:08,283 inside the token classification pipeline. 6 00:00:10,080 --> 00:00:11,580 In the pipeline video, 7 00:00:11,580 --> 00:00:13,320 we looked at the different applications 8 00:00:13,320 --> 00:00:15,960 the Transformers library supports out of the box, 9 00:00:15,960 --> 00:00:18,780 one of them being token classification, 10 00:00:18,780 --> 00:00:21,810 for instance predicting for each word in a sentence 11 00:00:21,810 --> 00:00:24,510 whether they correspond to a person, an organization 12 00:00:24,510 --> 00:00:25,353 or a location. 13 00:00:26,670 --> 00:00:28,920 We can even group together the tokens corresponding 14 00:00:28,920 --> 00:00:32,040 to the same entity, for instance all the tokens 15 00:00:32,040 --> 00:00:35,373 that formed the word Sylvain here, or Hugging and Face. 16 00:00:37,290 --> 00:00:40,230 The token classification pipeline works the same way 17 00:00:40,230 --> 00:00:42,630 as the text classification pipeline we studied 18 00:00:42,630 --> 00:00:44,430 in the previous video. 19 00:00:44,430 --> 00:00:45,930 There are three steps. 20 00:00:45,930 --> 00:00:49,623 The tokenization, the model, and the postprocessing. 21 00:00:50,940 --> 00:00:52,530 The first two steps are identical 22 00:00:52,530 --> 00:00:54,630 to the text classification pipeline, 23 00:00:54,630 --> 00:00:57,300 except we use an auto token classification model 24 00:00:57,300 --> 00:01:00,150 instead of a sequence classification one. 25 00:01:00,150 --> 00:01:03,720 We tokenize our text then feed it to the model. 26 00:01:03,720 --> 00:01:05,877 Instead of getting one number for each possible label 27 00:01:05,877 --> 00:01:08,700 for the whole sentence, we get one number 28 00:01:08,700 --> 00:01:10,770 for each of the possible nine labels 29 00:01:10,770 --> 00:01:13,983 for every token in the sentence, here 19. 30 00:01:15,300 --> 00:01:18,090 Like all the other models of the Transformers library, 31 00:01:18,090 --> 00:01:19,830 our model outputs logits, 32 00:01:19,830 --> 00:01:23,073 which we turn into predictions by using a SoftMax. 33 00:01:23,940 --> 00:01:26,190 We also get the predicted label for each token 34 00:01:26,190 --> 00:01:27,990 by taking the maximum prediction, 35 00:01:27,990 --> 00:01:29,880 since the SoftMax function preserves the orders, 36 00:01:29,880 --> 00:01:31,200 we could have done it on the logits 37 00:01:31,200 --> 00:01:33,050 if we had no need of the predictions. 38 00:01:33,930 --> 00:01:35,880 The model config contains the label mapping 39 00:01:35,880 --> 00:01:37,740 in its id2label field. 40 00:01:37,740 --> 00:01:41,430 Using it, we can map every token to its corresponding label. 41 00:01:41,430 --> 00:01:43,950 The label, O, correspond to no entity, 42 00:01:43,950 --> 00:01:45,985 which is why we didn't see it in our results 43 00:01:45,985 --> 00:01:47,547 in the first slide. 44 00:01:47,547 --> 00:01:49,440 On top of the label and the probability, 45 00:01:49,440 --> 00:01:51,000 those results included the start 46 00:01:51,000 --> 00:01:53,103 and end character in the sentence. 47 00:01:54,120 --> 00:01:55,380 We'll need to use the offset mapping 48 00:01:55,380 --> 00:01:56,640 of the tokenizer to get those. 49 00:01:56,640 --> 00:01:58,050 Look at the video linked below 50 00:01:58,050 --> 00:02:00,300 if you don't know about them already. 51 00:02:00,300 --> 00:02:02,280 Then, looping through each token 52 00:02:02,280 --> 00:02:04,080 that has a label distinct from O, 53 00:02:04,080 --> 00:02:06,120 we can build the list of results we got 54 00:02:06,120 --> 00:02:07,320 with our first pipeline. 55 00:02:08,460 --> 00:02:10,560 The last step is to group together tokens 56 00:02:10,560 --> 00:02:12,310 that correspond to the same entity. 57 00:02:13,264 --> 00:02:16,140 This is why we had two labels for each type of entity, 58 00:02:16,140 --> 00:02:18,450 I-PER and B-PER, for instance. 59 00:02:18,450 --> 00:02:20,100 It allows us to know if a token is 60 00:02:20,100 --> 00:02:22,323 in the same entity as the previous one. 61 00:02:23,310 --> 00:02:25,350 Note, that there are two ways of labeling used 62 00:02:25,350 --> 00:02:26,850 for token classification. 63 00:02:26,850 --> 00:02:29,420 One, in pink here, uses the B-PER label 64 00:02:29,420 --> 00:02:30,810 at the beginning of each new entity, 65 00:02:30,810 --> 00:02:32,760 but the other, in blue, 66 00:02:32,760 --> 00:02:35,340 only uses it to separate two adjacent entities 67 00:02:35,340 --> 00:02:37,140 of the same type. 68 00:02:37,140 --> 00:02:39,690 In both cases, we can flag a new entity 69 00:02:39,690 --> 00:02:41,940 each time we see a new label appearing, 70 00:02:41,940 --> 00:02:44,730 either with the I or B prefix, 71 00:02:44,730 --> 00:02:47,130 then take all the following tokens labeled the same, 72 00:02:47,130 --> 00:02:48,870 with an I-flag. 73 00:02:48,870 --> 00:02:51,330 This, coupled with the offset mapping to get the start 74 00:02:51,330 --> 00:02:54,210 and end characters allows us to get the span of texts 75 00:02:54,210 --> 00:02:55,233 for each entity. 76 00:02:56,569 --> 00:02:59,532 (title whooshes) 77 00:02:59,532 --> 00:03:01,134 (title fizzles)

subtitles/en/45_inside-the-token-classification-pipeline-(pytorch).srt (256 lines of code) (raw):