subtitles/en/46_inside-the-token-classification-pipeline-(tensorflow).srt (245 lines of code) (raw):

1 00:00:00,180 --> 00:00:03,013 (whooshing sound) 2 00:00:05,310 --> 00:00:06,143 - Let's have a look 3 00:00:06,143 --> 00:00:08,133 inside the token classification pipeline. 4 00:00:09,780 --> 00:00:11,430 In the pipeline video, 5 00:00:11,430 --> 00:00:13,230 we looked at the different applications 6 00:00:13,230 --> 00:00:16,050 the Transformers library supports out of the box. 7 00:00:16,050 --> 00:00:18,660 One of them being token classification. 8 00:00:18,660 --> 00:00:22,050 For instance, predicting for each word in a sentence, 9 00:00:22,050 --> 00:00:23,790 whether they correspond to a person, 10 00:00:23,790 --> 00:00:26,043 an organization, or location. 11 00:00:27,690 --> 00:00:29,250 We can even group together the tokens 12 00:00:29,250 --> 00:00:31,320 corresponding to the same entity. 13 00:00:31,320 --> 00:00:34,890 For instance, all the tokens that form the word Sylvain here 14 00:00:34,890 --> 00:00:36,423 or Hugging and Face. 15 00:00:37,320 --> 00:00:39,720 So, token classification pipeline 16 00:00:39,720 --> 00:00:42,480 works the same way as a text classification pipeline 17 00:00:42,480 --> 00:00:44,910 we studied in a previous video. 18 00:00:44,910 --> 00:00:46,500 There are three steps. 19 00:00:46,500 --> 00:00:50,043 Tokenization, the model, and the post processing. 20 00:00:51,690 --> 00:00:53,190 The first two steps are identical 21 00:00:53,190 --> 00:00:55,230 to the text classification pipeline, 22 00:00:55,230 --> 00:00:58,230 except we use an auto token classification model 23 00:00:58,230 --> 00:01:00,303 instead of a sequence classification one. 24 00:01:01,560 --> 00:01:04,593 We tokenize our text, then feed it to the model. 25 00:01:05,580 --> 00:01:08,160 Instead of getting one number for each possible level 26 00:01:08,160 --> 00:01:09,600 for the whole sentence, 27 00:01:09,600 --> 00:01:12,270 we get one number for each of the possible nine levels 28 00:01:12,270 --> 00:01:14,250 for every token in the sentence. 29 00:01:14,250 --> 00:01:15,573 Here, 19. 30 00:01:17,070 --> 00:01:19,710 Like all the other models of the Transformers library, 31 00:01:19,710 --> 00:01:22,560 our model outputs logits which we need to turn 32 00:01:22,560 --> 00:01:24,663 into predictions by using a SoftMax. 33 00:01:25,830 --> 00:01:28,170 We also get the predicted label for each token 34 00:01:28,170 --> 00:01:30,063 by taking the maximum prediction. 35 00:01:31,080 --> 00:01:33,540 Since the softmax function preserves the order, 36 00:01:33,540 --> 00:01:34,980 we could have done it on the logits 37 00:01:34,980 --> 00:01:36,830 if we had no need of the predictions. 38 00:01:37,680 --> 00:01:40,050 The model config contains the label mapping 39 00:01:40,050 --> 00:01:42,090 in its id2label field. 40 00:01:42,090 --> 00:01:45,600 Using it, we can map every token to its corresponding label. 41 00:01:45,600 --> 00:01:48,630 The label O corresponds to "no entity" 42 00:01:48,630 --> 00:01:50,460 which is why we didn't see it in our results 43 00:01:50,460 --> 00:01:52,110 in the first slide. 44 00:01:52,110 --> 00:01:54,150 On top of the label and the probability, 45 00:01:54,150 --> 00:01:55,620 those results included the start 46 00:01:55,620 --> 00:01:57,423 and end character in the sentence. 47 00:01:58,294 --> 00:01:59,880 We'll need to use the offset mapping 48 00:01:59,880 --> 00:02:01,110 of the tokenizer to get those. 49 00:02:01,110 --> 00:02:03,090 Look at the video link below 50 00:02:03,090 --> 00:02:05,340 if you don't know about them already. 51 00:02:05,340 --> 00:02:06,990 Then, looping through each token 52 00:02:06,990 --> 00:02:09,090 that has a label distinct from O, 53 00:02:09,090 --> 00:02:10,590 we can build the list of results 54 00:02:10,590 --> 00:02:12,140 we got with our first pipeline. 55 00:02:13,650 --> 00:02:15,840 The last step is to group together tokens 56 00:02:15,840 --> 00:02:17,640 that corresponds to the same entity. 57 00:02:18,930 --> 00:02:21,540 This is why we had two labels for each type of entity, 58 00:02:21,540 --> 00:02:23,940 I-PER and B-PER for instance. 59 00:02:23,940 --> 00:02:25,530 It allows us to know if a token 60 00:02:25,530 --> 00:02:27,603 is in the same entity as a previous one. 61 00:02:28,620 --> 00:02:29,850 Note that there are two ways 62 00:02:29,850 --> 00:02:32,490 of labeling used for token classification. 63 00:02:32,490 --> 00:02:35,360 One, in pink here, uses the B-PER label 64 00:02:35,360 --> 00:02:37,530 at the beginning of each new entity. 65 00:02:37,530 --> 00:02:39,990 But the other in blue only uses it 66 00:02:39,990 --> 00:02:42,933 to separate two adjacent entities of the same types. 67 00:02:44,340 --> 00:02:46,560 In both cases we can flag a new entity 68 00:02:46,560 --> 00:02:49,110 each time we see a new label appearing, 69 00:02:49,110 --> 00:02:51,330 either with the I or B prefix. 70 00:02:51,330 --> 00:02:53,850 Then, take all the following tokens labeled the same 71 00:02:53,850 --> 00:02:55,470 with an I-flag. 72 00:02:55,470 --> 00:02:57,000 This, coupled with the offset mapping 73 00:02:57,000 --> 00:02:59,010 to get the start and end characters 74 00:02:59,010 --> 00:03:01,560 allows us to get the span of texts for each entity. 75 00:03:02,869 --> 00:03:05,702 (whooshing sound)