subtitles/en/55_data-processing-for-token-classification.srt (253 lines of code) (raw):

1 00:00:05,730 --> 00:00:07,590 - Let's study how to preprocess a dataset 2 00:00:07,590 --> 00:00:09,063 for token classification! 3 00:00:10,560 --> 00:00:12,660 Token classification regroups any task 4 00:00:12,660 --> 00:00:14,940 that can be framed as labeling each word 5 00:00:14,940 --> 00:00:17,190 or token in a sentence, 6 00:00:17,190 --> 00:00:19,530 like identifying the persons, organizations 7 00:00:19,530 --> 00:00:21,093 and locations for instance. 8 00:00:22,170 --> 00:00:25,470 For our example, we will use the Conll dataset, 9 00:00:25,470 --> 00:00:27,900 in which we remove columns we won't use 10 00:00:27,900 --> 00:00:29,940 and rename the other ones to get to a dataset 11 00:00:29,940 --> 00:00:32,943 with just two columns, words and labels. 12 00:00:34,200 --> 00:00:36,750 If you have your own dataset for token classification, 13 00:00:36,750 --> 00:00:39,870 just make sure you clean your data to get to the same point, 14 00:00:39,870 --> 00:00:43,290 with one column containing words as list of strings 15 00:00:43,290 --> 00:00:45,540 and another containing labels as integers 16 00:00:45,540 --> 00:00:48,513 spanning from zero to your number of labels minus one. 17 00:00:49,740 --> 00:00:52,290 Make sure you have your label names stored somewhere. 18 00:00:52,290 --> 00:00:54,810 Here we get them from the dataset features. 19 00:00:54,810 --> 00:00:57,660 So you are able to map the integers to some real labels 20 00:00:57,660 --> 00:00:58,960 when inspecting your data. 21 00:01:00,690 --> 00:01:03,510 Here we are doing named entity recognitions, 22 00:01:03,510 --> 00:01:05,430 so ours labels are either O 23 00:01:05,430 --> 00:01:08,310 for words that do not belong to any entity. 24 00:01:08,310 --> 00:01:13,310 LOC for location, PER for person, ORG for organization 25 00:01:13,860 --> 00:01:15,603 and MISC for miscellaneous. 26 00:01:16,650 --> 00:01:18,540 Each label has two versions. 27 00:01:18,540 --> 00:01:21,960 The B labels indicate a word that begins an entity 28 00:01:21,960 --> 00:01:25,503 while the I labels indicate a word that is inside an entity. 29 00:01:27,180 --> 00:01:28,830 The first step to preprocess our data 30 00:01:28,830 --> 00:01:30,660 is to tokenize the words. 31 00:01:30,660 --> 00:01:33,120 This is very easily done with the tokenizer. 32 00:01:33,120 --> 00:01:35,370 We just have to tell it we have pre-tokenized the data 33 00:01:35,370 --> 00:01:37,503 with the flag is_split_into_words=True. 34 00:01:38,520 --> 00:01:40,380 Then comes the hard part. 35 00:01:40,380 --> 00:01:42,360 Since we have added special tokens 36 00:01:42,360 --> 00:01:45,270 and each word may have been split into several tokens, 37 00:01:45,270 --> 00:01:48,090 our labels won't match the tokens anymore. 38 00:01:48,090 --> 00:01:50,670 This is where the word IDs our fast tokenizer provides 39 00:01:50,670 --> 00:01:51,723 come to the rescue. 40 00:01:53,040 --> 00:01:55,500 They match each token to the word it belongs to 41 00:01:55,500 --> 00:01:58,470 which allows us to map each token to its label. 42 00:01:58,470 --> 00:01:59,303 We just have to make sure 43 00:01:59,303 --> 00:02:01,710 we change the B labels to their I counterparts 44 00:02:01,710 --> 00:02:03,450 for tokens that are inside 45 00:02:03,450 --> 00:02:05,433 but not at the beginning of a word. 46 00:02:06,330 --> 00:02:09,120 The special tokens get a label of -100, 47 00:02:09,120 --> 00:02:11,070 which is how we tell the Transformer loss functions 48 00:02:11,070 --> 00:02:14,607 to ignore them when computing the loss. 49 00:02:14,607 --> 00:02:16,890 The code is then pretty straightforward. 50 00:02:16,890 --> 00:02:18,660 We write a function that shifts the labels 51 00:02:18,660 --> 00:02:21,840 for tokens that are inside a word that you can customize 52 00:02:21,840 --> 00:02:24,490 and use it when generating the labels for each token. 53 00:02:25,830 --> 00:02:28,260 Once that function to create our labels is written, 54 00:02:28,260 --> 00:02:31,920 we can preprocess the whole dataset using the map function. 55 00:02:31,920 --> 00:02:33,360 With the option batched=True, 56 00:02:33,360 --> 00:02:35,793 we unleash the speed of out fast tokenizers. 57 00:02:37,110 --> 00:02:40,350 The last problem comes when we need to create a batch. 58 00:02:40,350 --> 00:02:42,150 Unless you changed the preprocessing function 59 00:02:42,150 --> 00:02:43,890 to apply some fixed padding, 60 00:02:43,890 --> 00:02:45,900 we will get sentences of various lengths, 61 00:02:45,900 --> 00:02:47,900 which we need to pad to the same length. 62 00:02:48,930 --> 00:02:50,730 The padding needs to be applied to the inputs 63 00:02:50,730 --> 00:02:51,900 as well as the labels, 64 00:02:51,900 --> 00:02:53,950 since we should have one label per token. 65 00:02:54,870 --> 00:02:58,260 Again, -100 indicates the labels that should be ignored 66 00:02:58,260 --> 00:02:59,510 for the loss computation. 67 00:03:00,420 --> 00:03:01,560 This is all done for us 68 00:03:01,560 --> 00:03:04,050 by the DataCollatorForTokenClassification, 69 00:03:04,050 --> 00:03:06,740 which you can use in PyTorch or TensorFlow. 70 00:03:06,740 --> 00:03:08,880 With all of this, you are either ready to send your data 71 00:03:08,880 --> 00:03:11,190 and this data collator to the Trainer, 72 00:03:11,190 --> 00:03:13,320 or use the to_tf_dataset method 73 00:03:13,320 --> 00:03:15,333 and the fit method of your model.