1 00:00:05,010 --> 00:00:06,270 - The fast tokenizers 2 00:00:06,270 --> 00:00:08,580 of the Transformers library are fast, 3 00:00:08,580 --> 00:00:11,490 but they also implement features that will be super useful 4 00:00:11,490 --> 00:00:14,536 for data pre-processing and post-processing. 5 00:00:14,536 --> 00:00:17,239 Let's have a look at them! 6 00:00:17,239 --> 00:00:18,650 First, let's have a look 7 00:00:18,650 --> 00:00:21,690 at the usual output of a tokenizer. 8 00:00:21,690 --> 00:00:24,278 We get input IDs that correspond to token, 9 00:00:24,278 --> 00:00:27,960 but we lose a lot of information in the process. 10 00:00:27,960 --> 00:00:29,010 For instance, 11 00:00:29,010 --> 00:00:31,856 here the tokenization is the same for the two sentences 12 00:00:31,856 --> 00:00:35,373 even if one has several more spaces than the other. 13 00:00:36,300 --> 00:00:39,150 Just having the input IDs is thus not enough 14 00:00:39,150 --> 00:00:42,330 if we want to match some tokens with a span of text, 15 00:00:42,330 --> 00:00:43,320 something we'll need to do 16 00:00:43,320 --> 00:00:46,111 when tackling question answering, for instance. 17 00:00:46,111 --> 00:00:47,592 It's also difficult to know 18 00:00:47,592 --> 00:00:50,850 when two tokens belong to the same word or not. 19 00:00:50,850 --> 00:00:52,860 It looks easy when you just look at the output 20 00:00:52,860 --> 00:00:55,650 of a BERT tokenizer where we just need to look 21 00:00:55,650 --> 00:00:56,779 for the hash hash. 22 00:00:56,779 --> 00:00:59,040 But other tokenizers have different ways 23 00:00:59,040 --> 00:01:00,987 to tokenize parts of words. 24 00:01:00,987 --> 00:01:04,470 For instance, RoBERTa adds this special G symbol 25 00:01:04,470 --> 00:01:06,491 to mark the tokens at the beginning of the word 26 00:01:06,491 --> 00:01:09,570 and T5 uses this special underscore symbol 27 00:01:09,570 --> 00:01:11,150 for the same purpose. 28 00:01:11,150 --> 00:01:14,760 Thankfully, the fast tokenizers keep track of the word 29 00:01:14,760 --> 00:01:16,230 each token comes from, 30 00:01:16,230 --> 00:01:19,571 with a word_ids method you can use on their outputs. 31 00:01:19,571 --> 00:01:21,870 The output is not necessarily clear, 32 00:01:21,870 --> 00:01:24,076 but assembled together in a nice table like this, 33 00:01:24,076 --> 00:01:26,853 we can look at the word position for each token. 34 00:01:27,930 --> 00:01:30,220 Even better, the fast tokenizers keep track 35 00:01:30,220 --> 00:01:33,198 of the span of characters each token comes from, 36 00:01:33,198 --> 00:01:35,760 and we can get them when calling it on one 37 00:01:35,760 --> 00:01:37,221 or several text by adding 38 00:01:37,221 --> 00:01:40,470 the return_offsets_mapping=True argument. 39 00:01:40,470 --> 00:01:42,312 In this instance, we can see how we jump positions 40 00:01:42,312 --> 00:01:45,650 between the hash hash token and the super token, 41 00:01:45,650 --> 00:01:49,992 because of the multiple spaces in the initial sentence. 42 00:01:49,992 --> 00:01:52,110 To enable this, the fast tokenizers 43 00:01:52,110 --> 00:01:54,270 store additional information at each step 44 00:01:54,270 --> 00:01:55,440 of their internal pipeline. 45 00:01:55,440 --> 00:01:57,951 That internal pipeline consists of normalization, 46 00:01:57,951 --> 00:02:00,360 where we apply some cleaning to the text, 47 00:02:00,360 --> 00:02:02,621 like lower casing or removing the accents; 48 00:02:02,621 --> 00:02:04,088 pre-tokenization, 49 00:02:04,088 --> 00:02:06,530 which is where we split the texts into words; 50 00:02:06,530 --> 00:02:09,360 then we apply the model of the tokenizer, 51 00:02:09,360 --> 00:02:11,725 which is where the words are split into tokens, 52 00:02:11,725 --> 00:02:13,748 before finally doing the post processing, 53 00:02:13,748 --> 00:02:16,023 where special tokens are added. 54 00:02:17,100 --> 00:02:19,050 From the beginning to the end of the pipeline, 55 00:02:19,050 --> 00:02:21,390 the tokenizer keeps track of each span of text 56 00:02:21,390 --> 00:02:23,853 that corresponds to each word, then each token. 57 00:02:24,990 --> 00:02:26,100 We'll see how useful it is 58 00:02:26,100 --> 00:02:27,990 when we tackle the following tasks: 59 00:02:27,990 --> 00:02:29,549 when doing masked language modeling 60 00:02:29,549 --> 00:02:32,407 one variation that gets state-of-the-art results 61 00:02:32,407 --> 00:02:35,040 is to mask all the tokens of a given word 62 00:02:35,040 --> 00:02:37,440 instead of randomly chosen words. 63 00:02:37,440 --> 00:02:40,800 This will require us to use the word IDs we saw. 64 00:02:40,800 --> 00:02:42,329 When doing token classification, 65 00:02:42,329 --> 00:02:45,090 we'll need to convert the labels we have on words, 66 00:02:45,090 --> 00:02:47,250 to labels on each tokens. 67 00:02:47,250 --> 00:02:48,480 As for the offset mappings, 68 00:02:48,480 --> 00:02:50,610 it will be super useful when we need to convert 69 00:02:50,610 --> 00:02:53,436 token positions in a sentence into a span of text, 70 00:02:53,436 --> 00:02:55,800 which we'll need to know when we're looking 71 00:02:55,800 --> 00:02:56,813 at question answering 72 00:02:56,813 --> 00:02:58,680 or when grouping the tokens corresponding 73 00:02:58,680 --> 00:03:01,023 to the same entity in token classification. 74 00:03:02,160 --> 00:03:03,450 To have a look at these tasks, 75 00:03:03,450 --> 00:03:04,950 check the videos linked below!