subtitles/en/54_building-a-new-tokenizer.srt (309 lines of code) (raw):

1 00:00:00,188 --> 00:00:02,855 (air whooshing) 2 00:00:05,400 --> 00:00:07,500 In this video, we will see how 3 00:00:07,500 --> 00:00:11,310 you can create your own tokenizer from scratch. 4 00:00:11,310 --> 00:00:15,000 To create your own tokenizer, you will have to think about 5 00:00:15,000 --> 00:00:18,180 each of the operations involved in tokenization. 6 00:00:18,180 --> 00:00:22,440 Namely, the normalization, the pre-tokenization, 7 00:00:22,440 --> 00:00:25,233 the model, the post processing, and the decoding. 8 00:00:26,100 --> 00:00:28,350 If you don't know what normalization, 9 00:00:28,350 --> 00:00:30,900 pre-tokenization, and the model are, 10 00:00:30,900 --> 00:00:34,531 I advise you to go and see the videos linked below. 11 00:00:34,531 --> 00:00:37,110 The post processing gathers all the modifications 12 00:00:37,110 --> 00:00:40,860 that we will carry out on the tokenized text. 13 00:00:40,860 --> 00:00:43,890 It can include the addition of special tokens, 14 00:00:43,890 --> 00:00:46,290 the creation of an intention mask, 15 00:00:46,290 --> 00:00:48,903 but also the generation of a list of token IDs. 16 00:00:50,220 --> 00:00:53,487 The decoding operation occurs at the very end, 17 00:00:53,487 --> 00:00:54,660 and will allow passing 18 00:00:54,660 --> 00:00:57,753 from the sequence of IDs in a sentence. 19 00:00:58,890 --> 00:01:01,800 For example, you can see that the hashtags 20 00:01:01,800 --> 00:01:04,260 have been removed, and the tokens 21 00:01:04,260 --> 00:01:07,323 composing the word today have been grouped together. 22 00:01:10,860 --> 00:01:13,440 In a fast tokenizer, all these components 23 00:01:13,440 --> 00:01:16,413 are gathered in the backend_tokenizer attribute. 24 00:01:17,370 --> 00:01:20,070 As you can see with this small code snippet, 25 00:01:20,070 --> 00:01:22,020 it is an instance of a tokenizer 26 00:01:22,020 --> 00:01:23,763 from the tokenizers library. 27 00:01:25,740 --> 00:01:28,263 So, to create your own tokenizer, 28 00:01:29,970 --> 00:01:31,770 you will have to follow these steps. 29 00:01:33,270 --> 00:01:35,433 First, create a training dataset. 30 00:01:36,690 --> 00:01:39,000 Second, create and train a tokenizer 31 00:01:39,000 --> 00:01:41,700 with the transformer library. 32 00:01:41,700 --> 00:01:46,700 And third, load this tokenizer into a transformer tokenizer. 33 00:01:49,350 --> 00:01:50,850 To understand these steps, 34 00:01:50,850 --> 00:01:54,573 I propose that we recreate a BERT tokenizer together. 35 00:01:56,460 --> 00:01:58,893 The first thing to do is to create a dataset. 36 00:01:59,970 --> 00:02:02,460 With this code snippet you can create an iterator 37 00:02:02,460 --> 00:02:05,430 on the dataset wikitext-2-raw-V1, 38 00:02:05,430 --> 00:02:08,160 which is a rather small dataset in English, 39 00:02:08,160 --> 00:02:09,730 perfect for the example. 40 00:02:12,210 --> 00:02:13,920 We attack here the big part, 41 00:02:13,920 --> 00:02:17,373 the design of our tokenizer with the tokenizer library. 42 00:02:18,750 --> 00:02:22,020 We start by initializing a tokenizer instance 43 00:02:22,020 --> 00:02:26,133 with a WordPiece model because it is the model used by BERT. 44 00:02:29,100 --> 00:02:32,190 Then we can define our normalizer. 45 00:02:32,190 --> 00:02:35,891 We will define it as a succession of two normalizations 46 00:02:35,891 --> 00:02:39,453 used to clean up characters not visible in the text. 47 00:02:40,590 --> 00:02:43,440 One lowercasing normalization, 48 00:02:43,440 --> 00:02:47,253 and two last normalizations used to remove accents. 49 00:02:49,500 --> 00:02:53,553 For the pre-tokenization, we will chain two pre_tokenizers. 50 00:02:54,390 --> 00:02:58,200 The first one separating the text at the level of spaces, 51 00:02:58,200 --> 00:03:01,533 and the second one isolating the punctuation marks. 52 00:03:03,360 --> 00:03:06,360 Now, we can define the trainer that will allow us 53 00:03:06,360 --> 00:03:09,753 to train the WordPiece model chosen at the beginning. 54 00:03:11,160 --> 00:03:12,600 To carry out the training, 55 00:03:12,600 --> 00:03:14,853 we will have to choose a vocabulary size. 56 00:03:16,050 --> 00:03:17,910 Here we choose 25,000. 57 00:03:17,910 --> 00:03:21,270 And we also need to announce the special tokens 58 00:03:21,270 --> 00:03:24,663 that we absolutely want to add to our vocabulary. 59 00:03:29,160 --> 00:03:33,000 In one line of code, we can train our WordPiece model 60 00:03:33,000 --> 00:03:35,553 using the iterator we defined earlier. 61 00:03:39,060 --> 00:03:42,570 Once the model has been trained, we can retrieve 62 00:03:42,570 --> 00:03:46,560 the IDs of the special class and separation tokens, 63 00:03:46,560 --> 00:03:49,413 because we will need them to post-process our sequence. 64 00:03:50,820 --> 00:03:52,860 Thanks to the TemplateProcessing class, 65 00:03:52,860 --> 00:03:57,210 we can add the CLS token at the beginning of each sequence, 66 00:03:57,210 --> 00:04:00,120 and the SEP token at the end of the sequence, 67 00:04:00,120 --> 00:04:03,873 and between two sentences if we tokenize a pair of text. 68 00:04:07,260 --> 00:04:10,500 Finally, we just have to define our decoder, 69 00:04:10,500 --> 00:04:12,690 which will allow us to remove the hashtags 70 00:04:12,690 --> 00:04:14,610 at the beginning of the tokens 71 00:04:14,610 --> 00:04:17,193 that must be reattached to the previous token. 72 00:04:21,300 --> 00:04:22,260 And there it is. 73 00:04:22,260 --> 00:04:25,110 You have all the necessary lines of code 74 00:04:25,110 --> 00:04:29,403 to define your own tokenizer with the tokenizer library. 75 00:04:30,960 --> 00:04:32,280 Now that we have a brand new tokenizer 76 00:04:32,280 --> 00:04:35,400 with the tokenizer library, we just have to load it 77 00:04:35,400 --> 00:04:38,463 into a fast tokenizer from the transformers library. 78 00:04:39,960 --> 00:04:42,630 Here again, we have several possibilities. 79 00:04:42,630 --> 00:04:44,430 We can load it in the generic class, 80 00:04:44,430 --> 00:04:48,330 PreTrainedTokenizerFast, or in the BertTokenizerFast class 81 00:04:48,330 --> 00:04:52,353 since we have built a BERT like tokenizer here. 82 00:04:57,000 --> 00:04:59,670 I really hope this video has helped you understand 83 00:04:59,670 --> 00:05:02,133 how you can create your own tokenizer, 84 00:05:03,178 --> 00:05:06,240 and that you are ready now to navigate 85 00:05:06,240 --> 00:05:08,070 the tokenizer library documentation 86 00:05:08,070 --> 00:05:11,367 to choose the components for your brand new tokenizer. 87 00:05:12,674 --> 00:05:15,341 (air whooshing)