subtitles/en/56_data-processing-for-masked-language-modeling.srt

1 00:00:00,000 --> 00:00:02,333 (whooshing) 2 00:00:05,250 --> 00:00:07,230 - Let's see how we can preprocess our data 3 00:00:07,230 --> 00:00:08,703 for masked language modeling. 4 00:00:10,230 --> 00:00:12,570 As a reminder, masked language modeling 5 00:00:12,570 --> 00:00:15,333 is when a model needs to fill the blanks in a sentence. 6 00:00:16,530 --> 00:00:19,650 To do this, you just need texts, no labels, 7 00:00:19,650 --> 00:00:22,200 as this is a self-supervised problem. 8 00:00:22,200 --> 00:00:23,670 To apply this on your own data, 9 00:00:23,670 --> 00:00:25,740 just make sure you have all your texts gathered 10 00:00:25,740 --> 00:00:27,603 in one column of your dataset. 11 00:00:28,440 --> 00:00:30,480 Before we start randomly masking things, 12 00:00:30,480 --> 00:00:33,090 we will need to somehow make all those texts the same length 13 00:00:33,090 --> 00:00:34,263 to batch them together. 14 00:00:35,640 --> 00:00:38,490 The first way to make all the texts the same length 15 00:00:38,490 --> 00:00:40,590 is the one we used in text classification. 16 00:00:41,430 --> 00:00:44,163 Let's pad the short texts and truncate the long ones. 17 00:00:45,030 --> 00:00:45,900 As we have seen 18 00:00:45,900 --> 00:00:48,690 when we processed data for text classification, 19 00:00:48,690 --> 00:00:49,923 this is all done by our tokenizer 20 00:00:49,923 --> 00:00:53,130 with the right options for padding and truncation. 21 00:00:53,130 --> 00:00:56,100 This will however make us lose a lot of texts 22 00:00:56,100 --> 00:00:58,620 if the examples in our dataset are very long, 23 00:00:58,620 --> 00:01:00,960 compared to the context length we picked. 24 00:01:00,960 --> 00:01:03,393 Here, all the portion in gray is lost. 25 00:01:04,410 --> 00:01:06,660 This is why a second way to generate samples of text 26 00:01:06,660 --> 00:01:08,820 with the same length is to chunk our text 27 00:01:08,820 --> 00:01:10,560 in pieces of context lengths, 28 00:01:10,560 --> 00:01:14,010 instead of discarding everything after the first chunk. 29 00:01:14,010 --> 00:01:15,420 There will probably be a remainder 30 00:01:15,420 --> 00:01:17,700 of length smaller than the context size, 31 00:01:17,700 --> 00:01:20,493 which we can choose to keep and pad or ignore. 32 00:01:21,570 --> 00:01:23,790 Here is how we can apply this in practice, 33 00:01:23,790 --> 00:01:26,460 by just adding the return overflowing tokens option 34 00:01:26,460 --> 00:01:28,200 in our tokenizer call. 35 00:01:28,200 --> 00:01:30,243 Note how this gives us a bigger dataset! 36 00:01:31,560 --> 00:01:34,260 This second way of chunking is ideal if all your texts 37 00:01:34,260 --> 00:01:36,270 are very long, but it won't work 38 00:01:36,270 --> 00:01:39,900 as nicely if you have a variety of lengths in the texts. 39 00:01:39,900 --> 00:01:41,040 In this case, 40 00:01:41,040 --> 00:01:44,280 the best option is to concatenate all your tokenized texts 41 00:01:44,280 --> 00:01:46,560 in one big stream, with a special tokens 42 00:01:46,560 --> 00:01:49,800 to indicate when you pass from one document to the other, 43 00:01:49,800 --> 00:01:52,503 and only then split the big stream into chunks. 44 00:01:53,760 --> 00:01:55,620 Here is how it can be done with code, 45 00:01:55,620 --> 00:01:58,230 with one loop to concatenate all the texts 46 00:01:58,230 --> 00:01:59,673 and another one to chunk it. 47 00:02:00,780 --> 00:02:02,850 Notice how it reduces the number of samples 48 00:02:02,850 --> 00:02:04,230 in our dataset here, 49 00:02:04,230 --> 00:02:06,580 there must have been quite a few short entries! 50 00:02:07,710 --> 00:02:11,130 Once this is done, the masking is the easy part. 51 00:02:11,130 --> 00:02:13,400 There is a data collator designed specifically for this 52 00:02:13,400 --> 00:02:15,540 in the Transformers library. 53 00:02:15,540 --> 00:02:17,700 You can use it directly in the Trainer, 54 00:02:17,700 --> 00:02:20,400 or when converting your datasets to tensorflow datasets 55 00:02:20,400 --> 00:02:23,703 before doing Keras.fit, with the to_tf_dataset method. 56 00:02:24,992 --> 00:02:27,325 (whooshing)

subtitles/en/56_data-processing-for-masked-language-modeling.srt (193 lines of code) (raw):