1 00:00:00,000 --> 00:00:02,917 (transition music) 2 00:00:05,364 --> 00:00:08,310 - In this video, we take a look at the data processing 3 00:00:08,310 --> 00:00:10,803 necessary to train causal language models. 4 00:00:12,690 --> 00:00:14,400 Causal language modeling is the task 5 00:00:14,400 --> 00:00:17,820 of predicting the next token based on the previous ones. 6 00:00:17,820 --> 00:00:19,680 Another term for causal language modeling 7 00:00:19,680 --> 00:00:21,000 is autoregressive modeling. 8 00:00:21,000 --> 00:00:23,940 In the example that you can see here, 9 00:00:23,940 --> 00:00:25,560 the next token could, for example, 10 00:00:25,560 --> 00:00:28,263 be NLP or it could be machine learning. 11 00:00:29,460 --> 00:00:31,457 A popular example of causal language models 12 00:00:31,457 --> 00:00:33,693 is the GPT family of models. 13 00:00:35,561 --> 00:00:38,010 To train models such as GPT, 14 00:00:38,010 --> 00:00:41,460 we usually start with a large corpus of text files. 15 00:00:41,460 --> 00:00:43,890 These files can be webpages scraped from the internet 16 00:00:43,890 --> 00:00:46,020 such as the Common Crawl dataset 17 00:00:46,020 --> 00:00:47,940 or they can be Python files from GitHub, 18 00:00:47,940 --> 00:00:49,490 like the ones you can see here. 19 00:00:50,400 --> 00:00:52,680 As a first step, we need to tokenize these files 20 00:00:52,680 --> 00:00:55,380 such that we can feed them through the model. 21 00:00:55,380 --> 00:00:58,500 Here, we show the tokenized texts as bars of various length, 22 00:00:58,500 --> 00:01:02,188 illustrating that they're shorter and longer ones. 23 00:01:02,188 --> 00:01:05,910 This is very common when working with text. 24 00:01:05,910 --> 00:01:09,270 However, transform models have a limited context window 25 00:01:09,270 --> 00:01:10,770 and depending on the data source, 26 00:01:10,770 --> 00:01:13,140 it is possible that the tokenized texts 27 00:01:13,140 --> 00:01:15,183 are much longer than this window. 28 00:01:16,080 --> 00:01:18,870 In this case, we could just truncate the sequences 29 00:01:18,870 --> 00:01:20,182 to the context length, 30 00:01:20,182 --> 00:01:22,650 but this would mean that we lose everything 31 00:01:22,650 --> 00:01:24,513 after the first context window. 32 00:01:25,500 --> 00:01:28,410 Using the return overflowing token flag, 33 00:01:28,410 --> 00:01:30,960 we can use the tokenizer to create chunks 34 00:01:30,960 --> 00:01:33,510 with each one being the size of the context length. 35 00:01:34,860 --> 00:01:36,180 Sometimes, it can still happen 36 00:01:36,180 --> 00:01:37,590 that the last chunk is too short 37 00:01:37,590 --> 00:01:39,900 if there aren't enough tokens to fill it. 38 00:01:39,900 --> 00:01:41,793 In this case, we can just remove it. 39 00:01:42,990 --> 00:01:45,960 With the return_length keyword, 40 00:01:45,960 --> 00:01:49,173 we also get the length of each chunk from the tokenizer. 41 00:01:51,960 --> 00:01:53,640 This function shows all the steps 42 00:01:53,640 --> 00:01:56,280 necessary to prepare the dataset. 43 00:01:56,280 --> 00:01:57,960 First, we tokenize the dataset 44 00:01:57,960 --> 00:02:00,330 with the flags I just mentioned. 45 00:02:00,330 --> 00:02:02,190 Then, we go through each chunk 46 00:02:02,190 --> 00:02:04,680 and if it's length matches the context length, 47 00:02:04,680 --> 00:02:06,663 we add it to the inputs we return. 48 00:02:07,590 --> 00:02:10,260 We can apply this function to the whole dataset. 49 00:02:10,260 --> 00:02:11,700 In addition, we make sure 50 00:02:11,700 --> 00:02:15,450 that to use batches and remove the existing columns. 51 00:02:15,450 --> 00:02:17,670 We need to remove the existing columns, 52 00:02:17,670 --> 00:02:21,330 because we can create multiple samples per text, 53 00:02:21,330 --> 00:02:22,890 and the shapes in the dataset 54 00:02:22,890 --> 00:02:24,753 would not match anymore in that case. 55 00:02:26,832 --> 00:02:30,330 If the context length is of similar lengths as the files, 56 00:02:30,330 --> 00:02:32,733 this approach doesn't work so well anymore. 57 00:02:33,660 --> 00:02:36,420 In this example, both sample 1 and 2 58 00:02:36,420 --> 00:02:38,400 are shorter than the context size 59 00:02:38,400 --> 00:02:41,610 and will be discarded with the previous approach. 60 00:02:41,610 --> 00:02:45,150 In this case, it is better to first tokenize each sample 61 00:02:45,150 --> 00:02:46,590 without truncation 62 00:02:46,590 --> 00:02:49,290 and then concatenate the tokenized samples 63 00:02:49,290 --> 00:02:52,353 with an end of string or EOS token in between. 64 00:02:53,546 --> 00:02:56,220 Finally, we can chunk this long sequence 65 00:02:56,220 --> 00:02:59,490 with the context length and we don't lose too many sequences 66 00:02:59,490 --> 00:03:01,263 because they're too short anymore. 67 00:03:04,170 --> 00:03:05,760 So far, we have only talked 68 00:03:05,760 --> 00:03:08,370 about the inputs for causal language modeling, 69 00:03:08,370 --> 00:03:11,850 but not the labels needed for supervised training. 70 00:03:11,850 --> 00:03:13,380 When we do causal language modeling, 71 00:03:13,380 --> 00:03:16,710 we don't require any extra labels for the input sequences 72 00:03:16,710 --> 00:03:20,610 as the input sequences themselves are the labels. 73 00:03:20,610 --> 00:03:24,240 In this example, when we feed the token trans to the model, 74 00:03:24,240 --> 00:03:27,510 the next token we wanted to predict is formers. 75 00:03:27,510 --> 00:03:30,780 In the next step, we feed trans and formers to the model 76 00:03:30,780 --> 00:03:33,903 and the label we wanted to predict is are. 77 00:03:35,460 --> 00:03:38,130 This pattern continues, and as you can see, 78 00:03:38,130 --> 00:03:41,220 the input sequence is the label sequence 79 00:03:41,220 --> 00:03:42,663 just shifted by one. 80 00:03:43,590 --> 00:03:47,310 Since the model only makes prediction after the first token, 81 00:03:47,310 --> 00:03:49,350 the first element of the input sequence, 82 00:03:49,350 --> 00:03:52,980 in this case, trans, is not used as a label. 83 00:03:52,980 --> 00:03:55,530 Similarly, we don't have a label 84 00:03:55,530 --> 00:03:57,600 for the last token in the sequence 85 00:03:57,600 --> 00:04:00,843 since there is no token after the sequence ends. 86 00:04:04,110 --> 00:04:06,300 Let's have a look at what we need to do 87 00:04:06,300 --> 00:04:10,200 to create the labels for causal language modeling in code. 88 00:04:10,200 --> 00:04:12,360 If we want to calculate a loss on a batch, 89 00:04:12,360 --> 00:04:15,120 we can just pass the input_ids as labels 90 00:04:15,120 --> 00:04:18,933 and all the shifting is handled in the model internally. 91 00:04:20,032 --> 00:04:22,170 So, you see, there's no matching involved 92 00:04:22,170 --> 00:04:24,870 in processing data for causal language modeling, 93 00:04:24,870 --> 00:04:27,723 and it only requires a few simple steps. 94 00:04:28,854 --> 00:04:31,771 (transition music)