1 00:00:00,227 --> 00:00:01,359 (air whooshing) 2 00:00:01,359 --> 00:00:02,610 (smiley clicking) 3 00:00:02,610 --> 00:00:05,550 (air whooshing) 4 00:00:05,550 --> 00:00:08,450 - Let's see how to preprocess a dataset for summarization. 5 00:00:09,750 --> 00:00:13,083 This is the task of, well, summarizing a long document. 6 00:00:14,040 --> 00:00:16,830 This video will focus on how to preprocess your dataset 7 00:00:16,830 --> 00:00:19,680 once you have managed to put it in the following format: 8 00:00:19,680 --> 00:00:21,510 one column for the long documents, 9 00:00:21,510 --> 00:00:23,610 and one for the summaries. 10 00:00:23,610 --> 00:00:24,930 Here is how we can achieve this 11 00:00:24,930 --> 00:00:27,573 with the Datasets library on the XSUM dataset. 12 00:00:28,650 --> 00:00:30,810 As long as you manage to have your data look like this, 13 00:00:30,810 --> 00:00:33,690 you should be able to follow the same steps. 14 00:00:33,690 --> 00:00:35,880 For once, our labels are not integers 15 00:00:35,880 --> 00:00:39,150 corresponding to some classes, but plain text. 16 00:00:39,150 --> 00:00:42,480 We will thus need to tokenize them, like our inputs. 17 00:00:42,480 --> 00:00:43,920 There is a small trap there though, 18 00:00:43,920 --> 00:00:45,360 as we need to tokenize our targets 19 00:00:45,360 --> 00:00:48,690 inside the as_target_tokenizer context manager. 20 00:00:48,690 --> 00:00:51,030 This is because the special tokens we add 21 00:00:51,030 --> 00:00:54,000 might be slightly different for the inputs and the target, 22 00:00:54,000 --> 00:00:57,300 so the tokenizer has to know which one it is processing. 23 00:00:57,300 --> 00:00:59,550 Processing the whole dataset is then super easy 24 00:00:59,550 --> 00:01:01,290 with the map function. 25 00:01:01,290 --> 00:01:03,450 Since the summaries are usually much shorter 26 00:01:03,450 --> 00:01:05,400 than the documents, you should definitely pick 27 00:01:05,400 --> 00:01:08,880 different maximum lengths for the inputs and targets. 28 00:01:08,880 --> 00:01:11,730 You can choose to pad at this stage to that maximum length 29 00:01:11,730 --> 00:01:14,070 by setting padding=max_length. 30 00:01:14,070 --> 00:01:16,170 Here we'll show you how to pad dynamically, 31 00:01:16,170 --> 00:01:17,620 as it requires one more step. 32 00:01:18,840 --> 00:01:20,910 Your inputs and targets are all sentences 33 00:01:20,910 --> 00:01:22,620 of various lengths. 34 00:01:22,620 --> 00:01:24,960 We'll pad the inputs and targets separately 35 00:01:24,960 --> 00:01:27,030 as the maximum lengths of the inputs and targets 36 00:01:27,030 --> 00:01:28,280 are completely different. 37 00:01:29,130 --> 00:01:31,170 Then, we pad the inputs to the maximum lengths 38 00:01:31,170 --> 00:01:33,813 among the inputs, and same for the target. 39 00:01:34,860 --> 00:01:36,630 We pad the input with the pad token, 40 00:01:36,630 --> 00:01:39,000 and the targets with the -100 index 41 00:01:39,000 --> 00:01:40,980 to make sure they are not taken into account 42 00:01:40,980 --> 00:01:42,180 in the loss computation. 43 00:01:43,440 --> 00:01:45,180 The Transformers library provide us 44 00:01:45,180 --> 00:01:48,510 with a data collator to do this all automatically. 45 00:01:48,510 --> 00:01:51,690 You can then pass it to the Trainer with your datasets, 46 00:01:51,690 --> 00:01:55,710 or use it in the to_tf_dataset method before using model.fit 47 00:01:55,710 --> 00:01:56,823 on your current model. 48 00:01:58,339 --> 00:02:01,520 (air whooshing) 49 00:02:01,520 --> 00:02:02,876 (air whooshing)