1 00:00:00,449 --> 00:00:01,559 (air whooshing) 2 00:00:01,559 --> 00:00:02,767 (logo popping) 3 00:00:02,767 --> 00:00:05,670 (metal sliding) 4 00:00:05,670 --> 00:00:08,470 - Let's see how to preprocess a dataset for translation. 5 00:00:09,540 --> 00:00:12,420 This is a task of well translating a sentence 6 00:00:12,420 --> 00:00:14,310 in another language. 7 00:00:14,310 --> 00:00:17,100 This video will focus on how to preprocess your dataset 8 00:00:17,100 --> 00:00:19,950 once you've managed to put it in the following format. 9 00:00:19,950 --> 00:00:23,730 One column for input texts and one for the target texts. 10 00:00:23,730 --> 00:00:25,980 Here is how we can achieve this with the Datasets library 11 00:00:25,980 --> 00:00:29,643 and the KDE4 dataset for English to French translation. 12 00:00:30,870 --> 00:00:33,240 As long as you manage to have your data look like this, 13 00:00:33,240 --> 00:00:35,440 you should be able to follow the same steps. 14 00:00:36,630 --> 00:00:39,210 For once, our labels are not integers 15 00:00:39,210 --> 00:00:42,210 corresponding to some classes, but plain texts. 16 00:00:42,210 --> 00:00:45,810 We will thus need to tokenize them, like our inputs. 17 00:00:45,810 --> 00:00:47,370 There is a trap there though, 18 00:00:47,370 --> 00:00:49,890 as if you tokenize your targets like your inputs, 19 00:00:49,890 --> 00:00:51,690 you will hit a problem. 20 00:00:51,690 --> 00:00:54,090 Even if you don't speak French, you might notice 21 00:00:54,090 --> 00:00:57,270 some weird things in the tokenization of the targets. 22 00:00:57,270 --> 00:01:00,510 Most of the words are tokenized in several subtokens, 23 00:01:00,510 --> 00:01:03,180 while fish, one of the only English word, 24 00:01:03,180 --> 00:01:05,670 is tokenized as a single word. 25 00:01:05,670 --> 00:01:08,703 That's because our inputs have been tokenized as English. 26 00:01:09,660 --> 00:01:11,430 Since our model knows two languages, 27 00:01:11,430 --> 00:01:13,800 you have to warn it when tokenizing the targets 28 00:01:13,800 --> 00:01:16,230 so it switches in French mode. 29 00:01:16,230 --> 00:01:20,010 This is done with the as_target_tokenizer context manager. 30 00:01:20,010 --> 00:01:23,343 You can see how it results in a more compact tokenization. 31 00:01:24,810 --> 00:01:25,890 Processing the whole dataset 32 00:01:25,890 --> 00:01:28,440 is then super easy with the map function. 33 00:01:28,440 --> 00:01:30,207 You can pick different maximum lengths 34 00:01:30,207 --> 00:01:32,100 for the inputs and targets, 35 00:01:32,100 --> 00:01:34,530 and choose to pad at this stage to that maximum length 36 00:01:34,530 --> 00:01:36,273 by setting padding=max_length. 37 00:01:37,230 --> 00:01:39,300 Here we'll show you to pad dynamically 38 00:01:39,300 --> 00:01:41,013 as it requires one more step. 39 00:01:42,450 --> 00:01:43,470 Your inputs and targets 40 00:01:43,470 --> 00:01:46,080 are all sentences of various lengths. 41 00:01:46,080 --> 00:01:48,510 We will pad the inputs and targets separately, 42 00:01:48,510 --> 00:01:50,460 as the maximum lengths of the inputs and targets 43 00:01:50,460 --> 00:01:51,483 might be different. 44 00:01:52,620 --> 00:01:54,540 Then we pad the inputs with the pad token 45 00:01:54,540 --> 00:01:57,060 and the targets with the -100 index 46 00:01:57,060 --> 00:01:58,890 to make sure they're not taken into account 47 00:01:58,890 --> 00:02:00,123 in the loss computation. 48 00:02:01,320 --> 00:02:02,153 Once this is done, 49 00:02:02,153 --> 00:02:04,340 batching inputs and targets become super easy. 50 00:02:05,670 --> 00:02:08,220 The Transformers library provides us with data collator 51 00:02:08,220 --> 00:02:10,500 to do this all automatically. 52 00:02:10,500 --> 00:02:13,800 You can then pass it to the Trainer with your datasets 53 00:02:13,800 --> 00:02:15,960 or use it in the to_tf_dataset method 54 00:02:15,960 --> 00:02:18,560 before using model.fit() on your Keras model. 55 00:02:21,057 --> 00:02:23,724 (air whooshing)