1 00:00:05,580 --> 00:00:07,177 - Let's study how to preprocess a dataset 2 00:00:07,177 --> 00:00:08,643 for question answering. 3 00:00:10,200 --> 00:00:11,640 Question answering is a task 4 00:00:11,640 --> 00:00:14,343 of finding answers to a question in some context. 5 00:00:15,270 --> 00:00:17,550 For example, we'll use the SQuAD dataset 6 00:00:17,550 --> 00:00:19,860 in which we remove columns we won't use 7 00:00:19,860 --> 00:00:21,660 and just extract the information we will need 8 00:00:21,660 --> 00:00:22,950 for the labels, 9 00:00:22,950 --> 00:00:26,370 the start and the end of the answer in the context. 10 00:00:26,370 --> 00:00:28,690 If you have your own dataset for question answering, 11 00:00:28,690 --> 00:00:31,680 just make sure you clean your data to get to the same point, 12 00:00:31,680 --> 00:00:33,900 with one column containing the questions, 13 00:00:33,900 --> 00:00:35,940 one column containing the context, 14 00:00:35,940 --> 00:00:38,610 one column for the index of the start and end character 15 00:00:38,610 --> 00:00:40,473 of the answer in the context. 16 00:00:41,610 --> 00:00:44,520 Note that the answer must be part of the context. 17 00:00:44,520 --> 00:00:47,160 If you want to perform generative question answering, 18 00:00:47,160 --> 00:00:50,160 look at one of the sequence to sequence videos linked below. 19 00:00:51,600 --> 00:00:53,430 Now, if we have a look at the tokens 20 00:00:53,430 --> 00:00:54,750 we will feed our model, 21 00:00:54,750 --> 00:00:58,320 we'll see the answer lies somewhere inside the context. 22 00:00:58,320 --> 00:01:01,080 For very long context, that answer may get truncated 23 00:01:01,080 --> 00:01:02,580 by the tokenizer. 24 00:01:02,580 --> 00:01:05,970 In this case, we won't have any proper labels for our model, 25 00:01:05,970 --> 00:01:07,680 so we should keep the truncated part 26 00:01:07,680 --> 00:01:10,203 as a separate feature instead of discarding it. 27 00:01:11,100 --> 00:01:12,990 The only thing we need to be careful with 28 00:01:12,990 --> 00:01:15,660 is to allow some overlap between separate chunks 29 00:01:15,660 --> 00:01:17,670 so that the answer is not truncated 30 00:01:17,670 --> 00:01:19,920 and that the feature containing the answer 31 00:01:19,920 --> 00:01:22,623 gets sufficient context to be able to predict it. 32 00:01:23,490 --> 00:01:26,040 Here is how it can be done by the tokenizer. 33 00:01:26,040 --> 00:01:29,370 We pass it the question, context, set a truncation 34 00:01:29,370 --> 00:01:33,240 for the context only, and the padding to the maximum length. 35 00:01:33,240 --> 00:01:35,340 The stride argument is where we set the number 36 00:01:35,340 --> 00:01:36,900 of overlapping tokens, 37 00:01:36,900 --> 00:01:39,600 and the return overflowing tokens equals true 38 00:01:39,600 --> 00:01:42,630 means we don't want to discard the truncated part. 39 00:01:42,630 --> 00:01:45,210 Lastly, we also return the offset mappings 40 00:01:45,210 --> 00:01:47,220 to be able to find the tokens corresponding 41 00:01:47,220 --> 00:01:48,693 to the answer start and end. 42 00:01:49,860 --> 00:01:52,290 We want those tokens because they will be the labels 43 00:01:52,290 --> 00:01:53,970 we pass through our model. 44 00:01:53,970 --> 00:01:56,870 In a one-hot encoded version, here is what they look like. 45 00:01:57,930 --> 00:02:00,480 If the context we have does not contain the answer, 46 00:02:00,480 --> 00:02:03,799 we set the two labels to the index of the CLS token. 47 00:02:03,799 --> 00:02:05,700 We also do this if the context 48 00:02:05,700 --> 00:02:07,713 only partially contains the answer. 49 00:02:08,580 --> 00:02:11,400 In terms of code, here is how we can do it. 50 00:02:11,400 --> 00:02:13,710 Using the sequence IDs of an input, 51 00:02:13,710 --> 00:02:17,220 we can determine the beginning and the end of the context. 52 00:02:17,220 --> 00:02:19,800 Then, we know if we have to return to the CLS position 53 00:02:19,800 --> 00:02:22,290 for the two labels or we determine the position 54 00:02:22,290 --> 00:02:25,050 of the first and last tokens of the answer. 55 00:02:25,050 --> 00:02:27,800 We can check it works properly on our previous example. 56 00:02:28,680 --> 00:02:31,380 Putting it all together looks like this big function, 57 00:02:31,380 --> 00:02:34,233 which we can apply to our datasets with the map method. 58 00:02:35,310 --> 00:02:37,920 Since we applied padding during the tokenization, 59 00:02:37,920 --> 00:02:40,680 we can then use this directly as the trainer 60 00:02:40,680 --> 00:02:44,133 or apply the to_tf_dataset method to use Keras.fit.