1
00:00:05,580 --> 00:00:07,177
- Let's study how to preprocess a dataset

2
00:00:07,177 --> 00:00:08,643
for question answering.

3
00:00:10,200 --> 00:00:11,640
Question answering is a task

4
00:00:11,640 --> 00:00:14,343
of finding answers to a
question in some context.

5
00:00:15,270 --> 00:00:17,550
For example, we'll use the SQuAD dataset

6
00:00:17,550 --> 00:00:19,860
in which we remove columns we won't use

7
00:00:19,860 --> 00:00:21,660
and just extract the
information we will need

8
00:00:21,660 --> 00:00:22,950
for the labels,

9
00:00:22,950 --> 00:00:26,370
the start and the end of
the answer in the context.

10
00:00:26,370 --> 00:00:28,690
If you have your own dataset
for question answering,

11
00:00:28,690 --> 00:00:31,680
just make sure you clean your
data to get to the same point,

12
00:00:31,680 --> 00:00:33,900
with one column containing the questions,

13
00:00:33,900 --> 00:00:35,940
one column containing the context,

14
00:00:35,940 --> 00:00:38,610
one column for the index of
the start and end character

15
00:00:38,610 --> 00:00:40,473
of the answer in the context.

16
00:00:41,610 --> 00:00:44,520
Note that the answer must
be part of the context.

17
00:00:44,520 --> 00:00:47,160
If you want to perform
generative question answering,

18
00:00:47,160 --> 00:00:50,160
look at one of the sequence to
sequence videos linked below.

19
00:00:51,600 --> 00:00:53,430
Now, if we have a look at the tokens

20
00:00:53,430 --> 00:00:54,750
we will feed our model,

21
00:00:54,750 --> 00:00:58,320
we'll see the answer lies
somewhere inside the context.

22
00:00:58,320 --> 00:01:01,080
For very long context, that
answer may get truncated

23
00:01:01,080 --> 00:01:02,580
by the tokenizer.

24
00:01:02,580 --> 00:01:05,970
In this case, we won't have any
proper labels for our model,

25
00:01:05,970 --> 00:01:07,680
so we should keep the truncated part

26
00:01:07,680 --> 00:01:10,203
as a separate feature
instead of discarding it.

27
00:01:11,100 --> 00:01:12,990
The only thing we need to be careful with

28
00:01:12,990 --> 00:01:15,660
is to allow some overlap
between separate chunks

29
00:01:15,660 --> 00:01:17,670
so that the answer is not truncated

30
00:01:17,670 --> 00:01:19,920
and that the feature containing the answer

31
00:01:19,920 --> 00:01:22,623
gets sufficient context
to be able to predict it.

32
00:01:23,490 --> 00:01:26,040
Here is how it can be
done by the tokenizer.

33
00:01:26,040 --> 00:01:29,370
We pass it the question,
context, set a truncation

34
00:01:29,370 --> 00:01:33,240
for the context only, and the
padding to the maximum length.

35
00:01:33,240 --> 00:01:35,340
The stride argument is
where we set the number

36
00:01:35,340 --> 00:01:36,900
of overlapping tokens,

37
00:01:36,900 --> 00:01:39,600
and the return overflowing
tokens equals true

38
00:01:39,600 --> 00:01:42,630
means we don't want to
discard the truncated part.

39
00:01:42,630 --> 00:01:45,210
Lastly, we also return the offset mappings

40
00:01:45,210 --> 00:01:47,220
to be able to find the
tokens corresponding

41
00:01:47,220 --> 00:01:48,693
to the answer start and end.

42
00:01:49,860 --> 00:01:52,290
We want those tokens because
they will be the labels

43
00:01:52,290 --> 00:01:53,970
we pass through our model.

44
00:01:53,970 --> 00:01:56,870
In a one-hot encoded version,
here is what they look like.

45
00:01:57,930 --> 00:02:00,480
If the context we have does
not contain the answer,

46
00:02:00,480 --> 00:02:03,799
we set the two labels to
the index of the CLS token.

47
00:02:03,799 --> 00:02:05,700
We also do this if the context

48
00:02:05,700 --> 00:02:07,713
only partially contains the answer.

49
00:02:08,580 --> 00:02:11,400
In terms of code, here
is how we can do it.

50
00:02:11,400 --> 00:02:13,710
Using the sequence IDs of an input,

51
00:02:13,710 --> 00:02:17,220
we can determine the beginning
and the end of the context.

52
00:02:17,220 --> 00:02:19,800
Then, we know if we have to
return to the CLS position

53
00:02:19,800 --> 00:02:22,290
for the two labels or we
determine the position

54
00:02:22,290 --> 00:02:25,050
of the first and last
tokens of the answer.

55
00:02:25,050 --> 00:02:27,800
We can check it works properly
on our previous example.

56
00:02:28,680 --> 00:02:31,380
Putting it all together
looks like this big function,

57
00:02:31,380 --> 00:02:34,233
which we can apply to our
datasets with the map method.

58
00:02:35,310 --> 00:02:37,920
Since we applied padding
during the tokenization,

59
00:02:37,920 --> 00:02:40,680
we can then use this
directly as the trainer

60
00:02:40,680 --> 00:02:44,133
or apply the to_tf_dataset
method to use Keras.fit.