subtitles/en/65_data-processing-for-question-answering.srt (217 lines of code) (raw):
1
00:00:05,580 --> 00:00:07,177
- Let's study how to preprocess a dataset
2
00:00:07,177 --> 00:00:08,643
for question answering.
3
00:00:10,200 --> 00:00:11,640
Question answering is a task
4
00:00:11,640 --> 00:00:14,343
of finding answers to a
question in some context.
5
00:00:15,270 --> 00:00:17,550
For example, we'll use the SQuAD dataset
6
00:00:17,550 --> 00:00:19,860
in which we remove columns we won't use
7
00:00:19,860 --> 00:00:21,660
and just extract the
information we will need
8
00:00:21,660 --> 00:00:22,950
for the labels,
9
00:00:22,950 --> 00:00:26,370
the start and the end of
the answer in the context.
10
00:00:26,370 --> 00:00:28,690
If you have your own dataset
for question answering,
11
00:00:28,690 --> 00:00:31,680
just make sure you clean your
data to get to the same point,
12
00:00:31,680 --> 00:00:33,900
with one column containing the questions,
13
00:00:33,900 --> 00:00:35,940
one column containing the context,
14
00:00:35,940 --> 00:00:38,610
one column for the index of
the start and end character
15
00:00:38,610 --> 00:00:40,473
of the answer in the context.
16
00:00:41,610 --> 00:00:44,520
Note that the answer must
be part of the context.
17
00:00:44,520 --> 00:00:47,160
If you want to perform
generative question answering,
18
00:00:47,160 --> 00:00:50,160
look at one of the sequence to
sequence videos linked below.
19
00:00:51,600 --> 00:00:53,430
Now, if we have a look at the tokens
20
00:00:53,430 --> 00:00:54,750
we will feed our model,
21
00:00:54,750 --> 00:00:58,320
we'll see the answer lies
somewhere inside the context.
22
00:00:58,320 --> 00:01:01,080
For very long context, that
answer may get truncated
23
00:01:01,080 --> 00:01:02,580
by the tokenizer.
24
00:01:02,580 --> 00:01:05,970
In this case, we won't have any
proper labels for our model,
25
00:01:05,970 --> 00:01:07,680
so we should keep the truncated part
26
00:01:07,680 --> 00:01:10,203
as a separate feature
instead of discarding it.
27
00:01:11,100 --> 00:01:12,990
The only thing we need to be careful with
28
00:01:12,990 --> 00:01:15,660
is to allow some overlap
between separate chunks
29
00:01:15,660 --> 00:01:17,670
so that the answer is not truncated
30
00:01:17,670 --> 00:01:19,920
and that the feature containing the answer
31
00:01:19,920 --> 00:01:22,623
gets sufficient context
to be able to predict it.
32
00:01:23,490 --> 00:01:26,040
Here is how it can be
done by the tokenizer.
33
00:01:26,040 --> 00:01:29,370
We pass it the question,
context, set a truncation
34
00:01:29,370 --> 00:01:33,240
for the context only, and the
padding to the maximum length.
35
00:01:33,240 --> 00:01:35,340
The stride argument is
where we set the number
36
00:01:35,340 --> 00:01:36,900
of overlapping tokens,
37
00:01:36,900 --> 00:01:39,600
and the return overflowing
tokens equals true
38
00:01:39,600 --> 00:01:42,630
means we don't want to
discard the truncated part.
39
00:01:42,630 --> 00:01:45,210
Lastly, we also return the offset mappings
40
00:01:45,210 --> 00:01:47,220
to be able to find the
tokens corresponding
41
00:01:47,220 --> 00:01:48,693
to the answer start and end.
42
00:01:49,860 --> 00:01:52,290
We want those tokens because
they will be the labels
43
00:01:52,290 --> 00:01:53,970
we pass through our model.
44
00:01:53,970 --> 00:01:56,870
In a one-hot encoded version,
here is what they look like.
45
00:01:57,930 --> 00:02:00,480
If the context we have does
not contain the answer,
46
00:02:00,480 --> 00:02:03,799
we set the two labels to
the index of the CLS token.
47
00:02:03,799 --> 00:02:05,700
We also do this if the context
48
00:02:05,700 --> 00:02:07,713
only partially contains the answer.
49
00:02:08,580 --> 00:02:11,400
In terms of code, here
is how we can do it.
50
00:02:11,400 --> 00:02:13,710
Using the sequence IDs of an input,
51
00:02:13,710 --> 00:02:17,220
we can determine the beginning
and the end of the context.
52
00:02:17,220 --> 00:02:19,800
Then, we know if we have to
return to the CLS position
53
00:02:19,800 --> 00:02:22,290
for the two labels or we
determine the position
54
00:02:22,290 --> 00:02:25,050
of the first and last
tokens of the answer.
55
00:02:25,050 --> 00:02:27,800
We can check it works properly
on our previous example.
56
00:02:28,680 --> 00:02:31,380
Putting it all together
looks like this big function,
57
00:02:31,380 --> 00:02:34,233
which we can apply to our
datasets with the map method.
58
00:02:35,310 --> 00:02:37,920
Since we applied padding
during the tokenization,
59
00:02:37,920 --> 00:02:40,680
we can then use this
directly as the trainer
60
00:02:40,680 --> 00:02:44,133
or apply the to_tf_dataset
method to use Keras.fit.