1
00:00:00,000 --> 00:00:03,417
(light transition music)

2
00:00:05,490 --> 00:00:08,440
- Let's have a look inside the
question answering pipeline.

3
00:00:09,780 --> 00:00:11,370
The question answering pipeline

4
00:00:11,370 --> 00:00:13,710
can extract answers to questions

5
00:00:13,710 --> 00:00:16,020
from a given context or passage of text

6
00:00:16,020 --> 00:00:18,370
like this part of the
Transformers repo README.

7
00:00:19,290 --> 00:00:21,180
It also works for very long context,

8
00:00:21,180 --> 00:00:24,720
even if the answer is at the
very end, like in this example.

9
00:00:24,720 --> 00:00:26,223
In this video, we'll see why.

10
00:00:27,840 --> 00:00:29,310
The question answering pipeline

11
00:00:29,310 --> 00:00:32,130
follows the same steps
as the other pipelines.

12
00:00:32,130 --> 00:00:35,550
The question and context are
tokenized as a sentence pair,

13
00:00:35,550 --> 00:00:38,463
fed to the model then some
post-processing is applied.

14
00:00:39,540 --> 00:00:42,840
So tokenization and model
steps should be familiar.

15
00:00:42,840 --> 00:00:45,000
We use the auto class suitable
for question answering

16
00:00:45,000 --> 00:00:47,460
instead of sequence classification,

17
00:00:47,460 --> 00:00:50,190
but one key difference
with text classification

18
00:00:50,190 --> 00:00:52,380
is that our model outputs two tensors

19
00:00:52,380 --> 00:00:55,230
named start logits and end logits.

20
00:00:55,230 --> 00:00:56,160
Why is that?

21
00:00:56,160 --> 00:00:58,170
Well, this is the way the
model finds the answer

22
00:00:58,170 --> 00:00:59,043
to the question.

23
00:01:00,090 --> 00:01:02,610
First, let's have a look
at the model inputs.

24
00:01:02,610 --> 00:01:04,800
It's numbers associated
with the tokenization

25
00:01:04,800 --> 00:01:05,850
of the question,

26
00:01:05,850 --> 00:01:07,753
followed by the context

27
00:01:07,753 --> 00:01:10,233
with the usual CLS and SEP special tokens.

28
00:01:11,130 --> 00:01:13,203
The answer is a part of those tokens.

29
00:01:14,040 --> 00:01:15,330
So we ask the model to predict

30
00:01:15,330 --> 00:01:17,040
which token starts the answer

31
00:01:17,040 --> 00:01:19,320
and which ends the answer.

32
00:01:19,320 --> 00:01:20,910
For our two logit outputs,

33
00:01:20,910 --> 00:01:23,823
the theoretical labels are
the pink and purple vectors.

34
00:01:24,870 --> 00:01:26,700
To convert those logits
into probabilities,

35
00:01:26,700 --> 00:01:28,596
we will need to apply a SoftMax,

36
00:01:28,596 --> 00:01:31,020
like in the text classification pipeline.

37
00:01:31,020 --> 00:01:32,310
We just mask the tokens

38
00:01:32,310 --> 00:01:35,940
that are not part of the
context before doing that,

39
00:01:35,940 --> 00:01:38,310
leaving the initial CLS token unmasked

40
00:01:38,310 --> 00:01:40,773
as we use it to predict
an impossible answer.

41
00:01:41,940 --> 00:01:44,730
This is what it looks
like in terms of code.

42
00:01:44,730 --> 00:01:47,340
We use a large negative
number for the masking

43
00:01:47,340 --> 00:01:49,533
since its exponential will then be zero.

44
00:01:50,850 --> 00:01:53,160
Now the probability for
each start and end position

45
00:01:53,160 --> 00:01:55,740
corresponding to a possible answer

46
00:01:55,740 --> 00:01:57,540
will give a score that is a product

47
00:01:57,540 --> 00:01:58,680
of the start probabilities

48
00:01:58,680 --> 00:02:00,873
and end probabilities at those position.

49
00:02:01,920 --> 00:02:04,530
Of course, a start index
greater than an end index

50
00:02:04,530 --> 00:02:06,330
corresponds to an impossible answer.

51
00:02:07,744 --> 00:02:09,510
Here is the code to find the best score

52
00:02:09,510 --> 00:02:11,280
for a possible answer.

53
00:02:11,280 --> 00:02:13,830
Once we have the start and
end position for the tokens,

54
00:02:13,830 --> 00:02:16,650
we use the offset mappings
provided by our tokenizer

55
00:02:16,650 --> 00:02:19,710
to find the span of characters
in the initial context,

56
00:02:19,710 --> 00:02:20,810
and we get our answer.

57
00:02:22,080 --> 00:02:23,700
Now, when the context is long,

58
00:02:23,700 --> 00:02:25,977
it might get truncated by the tokenizer.

59
00:02:26,834 --> 00:02:29,790
This might result in part
of the answer, or worse,

60
00:02:29,790 --> 00:02:32,190
the whole answer, being truncated.

61
00:02:32,190 --> 00:02:34,020
So we don't discard the truncated tokens

62
00:02:34,020 --> 00:02:36,420
but build new features with them.

63
00:02:36,420 --> 00:02:39,330
Each of those features
contains the question,

64
00:02:39,330 --> 00:02:42,150
then a chunk of text in the context.

65
00:02:42,150 --> 00:02:44,520
If we take disjoint chunks of texts,

66
00:02:44,520 --> 00:02:45,840
we might end up with the answer

67
00:02:45,840 --> 00:02:47,733
being split between two features.

68
00:02:48,720 --> 00:02:52,050
So instead, we take
overlapping chunks of text

69
00:02:52,050 --> 00:02:53,910
to make sure at least one of the chunks

70
00:02:53,910 --> 00:02:56,940
will fully contain the
answer to the question.

71
00:02:56,940 --> 00:02:59,220
So, tokenizers does all of
this for us automatically

72
00:02:59,220 --> 00:03:01,920
with the return overflowing tokens option.

73
00:03:01,920 --> 00:03:02,753
The stride argument

74
00:03:02,753 --> 00:03:04,830
controls the number of overlapping tokens.

75
00:03:05,940 --> 00:03:07,740
Here is how our very long context

76
00:03:07,740 --> 00:03:10,323
gets truncated in two
features with some overlap.

77
00:03:11,160 --> 00:03:12,720
By applying the same post-processing

78
00:03:12,720 --> 00:03:14,850
we saw before for each feature,

79
00:03:14,850 --> 00:03:17,970
we get the answer with a
score for each of them,

80
00:03:17,970 --> 00:03:19,920
and we take the answer with the best score

81
00:03:19,920 --> 00:03:21,303
as a final solution.

82
00:03:23,089 --> 00:03:26,506
(light transition music)