1
00:00:04,230 --> 00:00:07,699
- Let's have a look inside the
question answering pipeline.

2
00:00:07,699 --> 00:00:10,680
The question answering
pipeline can extracts answers

3
00:00:10,680 --> 00:00:14,190
to questions from a given
context or passage of text,

4
00:00:14,190 --> 00:00:16,540
like this part of the
transformers repo README.

5
00:00:18,060 --> 00:00:20,310
It also works for very long contexts,

6
00:00:20,310 --> 00:00:23,850
even if the answer is at the
very end, like in this example.

7
00:00:23,850 --> 00:00:25,400
In this video, we will see why.

8
00:00:26,820 --> 00:00:29,460
The question answering
pipeline follows the same steps

9
00:00:29,460 --> 00:00:31,050
as the other pipelines:

10
00:00:31,050 --> 00:00:34,200
the question and context are
tokenized as a sentence pair,

11
00:00:34,200 --> 00:00:37,955
fed to the model then some
post-processing is applied.

12
00:00:37,955 --> 00:00:41,730
The tokenization and model
steps should be familiar.

13
00:00:41,730 --> 00:00:44,610
We use the auto class suitable
for question answering

14
00:00:44,610 --> 00:00:47,070
instead of sequence classification,

15
00:00:47,070 --> 00:00:49,392
but one key difference
with text classification

16
00:00:49,392 --> 00:00:52,980
is that our model outputs two
tensors named start logits

17
00:00:52,980 --> 00:00:54,570
and end logits.

18
00:00:54,570 --> 00:00:55,830
Why is that?

19
00:00:55,830 --> 00:00:57,930
Well, this is the way the
model finds the answer

20
00:00:57,930 --> 00:00:58,803
to the question.

21
00:00:59,790 --> 00:01:02,130
First, let's have a look
at the model inputs.

22
00:01:02,130 --> 00:01:04,350
Its numbers associated
with the tokenization

23
00:01:04,350 --> 00:01:06,843
of the question followed by the context

24
00:01:06,843 --> 00:01:09,723
with the usual CLS and SEP special tokens.

25
00:01:10,620 --> 00:01:13,320
The answer is a part of those tokens.

26
00:01:13,320 --> 00:01:15,510
So we ask the model to
predict which token starts

27
00:01:15,510 --> 00:01:17,373
the answer and which ends the answer.

28
00:01:18,548 --> 00:01:19,650
For our two logit outputs,

29
00:01:19,650 --> 00:01:22,803
the theoretical labels are
the pink and purple vectors.

30
00:01:24,300 --> 00:01:26,430
To convert those logits
into probabilities,

31
00:01:26,430 --> 00:01:28,436
we will need to apply a SoftMax,

32
00:01:28,436 --> 00:01:30,360
like in the text classification pipeline.

33
00:01:30,360 --> 00:01:33,390
We just mask the tokens that
are not part of the context

34
00:01:33,390 --> 00:01:36,855
before doing that, leaving
the initial CLS token unmasked

35
00:01:36,855 --> 00:01:39,303
as we use it to predict
an impossible answer.

36
00:01:40,267 --> 00:01:43,500
This is what it looks in terms of code.

37
00:01:43,500 --> 00:01:45,870
We use a large negative
number for the masking,

38
00:01:45,870 --> 00:01:48,957
since its exponential will then be zero.

39
00:01:48,957 --> 00:01:50,580
Now, the probability for each start

40
00:01:50,580 --> 00:01:53,550
and end position corresponding
to a possible answer,

41
00:01:53,550 --> 00:01:55,050
we give a score that is the product

42
00:01:55,050 --> 00:01:57,630
of the start probabilities
and end probabilities

43
00:01:57,630 --> 00:01:58,803
at those positions.

44
00:02:00,120 --> 00:02:02,670
Of course, a start index
greater than an end index

45
00:02:02,670 --> 00:02:04,503
corresponds to an impossible answer.

46
00:02:05,430 --> 00:02:07,080
Here is the code to find the best score

47
00:02:07,080 --> 00:02:08,820
for a possible answer.

48
00:02:08,820 --> 00:02:11,430
Once we have the start and
end positions of the tokens,

49
00:02:11,430 --> 00:02:14,130
we use the offset mappings
provided by our tokenizer

50
00:02:14,130 --> 00:02:16,950
to find the span of characters
in the initial context,

51
00:02:16,950 --> 00:02:17,900
and get our answer.

52
00:02:19,470 --> 00:02:21,900
Now, when the context is
long, it might get truncated

53
00:02:21,900 --> 00:02:22,750
by the tokenizer.

54
00:02:23,760 --> 00:02:26,220
This might result in part
of the answer, or worse,

55
00:02:26,220 --> 00:02:28,113
the whole answer, being truncated.

56
00:02:29,100 --> 00:02:31,050
So we don't discard the truncated tokens

57
00:02:31,050 --> 00:02:33,330
but build new features with them.

58
00:02:33,330 --> 00:02:35,994
Each of those features
contains the question,

59
00:02:35,994 --> 00:02:39,240
then a chunk of text in the context.

60
00:02:39,240 --> 00:02:41,430
If we take disjoint chunks of texts,

61
00:02:41,430 --> 00:02:43,530
we might end up with
the answer being split

62
00:02:43,530 --> 00:02:45,330
between two features.

63
00:02:45,330 --> 00:02:48,060
So instead, we take
overlapping chunks of texts,

64
00:02:48,060 --> 00:02:50,640
to make sure at least one of
the chunks will fully contain

65
00:02:50,640 --> 00:02:51,990
the answer to the question.

66
00:02:52,830 --> 00:02:55,260
The tokenizers do all of
this for us automatically

67
00:02:55,260 --> 00:02:58,170
with the return overflowing tokens option.

68
00:02:58,170 --> 00:02:59,700
The stride argument controls

69
00:02:59,700 --> 00:03:02,070
the number of overlapping tokens.

70
00:03:02,070 --> 00:03:04,020
Here is how our very long
context gets truncated

71
00:03:04,020 --> 00:03:05,850
in two features with some overlap.

72
00:03:05,850 --> 00:03:07,950
By applying the same
post-processing we saw before

73
00:03:07,950 --> 00:03:10,636
for each feature, we get
the answer with a score

74
00:03:10,636 --> 00:03:12,453
for each of them,

75
00:03:12,453 --> 00:03:14,910
and we take the answer with the best score

76
00:03:14,910 --> 00:03:16,203
as a final solution.