1
00:00:00,397 --> 00:00:02,980
(subtle blast)

2
00:00:05,490 --> 00:00:07,953
- What happens inside
the pipeline function?

3
00:00:09,930 --> 00:00:13,050
In this video, we will look
at what actually happens

4
00:00:13,050 --> 00:00:14,820
when we use the pipeline function

5
00:00:14,820 --> 00:00:16,920
of the Transformers library.

6
00:00:16,920 --> 00:00:18,930
More specifically, we will look at

7
00:00:18,930 --> 00:00:21,030
the sentiment analysis pipeline,

8
00:00:21,030 --> 00:00:23,760
and how it went from the
two following sentences

9
00:00:23,760 --> 00:00:25,800
to the positive and negative labels

10
00:00:25,800 --> 00:00:27,250
with their respective scores.

11
00:00:28,740 --> 00:00:31,110
As we have seen in the pipeline video,

12
00:00:31,110 --> 00:00:33,900
there are three stages in the pipeline.

13
00:00:33,900 --> 00:00:36,810
First, we convert the raw texts to numbers

14
00:00:36,810 --> 00:00:39,160
the model can make sense
of, using a tokenizer.

15
00:00:40,140 --> 00:00:42,600
Then, those numbers go through the model,

16
00:00:42,600 --> 00:00:44,550
which outputs logits.

17
00:00:44,550 --> 00:00:47,190
Finally, the post-processing steps

18
00:00:47,190 --> 00:00:49,490
transforms those logits
into labels and score.

19
00:00:51,000 --> 00:00:52,590
Let's look in detail at those three steps,

20
00:00:52,590 --> 00:00:55,200
and how to replicate them
using the Transformers library,

21
00:00:55,200 --> 00:00:57,903
beginning with the first
stage, tokenization.

22
00:00:59,905 --> 00:01:02,520
The tokenization process
has several steps.

23
00:01:02,520 --> 00:01:06,900
First, the text is split into
small chunks called token.

24
00:01:06,900 --> 00:01:09,933
They can be words, parts of
words or punctuation symbols.

25
00:01:10,800 --> 00:01:14,310
Then the tokenizer will
had some special tokens

26
00:01:14,310 --> 00:01:15,573
if the model expect them.

27
00:01:16,440 --> 00:01:20,430
Here, the model used expects
a CLS token at the beginning

28
00:01:20,430 --> 00:01:23,910
and a SEP token at the end
of the sentence to classify.

29
00:01:23,910 --> 00:01:27,630
Lastly, the tokenizer matches
each token to its unique ID

30
00:01:27,630 --> 00:01:29,730
in the vocabulary of the pretrained model.

31
00:01:30,660 --> 00:01:32,040
To load such a tokenizer,

32
00:01:32,040 --> 00:01:34,983
the Transformers library
provides the AutoTokenizer API.

33
00:01:35,880 --> 00:01:39,510
The most important method of
this class is from_pretrained,

34
00:01:39,510 --> 00:01:41,940
which will download and
cache the configuration

35
00:01:41,940 --> 00:01:44,913
and the vocabulary associated
to a given checkpoint.

36
00:01:46,410 --> 00:01:48,180
Here, the checkpoint used by default

37
00:01:48,180 --> 00:01:50,310
for the sentiment analysis pipeline

38
00:01:50,310 --> 00:01:54,510
is distilbert base uncased
finetuned sst2 English,

39
00:01:54,510 --> 00:01:55,960
which is a bit of a mouthful.

40
00:01:56,820 --> 00:01:59,760
We instantiate a tokenizer
associated with that checkpoint,

41
00:01:59,760 --> 00:02:01,833
then feed it the two sentences.

42
00:02:02,790 --> 00:02:05,490
Since those two sentences
are not of the same size,

43
00:02:05,490 --> 00:02:07,440
we will need to pad the shortest one

44
00:02:07,440 --> 00:02:09,570
to be able to build an array.

45
00:02:09,570 --> 00:02:10,403
This is done by the tokenizer

46
00:02:10,403 --> 00:02:12,603
with the option padding=True.

47
00:02:14,130 --> 00:02:17,340
With truncation=True, we
ensure that any sentence longer

48
00:02:17,340 --> 00:02:19,953
than the maximum the model
can handle is truncated.

49
00:02:20,820 --> 00:02:24,200
Lastly, the return_tensors
option tells the tokenizer

50
00:02:24,200 --> 00:02:25,773
to return a PyTorch tensor.

51
00:02:26,910 --> 00:02:28,050
Looking at the result,

52
00:02:28,050 --> 00:02:30,450
we see we have a dictionary with two keys.

53
00:02:30,450 --> 00:02:33,840
Input IDs contains the
IDs of both sentences,

54
00:02:33,840 --> 00:02:35,840
with zeros where the padding is applied.

55
00:02:36,750 --> 00:02:38,550
The second key, attention mask,

56
00:02:38,550 --> 00:02:40,650
indicates where padding has been applied,

57
00:02:40,650 --> 00:02:42,750
so the model does not pay attention to it.

58
00:02:43,590 --> 00:02:46,380
This is all what is inside
the tokenization step.

59
00:02:46,380 --> 00:02:49,653
Now let's have a look at
the second step, the model.

60
00:02:51,090 --> 00:02:53,850
As for the tokenizer,
there is an AutoModel API,

61
00:02:53,850 --> 00:02:55,890
with a from_pretrained method.

62
00:02:55,890 --> 00:02:59,100
It will download and cache
the configuration of the model

63
00:02:59,100 --> 00:03:01,560
as well as the pretrained weights.

64
00:03:01,560 --> 00:03:04,830
However, the AutoModel
API will only instantiate

65
00:03:04,830 --> 00:03:06,540
the body of the model,

66
00:03:06,540 --> 00:03:09,120
that is, the part of
the model that is left

67
00:03:09,120 --> 00:03:11,103
once the pretraining head is removed.

68
00:03:12,210 --> 00:03:14,460
It will output a high-dimensional tensor

69
00:03:14,460 --> 00:03:17,190
that is a representation
of the sentences passed,

70
00:03:17,190 --> 00:03:18,930
but which is not directly useful

71
00:03:18,930 --> 00:03:20,480
for our classification problem.

72
00:03:21,930 --> 00:03:24,210
Here the tensor has two sentences,

73
00:03:24,210 --> 00:03:26,070
each of sixteen token,

74
00:03:26,070 --> 00:03:30,393
and the last dimension is the
hidden size of our model, 768.

75
00:03:31,620 --> 00:03:34,020
To get an output linked to
our classification problem,

76
00:03:34,020 --> 00:03:37,800
we need to use the
AutoModelForSequenceClassification class.

77
00:03:37,800 --> 00:03:40,170
It works exactly as the AutoModel class,

78
00:03:40,170 --> 00:03:41,970
except that it will build a model

79
00:03:41,970 --> 00:03:43,353
with a classification head.

80
00:03:44,520 --> 00:03:46,770
There is one auto class
for each common NLP task

81
00:03:46,770 --> 00:03:48,170
in the Transformers library.

82
00:03:49,050 --> 00:03:52,380
Here, after giving our
model the two sentences,

83
00:03:52,380 --> 00:03:54,600
we get a tensor of size two by two;

84
00:03:54,600 --> 00:03:57,783
one result for each sentence
and for each possible label.

85
00:03:59,100 --> 00:04:01,470
Those outputs are not probabilities yet.

86
00:04:01,470 --> 00:04:03,660
We can see they don't sum to 1.

87
00:04:03,660 --> 00:04:06,090
This is because each model
of the Transformers library

88
00:04:06,090 --> 00:04:07,830
returns logits.

89
00:04:07,830 --> 00:04:09,480
To make sense of those logits,

90
00:04:09,480 --> 00:04:10,980
we need to dig into the third

91
00:04:10,980 --> 00:04:13,653
and last step of the
pipeline, post-processing.

92
00:04:15,300 --> 00:04:17,310
To convert logits into probabilities,

93
00:04:17,310 --> 00:04:19,950
we need to apply a SoftMax layer to them.

94
00:04:19,950 --> 00:04:22,800
As we can see, this transforms
them into positive numbers

95
00:04:22,800 --> 00:04:23,793
that sum up to 1.

96
00:04:24,990 --> 00:04:27,030
The last step is to know
which of those corresponds

97
00:04:27,030 --> 00:04:29,400
to the positive or the negative label.

98
00:04:29,400 --> 00:04:33,480
This is given by the id2label
field of the model config.

99
00:04:33,480 --> 00:04:36,000
The first probabilities, index 0,

100
00:04:36,000 --> 00:04:37,740
correspond to the negative label,

101
00:04:37,740 --> 00:04:42,060
and the seconds, index 1,
correspond to the positive label.

102
00:04:42,060 --> 00:04:43,830
This is how our classifier built

103
00:04:43,830 --> 00:04:46,260
with the pipeline function
picked those labels

104
00:04:46,260 --> 00:04:47,560
and computed those scores.

105
00:04:48,420 --> 00:04:50,400
Now that you know how each steps works,

106
00:04:50,400 --> 00:04:52,533
you can easily tweak them to your needs.

107
00:04:55,314 --> 00:04:57,897
(subtle blast)