1
00:00:00,554 --> 00:00:03,304
(logo whooshing)

2
00:00:05,340 --> 00:00:07,563
- What happens inside
the pipeline function?

3
00:00:08,760 --> 00:00:11,580
In this video, we will look
at what actually happens

4
00:00:11,580 --> 00:00:13,080
when we use the pipeline function

5
00:00:13,080 --> 00:00:15,090
of the Transformers library.

6
00:00:15,090 --> 00:00:16,860
More specifically, we will look

7
00:00:16,860 --> 00:00:19,200
at the sentiment analysis pipeline,

8
00:00:19,200 --> 00:00:22,020
and how it went from the
two following sentences,

9
00:00:22,020 --> 00:00:23,970
to the positive and negative labels

10
00:00:23,970 --> 00:00:25,420
with their respective scores.

11
00:00:26,760 --> 00:00:29,190
As we have seen in the
pipeline presentation,

12
00:00:29,190 --> 00:00:31,860
there are three stages in the pipeline.

13
00:00:31,860 --> 00:00:34,620
First, we convert the raw texts to numbers

14
00:00:34,620 --> 00:00:37,173
the model can make sense
of using a tokenizer.

15
00:00:38,010 --> 00:00:40,530
Then those numbers go through the model,

16
00:00:40,530 --> 00:00:41,943
which outputs logits.

17
00:00:42,780 --> 00:00:45,600
Finally, the post-processing
steps transforms

18
00:00:45,600 --> 00:00:48,150
those logits into labels and scores.

19
00:00:48,150 --> 00:00:50,700
Let's look in detail at those three steps

20
00:00:50,700 --> 00:00:53,640
and how to replicate them
using the Transformers library,

21
00:00:53,640 --> 00:00:56,043
beginning with the first
stage, tokenization.

22
00:00:57,915 --> 00:01:00,360
The tokenization process
has several steps.

23
00:01:00,360 --> 00:01:04,950
First, the text is split into
small chunks called tokens.

24
00:01:04,950 --> 00:01:08,550
They can be words, parts of
words or punctuation symbols.

25
00:01:08,550 --> 00:01:11,580
Then the tokenizer will
had some special tokens,

26
00:01:11,580 --> 00:01:13,500
if the model expect them.

27
00:01:13,500 --> 00:01:16,860
Here the model uses expects
a CLS token at the beginning

28
00:01:16,860 --> 00:01:19,743
and a SEP token at the end
of the sentence to classify.

29
00:01:20,580 --> 00:01:24,180
Lastly, the tokenizer matches
each token to its unique ID

30
00:01:24,180 --> 00:01:27,000
in the vocabulary of the pretrained model.

31
00:01:27,000 --> 00:01:28,680
To load such a tokenizer,

32
00:01:28,680 --> 00:01:31,743
the Transformers library
provides the AutoTokenizer API.

33
00:01:32,730 --> 00:01:36,120
The most important method of
this class is from_pretrained,

34
00:01:36,120 --> 00:01:38,910
which will download and
cache the configuration

35
00:01:38,910 --> 00:01:41,853
and the vocabulary associated
to a given checkpoint.

36
00:01:43,200 --> 00:01:45,360
Here the checkpoint used by default

37
00:01:45,360 --> 00:01:47,280
for the sentiment analysis pipeline

38
00:01:47,280 --> 00:01:51,986
is
distilbert-base-uncased-finetuned-sst-2-English.

39
00:01:51,986 --> 00:01:53,700
(indistinct)

40
00:01:53,700 --> 00:01:56,490
We instantiate a tokenizer
associated with that checkpoint,

41
00:01:56,490 --> 00:01:59,490
then feed it the two sentences.

42
00:01:59,490 --> 00:02:02,100
Since those two sentences
are not of the same size,

43
00:02:02,100 --> 00:02:03,930
we will need to pad the shortest one

44
00:02:03,930 --> 00:02:06,030
to be able to build an array.

45
00:02:06,030 --> 00:02:09,840
This is done by the tokenizer
with the option, padding=True.

46
00:02:09,840 --> 00:02:12,810
With truncation=True, we
ensure that any sentence

47
00:02:12,810 --> 00:02:15,873
longer than the maximum the
model can handle is truncated.

48
00:02:17,010 --> 00:02:19,620
Lastly, the return_tensors option

49
00:02:19,620 --> 00:02:22,323
tells the tokenizer to
return a PyTorch tensor.

50
00:02:23,190 --> 00:02:25,590
Looking at the result, we
see we have a dictionary

51
00:02:25,590 --> 00:02:26,670
with two keys.

52
00:02:26,670 --> 00:02:29,970
Input IDs contains the
IDs of both sentences,

53
00:02:29,970 --> 00:02:32,550
with zero where the padding is applied.

54
00:02:32,550 --> 00:02:34,260
The second key, attention mask,

55
00:02:34,260 --> 00:02:36,150
indicates where padding has been applied,

56
00:02:36,150 --> 00:02:38,940
so the model does not pay attention to it.

57
00:02:38,940 --> 00:02:42,090
This is all what is inside
the tokenization step.

58
00:02:42,090 --> 00:02:46,289
Now, let's have a look at
the second step, the model.

59
00:02:46,289 --> 00:02:47,952
As for the tokenizer,

60
00:02:47,952 --> 00:02:51,133
there is an AutoModel API
with a from_pretrained method.

61
00:02:51,133 --> 00:02:53,954
It will download and cache
the configuration of the model

62
00:02:53,954 --> 00:02:56,280
as well as the pretrained weights.

63
00:02:56,280 --> 00:02:58,200
However, the AutoModel API

64
00:02:58,200 --> 00:03:00,630
will only instantiate
the body of the model,

65
00:03:00,630 --> 00:03:03,420
that is the part of the model that is left

66
00:03:03,420 --> 00:03:06,090
once the pretraining head is removed.

67
00:03:06,090 --> 00:03:08,610
It will output a high-dimensional tensor

68
00:03:08,610 --> 00:03:11,220
that is a representation
of the sentences passed,

69
00:03:11,220 --> 00:03:12,690
but which is not directly useful

70
00:03:12,690 --> 00:03:15,030
for our classification problem.

71
00:03:15,030 --> 00:03:19,230
Here the tensor has two
sentences, each of 16 tokens,

72
00:03:19,230 --> 00:03:23,433
and the last dimension is the
hidden size of our model, 768.

73
00:03:24,900 --> 00:03:27,510
To get an output linked to
our classification problem,

74
00:03:27,510 --> 00:03:31,170
we need to use the
AutoModelForSequenceClassification class.

75
00:03:31,170 --> 00:03:33,330
It works exactly as the AutoModel class,

76
00:03:33,330 --> 00:03:35,130
except that it will build a model

77
00:03:35,130 --> 00:03:36,543
with a classification head.

78
00:03:37,483 --> 00:03:39,560
There is one auto class
for each common NLP task

79
00:03:39,560 --> 00:03:40,960
in the Transformers library.

80
00:03:42,150 --> 00:03:45,570
Here after giving our
model the two sentences,

81
00:03:45,570 --> 00:03:47,820
we get a tensor of size two by two,

82
00:03:47,820 --> 00:03:50,943
one result for each sentence
and for each possible label.

83
00:03:51,840 --> 00:03:53,970
Those outputs are not probabilities yet,

84
00:03:53,970 --> 00:03:56,100
we can see they don't sum to 1.

85
00:03:56,100 --> 00:03:57,270
This is because each model

86
00:03:57,270 --> 00:04:00,810
of the Transformers
library returns logits.

87
00:04:00,810 --> 00:04:02,250
To make sense of those logits,

88
00:04:02,250 --> 00:04:05,910
we need to dig into the third
and last step of the pipeline.

89
00:04:05,910 --> 00:04:10,620
Post-processing, to convert
logits into probabilities,

90
00:04:10,620 --> 00:04:13,470
we need to apply a SoftMax layers to them.

91
00:04:13,470 --> 00:04:14,610
As we can see,

92
00:04:14,610 --> 00:04:17,267
this transforms them into positive number

93
00:04:17,267 --> 00:04:18,663
that sum up to one.

94
00:04:18,663 --> 00:04:21,360
The last step is to know
which of those corresponds

95
00:04:21,360 --> 00:04:23,580
to the positive or the negative label.

96
00:04:23,580 --> 00:04:28,020
This is given by the id2label
field of the model config.

97
00:04:28,020 --> 00:04:30,390
The first probabilities, index zero,

98
00:04:30,390 --> 00:04:32,250
correspond to the negative label,

99
00:04:32,250 --> 00:04:34,140
and the seconds, index one,

100
00:04:34,140 --> 00:04:36,480
correspond to the positive label.

101
00:04:36,480 --> 00:04:37,950
This is how our classifier built

102
00:04:37,950 --> 00:04:40,230
with the pipeline function
picked those labels

103
00:04:40,230 --> 00:04:42,240
and computed those scores.

104
00:04:42,240 --> 00:04:44,220
Now that you know how each steps works,

105
00:04:44,220 --> 00:04:46,220
you can easily tweak them to your needs.

106
00:04:47,524 --> 00:04:50,274
(logo whooshing)