subtitles/en/08_what-happens-inside-the-pipeline-function-(pytorch).srt (365 lines of code) (raw):
1
00:00:00,554 --> 00:00:03,304
(logo whooshing)
2
00:00:05,340 --> 00:00:07,563
- What happens inside
the pipeline function?
3
00:00:08,760 --> 00:00:11,580
In this video, we will look
at what actually happens
4
00:00:11,580 --> 00:00:13,080
when we use the pipeline function
5
00:00:13,080 --> 00:00:15,090
of the Transformers library.
6
00:00:15,090 --> 00:00:16,860
More specifically, we will look
7
00:00:16,860 --> 00:00:19,200
at the sentiment analysis pipeline,
8
00:00:19,200 --> 00:00:22,020
and how it went from the
two following sentences,
9
00:00:22,020 --> 00:00:23,970
to the positive and negative labels
10
00:00:23,970 --> 00:00:25,420
with their respective scores.
11
00:00:26,760 --> 00:00:29,190
As we have seen in the
pipeline presentation,
12
00:00:29,190 --> 00:00:31,860
there are three stages in the pipeline.
13
00:00:31,860 --> 00:00:34,620
First, we convert the raw texts to numbers
14
00:00:34,620 --> 00:00:37,173
the model can make sense
of using a tokenizer.
15
00:00:38,010 --> 00:00:40,530
Then those numbers go through the model,
16
00:00:40,530 --> 00:00:41,943
which outputs logits.
17
00:00:42,780 --> 00:00:45,600
Finally, the post-processing
steps transforms
18
00:00:45,600 --> 00:00:48,150
those logits into labels and scores.
19
00:00:48,150 --> 00:00:50,700
Let's look in detail at those three steps
20
00:00:50,700 --> 00:00:53,640
and how to replicate them
using the Transformers library,
21
00:00:53,640 --> 00:00:56,043
beginning with the first
stage, tokenization.
22
00:00:57,915 --> 00:01:00,360
The tokenization process
has several steps.
23
00:01:00,360 --> 00:01:04,950
First, the text is split into
small chunks called tokens.
24
00:01:04,950 --> 00:01:08,550
They can be words, parts of
words or punctuation symbols.
25
00:01:08,550 --> 00:01:11,580
Then the tokenizer will
had some special tokens,
26
00:01:11,580 --> 00:01:13,500
if the model expect them.
27
00:01:13,500 --> 00:01:16,860
Here the model uses expects
a CLS token at the beginning
28
00:01:16,860 --> 00:01:19,743
and a SEP token at the end
of the sentence to classify.
29
00:01:20,580 --> 00:01:24,180
Lastly, the tokenizer matches
each token to its unique ID
30
00:01:24,180 --> 00:01:27,000
in the vocabulary of the pretrained model.
31
00:01:27,000 --> 00:01:28,680
To load such a tokenizer,
32
00:01:28,680 --> 00:01:31,743
the Transformers library
provides the AutoTokenizer API.
33
00:01:32,730 --> 00:01:36,120
The most important method of
this class is from_pretrained,
34
00:01:36,120 --> 00:01:38,910
which will download and
cache the configuration
35
00:01:38,910 --> 00:01:41,853
and the vocabulary associated
to a given checkpoint.
36
00:01:43,200 --> 00:01:45,360
Here the checkpoint used by default
37
00:01:45,360 --> 00:01:47,280
for the sentiment analysis pipeline
38
00:01:47,280 --> 00:01:51,986
is
distilbert-base-uncased-finetuned-sst-2-English.
39
00:01:51,986 --> 00:01:53,700
(indistinct)
40
00:01:53,700 --> 00:01:56,490
We instantiate a tokenizer
associated with that checkpoint,
41
00:01:56,490 --> 00:01:59,490
then feed it the two sentences.
42
00:01:59,490 --> 00:02:02,100
Since those two sentences
are not of the same size,
43
00:02:02,100 --> 00:02:03,930
we will need to pad the shortest one
44
00:02:03,930 --> 00:02:06,030
to be able to build an array.
45
00:02:06,030 --> 00:02:09,840
This is done by the tokenizer
with the option, padding=True.
46
00:02:09,840 --> 00:02:12,810
With truncation=True, we
ensure that any sentence
47
00:02:12,810 --> 00:02:15,873
longer than the maximum the
model can handle is truncated.
48
00:02:17,010 --> 00:02:19,620
Lastly, the return_tensors option
49
00:02:19,620 --> 00:02:22,323
tells the tokenizer to
return a PyTorch tensor.
50
00:02:23,190 --> 00:02:25,590
Looking at the result, we
see we have a dictionary
51
00:02:25,590 --> 00:02:26,670
with two keys.
52
00:02:26,670 --> 00:02:29,970
Input IDs contains the
IDs of both sentences,
53
00:02:29,970 --> 00:02:32,550
with zero where the padding is applied.
54
00:02:32,550 --> 00:02:34,260
The second key, attention mask,
55
00:02:34,260 --> 00:02:36,150
indicates where padding has been applied,
56
00:02:36,150 --> 00:02:38,940
so the model does not pay attention to it.
57
00:02:38,940 --> 00:02:42,090
This is all what is inside
the tokenization step.
58
00:02:42,090 --> 00:02:46,289
Now, let's have a look at
the second step, the model.
59
00:02:46,289 --> 00:02:47,952
As for the tokenizer,
60
00:02:47,952 --> 00:02:51,133
there is an AutoModel API
with a from_pretrained method.
61
00:02:51,133 --> 00:02:53,954
It will download and cache
the configuration of the model
62
00:02:53,954 --> 00:02:56,280
as well as the pretrained weights.
63
00:02:56,280 --> 00:02:58,200
However, the AutoModel API
64
00:02:58,200 --> 00:03:00,630
will only instantiate
the body of the model,
65
00:03:00,630 --> 00:03:03,420
that is the part of the model that is left
66
00:03:03,420 --> 00:03:06,090
once the pretraining head is removed.
67
00:03:06,090 --> 00:03:08,610
It will output a high-dimensional tensor
68
00:03:08,610 --> 00:03:11,220
that is a representation
of the sentences passed,
69
00:03:11,220 --> 00:03:12,690
but which is not directly useful
70
00:03:12,690 --> 00:03:15,030
for our classification problem.
71
00:03:15,030 --> 00:03:19,230
Here the tensor has two
sentences, each of 16 tokens,
72
00:03:19,230 --> 00:03:23,433
and the last dimension is the
hidden size of our model, 768.
73
00:03:24,900 --> 00:03:27,510
To get an output linked to
our classification problem,
74
00:03:27,510 --> 00:03:31,170
we need to use the
AutoModelForSequenceClassification class.
75
00:03:31,170 --> 00:03:33,330
It works exactly as the AutoModel class,
76
00:03:33,330 --> 00:03:35,130
except that it will build a model
77
00:03:35,130 --> 00:03:36,543
with a classification head.
78
00:03:37,483 --> 00:03:39,560
There is one auto class
for each common NLP task
79
00:03:39,560 --> 00:03:40,960
in the Transformers library.
80
00:03:42,150 --> 00:03:45,570
Here after giving our
model the two sentences,
81
00:03:45,570 --> 00:03:47,820
we get a tensor of size two by two,
82
00:03:47,820 --> 00:03:50,943
one result for each sentence
and for each possible label.
83
00:03:51,840 --> 00:03:53,970
Those outputs are not probabilities yet,
84
00:03:53,970 --> 00:03:56,100
we can see they don't sum to 1.
85
00:03:56,100 --> 00:03:57,270
This is because each model
86
00:03:57,270 --> 00:04:00,810
of the Transformers
library returns logits.
87
00:04:00,810 --> 00:04:02,250
To make sense of those logits,
88
00:04:02,250 --> 00:04:05,910
we need to dig into the third
and last step of the pipeline.
89
00:04:05,910 --> 00:04:10,620
Post-processing, to convert
logits into probabilities,
90
00:04:10,620 --> 00:04:13,470
we need to apply a SoftMax layers to them.
91
00:04:13,470 --> 00:04:14,610
As we can see,
92
00:04:14,610 --> 00:04:17,267
this transforms them into positive number
93
00:04:17,267 --> 00:04:18,663
that sum up to one.
94
00:04:18,663 --> 00:04:21,360
The last step is to know
which of those corresponds
95
00:04:21,360 --> 00:04:23,580
to the positive or the negative label.
96
00:04:23,580 --> 00:04:28,020
This is given by the id2label
field of the model config.
97
00:04:28,020 --> 00:04:30,390
The first probabilities, index zero,
98
00:04:30,390 --> 00:04:32,250
correspond to the negative label,
99
00:04:32,250 --> 00:04:34,140
and the seconds, index one,
100
00:04:34,140 --> 00:04:36,480
correspond to the positive label.
101
00:04:36,480 --> 00:04:37,950
This is how our classifier built
102
00:04:37,950 --> 00:04:40,230
with the pipeline function
picked those labels
103
00:04:40,230 --> 00:04:42,240
and computed those scores.
104
00:04:42,240 --> 00:04:44,220
Now that you know how each steps works,
105
00:04:44,220 --> 00:04:46,220
you can easily tweak them to your needs.
106
00:04:47,524 --> 00:04:50,274
(logo whooshing)