subtitles/en/57_what-is-perplexity.srt (178 lines of code) (raw):
1
00:00:00,095 --> 00:00:01,582
(screen whooshing)
2
00:00:01,582 --> 00:00:02,659
(sticker popping)
3
00:00:02,659 --> 00:00:05,379
(screen whooshing)
4
00:00:05,379 --> 00:00:06,720
- In this video, we take a look
5
00:00:06,720 --> 00:00:09,483
at the mysterious sounding
metric called perplexity.
6
00:00:11,070 --> 00:00:12,630
You might have encountered perplexity
7
00:00:12,630 --> 00:00:14,970
when reading about generative models.
8
00:00:14,970 --> 00:00:16,680
You can see two examples here,
9
00:00:16,680 --> 00:00:18,577
one from the original transformer paper,
10
00:00:18,577 --> 00:00:19,950
"Attention is all you need,"
11
00:00:19,950 --> 00:00:23,340
and the other one from the
more recent GPT-2 paper.
12
00:00:23,340 --> 00:00:25,740
Perplexity is a common metric
to measure the performance
13
00:00:25,740 --> 00:00:27,150
of language models.
14
00:00:27,150 --> 00:00:30,000
The smaller its value, the
better the performance.
15
00:00:30,000 --> 00:00:32,950
But what does it actually mean
and how can we calculate it?
16
00:00:34,440 --> 00:00:36,180
A very common quantity in machine learning
17
00:00:36,180 --> 00:00:37,650
is the likelihood.
18
00:00:37,650 --> 00:00:39,240
We can calculate the likelihood
19
00:00:39,240 --> 00:00:42,390
as the product of each
token's probability.
20
00:00:42,390 --> 00:00:44,730
What this means is that for each token,
21
00:00:44,730 --> 00:00:47,340
we use the language model
to predict its probability
22
00:00:47,340 --> 00:00:49,560
based on the previous tokens.
23
00:00:49,560 --> 00:00:52,050
In the end, we multiply all probabilities
24
00:00:52,050 --> 00:00:53,253
to get the likelihood.
25
00:00:55,892 --> 00:00:57,000
With the likelihood,
26
00:00:57,000 --> 00:00:59,340
we can calculate another
important quantity,
27
00:00:59,340 --> 00:01:01,200
the cross-entropy.
28
00:01:01,200 --> 00:01:03,450
You might have already
heard about cross-entropy
29
00:01:03,450 --> 00:01:05,670
when looking at loss function.
30
00:01:05,670 --> 00:01:09,210
It is often used as a loss
function in classification.
31
00:01:09,210 --> 00:01:11,610
In language modeling, we
predict the next token
32
00:01:11,610 --> 00:01:12,930
based on the previous token,
33
00:01:12,930 --> 00:01:15,810
which is also a classification task.
34
00:01:15,810 --> 00:01:17,340
Therefore, if we want to calculate
35
00:01:17,340 --> 00:01:19,290
the cross-entropy of an example,
36
00:01:19,290 --> 00:01:21,090
we can simply pass it to the model
37
00:01:21,090 --> 00:01:23,580
with its inputs as labels.
38
00:01:23,580 --> 00:01:26,433
The loss then corresponds
to the cross-entropy.
39
00:01:29,130 --> 00:01:31,110
We are now only a single operation away
40
00:01:31,110 --> 00:01:33,510
from calculating the perplexity.
41
00:01:33,510 --> 00:01:37,710
By exponentiating the cross-entropy,
we get the perplexity.
42
00:01:37,710 --> 00:01:40,260
So you see that the
perplexity is closely related
43
00:01:40,260 --> 00:01:41,163
to the loss.
44
00:01:42,060 --> 00:01:43,380
Plugging in previous results
45
00:01:43,380 --> 00:01:47,010
shows that this is
equivalent to exponentiating
46
00:01:47,010 --> 00:01:51,033
the negative average lock
probability of each token.
47
00:01:52,050 --> 00:01:54,630
Keep in mind that the
loss is only a weak proxy
48
00:01:54,630 --> 00:01:57,360
for a model's ability
to generate quality text
49
00:01:57,360 --> 00:02:00,510
and the same is true for perplexity.
50
00:02:00,510 --> 00:02:02,550
For this reason, one
usually also calculates
51
00:02:02,550 --> 00:02:03,840
more sophisticated metrics
52
00:02:03,840 --> 00:02:07,413
such as BLEU or ROUGE on generative tasks.
53
00:02:08,551 --> 00:02:11,468
(screen whooshing)