subtitles/en/57_what-is-perplexity.srt (178 lines of code) (raw):

1 00:00:00,095 --> 00:00:01,582 (screen whooshing) 2 00:00:01,582 --> 00:00:02,659 (sticker popping) 3 00:00:02,659 --> 00:00:05,379 (screen whooshing) 4 00:00:05,379 --> 00:00:06,720 - In this video, we take a look 5 00:00:06,720 --> 00:00:09,483 at the mysterious sounding metric called perplexity. 6 00:00:11,070 --> 00:00:12,630 You might have encountered perplexity 7 00:00:12,630 --> 00:00:14,970 when reading about generative models. 8 00:00:14,970 --> 00:00:16,680 You can see two examples here, 9 00:00:16,680 --> 00:00:18,577 one from the original transformer paper, 10 00:00:18,577 --> 00:00:19,950 "Attention is all you need," 11 00:00:19,950 --> 00:00:23,340 and the other one from the more recent GPT-2 paper. 12 00:00:23,340 --> 00:00:25,740 Perplexity is a common metric to measure the performance 13 00:00:25,740 --> 00:00:27,150 of language models. 14 00:00:27,150 --> 00:00:30,000 The smaller its value, the better the performance. 15 00:00:30,000 --> 00:00:32,950 But what does it actually mean and how can we calculate it? 16 00:00:34,440 --> 00:00:36,180 A very common quantity in machine learning 17 00:00:36,180 --> 00:00:37,650 is the likelihood. 18 00:00:37,650 --> 00:00:39,240 We can calculate the likelihood 19 00:00:39,240 --> 00:00:42,390 as the product of each token's probability. 20 00:00:42,390 --> 00:00:44,730 What this means is that for each token, 21 00:00:44,730 --> 00:00:47,340 we use the language model to predict its probability 22 00:00:47,340 --> 00:00:49,560 based on the previous tokens. 23 00:00:49,560 --> 00:00:52,050 In the end, we multiply all probabilities 24 00:00:52,050 --> 00:00:53,253 to get the likelihood. 25 00:00:55,892 --> 00:00:57,000 With the likelihood, 26 00:00:57,000 --> 00:00:59,340 we can calculate another important quantity, 27 00:00:59,340 --> 00:01:01,200 the cross-entropy. 28 00:01:01,200 --> 00:01:03,450 You might have already heard about cross-entropy 29 00:01:03,450 --> 00:01:05,670 when looking at loss function. 30 00:01:05,670 --> 00:01:09,210 It is often used as a loss function in classification. 31 00:01:09,210 --> 00:01:11,610 In language modeling, we predict the next token 32 00:01:11,610 --> 00:01:12,930 based on the previous token, 33 00:01:12,930 --> 00:01:15,810 which is also a classification task. 34 00:01:15,810 --> 00:01:17,340 Therefore, if we want to calculate 35 00:01:17,340 --> 00:01:19,290 the cross-entropy of an example, 36 00:01:19,290 --> 00:01:21,090 we can simply pass it to the model 37 00:01:21,090 --> 00:01:23,580 with its inputs as labels. 38 00:01:23,580 --> 00:01:26,433 The loss then corresponds to the cross-entropy. 39 00:01:29,130 --> 00:01:31,110 We are now only a single operation away 40 00:01:31,110 --> 00:01:33,510 from calculating the perplexity. 41 00:01:33,510 --> 00:01:37,710 By exponentiating the cross-entropy, we get the perplexity. 42 00:01:37,710 --> 00:01:40,260 So you see that the perplexity is closely related 43 00:01:40,260 --> 00:01:41,163 to the loss. 44 00:01:42,060 --> 00:01:43,380 Plugging in previous results 45 00:01:43,380 --> 00:01:47,010 shows that this is equivalent to exponentiating 46 00:01:47,010 --> 00:01:51,033 the negative average lock probability of each token. 47 00:01:52,050 --> 00:01:54,630 Keep in mind that the loss is only a weak proxy 48 00:01:54,630 --> 00:01:57,360 for a model's ability to generate quality text 49 00:01:57,360 --> 00:02:00,510 and the same is true for perplexity. 50 00:02:00,510 --> 00:02:02,550 For this reason, one usually also calculates 51 00:02:02,550 --> 00:02:03,840 more sophisticated metrics 52 00:02:03,840 --> 00:02:07,413 such as BLEU or ROUGE on generative tasks. 53 00:02:08,551 --> 00:02:11,468 (screen whooshing)