subtitles/en/62_what-is-the-rouge-metric.srt (353 lines of code) (raw):

1 00:00:00,624 --> 00:00:03,374 (logo whooshing) 2 00:00:05,700 --> 00:00:07,740 - What is the ROUGE metric? 3 00:00:07,740 --> 00:00:08,880 For many NLP tasks 4 00:00:08,880 --> 00:00:12,270 we can use common metrics like accuracy or F1 score. 5 00:00:12,270 --> 00:00:13,650 But what do you do when you wanna measure something 6 00:00:13,650 --> 00:00:16,920 like the quality of a summary from a model like T5? 7 00:00:16,920 --> 00:00:18,180 In this video, we'll take a look 8 00:00:18,180 --> 00:00:21,180 at a widely used metric for text summarization called ROUGE. 9 00:00:22,740 --> 00:00:24,660 There are actually several variants of ROUGE 10 00:00:24,660 --> 00:00:26,190 but the basic idea behind all of them 11 00:00:26,190 --> 00:00:27,840 is to assign a single numerical score 12 00:00:27,840 --> 00:00:30,000 to a summary that tells us how good it is 13 00:00:30,000 --> 00:00:32,774 compared to one or more reference summaries. 14 00:00:32,774 --> 00:00:34,020 In this example, we have a book review 15 00:00:34,020 --> 00:00:36,570 that has been summarized by some model. 16 00:00:36,570 --> 00:00:38,320 If we compare the generated summary 17 00:00:39,168 --> 00:00:40,260 to some reference human summaries, we can see 18 00:00:40,260 --> 00:00:42,841 that the model is actually pretty good 19 00:00:42,841 --> 00:00:44,063 and only differs by a word or two. 20 00:00:45,060 --> 00:00:46,260 So how can we measure the quality 21 00:00:46,260 --> 00:00:49,050 of a generated summary in an automatic way? 22 00:00:49,050 --> 00:00:51,510 The approach that ROUGE takes is to compare the n-grams 23 00:00:51,510 --> 00:00:55,200 of the generated summary to the n-grams of the references. 24 00:00:55,200 --> 00:00:58,590 And n-gram is just a fancy way of saying a chunk of N words. 25 00:00:58,590 --> 00:01:00,030 So let's start with unigrams 26 00:01:00,030 --> 00:01:02,780 which correspond to the individual words in a sentence. 27 00:01:03,780 --> 00:01:05,250 In this example, you can see that six 28 00:01:05,250 --> 00:01:07,650 of the words in the generated summary are also found 29 00:01:07,650 --> 00:01:09,420 in one of the reference summaries. 30 00:01:09,420 --> 00:01:11,310 And the rouge metric that compares unigrams 31 00:01:11,310 --> 00:01:12,260 is called ROUGE-1. 32 00:01:14,533 --> 00:01:16,770 Now that we found our matches, one way to assign a score 33 00:01:16,770 --> 00:01:20,280 to the summary is to compute the recall of the unigrams. 34 00:01:20,280 --> 00:01:21,540 This means we just count the number 35 00:01:21,540 --> 00:01:22,950 of matching words in the generated 36 00:01:22,950 --> 00:01:25,290 and reference summaries and normalize the count 37 00:01:25,290 --> 00:01:28,200 by dividing by the number of words in the reference. 38 00:01:28,200 --> 00:01:30,450 In this example, we found six matching words 39 00:01:30,450 --> 00:01:32,160 and our reference has six words. 40 00:01:32,160 --> 00:01:33,933 So our unigram recall is perfect. 41 00:01:34,800 --> 00:01:35,810 This means that all of the words 42 00:01:35,810 --> 00:01:37,500 in the reference summary have been produced 43 00:01:37,500 --> 00:01:38,550 in the generated one. 44 00:01:40,050 --> 00:01:42,360 Now, perfect recall sounds great, but imagine 45 00:01:42,360 --> 00:01:44,520 if our generated summary have been something like 46 00:01:44,520 --> 00:01:45,720 I really, really, really, 47 00:01:45,720 --> 00:01:48,150 really loved reading the Hunger Games. 48 00:01:48,150 --> 00:01:49,378 This would also have perfect recall 49 00:01:49,378 --> 00:01:51,330 but is arguably a worse summary, 50 00:01:51,330 --> 00:01:52,653 since it is verbose. 51 00:01:53,550 --> 00:01:54,600 To deal with these scenarios, 52 00:01:54,600 --> 00:01:56,190 we can also compute precision, 53 00:01:56,190 --> 00:01:58,380 which in the ROUGE context measures how much 54 00:01:58,380 --> 00:02:00,810 of the generator summary was relevant. 55 00:02:00,810 --> 00:02:03,630 In practice, both precision and recall are usually computed 56 00:02:03,630 --> 00:02:05,493 and then the F1 score is reported. 57 00:02:07,170 --> 00:02:08,542 Now we can change the granularity 58 00:02:08,542 --> 00:02:13,020 of the comparison by comparing bigrams instead of unigrams. 59 00:02:13,020 --> 00:02:15,090 With bigrams, we chunk the sentence into pairs 60 00:02:15,090 --> 00:02:17,910 of consecutive words and then count how many pairs 61 00:02:17,910 --> 00:02:21,360 in the generated summary are present in the reference one. 62 00:02:21,360 --> 00:02:23,880 This gives us ROUGE-2 precision and recall 63 00:02:23,880 --> 00:02:24,780 which as we can see, 64 00:02:24,780 --> 00:02:27,780 is lower than the ROUGE-1 scores from earlier. 65 00:02:27,780 --> 00:02:29,400 Now, if the summaries are long, 66 00:02:29,400 --> 00:02:31,740 the ROUGE-2 scores will generally be small 67 00:02:31,740 --> 00:02:34,290 because there are fewer bios to match. 68 00:02:34,290 --> 00:02:36,870 And this is also true for abstractive summarization. 69 00:02:36,870 --> 00:02:39,993 So both ROUGE-1 and ROUGE-2 scores are usually reported. 70 00:02:42,000 --> 00:02:45,330 The last ROUGE variant we will discuss is ROUGE L. 71 00:02:45,330 --> 00:02:47,160 ROUGE L doesn't compare ngrams 72 00:02:47,160 --> 00:02:49,572 but instead treats each summary as a sequence of words 73 00:02:49,572 --> 00:02:53,403 and then looks for the longest common subsequence or LCS. 74 00:02:54,775 --> 00:02:56,130 A subsequence is a sequence that appears 75 00:02:56,130 --> 00:02:59,760 in the same relative order, but not necessarily contiguous. 76 00:02:59,760 --> 00:03:03,210 So in this example, I loved reading the Hunger Games, 77 00:03:03,210 --> 00:03:06,930 is the longest common subsequence between the two summaries. 78 00:03:06,930 --> 00:03:08,610 And the main advantage of ROUGE L 79 00:03:08,610 --> 00:03:11,670 over ROUGE-1 or ROUGE-2 is that it doesn't depend 80 00:03:11,670 --> 00:03:14,100 on consecutive n-gram matches, and so it tends 81 00:03:14,100 --> 00:03:16,650 to capture sentence structure much more accurately. 82 00:03:18,150 --> 00:03:19,440 Now to compute ROUGE scores 83 00:03:19,440 --> 00:03:21,660 in the data sets library is very simple. 84 00:03:21,660 --> 00:03:23,910 You just use the load_metric function, 85 00:03:23,910 --> 00:03:26,400 provide your model summaries along with the references 86 00:03:26,400 --> 00:03:27,500 and you're good to go. 87 00:03:28,770 --> 00:03:30,120 The output from the calculation 88 00:03:30,120 --> 00:03:31,507 contains a lot of information. 89 00:03:31,507 --> 00:03:34,560 The first thing we can see is that the confidence intervals 90 00:03:34,560 --> 00:03:36,090 of each ROUGE score are provided 91 00:03:36,090 --> 00:03:39,030 in the low, mid and high fields. 92 00:03:39,030 --> 00:03:40,980 This is really useful if you wanna know the spread 93 00:03:40,980 --> 00:03:43,730 of your ROUGE scores when comparing two or more models. 94 00:03:45,090 --> 00:03:46,050 The second thing to notice 95 00:03:46,050 --> 00:03:48,330 is that we have four types of ROUGE score. 96 00:03:48,330 --> 00:03:51,480 We've already seen ROUGE-1, ROUGE-2 and ROUGE-L 97 00:03:51,480 --> 00:03:53,760 So what is ROUGE-L sum? 98 00:03:53,760 --> 00:03:55,410 Well, the sum in ROUGEL's sum 99 00:03:55,410 --> 00:03:57,630 refers to the fact that this metric is computed 100 00:03:57,630 --> 00:04:00,240 over a whole summary while ROUGE-L is computed 101 00:04:00,240 --> 00:04:02,493 as the average of individual sentences. 102 00:04:04,166 --> 00:04:06,916 (logo whooshing)