subtitles/en/62_what-is-the-rouge-metric.srt (353 lines of code) (raw):
1
00:00:00,624 --> 00:00:03,374
(logo whooshing)
2
00:00:05,700 --> 00:00:07,740
- What is the ROUGE metric?
3
00:00:07,740 --> 00:00:08,880
For many NLP tasks
4
00:00:08,880 --> 00:00:12,270
we can use common metrics
like accuracy or F1 score.
5
00:00:12,270 --> 00:00:13,650
But what do you do when
you wanna measure something
6
00:00:13,650 --> 00:00:16,920
like the quality of a
summary from a model like T5?
7
00:00:16,920 --> 00:00:18,180
In this video, we'll take a look
8
00:00:18,180 --> 00:00:21,180
at a widely used metric for
text summarization called ROUGE.
9
00:00:22,740 --> 00:00:24,660
There are actually
several variants of ROUGE
10
00:00:24,660 --> 00:00:26,190
but the basic idea behind all of them
11
00:00:26,190 --> 00:00:27,840
is to assign a single numerical score
12
00:00:27,840 --> 00:00:30,000
to a summary that tells us how good it is
13
00:00:30,000 --> 00:00:32,774
compared to one or more
reference summaries.
14
00:00:32,774 --> 00:00:34,020
In this example, we have a book review
15
00:00:34,020 --> 00:00:36,570
that has been summarized by some model.
16
00:00:36,570 --> 00:00:38,320
If we compare the generated summary
17
00:00:39,168 --> 00:00:40,260
to some reference human
summaries, we can see
18
00:00:40,260 --> 00:00:42,841
that the model is actually pretty good
19
00:00:42,841 --> 00:00:44,063
and only differs by a word or two.
20
00:00:45,060 --> 00:00:46,260
So how can we measure the quality
21
00:00:46,260 --> 00:00:49,050
of a generated summary
in an automatic way?
22
00:00:49,050 --> 00:00:51,510
The approach that ROUGE takes
is to compare the n-grams
23
00:00:51,510 --> 00:00:55,200
of the generated summary to
the n-grams of the references.
24
00:00:55,200 --> 00:00:58,590
And n-gram is just a fancy way
of saying a chunk of N words.
25
00:00:58,590 --> 00:01:00,030
So let's start with unigrams
26
00:01:00,030 --> 00:01:02,780
which correspond to the
individual words in a sentence.
27
00:01:03,780 --> 00:01:05,250
In this example, you can see that six
28
00:01:05,250 --> 00:01:07,650
of the words in the generated
summary are also found
29
00:01:07,650 --> 00:01:09,420
in one of the reference summaries.
30
00:01:09,420 --> 00:01:11,310
And the rouge metric
that compares unigrams
31
00:01:11,310 --> 00:01:12,260
is called ROUGE-1.
32
00:01:14,533 --> 00:01:16,770
Now that we found our matches,
one way to assign a score
33
00:01:16,770 --> 00:01:20,280
to the summary is to compute
the recall of the unigrams.
34
00:01:20,280 --> 00:01:21,540
This means we just count the number
35
00:01:21,540 --> 00:01:22,950
of matching words in the generated
36
00:01:22,950 --> 00:01:25,290
and reference summaries
and normalize the count
37
00:01:25,290 --> 00:01:28,200
by dividing by the number
of words in the reference.
38
00:01:28,200 --> 00:01:30,450
In this example, we
found six matching words
39
00:01:30,450 --> 00:01:32,160
and our reference has six words.
40
00:01:32,160 --> 00:01:33,933
So our unigram recall is perfect.
41
00:01:34,800 --> 00:01:35,810
This means that all of the words
42
00:01:35,810 --> 00:01:37,500
in the reference summary
have been produced
43
00:01:37,500 --> 00:01:38,550
in the generated one.
44
00:01:40,050 --> 00:01:42,360
Now, perfect recall
sounds great, but imagine
45
00:01:42,360 --> 00:01:44,520
if our generated summary
have been something like
46
00:01:44,520 --> 00:01:45,720
I really, really, really,
47
00:01:45,720 --> 00:01:48,150
really loved reading the Hunger Games.
48
00:01:48,150 --> 00:01:49,378
This would also have perfect recall
49
00:01:49,378 --> 00:01:51,330
but is arguably a worse summary,
50
00:01:51,330 --> 00:01:52,653
since it is verbose.
51
00:01:53,550 --> 00:01:54,600
To deal with these scenarios,
52
00:01:54,600 --> 00:01:56,190
we can also compute precision,
53
00:01:56,190 --> 00:01:58,380
which in the ROUGE
context measures how much
54
00:01:58,380 --> 00:02:00,810
of the generator summary was relevant.
55
00:02:00,810 --> 00:02:03,630
In practice, both precision
and recall are usually computed
56
00:02:03,630 --> 00:02:05,493
and then the F1 score is reported.
57
00:02:07,170 --> 00:02:08,542
Now we can change the granularity
58
00:02:08,542 --> 00:02:13,020
of the comparison by comparing
bigrams instead of unigrams.
59
00:02:13,020 --> 00:02:15,090
With bigrams, we chunk
the sentence into pairs
60
00:02:15,090 --> 00:02:17,910
of consecutive words and
then count how many pairs
61
00:02:17,910 --> 00:02:21,360
in the generated summary are
present in the reference one.
62
00:02:21,360 --> 00:02:23,880
This gives us ROUGE-2 precision and recall
63
00:02:23,880 --> 00:02:24,780
which as we can see,
64
00:02:24,780 --> 00:02:27,780
is lower than the ROUGE-1
scores from earlier.
65
00:02:27,780 --> 00:02:29,400
Now, if the summaries are long,
66
00:02:29,400 --> 00:02:31,740
the ROUGE-2 scores will generally be small
67
00:02:31,740 --> 00:02:34,290
because there are fewer bios to match.
68
00:02:34,290 --> 00:02:36,870
And this is also true for
abstractive summarization.
69
00:02:36,870 --> 00:02:39,993
So both ROUGE-1 and ROUGE-2
scores are usually reported.
70
00:02:42,000 --> 00:02:45,330
The last ROUGE variant we
will discuss is ROUGE L.
71
00:02:45,330 --> 00:02:47,160
ROUGE L doesn't compare ngrams
72
00:02:47,160 --> 00:02:49,572
but instead treats each
summary as a sequence of words
73
00:02:49,572 --> 00:02:53,403
and then looks for the longest
common subsequence or LCS.
74
00:02:54,775 --> 00:02:56,130
A subsequence is a sequence that appears
75
00:02:56,130 --> 00:02:59,760
in the same relative order,
but not necessarily contiguous.
76
00:02:59,760 --> 00:03:03,210
So in this example, I loved
reading the Hunger Games,
77
00:03:03,210 --> 00:03:06,930
is the longest common subsequence
between the two summaries.
78
00:03:06,930 --> 00:03:08,610
And the main advantage of ROUGE L
79
00:03:08,610 --> 00:03:11,670
over ROUGE-1 or ROUGE-2
is that it doesn't depend
80
00:03:11,670 --> 00:03:14,100
on consecutive n-gram
matches, and so it tends
81
00:03:14,100 --> 00:03:16,650
to capture sentence structure
much more accurately.
82
00:03:18,150 --> 00:03:19,440
Now to compute ROUGE scores
83
00:03:19,440 --> 00:03:21,660
in the data sets library is very simple.
84
00:03:21,660 --> 00:03:23,910
You just use the load_metric function,
85
00:03:23,910 --> 00:03:26,400
provide your model summaries
along with the references
86
00:03:26,400 --> 00:03:27,500
and you're good to go.
87
00:03:28,770 --> 00:03:30,120
The output from the calculation
88
00:03:30,120 --> 00:03:31,507
contains a lot of information.
89
00:03:31,507 --> 00:03:34,560
The first thing we can see is
that the confidence intervals
90
00:03:34,560 --> 00:03:36,090
of each ROUGE score are provided
91
00:03:36,090 --> 00:03:39,030
in the low, mid and high fields.
92
00:03:39,030 --> 00:03:40,980
This is really useful if
you wanna know the spread
93
00:03:40,980 --> 00:03:43,730
of your ROUGE scores when
comparing two or more models.
94
00:03:45,090 --> 00:03:46,050
The second thing to notice
95
00:03:46,050 --> 00:03:48,330
is that we have four types of ROUGE score.
96
00:03:48,330 --> 00:03:51,480
We've already seen ROUGE-1,
ROUGE-2 and ROUGE-L
97
00:03:51,480 --> 00:03:53,760
So what is ROUGE-L sum?
98
00:03:53,760 --> 00:03:55,410
Well, the sum in ROUGEL's sum
99
00:03:55,410 --> 00:03:57,630
refers to the fact that
this metric is computed
100
00:03:57,630 --> 00:04:00,240
over a whole summary
while ROUGE-L is computed
101
00:04:00,240 --> 00:04:02,493
as the average of individual sentences.
102
00:04:04,166 --> 00:04:06,916
(logo whooshing)