subtitles/zh-CN/62_what-is-the-rouge-metric.srt (408 lines of code) (raw):
1
00:00:00,624 --> 00:00:03,374
(徽标呼啸而过)
(logo whooshing)
2
00:00:05,700 --> 00:00:07,740
- 什么是 ROUGE 指标?
- What is the ROUGE metric?
3
00:00:07,740 --> 00:00:08,880
对于许多 NLP 任务
For many NLP tasks
4
00:00:08,880 --> 00:00:12,270
我们可以使用常见的指标,如准确性或 F1 分数。
we can use common metrics like accuracy or F1 score.
5
00:00:12,270 --> 00:00:13,650
但是当你想评估类似从像 T5 这样的模型上
But what do you do when you wanna measure something
6
00:00:13,650 --> 00:00:16,920
所获得的文本摘要的质量,该如何操作呢?
like the quality of a summary from a model like T5?
7
00:00:16,920 --> 00:00:18,180
在本视频中,我们将了解
In this video, we'll take a look
8
00:00:18,180 --> 00:00:21,180
被广泛用于文本摘要的评估指标,称为 ROUGE。
at a widely used metric for text summarization called ROUGE.
9
00:00:22,740 --> 00:00:24,660
ROUGE 实际上有几种变体
There are actually several variants of ROUGE
10
00:00:24,660 --> 00:00:26,190
但所有这些变体背后的基本思想
but the basic idea behind all of them
11
00:00:26,190 --> 00:00:27,840
是为每个摘要分配一个单独的分数
is to assign a single numerical score
12
00:00:27,840 --> 00:00:30,000
来告诉我们相比一个或者多个参考的摘要
to a summary that tells us how good it is
13
00:00:30,000 --> 00:00:32,774
当前的摘要有多好。
compared to one or more reference summaries.
14
00:00:32,774 --> 00:00:34,020
在这个例子中,我们有一个书评
In this example, we have a book review
15
00:00:34,020 --> 00:00:36,570
它是通过某些模型摘要获得。
that has been summarized by some model.
16
00:00:36,570 --> 00:00:38,320
如果我们将生成的摘要
If we compare the generated summary
17
00:00:39,168 --> 00:00:40,260
和一些人工的摘要相比较,
to some reference human summaries,
18
00:00:40,260 --> 00:00:42,841
我们可以看到该模型实际上非常好
we can see that the model is actually pretty good
19
00:00:42,841 --> 00:00:44,063
并且只相差一两个词。
and only differs by a word or two.
20
00:00:45,060 --> 00:00:46,260
那么我们如何通过自动的方式
So how can we measure the quality
21
00:00:46,260 --> 00:00:49,050
评估生成的摘要的质量呢?
of a generated summary in an automatic way?
22
00:00:49,050 --> 00:00:51,510
ROUGE 采用的方法是比较生成的摘要的 n-gram
The approach that ROUGE takes is to compare the n-grams
23
00:00:51,510 --> 00:00:55,200
和参考文献的 n-gram。
of the generated summary to the n-grams of the references.
24
00:00:55,200 --> 00:00:58,590
n-gram 只是一种表达 N 个单词的词块的流行说法。
And n-gram is just a fancy way of saying a chunk of N words.
25
00:00:58,590 --> 00:01:00,030
所以让我们从 unigram 开始
So let's start with unigrams
26
00:01:00,030 --> 00:01:02,780
它对应于句子中的各个单词。
which correspond to the individual words in a sentence.
27
00:01:03,780 --> 00:01:05,250
在这个例子中,你可以看到
In this example, you can see that six
28
00:01:05,250 --> 00:01:07,650
在生成的摘要中有 6 个单词,也出现在
of the words in the generated summary are also found
29
00:01:07,650 --> 00:01:09,420
其中一个参考的摘要中。
in one of the reference summaries.
30
00:01:09,420 --> 00:01:11,310
而比较 unigram 的 rouge metric
And the rouge metric that compares unigrams
31
00:01:11,310 --> 00:01:12,260
被称为 ROUGE-1。
is called ROUGE-1.
32
00:01:14,533 --> 00:01:16,770
现在我们找到了我们的匹配项,一种为摘要分配分数的方法
Now that we found our matches, one way to assign a score
33
00:01:16,770 --> 00:01:20,280
是计算 unigram 的被召回次数。
to the summary is to compute the recall of the unigrams.
34
00:01:20,280 --> 00:01:21,540
这意味着我们只计算
This means we just count the number
35
00:01:21,540 --> 00:01:22,950
生成摘要和参考摘要的匹配词
of matching words in the generated
36
00:01:22,950 --> 00:01:25,290
并将计数通过除以参考文本中的单词数
and reference summaries and normalize the count
37
00:01:25,290 --> 00:01:28,200
进行规范化处理。
by dividing by the number of words in the reference.
38
00:01:28,200 --> 00:01:30,450
在这个例子中,我们找到了六个匹配的词
In this example, we found six matching words
39
00:01:30,450 --> 00:01:32,160
我们的参考文本有六个词。
and our reference has six words.
40
00:01:32,160 --> 00:01:33,933
所以我们的 unigram 召回是完美的。
So our unigram recall is perfect.
41
00:01:34,800 --> 00:01:35,810
这意味着在参考摘要中
This means that all of the words
42
00:01:35,810 --> 00:01:37,500
的所有的词都会出现
in the reference summary have been produced
43
00:01:37,500 --> 00:01:38,550
在生成的摘要中。
in the generated one.
44
00:01:40,050 --> 00:01:42,360
现在,完美的召回听起来不错,但想象一下
Now, perfect recall sounds great, but imagine
45
00:01:42,360 --> 00:01:44,520
如果我们生成的摘要是这样的
if our generated summary have been something like
46
00:01:44,520 --> 00:01:45,720
我真的,真的,真的,
I really, really, really,
47
00:01:45,720 --> 00:01:48,150
真的很喜欢阅读 Hunger Games。
really loved reading the Hunger Games.
48
00:01:48,150 --> 00:01:49,378
这也会有完美的召回
This would also have perfect recall
49
00:01:49,378 --> 00:01:51,330
但可以说是一个更糟糕的总结,
but is arguably a worse summary,
50
00:01:51,330 --> 00:01:52,653
因为它很冗长。
since it is verbose.
51
00:01:53,550 --> 00:01:54,600
为了应对这些场景,
To deal with these scenarios,
52
00:01:54,600 --> 00:01:56,190
我们还可以计算精度,
we can also compute precision,
53
00:01:56,190 --> 00:01:58,380
在 ROUGE 上下文中,它衡量了
which in the ROUGE context measures how much
54
00:01:58,380 --> 00:02:00,810
在生成器摘要中具有多少相关性。
of the generator summary was relevant.
55
00:02:00,810 --> 00:02:03,630
在实际操作中,通常计算精度和召回率
In practice, both precision and recall are usually computed
56
00:02:03,630 --> 00:02:05,493
然后报告 F1 分数。
and then the F1 score is reported.
57
00:02:07,170 --> 00:02:08,542
现在我们可以改变比较的粒度
Now we can change the granularity
58
00:02:08,542 --> 00:02:13,020
将比较的对象由 unigram 改变为 bigram。
of the comparison by comparing bigrams instead of unigrams.
59
00:02:13,020 --> 00:02:15,090
使用 bigram,我们将句子
With bigrams, we chunk the sentence into pairs
60
00:02:15,090 --> 00:02:17,910
切分为成对的连续单词,
of consecutive words and then count how many pairs
61
00:02:17,910 --> 00:02:21,360
然后计算生成的摘要中有多少对出现在参考摘要中。
in the generated summary are present in the reference one.
62
00:02:21,360 --> 00:02:23,880
这给了我们 ROUGE-2 精确度和召回率
This gives us ROUGE-2 precision and recall
63
00:02:23,880 --> 00:02:24,780
正如我们所见,
which as we can see,
64
00:02:24,780 --> 00:02:27,780
低于之前的 ROUGE-1 分数。
is lower than the ROUGE-1 scores from earlier.
65
00:02:27,780 --> 00:02:29,400
现在,如果摘要很长,
Now, if the summaries are long,
66
00:02:29,400 --> 00:02:31,740
ROUGE-2 分数通常会很小
the ROUGE-2 scores will generally be small
67
00:02:31,740 --> 00:02:34,290
因为要匹配的 bios 更少。
because there are fewer bios to match.
68
00:02:34,290 --> 00:02:36,870
这也适用于重写式摘要。
And this is also true for abstractive summarization.
69
00:02:36,870 --> 00:02:39,993
所以通常会报告 ROUGE-1 和 ROUGE-2 分数。
So both ROUGE-1 and ROUGE-2 scores are usually reported.
70
00:02:42,000 --> 00:02:45,330
我们将讨论的最后一个 ROUGE 变体是 ROUGE L。
The last ROUGE variant we will discuss is ROUGE L.
71
00:02:45,330 --> 00:02:47,160
ROUGE L 不比较 ngram
ROUGE L doesn't compare ngrams
72
00:02:47,160 --> 00:02:49,572
而是将每个摘要视为一系列单词
but instead treats each summary as a sequence of words
73
00:02:49,572 --> 00:02:53,403
然后寻找最长的公共子序列或 LCS。
and then looks for the longest common subsequence or LCS.
74
00:02:54,775 --> 00:02:56,130
子序列是以相同的相对顺序
A subsequence is a sequence that appears
75
00:02:56,130 --> 00:02:59,760
出现的序列,但不一定是连续的。
in the same relative order, but not necessarily contiguous.
76
00:02:59,760 --> 00:03:03,210
所以在这个例子中,我喜欢阅读 Hunger Games,
So in this example, I loved reading the Hunger Games,
77
00:03:03,210 --> 00:03:06,930
是两个摘要之间最长的公共子序列。
is the longest common subsequence between the two summaries.
78
00:03:06,930 --> 00:03:08,610
而相比 ROUGE-1 或 ROUGE-2
And the main advantage of ROUGE L
79
00:03:08,610 --> 00:03:11,670
ROUGE L 的主要优势是它不依赖于
over ROUGE-1 or ROUGE-2 is that it doesn't depend
80
00:03:11,670 --> 00:03:14,100
在连续的 n-gram 匹配上,
on consecutive n-gram matches, and so it tends
81
00:03:14,100 --> 00:03:16,650
所以它倾向于更准确地捕捉句子结构。
to capture sentence structure much more accurately.
82
00:03:18,150 --> 00:03:19,440
现在在 Dataset 库中
Now to compute ROUGE scores
83
00:03:19,440 --> 00:03:21,660
计算 ROUGE 分数很简单。
in the data sets library is very simple.
84
00:03:21,660 --> 00:03:23,910
你只需使用 load_metric 函数,
You just use the load_metric function,
85
00:03:23,910 --> 00:03:26,400
提供你的模型摘要以及参考摘要
provide your model summaries along with the references
86
00:03:26,400 --> 00:03:27,500
你可以开始了。
and you're good to go.
87
00:03:28,770 --> 00:03:30,120
计算的输出
The output from the calculation
88
00:03:30,120 --> 00:03:31,507
包含很多信息。
contains a lot of information.
89
00:03:31,507 --> 00:03:34,560
我们首先可以看到的是
The first thing we can see is that the confidence intervals
90
00:03:34,560 --> 00:03:36,090
每个 ROUGE 分数的置信区间
of each ROUGE score are provided
91
00:03:36,090 --> 00:03:39,030
每个 ROUGE 分数的置信区间。
in the low, mid and high fields.
92
00:03:39,030 --> 00:03:40,980
当比较两个或多个模型时,
This is really useful if you wanna know the spread
93
00:03:40,980 --> 00:03:43,730
如果你想知道 ROUGE 分数的范围,这就真的很有用。
of your ROUGE scores when comparing two or more models.
94
00:03:45,090 --> 00:03:46,050
第二点要注意
The second thing to notice
95
00:03:46,050 --> 00:03:48,330
是我们有四种类型的 ROUGE 分数。
is that we have four types of ROUGE score.
96
00:03:48,330 --> 00:03:51,480
我们已经看过 ROUGE-1、ROUGE-2 和 ROUGE-L
We've already seen ROUGE-1, ROUGE-2 and ROUGE-L
97
00:03:51,480 --> 00:03:53,760
那么什么是 ROUGE-L sum 呢?
So what is ROUGE-L sum?
98
00:03:53,760 --> 00:03:55,410
其实就是 ROUGEL 的总和
Well, the sum in ROUGEL's sum
99
00:03:55,410 --> 00:03:57,630
指的是这个指标是
refers to the fact that this metric is computed
100
00:03:57,630 --> 00:04:00,240
当 ROUGE-L 作为单个句子的平均值计算时,
over a whole summary while ROUGE-L is computed
101
00:04:00,240 --> 00:04:02,493
基于整个摘要计算出来的。
as the average of individual sentences.
102
00:04:04,166 --> 00:04:06,916
(徽标呼啸而过)
(logo whooshing)