1
00:00:00,624 --> 00:00:03,374
（徽标呼啸而过）
(logo whooshing)

2
00:00:05,700 --> 00:00:07,740
- 什么是 ROUGE 指标？
- What is the ROUGE metric?

3
00:00:07,740 --> 00:00:08,880
对于许多 NLP 任务
For many NLP tasks

4
00:00:08,880 --> 00:00:12,270
我们可以使用常见的指标，如准确性或 F1 分数。
we can use common metrics like accuracy or F1 score.

5
00:00:12,270 --> 00:00:13,650
但是当你想评估类似从像 T5 这样的模型上
But what do you do when you wanna measure something

6
00:00:13,650 --> 00:00:16,920
所获得的文本摘要的质量，该如何操作呢？
like the quality of a summary from a model like T5?

7
00:00:16,920 --> 00:00:18,180
在本视频中，我们将了解
In this video, we'll take a look

8
00:00:18,180 --> 00:00:21,180
被广泛用于文本摘要的评估指标，称为 ROUGE。
at a widely used metric for text summarization called ROUGE.

9
00:00:22,740 --> 00:00:24,660
ROUGE 实际上有几种变体
There are actually several variants of ROUGE

10
00:00:24,660 --> 00:00:26,190
但所有这些变体背后的基本思想
but the basic idea behind all of them

11
00:00:26,190 --> 00:00:27,840
是为每个摘要分配一个单独的分数
is to assign a single numerical score

12
00:00:27,840 --> 00:00:30,000
来告诉我们相比一个或者多个参考的摘要
to a summary that tells us how good it is

13
00:00:30,000 --> 00:00:32,774
当前的摘要有多好。
compared to one or more reference summaries.

14
00:00:32,774 --> 00:00:34,020
在这个例子中，我们有一个书评
In this example, we have a book review

15
00:00:34,020 --> 00:00:36,570
它是通过某些模型摘要获得。
that has been summarized by some model.

16
00:00:36,570 --> 00:00:38,320
如果我们将生成的摘要
If we compare the generated summary

17
00:00:39,168 --> 00:00:40,260
和一些人工的摘要相比较，
to some reference human summaries, 

18
00:00:40,260 --> 00:00:42,841
我们可以看到该模型实际上非常好
we can see that the model is actually pretty good

19
00:00:42,841 --> 00:00:44,063
并且只相差一两个词。
and only differs by a word or two.

20
00:00:45,060 --> 00:00:46,260
那么我们如何通过自动的方式
So how can we measure the quality

21
00:00:46,260 --> 00:00:49,050
评估生成的摘要的质量呢？
of a generated summary in an automatic way?

22
00:00:49,050 --> 00:00:51,510
ROUGE 采用的方法是比较生成的摘要的 n-gram
The approach that ROUGE takes is to compare the n-grams

23
00:00:51,510 --> 00:00:55,200
和参考文献的 n-gram。
of the generated summary to the n-grams of the references.

24
00:00:55,200 --> 00:00:58,590
n-gram 只是一种表达 N 个单词的词块的流行说法。
And n-gram is just a fancy way of saying a chunk of N words.

25
00:00:58,590 --> 00:01:00,030
所以让我们从 unigram 开始
So let's start with unigrams

26
00:01:00,030 --> 00:01:02,780
它对应于句子中的各个单词。
which correspond to the individual words in a sentence.

27
00:01:03,780 --> 00:01:05,250
在这个例子中，你可以看到
In this example, you can see that six

28
00:01:05,250 --> 00:01:07,650
在生成的摘要中有 6 个单词，也出现在
of the words in the generated summary are also found

29
00:01:07,650 --> 00:01:09,420
其中一个参考的摘要中。
in one of the reference summaries.

30
00:01:09,420 --> 00:01:11,310
而比较 unigram 的 rouge metric
And the rouge metric that compares unigrams

31
00:01:11,310 --> 00:01:12,260
被称为 ROUGE-1。
is called ROUGE-1.

32
00:01:14,533 --> 00:01:16,770
现在我们找到了我们的匹配项，一种为摘要分配分数的方法
Now that we found our matches, one way to assign a score

33
00:01:16,770 --> 00:01:20,280
是计算 unigram 的被召回次数。
to the summary is to compute the recall of the unigrams.

34
00:01:20,280 --> 00:01:21,540
这意味着我们只计算
This means we just count the number

35
00:01:21,540 --> 00:01:22,950
生成摘要和参考摘要的匹配词
of matching words in the generated

36
00:01:22,950 --> 00:01:25,290
并将计数通过除以参考文本中的单词数
and reference summaries and normalize the count

37
00:01:25,290 --> 00:01:28,200
进行规范化处理。
by dividing by the number of words in the reference.

38
00:01:28,200 --> 00:01:30,450
在这个例子中，我们找到了六个匹配的词
In this example, we found six matching words

39
00:01:30,450 --> 00:01:32,160
我们的参考文本有六个词。
and our reference has six words.

40
00:01:32,160 --> 00:01:33,933
所以我们的 unigram 召回是完美的。
So our unigram recall is perfect.

41
00:01:34,800 --> 00:01:35,810
这意味着在参考摘要中
This means that all of the words

42
00:01:35,810 --> 00:01:37,500
的所有的词都会出现
in the reference summary have been produced

43
00:01:37,500 --> 00:01:38,550
在生成的摘要中。
in the generated one.

44
00:01:40,050 --> 00:01:42,360
现在，完美的召回听起来不错，但想象一下
Now, perfect recall sounds great, but imagine

45
00:01:42,360 --> 00:01:44,520
如果我们生成的摘要是这样的
if our generated summary have been something like

46
00:01:44,520 --> 00:01:45,720
我真的，真的，真的，
I really, really, really,

47
00:01:45,720 --> 00:01:48,150
真的很喜欢阅读 Hunger Games。
really loved reading the Hunger Games.

48
00:01:48,150 --> 00:01:49,378
这也会有完美的召回
This would also have perfect recall

49
00:01:49,378 --> 00:01:51,330
但可以说是一个更糟糕的总结，
but is arguably a worse summary,

50
00:01:51,330 --> 00:01:52,653
因为它很冗长。
since it is verbose.

51
00:01:53,550 --> 00:01:54,600
为了应对这些场景，
To deal with these scenarios,

52
00:01:54,600 --> 00:01:56,190
我们还可以计算精度，
we can also compute precision,

53
00:01:56,190 --> 00:01:58,380
在 ROUGE 上下文中，它衡量了
which in the ROUGE context measures how much

54
00:01:58,380 --> 00:02:00,810
在生成器摘要中具有多少相关性。
of the generator summary was relevant.

55
00:02:00,810 --> 00:02:03,630
在实际操作中，通常计算精度和召回率
In practice, both precision and recall are usually computed

56
00:02:03,630 --> 00:02:05,493
然后报告 F1 分数。
and then the F1 score is reported.

57
00:02:07,170 --> 00:02:08,542
现在我们可以改变比较的粒度
Now we can change the granularity

58
00:02:08,542 --> 00:02:13,020
将比较的对象由 unigram 改变为 bigram。
of the comparison by comparing bigrams instead of unigrams.

59
00:02:13,020 --> 00:02:15,090
使用 bigram，我们将句子
With bigrams, we chunk the sentence into pairs

60
00:02:15,090 --> 00:02:17,910
切分为成对的连续单词，
of consecutive words and then count how many pairs

61
00:02:17,910 --> 00:02:21,360
然后计算生成的摘要中有多少对出现在参考摘要中。
in the generated summary are present in the reference one.

62
00:02:21,360 --> 00:02:23,880
这给了我们 ROUGE-2 精确度和召回率
This gives us ROUGE-2 precision and recall

63
00:02:23,880 --> 00:02:24,780
正如我们所见，
which as we can see,

64
00:02:24,780 --> 00:02:27,780
低于之前的 ROUGE-1 分数。
is lower than the ROUGE-1 scores from earlier.

65
00:02:27,780 --> 00:02:29,400
现在，如果摘要很长，
Now, if the summaries are long,

66
00:02:29,400 --> 00:02:31,740
ROUGE-2 分数通常会很小
the ROUGE-2 scores will generally be small

67
00:02:31,740 --> 00:02:34,290
因为要匹配的 bios 更少。
because there are fewer bios to match.

68
00:02:34,290 --> 00:02:36,870
这也适用于重写式摘要。
And this is also true for abstractive summarization.

69
00:02:36,870 --> 00:02:39,993
所以通常会报告 ROUGE-1 和 ROUGE-2 分数。
So both ROUGE-1 and ROUGE-2 scores are usually reported.

70
00:02:42,000 --> 00:02:45,330
我们将讨论的最后一个 ROUGE 变体是 ROUGE L。
The last ROUGE variant we will discuss is ROUGE L.

71
00:02:45,330 --> 00:02:47,160
ROUGE L 不比较 ngram
ROUGE L doesn't compare ngrams

72
00:02:47,160 --> 00:02:49,572
而是将每个摘要视为一系列单词
but instead treats each summary as a sequence of words

73
00:02:49,572 --> 00:02:53,403
然后寻找最长的公共子序列或 LCS。
and then looks for the longest common subsequence or LCS.

74
00:02:54,775 --> 00:02:56,130
子序列是以相同的相对顺序
A subsequence is a sequence that appears

75
00:02:56,130 --> 00:02:59,760
出现的序列，但不一定是连续的。
in the same relative order, but not necessarily contiguous.

76
00:02:59,760 --> 00:03:03,210
所以在这个例子中，我喜欢阅读 Hunger Games，
So in this example, I loved reading the Hunger Games,

77
00:03:03,210 --> 00:03:06,930
是两个摘要之间最长的公共子序列。
is the longest common subsequence between the two summaries.

78
00:03:06,930 --> 00:03:08,610
而相比 ROUGE-1 或 ROUGE-2 
And the main advantage of ROUGE L

79
00:03:08,610 --> 00:03:11,670
ROUGE L 的主要优势是它不依赖于
over ROUGE-1 or ROUGE-2 is that it doesn't depend

80
00:03:11,670 --> 00:03:14,100
在连续的 n-gram 匹配上，
on consecutive n-gram matches, and so it tends

81
00:03:14,100 --> 00:03:16,650
所以它倾向于更准确地捕捉句子结构。
to capture sentence structure much more accurately.

82
00:03:18,150 --> 00:03:19,440
现在在 Dataset 库中
Now to compute ROUGE scores

83
00:03:19,440 --> 00:03:21,660
计算 ROUGE 分数很简单。
in the data sets library is very simple.

84
00:03:21,660 --> 00:03:23,910
你只需使用 load_metric 函数，
You just use the load_metric function,

85
00:03:23,910 --> 00:03:26,400
提供你的模型摘要以及参考摘要
provide your model summaries along with the references

86
00:03:26,400 --> 00:03:27,500
你可以开始了。
and you're good to go.

87
00:03:28,770 --> 00:03:30,120
计算的输出
The output from the calculation

88
00:03:30,120 --> 00:03:31,507
包含很多信息。
contains a lot of information.

89
00:03:31,507 --> 00:03:34,560
我们首先可以看到的是
The first thing we can see is that the confidence intervals

90
00:03:34,560 --> 00:03:36,090
每个 ROUGE 分数的置信区间
of each ROUGE score are provided

91
00:03:36,090 --> 00:03:39,030
每个 ROUGE 分数的置信区间。
in the low, mid and high fields.

92
00:03:39,030 --> 00:03:40,980
当比较两个或多个模型时，
This is really useful if you wanna know the spread

93
00:03:40,980 --> 00:03:43,730
如果你想知道 ROUGE 分数的范围，这就真的很有用。
of your ROUGE scores when comparing two or more models.

94
00:03:45,090 --> 00:03:46,050
第二点要注意
The second thing to notice

95
00:03:46,050 --> 00:03:48,330
是我们有四种类型的 ROUGE 分数。
is that we have four types of ROUGE score.

96
00:03:48,330 --> 00:03:51,480
我们已经看过 ROUGE-1、ROUGE-2 和 ROUGE-L
We've already seen ROUGE-1, ROUGE-2 and ROUGE-L

97
00:03:51,480 --> 00:03:53,760
那么什么是 ROUGE-L sum 呢？
So what is ROUGE-L sum?

98
00:03:53,760 --> 00:03:55,410
其实就是 ROUGEL 的总和
Well, the sum in ROUGEL's sum

99
00:03:55,410 --> 00:03:57,630
指的是这个指标是
refers to the fact that this metric is computed

100
00:03:57,630 --> 00:04:00,240
当 ROUGE-L 作为单个句子的平均值计算时，
over a whole summary while ROUGE-L is computed

101
00:04:00,240 --> 00:04:02,493
基于整个摘要计算出来的。
as the average of individual sentences.

102
00:04:04,166 --> 00:04:06,916
（徽标呼啸而过）
(logo whooshing)