1 00:00:00,624 --> 00:00:03,374 (徽标呼啸而过) (logo whooshing) 2 00:00:05,700 --> 00:00:07,740 - 什么是 ROUGE 指标? - What is the ROUGE metric? 3 00:00:07,740 --> 00:00:08,880 对于许多 NLP 任务 For many NLP tasks 4 00:00:08,880 --> 00:00:12,270 我们可以使用常见的指标,如准确性或 F1 分数。 we can use common metrics like accuracy or F1 score. 5 00:00:12,270 --> 00:00:13,650 但是当你想评估类似从像 T5 这样的模型上 But what do you do when you wanna measure something 6 00:00:13,650 --> 00:00:16,920 所获得的文本摘要的质量,该如何操作呢? like the quality of a summary from a model like T5? 7 00:00:16,920 --> 00:00:18,180 在本视频中,我们将了解 In this video, we'll take a look 8 00:00:18,180 --> 00:00:21,180 被广泛用于文本摘要的评估指标,称为 ROUGE。 at a widely used metric for text summarization called ROUGE. 9 00:00:22,740 --> 00:00:24,660 ROUGE 实际上有几种变体 There are actually several variants of ROUGE 10 00:00:24,660 --> 00:00:26,190 但所有这些变体背后的基本思想 but the basic idea behind all of them 11 00:00:26,190 --> 00:00:27,840 是为每个摘要分配一个单独的分数 is to assign a single numerical score 12 00:00:27,840 --> 00:00:30,000 来告诉我们相比一个或者多个参考的摘要 to a summary that tells us how good it is 13 00:00:30,000 --> 00:00:32,774 当前的摘要有多好。 compared to one or more reference summaries. 14 00:00:32,774 --> 00:00:34,020 在这个例子中,我们有一个书评 In this example, we have a book review 15 00:00:34,020 --> 00:00:36,570 它是通过某些模型摘要获得。 that has been summarized by some model. 16 00:00:36,570 --> 00:00:38,320 如果我们将生成的摘要 If we compare the generated summary 17 00:00:39,168 --> 00:00:40,260 和一些人工的摘要相比较, to some reference human summaries, 18 00:00:40,260 --> 00:00:42,841 我们可以看到该模型实际上非常好 we can see that the model is actually pretty good 19 00:00:42,841 --> 00:00:44,063 并且只相差一两个词。 and only differs by a word or two. 20 00:00:45,060 --> 00:00:46,260 那么我们如何通过自动的方式 So how can we measure the quality 21 00:00:46,260 --> 00:00:49,050 评估生成的摘要的质量呢? of a generated summary in an automatic way? 22 00:00:49,050 --> 00:00:51,510 ROUGE 采用的方法是比较生成的摘要的 n-gram The approach that ROUGE takes is to compare the n-grams 23 00:00:51,510 --> 00:00:55,200 和参考文献的 n-gram。 of the generated summary to the n-grams of the references. 24 00:00:55,200 --> 00:00:58,590 n-gram 只是一种表达 N 个单词的词块的流行说法。 And n-gram is just a fancy way of saying a chunk of N words. 25 00:00:58,590 --> 00:01:00,030 所以让我们从 unigram 开始 So let's start with unigrams 26 00:01:00,030 --> 00:01:02,780 它对应于句子中的各个单词。 which correspond to the individual words in a sentence. 27 00:01:03,780 --> 00:01:05,250 在这个例子中,你可以看到 In this example, you can see that six 28 00:01:05,250 --> 00:01:07,650 在生成的摘要中有 6 个单词,也出现在 of the words in the generated summary are also found 29 00:01:07,650 --> 00:01:09,420 其中一个参考的摘要中。 in one of the reference summaries. 30 00:01:09,420 --> 00:01:11,310 而比较 unigram 的 rouge metric And the rouge metric that compares unigrams 31 00:01:11,310 --> 00:01:12,260 被称为 ROUGE-1。 is called ROUGE-1. 32 00:01:14,533 --> 00:01:16,770 现在我们找到了我们的匹配项,一种为摘要分配分数的方法 Now that we found our matches, one way to assign a score 33 00:01:16,770 --> 00:01:20,280 是计算 unigram 的被召回次数。 to the summary is to compute the recall of the unigrams. 34 00:01:20,280 --> 00:01:21,540 这意味着我们只计算 This means we just count the number 35 00:01:21,540 --> 00:01:22,950 生成摘要和参考摘要的匹配词 of matching words in the generated 36 00:01:22,950 --> 00:01:25,290 并将计数通过除以参考文本中的单词数 and reference summaries and normalize the count 37 00:01:25,290 --> 00:01:28,200 进行规范化处理。 by dividing by the number of words in the reference. 38 00:01:28,200 --> 00:01:30,450 在这个例子中,我们找到了六个匹配的词 In this example, we found six matching words 39 00:01:30,450 --> 00:01:32,160 我们的参考文本有六个词。 and our reference has six words. 40 00:01:32,160 --> 00:01:33,933 所以我们的 unigram 召回是完美的。 So our unigram recall is perfect. 41 00:01:34,800 --> 00:01:35,810 这意味着在参考摘要中 This means that all of the words 42 00:01:35,810 --> 00:01:37,500 的所有的词都会出现 in the reference summary have been produced 43 00:01:37,500 --> 00:01:38,550 在生成的摘要中。 in the generated one. 44 00:01:40,050 --> 00:01:42,360 现在,完美的召回听起来不错,但想象一下 Now, perfect recall sounds great, but imagine 45 00:01:42,360 --> 00:01:44,520 如果我们生成的摘要是这样的 if our generated summary have been something like 46 00:01:44,520 --> 00:01:45,720 我真的,真的,真的, I really, really, really, 47 00:01:45,720 --> 00:01:48,150 真的很喜欢阅读 Hunger Games。 really loved reading the Hunger Games. 48 00:01:48,150 --> 00:01:49,378 这也会有完美的召回 This would also have perfect recall 49 00:01:49,378 --> 00:01:51,330 但可以说是一个更糟糕的总结, but is arguably a worse summary, 50 00:01:51,330 --> 00:01:52,653 因为它很冗长。 since it is verbose. 51 00:01:53,550 --> 00:01:54,600 为了应对这些场景, To deal with these scenarios, 52 00:01:54,600 --> 00:01:56,190 我们还可以计算精度, we can also compute precision, 53 00:01:56,190 --> 00:01:58,380 在 ROUGE 上下文中,它衡量了 which in the ROUGE context measures how much 54 00:01:58,380 --> 00:02:00,810 在生成器摘要中具有多少相关性。 of the generator summary was relevant. 55 00:02:00,810 --> 00:02:03,630 在实际操作中,通常计算精度和召回率 In practice, both precision and recall are usually computed 56 00:02:03,630 --> 00:02:05,493 然后报告 F1 分数。 and then the F1 score is reported. 57 00:02:07,170 --> 00:02:08,542 现在我们可以改变比较的粒度 Now we can change the granularity 58 00:02:08,542 --> 00:02:13,020 将比较的对象由 unigram 改变为 bigram。 of the comparison by comparing bigrams instead of unigrams. 59 00:02:13,020 --> 00:02:15,090 使用 bigram,我们将句子 With bigrams, we chunk the sentence into pairs 60 00:02:15,090 --> 00:02:17,910 切分为成对的连续单词, of consecutive words and then count how many pairs 61 00:02:17,910 --> 00:02:21,360 然后计算生成的摘要中有多少对出现在参考摘要中。 in the generated summary are present in the reference one. 62 00:02:21,360 --> 00:02:23,880 这给了我们 ROUGE-2 精确度和召回率 This gives us ROUGE-2 precision and recall 63 00:02:23,880 --> 00:02:24,780 正如我们所见, which as we can see, 64 00:02:24,780 --> 00:02:27,780 低于之前的 ROUGE-1 分数。 is lower than the ROUGE-1 scores from earlier. 65 00:02:27,780 --> 00:02:29,400 现在,如果摘要很长, Now, if the summaries are long, 66 00:02:29,400 --> 00:02:31,740 ROUGE-2 分数通常会很小 the ROUGE-2 scores will generally be small 67 00:02:31,740 --> 00:02:34,290 因为要匹配的 bios 更少。 because there are fewer bios to match. 68 00:02:34,290 --> 00:02:36,870 这也适用于重写式摘要。 And this is also true for abstractive summarization. 69 00:02:36,870 --> 00:02:39,993 所以通常会报告 ROUGE-1 和 ROUGE-2 分数。 So both ROUGE-1 and ROUGE-2 scores are usually reported. 70 00:02:42,000 --> 00:02:45,330 我们将讨论的最后一个 ROUGE 变体是 ROUGE L。 The last ROUGE variant we will discuss is ROUGE L. 71 00:02:45,330 --> 00:02:47,160 ROUGE L 不比较 ngram ROUGE L doesn't compare ngrams 72 00:02:47,160 --> 00:02:49,572 而是将每个摘要视为一系列单词 but instead treats each summary as a sequence of words 73 00:02:49,572 --> 00:02:53,403 然后寻找最长的公共子序列或 LCS。 and then looks for the longest common subsequence or LCS. 74 00:02:54,775 --> 00:02:56,130 子序列是以相同的相对顺序 A subsequence is a sequence that appears 75 00:02:56,130 --> 00:02:59,760 出现的序列,但不一定是连续的。 in the same relative order, but not necessarily contiguous. 76 00:02:59,760 --> 00:03:03,210 所以在这个例子中,我喜欢阅读 Hunger Games, So in this example, I loved reading the Hunger Games, 77 00:03:03,210 --> 00:03:06,930 是两个摘要之间最长的公共子序列。 is the longest common subsequence between the two summaries. 78 00:03:06,930 --> 00:03:08,610 而相比 ROUGE-1 或 ROUGE-2 And the main advantage of ROUGE L 79 00:03:08,610 --> 00:03:11,670 ROUGE L 的主要优势是它不依赖于 over ROUGE-1 or ROUGE-2 is that it doesn't depend 80 00:03:11,670 --> 00:03:14,100 在连续的 n-gram 匹配上, on consecutive n-gram matches, and so it tends 81 00:03:14,100 --> 00:03:16,650 所以它倾向于更准确地捕捉句子结构。 to capture sentence structure much more accurately. 82 00:03:18,150 --> 00:03:19,440 现在在 Dataset 库中 Now to compute ROUGE scores 83 00:03:19,440 --> 00:03:21,660 计算 ROUGE 分数很简单。 in the data sets library is very simple. 84 00:03:21,660 --> 00:03:23,910 你只需使用 load_metric 函数, You just use the load_metric function, 85 00:03:23,910 --> 00:03:26,400 提供你的模型摘要以及参考摘要 provide your model summaries along with the references 86 00:03:26,400 --> 00:03:27,500 你可以开始了。 and you're good to go. 87 00:03:28,770 --> 00:03:30,120 计算的输出 The output from the calculation 88 00:03:30,120 --> 00:03:31,507 包含很多信息。 contains a lot of information. 89 00:03:31,507 --> 00:03:34,560 我们首先可以看到的是 The first thing we can see is that the confidence intervals 90 00:03:34,560 --> 00:03:36,090 每个 ROUGE 分数的置信区间 of each ROUGE score are provided 91 00:03:36,090 --> 00:03:39,030 每个 ROUGE 分数的置信区间。 in the low, mid and high fields. 92 00:03:39,030 --> 00:03:40,980 当比较两个或多个模型时, This is really useful if you wanna know the spread 93 00:03:40,980 --> 00:03:43,730 如果你想知道 ROUGE 分数的范围,这就真的很有用。 of your ROUGE scores when comparing two or more models. 94 00:03:45,090 --> 00:03:46,050 第二点要注意 The second thing to notice 95 00:03:46,050 --> 00:03:48,330 是我们有四种类型的 ROUGE 分数。 is that we have four types of ROUGE score. 96 00:03:48,330 --> 00:03:51,480 我们已经看过 ROUGE-1、ROUGE-2 和 ROUGE-L We've already seen ROUGE-1, ROUGE-2 and ROUGE-L 97 00:03:51,480 --> 00:03:53,760 那么什么是 ROUGE-L sum 呢? So what is ROUGE-L sum? 98 00:03:53,760 --> 00:03:55,410 其实就是 ROUGEL 的总和 Well, the sum in ROUGEL's sum 99 00:03:55,410 --> 00:03:57,630 指的是这个指标是 refers to the fact that this metric is computed 100 00:03:57,630 --> 00:04:00,240 当 ROUGE-L 作为单个句子的平均值计算时, over a whole summary while ROUGE-L is computed 101 00:04:00,240 --> 00:04:02,493 基于整个摘要计算出来的。 as the average of individual sentences. 102 00:04:04,166 --> 00:04:06,916 (徽标呼啸而过) (logo whooshing)