subtitles/zh-CN/60_what-is-the-bleu-metric.srt (480 lines of code) (raw):

1 00:00:00,147 --> 00:00:01,412 (屏幕呼啸) (screen whooshing) 2 00:00:01,412 --> 00:00:02,698 (贴纸弹出) (sticker popping) 3 00:00:02,698 --> 00:00:05,670 (屏幕呼啸) (screen whooshing) 4 00:00:05,670 --> 00:00:07,650 - 什么是 BLEU 指标? - What is the BLEU metric? 5 00:00:07,650 --> 00:00:10,170 对于许多 NLP 任务,我们可以使用常见指标 For many NLP tasks we can use common metrics 6 00:00:10,170 --> 00:00:12,810 比如准确性或 F1 分数, like accuracy or F1 score, but what do you do 7 00:00:12,810 --> 00:00:14,340 但是当你想衡量模型所翻译的文本的质量时 when you wanna measure the quality of text 8 00:00:14,340 --> 00:00:16,560 该如何评估呢? that's been translated from a model? 9 00:00:16,560 --> 00:00:18,750 在本视频中,我们将为大家介绍一个 In this video, we'll take a look at a widely used metric 10 00:00:18,750 --> 00:00:20,613 广泛使用于机器翻译的指标,叫做 BLEU 。 for machine translation called BLEU. 11 00:00:22,290 --> 00:00:23,940 BLEU 背后的基本逻辑是 The basic idea behind BLEU is to assign 12 00:00:23,940 --> 00:00:26,250 为每个翻译分配一个单独的数字评分 a single numerical score to a translation 13 00:00:26,250 --> 00:00:27,450 用于评估 that tells us how good it is 14 00:00:27,450 --> 00:00:30,199 它与一个或者多个翻译相比,其质量的优劣。 compared to one or more reference translations. 15 00:00:30,199 --> 00:00:32,130 在这个例子中,我们有一个西班牙语句子 In this example, we have a sentence in Spanish 16 00:00:32,130 --> 00:00:35,340 已通过某种模型翻译成英文。 that has been translated into English by some model. 17 00:00:35,340 --> 00:00:37,170 如果我们将生成的翻译 If we compare the generated translation 18 00:00:37,170 --> 00:00:39,150 与一些用于参照的人工翻译进行比较, to some reference human translations, 19 00:00:39,150 --> 00:00:41,190 我们可以看到这个模型其实还不错, we can see that the model is actually pretty good, 20 00:00:41,190 --> 00:00:43,260 但犯了一个常见的错误。 but has made a common error. 21 00:00:43,260 --> 00:00:46,050 西班牙语单词 tengo 在英语中的意思是 have, The Spanish word tengo means have in English, 22 00:00:46,050 --> 00:00:48,700 这种一一对应的直译不太自然。 and this one-to-one translation is not quite natural. 23 00:00:49,890 --> 00:00:51,270 那么对于使用某种自动的方法生成的翻译 So how can we measure the quality 24 00:00:51,270 --> 00:00:54,270 我们如何来评估它的质量呢? of a generated translation in some automatic way? 25 00:00:54,270 --> 00:00:56,730 BLEU 采用的方法是 The approach that BLEU takes is to compare the n-grams 26 00:00:56,730 --> 00:00:58,550 将所生成翻译的 n-gram 和参照的翻译的 n-gram of the generated translation to the n-grams 27 00:00:58,550 --> 00:01:00,390 进行比较。 in the references. 28 00:01:00,390 --> 00:01:02,400 现在,n-gram 只是用于描述 n 个单词的 Now, an n-gram is just a fancy way of saying 29 00:01:02,400 --> 00:01:03,960 一种奇特的说法。 a chunk of n words. 30 00:01:03,960 --> 00:01:05,220 所以让我们从 unigrams 开始, So let's start with unigrams, 31 00:01:05,220 --> 00:01:08,020 它对应于句子中的单个单词。 which corresponds to the individual words in a sentence. 32 00:01:08,880 --> 00:01:11,250 在此示例中,你可以看到所生成的翻译中有四个单词 In this example, you can see that four of the words 33 00:01:11,250 --> 00:01:13,140 在参照的翻译的其中一个 in the generated translation are also found 34 00:01:13,140 --> 00:01:14,990 也出现了。 in one of the reference translations. 35 00:01:16,350 --> 00:01:18,240 一旦我们找到了匹配项, And once we've found our matches, 36 00:01:18,240 --> 00:01:20,130 给译文打分的一种方法 one way to assign a score to the translation 37 00:01:20,130 --> 00:01:23,070 是计算 unigrams 的精度。 is to compute the precision of the unigrams. 38 00:01:23,070 --> 00:01:25,200 这意味着我们在生成的和参考的翻译中 This means we just count the number of matching words 39 00:01:25,200 --> 00:01:27,360 只计算匹配词的数量 in the generated and reference translations 40 00:01:27,360 --> 00:01:29,660 并且通过除以生成结果的单词数 and normalize the count by dividing by the number of words 41 00:01:29,660 --> 00:01:30,753 来归一化计数值。 in the generation. 42 00:01:31,800 --> 00:01:34,080 在这个例子中,我们找到了四个匹配的词 In this example, we found four matching words 43 00:01:34,080 --> 00:01:36,033 而我们的生成结果中有五个单词。 and our generation has five words. 44 00:01:37,140 --> 00:01:39,690 现在,一般来说,精度范围从零到一, Now, in general, precision ranges from zero to one, 45 00:01:39,690 --> 00:01:42,390 更高的精度分数意味着更好的翻译。 and higher precision scores mean a better translation. 46 00:01:44,160 --> 00:01:45,570 但是到这里还没有结束 But this isn't really the whole story 47 00:01:45,570 --> 00:01:47,310 因为 unigram 精度有一个问题 because one problem with unigram precision 48 00:01:47,310 --> 00:01:49,140 翻译模型有时会在重复的句式上卡住 is that translation models sometimes get stuck 49 00:01:49,140 --> 00:01:51,330 这样的句式会很多次重复 in repetitive patterns and just repeat the same word 50 00:01:51,330 --> 00:01:52,293 某一个单词。 several times. 51 00:01:53,160 --> 00:01:54,690 如果我们只计算单词匹配的数量, If we just count the number of word matches, 52 00:01:54,690 --> 00:01:56,370 我们可以获得非常高的精度分数 we can get really high precision scores 53 00:01:56,370 --> 00:01:57,840 即使从人类的角度来看 even though the translation is terrible 54 00:01:57,840 --> 00:01:59,090 这个翻译很糟糕! from a human perspective! 55 00:02:00,000 --> 00:02:02,970 例如,如果我们的模型只生成单词 six, For example, if our model just generates the word six, 56 00:02:02,970 --> 00:02:05,020 我们得到了完美的 unigram 精度分数。 we get a perfect unigram precision score. 57 00:02:06,960 --> 00:02:09,930 所以为了解决这个问题,BLEU 使用了修改后的精度 So to handle this, BLEU uses a modified precision 58 00:02:09,930 --> 00:02:12,210 基于它出现在参考翻译中 that clips the number of times to count a word, 59 00:02:12,210 --> 00:02:13,680 出现的最大次数 based on the maximum number of times 60 00:02:13,680 --> 00:02:16,399 再减掉计算一个单词的次数。 it appears in the reference translation. 61 00:02:16,399 --> 00:02:18,630 在这个例子中,单词 six 只在参考翻译中出现了一次 In this example, the word six only appears once 62 00:02:18,630 --> 00:02:21,360 所以我们把分子改为 1 in the reference, so we clip the numerator to one 63 00:02:21,360 --> 00:02:22,710 和修改后的 unigram 精度 and the modified unigram precision 64 00:02:22,710 --> 00:02:25,233 现在给出的分数比预期的要低得多。 now gives a much lower score as expected. 65 00:02:27,660 --> 00:02:29,400 unigram 精度的另一个问题 Another problem with unigram precision 66 00:02:29,400 --> 00:02:30,780 是它没有考虑到 is that it doesn't take into account 67 00:02:30,780 --> 00:02:33,900 单词在翻译中出现的顺序。 the order in which the words appear in the translations. 68 00:02:33,900 --> 00:02:35,700 例如,假设我们有 Yoda 为我们 For example, suppose we had Yoda 69 00:02:35,700 --> 00:02:37,410 翻译西班牙语句子, translate our Spanish sentence, 70 00:02:37,410 --> 00:02:39,457 那么我们可能会得到一些退步的结果, then we might get something backwards like, 71 00:02:39,457 --> 00:02:42,450 比如,“Years sixty thirty have I.” "Years sixty thirty have I." 72 00:02:42,450 --> 00:02:44,670 在这种情况下,修改后的 unigram 精度值 In this case, the modified unigram precision 73 00:02:44,670 --> 00:02:47,393 给出了高精度,这并不是我们真正想要的。 gives a high precision which is not really what we want. 74 00:02:48,480 --> 00:02:50,460 所以要处理词序问题, So to deal with word ordering problems, 75 00:02:50,460 --> 00:02:52,020 BLEU 实际上计算几个不同的 n-gram 精度值, BLEU actually computes the precision 76 00:02:52,020 --> 00:02:55,410 然后对结果计算平均值。 for several different n-grams and then averages the result. 77 00:02:55,410 --> 00:02:57,300 例如,如果我们比较 4-grams, For example, if we compare 4-grams, 78 00:02:57,300 --> 00:02:58,830 我们可以看到翻译中 we can see that there are no matching chunks 79 00:02:58,830 --> 00:03:01,020 没有匹配四个词的语块, of four words in the translations, 80 00:03:01,020 --> 00:03:02,913 所以 4-gram 精度为 0。 and so the 4-gram precision is 0. 81 00:03:05,460 --> 00:03:07,560 现在,使用 Datasets 库计算 BLEU 分数 Now, to compute BLEU scores in Datasets library 82 00:03:07,560 --> 00:03:09,120 真的很简单。 is really very simple. 83 00:03:09,120 --> 00:03:11,100 你只需使用 load_metric 函数, You just use the load_metric function, 84 00:03:11,100 --> 00:03:13,290 为模型的预测提供参考 provide your model's predictions with their references 85 00:03:13,290 --> 00:03:14,390 然后就一切就绪! and you're good to go! 86 00:03:16,470 --> 00:03:19,200 输出将包含几个重点的字段。 The output will contain several fields of interest. 87 00:03:19,200 --> 00:03:20,490 精度字段包含 The precisions field contains 88 00:03:20,490 --> 00:03:23,133 每个 n-gram 的全部单体精度分数。 all the individual precision scores for each n-gram. 89 00:03:25,050 --> 00:03:26,940 然后 BLEU 分数本身 The BLEU score itself is then calculated 90 00:03:26,940 --> 00:03:30,090 通过取精度分数的几何平均值进行计算。 by taking the geometric mean of the precision scores. 91 00:03:30,090 --> 00:03:32,790 默认情况下,所有四个 n-gram 精度的平均值都会输出, And by default, the mean of all four n-gram precisions 92 00:03:32,790 --> 00:03:35,793 该指标有时也称为 BLEU-4。 is reported, a metric that is sometimes also called BLEU-4. 93 00:03:36,660 --> 00:03:38,880 在此示例中,我们可以看到 BLEU 分数为零 In this example, we can see the BLEU score is zero 94 00:03:38,880 --> 00:03:40,780 因为 4-gram 精度为零。 because the 4-gram precision was zero. 95 00:03:43,290 --> 00:03:45,390 现在,BLEU 指标有一些不错的特性, Now, the BLEU metric has some nice properties, 96 00:03:45,390 --> 00:03:47,520 但这离作为完美评估指标的距离还很远。 but it is far from a perfect metric. 97 00:03:47,520 --> 00:03:49,440 好的特性是它很容易计算 The good properties are that it's easy to compute 98 00:03:49,440 --> 00:03:50,970 它被广泛用于研究 and it's widely used in research 99 00:03:50,970 --> 00:03:52,620 这样你就可以基于普遍的基准将你的模型 so you can compare your model against others 100 00:03:52,620 --> 00:03:54,630 与其他模型进行比较。 on common benchmarks. 101 00:03:54,630 --> 00:03:56,670 另一方面,BLEU 有几个大问题, On the other hand, there are several big problems with BLEU, 102 00:03:56,670 --> 00:03:58,830 包括实际上它不包含语义 including the fact it doesn't incorporate semantics 103 00:03:58,830 --> 00:04:01,920 它不适用于非英语语言。 and it struggles a lot on non-English languages. 104 00:04:01,920 --> 00:04:02,790 BLEU 的另一个问题 Another problem with BLEU 105 00:04:02,790 --> 00:04:04,620 是它假定人工翻译 is that it assumes the human translations 106 00:04:04,620 --> 00:04:05,820 已经被词元化 have already been tokenized 107 00:04:05,820 --> 00:04:07,320 这使得在基于不同的分词器的情况下 and this makes it hard to compare models 108 00:04:07,320 --> 00:04:08,820 比较模型变得困难。 that use different tokenizers. 109 00:04:10,590 --> 00:04:12,570 所以正如我们所见,衡量文本的质量 So as we've seen, measuring the quality of texts 110 00:04:12,570 --> 00:04:15,570 仍然是 NLP 研究中的一个困难和开放的问题。 is still a difficult and open problem in NLP research. 111 00:04:15,570 --> 00:04:17,580 对于机器翻译,目前的推荐 For machine translation, the current recommendation 112 00:04:17,580 --> 00:04:19,440 是使用 SacreBLEU 指标, is to use the SacreBLEU metric, 113 00:04:19,440 --> 00:04:22,830 它解决了 BLEU 的词元化限制。 which addresses the tokenization limitations of BLEU. 114 00:04:22,830 --> 00:04:24,360 正如你在此示例中所见, As you can see in this example, 115 00:04:24,360 --> 00:04:26,580 计算 SacreBLEU 分数几乎与 BLEU computing the SacreBLEU score is almost identical 116 00:04:26,580 --> 00:04:28,020 完全一致。 to the BLEU one. 117 00:04:28,020 --> 00:04:30,360 主要区别在于我们现在传递一个文本列表 The main difference is that we now pass a list of texts 118 00:04:30,360 --> 00:04:32,640 而不是翻译的单词列表, instead of a list of words to the translations, 119 00:04:32,640 --> 00:04:35,640 SacreBLEU 负责底层的词元化。 and SacreBLEU takes care of the tokenization under the hood. 120 00:04:36,582 --> 00:04:39,499 (屏幕呼啸) (screen whooshing)