subtitles/zh-CN/60_what-is-the-bleu-metric.srt (480 lines of code) (raw):
1
00:00:00,147 --> 00:00:01,412
(屏幕呼啸)
(screen whooshing)
2
00:00:01,412 --> 00:00:02,698
(贴纸弹出)
(sticker popping)
3
00:00:02,698 --> 00:00:05,670
(屏幕呼啸)
(screen whooshing)
4
00:00:05,670 --> 00:00:07,650
- 什么是 BLEU 指标?
- What is the BLEU metric?
5
00:00:07,650 --> 00:00:10,170
对于许多 NLP 任务,我们可以使用常见指标
For many NLP tasks we can use common metrics
6
00:00:10,170 --> 00:00:12,810
比如准确性或 F1 分数,
like accuracy or F1 score, but what do you do
7
00:00:12,810 --> 00:00:14,340
但是当你想衡量模型所翻译的文本的质量时
when you wanna measure the quality of text
8
00:00:14,340 --> 00:00:16,560
该如何评估呢?
that's been translated from a model?
9
00:00:16,560 --> 00:00:18,750
在本视频中,我们将为大家介绍一个
In this video, we'll take a look at a widely used metric
10
00:00:18,750 --> 00:00:20,613
广泛使用于机器翻译的指标,叫做 BLEU 。
for machine translation called BLEU.
11
00:00:22,290 --> 00:00:23,940
BLEU 背后的基本逻辑是
The basic idea behind BLEU is to assign
12
00:00:23,940 --> 00:00:26,250
为每个翻译分配一个单独的数字评分
a single numerical score to a translation
13
00:00:26,250 --> 00:00:27,450
用于评估
that tells us how good it is
14
00:00:27,450 --> 00:00:30,199
它与一个或者多个翻译相比,其质量的优劣。
compared to one or more reference translations.
15
00:00:30,199 --> 00:00:32,130
在这个例子中,我们有一个西班牙语句子
In this example, we have a sentence in Spanish
16
00:00:32,130 --> 00:00:35,340
已通过某种模型翻译成英文。
that has been translated into English by some model.
17
00:00:35,340 --> 00:00:37,170
如果我们将生成的翻译
If we compare the generated translation
18
00:00:37,170 --> 00:00:39,150
与一些用于参照的人工翻译进行比较,
to some reference human translations,
19
00:00:39,150 --> 00:00:41,190
我们可以看到这个模型其实还不错,
we can see that the model is actually pretty good,
20
00:00:41,190 --> 00:00:43,260
但犯了一个常见的错误。
but has made a common error.
21
00:00:43,260 --> 00:00:46,050
西班牙语单词 tengo 在英语中的意思是 have,
The Spanish word tengo means have in English,
22
00:00:46,050 --> 00:00:48,700
这种一一对应的直译不太自然。
and this one-to-one translation is not quite natural.
23
00:00:49,890 --> 00:00:51,270
那么对于使用某种自动的方法生成的翻译
So how can we measure the quality
24
00:00:51,270 --> 00:00:54,270
我们如何来评估它的质量呢?
of a generated translation in some automatic way?
25
00:00:54,270 --> 00:00:56,730
BLEU 采用的方法是
The approach that BLEU takes is to compare the n-grams
26
00:00:56,730 --> 00:00:58,550
将所生成翻译的 n-gram 和参照的翻译的 n-gram
of the generated translation to the n-grams
27
00:00:58,550 --> 00:01:00,390
进行比较。
in the references.
28
00:01:00,390 --> 00:01:02,400
现在,n-gram 只是用于描述 n 个单词的
Now, an n-gram is just a fancy way of saying
29
00:01:02,400 --> 00:01:03,960
一种奇特的说法。
a chunk of n words.
30
00:01:03,960 --> 00:01:05,220
所以让我们从 unigrams 开始,
So let's start with unigrams,
31
00:01:05,220 --> 00:01:08,020
它对应于句子中的单个单词。
which corresponds to the individual words in a sentence.
32
00:01:08,880 --> 00:01:11,250
在此示例中,你可以看到所生成的翻译中有四个单词
In this example, you can see that four of the words
33
00:01:11,250 --> 00:01:13,140
在参照的翻译的其中一个
in the generated translation are also found
34
00:01:13,140 --> 00:01:14,990
也出现了。
in one of the reference translations.
35
00:01:16,350 --> 00:01:18,240
一旦我们找到了匹配项,
And once we've found our matches,
36
00:01:18,240 --> 00:01:20,130
给译文打分的一种方法
one way to assign a score to the translation
37
00:01:20,130 --> 00:01:23,070
是计算 unigrams 的精度。
is to compute the precision of the unigrams.
38
00:01:23,070 --> 00:01:25,200
这意味着我们在生成的和参考的翻译中
This means we just count the number of matching words
39
00:01:25,200 --> 00:01:27,360
只计算匹配词的数量
in the generated and reference translations
40
00:01:27,360 --> 00:01:29,660
并且通过除以生成结果的单词数
and normalize the count by dividing by the number of words
41
00:01:29,660 --> 00:01:30,753
来归一化计数值。
in the generation.
42
00:01:31,800 --> 00:01:34,080
在这个例子中,我们找到了四个匹配的词
In this example, we found four matching words
43
00:01:34,080 --> 00:01:36,033
而我们的生成结果中有五个单词。
and our generation has five words.
44
00:01:37,140 --> 00:01:39,690
现在,一般来说,精度范围从零到一,
Now, in general, precision ranges from zero to one,
45
00:01:39,690 --> 00:01:42,390
更高的精度分数意味着更好的翻译。
and higher precision scores mean a better translation.
46
00:01:44,160 --> 00:01:45,570
但是到这里还没有结束
But this isn't really the whole story
47
00:01:45,570 --> 00:01:47,310
因为 unigram 精度有一个问题
because one problem with unigram precision
48
00:01:47,310 --> 00:01:49,140
翻译模型有时会在重复的句式上卡住
is that translation models sometimes get stuck
49
00:01:49,140 --> 00:01:51,330
这样的句式会很多次重复
in repetitive patterns and just repeat the same word
50
00:01:51,330 --> 00:01:52,293
某一个单词。
several times.
51
00:01:53,160 --> 00:01:54,690
如果我们只计算单词匹配的数量,
If we just count the number of word matches,
52
00:01:54,690 --> 00:01:56,370
我们可以获得非常高的精度分数
we can get really high precision scores
53
00:01:56,370 --> 00:01:57,840
即使从人类的角度来看
even though the translation is terrible
54
00:01:57,840 --> 00:01:59,090
这个翻译很糟糕!
from a human perspective!
55
00:02:00,000 --> 00:02:02,970
例如,如果我们的模型只生成单词 six,
For example, if our model just generates the word six,
56
00:02:02,970 --> 00:02:05,020
我们得到了完美的 unigram 精度分数。
we get a perfect unigram precision score.
57
00:02:06,960 --> 00:02:09,930
所以为了解决这个问题,BLEU 使用了修改后的精度
So to handle this, BLEU uses a modified precision
58
00:02:09,930 --> 00:02:12,210
基于它出现在参考翻译中
that clips the number of times to count a word,
59
00:02:12,210 --> 00:02:13,680
出现的最大次数
based on the maximum number of times
60
00:02:13,680 --> 00:02:16,399
再减掉计算一个单词的次数。
it appears in the reference translation.
61
00:02:16,399 --> 00:02:18,630
在这个例子中,单词 six 只在参考翻译中出现了一次
In this example, the word six only appears once
62
00:02:18,630 --> 00:02:21,360
所以我们把分子改为 1
in the reference, so we clip the numerator to one
63
00:02:21,360 --> 00:02:22,710
和修改后的 unigram 精度
and the modified unigram precision
64
00:02:22,710 --> 00:02:25,233
现在给出的分数比预期的要低得多。
now gives a much lower score as expected.
65
00:02:27,660 --> 00:02:29,400
unigram 精度的另一个问题
Another problem with unigram precision
66
00:02:29,400 --> 00:02:30,780
是它没有考虑到
is that it doesn't take into account
67
00:02:30,780 --> 00:02:33,900
单词在翻译中出现的顺序。
the order in which the words appear in the translations.
68
00:02:33,900 --> 00:02:35,700
例如,假设我们有 Yoda 为我们
For example, suppose we had Yoda
69
00:02:35,700 --> 00:02:37,410
翻译西班牙语句子,
translate our Spanish sentence,
70
00:02:37,410 --> 00:02:39,457
那么我们可能会得到一些退步的结果,
then we might get something backwards like,
71
00:02:39,457 --> 00:02:42,450
比如,“Years sixty thirty have I.”
"Years sixty thirty have I."
72
00:02:42,450 --> 00:02:44,670
在这种情况下,修改后的 unigram 精度值
In this case, the modified unigram precision
73
00:02:44,670 --> 00:02:47,393
给出了高精度,这并不是我们真正想要的。
gives a high precision which is not really what we want.
74
00:02:48,480 --> 00:02:50,460
所以要处理词序问题,
So to deal with word ordering problems,
75
00:02:50,460 --> 00:02:52,020
BLEU 实际上计算几个不同的 n-gram 精度值,
BLEU actually computes the precision
76
00:02:52,020 --> 00:02:55,410
然后对结果计算平均值。
for several different n-grams and then averages the result.
77
00:02:55,410 --> 00:02:57,300
例如,如果我们比较 4-grams,
For example, if we compare 4-grams,
78
00:02:57,300 --> 00:02:58,830
我们可以看到翻译中
we can see that there are no matching chunks
79
00:02:58,830 --> 00:03:01,020
没有匹配四个词的语块,
of four words in the translations,
80
00:03:01,020 --> 00:03:02,913
所以 4-gram 精度为 0。
and so the 4-gram precision is 0.
81
00:03:05,460 --> 00:03:07,560
现在,使用 Datasets 库计算 BLEU 分数
Now, to compute BLEU scores in Datasets library
82
00:03:07,560 --> 00:03:09,120
真的很简单。
is really very simple.
83
00:03:09,120 --> 00:03:11,100
你只需使用 load_metric 函数,
You just use the load_metric function,
84
00:03:11,100 --> 00:03:13,290
为模型的预测提供参考
provide your model's predictions with their references
85
00:03:13,290 --> 00:03:14,390
然后就一切就绪!
and you're good to go!
86
00:03:16,470 --> 00:03:19,200
输出将包含几个重点的字段。
The output will contain several fields of interest.
87
00:03:19,200 --> 00:03:20,490
精度字段包含
The precisions field contains
88
00:03:20,490 --> 00:03:23,133
每个 n-gram 的全部单体精度分数。
all the individual precision scores for each n-gram.
89
00:03:25,050 --> 00:03:26,940
然后 BLEU 分数本身
The BLEU score itself is then calculated
90
00:03:26,940 --> 00:03:30,090
通过取精度分数的几何平均值进行计算。
by taking the geometric mean of the precision scores.
91
00:03:30,090 --> 00:03:32,790
默认情况下,所有四个 n-gram 精度的平均值都会输出,
And by default, the mean of all four n-gram precisions
92
00:03:32,790 --> 00:03:35,793
该指标有时也称为 BLEU-4。
is reported, a metric that is sometimes also called BLEU-4.
93
00:03:36,660 --> 00:03:38,880
在此示例中,我们可以看到 BLEU 分数为零
In this example, we can see the BLEU score is zero
94
00:03:38,880 --> 00:03:40,780
因为 4-gram 精度为零。
because the 4-gram precision was zero.
95
00:03:43,290 --> 00:03:45,390
现在,BLEU 指标有一些不错的特性,
Now, the BLEU metric has some nice properties,
96
00:03:45,390 --> 00:03:47,520
但这离作为完美评估指标的距离还很远。
but it is far from a perfect metric.
97
00:03:47,520 --> 00:03:49,440
好的特性是它很容易计算
The good properties are that it's easy to compute
98
00:03:49,440 --> 00:03:50,970
它被广泛用于研究
and it's widely used in research
99
00:03:50,970 --> 00:03:52,620
这样你就可以基于普遍的基准将你的模型
so you can compare your model against others
100
00:03:52,620 --> 00:03:54,630
与其他模型进行比较。
on common benchmarks.
101
00:03:54,630 --> 00:03:56,670
另一方面,BLEU 有几个大问题,
On the other hand, there are several big problems with BLEU,
102
00:03:56,670 --> 00:03:58,830
包括实际上它不包含语义
including the fact it doesn't incorporate semantics
103
00:03:58,830 --> 00:04:01,920
它不适用于非英语语言。
and it struggles a lot on non-English languages.
104
00:04:01,920 --> 00:04:02,790
BLEU 的另一个问题
Another problem with BLEU
105
00:04:02,790 --> 00:04:04,620
是它假定人工翻译
is that it assumes the human translations
106
00:04:04,620 --> 00:04:05,820
已经被词元化
have already been tokenized
107
00:04:05,820 --> 00:04:07,320
这使得在基于不同的分词器的情况下
and this makes it hard to compare models
108
00:04:07,320 --> 00:04:08,820
比较模型变得困难。
that use different tokenizers.
109
00:04:10,590 --> 00:04:12,570
所以正如我们所见,衡量文本的质量
So as we've seen, measuring the quality of texts
110
00:04:12,570 --> 00:04:15,570
仍然是 NLP 研究中的一个困难和开放的问题。
is still a difficult and open problem in NLP research.
111
00:04:15,570 --> 00:04:17,580
对于机器翻译,目前的推荐
For machine translation, the current recommendation
112
00:04:17,580 --> 00:04:19,440
是使用 SacreBLEU 指标,
is to use the SacreBLEU metric,
113
00:04:19,440 --> 00:04:22,830
它解决了 BLEU 的词元化限制。
which addresses the tokenization limitations of BLEU.
114
00:04:22,830 --> 00:04:24,360
正如你在此示例中所见,
As you can see in this example,
115
00:04:24,360 --> 00:04:26,580
计算 SacreBLEU 分数几乎与 BLEU
computing the SacreBLEU score is almost identical
116
00:04:26,580 --> 00:04:28,020
完全一致。
to the BLEU one.
117
00:04:28,020 --> 00:04:30,360
主要区别在于我们现在传递一个文本列表
The main difference is that we now pass a list of texts
118
00:04:30,360 --> 00:04:32,640
而不是翻译的单词列表,
instead of a list of words to the translations,
119
00:04:32,640 --> 00:04:35,640
SacreBLEU 负责底层的词元化。
and SacreBLEU takes care of the tokenization under the hood.
120
00:04:36,582 --> 00:04:39,499
(屏幕呼啸)
(screen whooshing)