subtitles/en/60_what-is-the-bleu-metric.srt (420 lines of code) (raw):
1
00:00:00,147 --> 00:00:01,412
(screen whooshing)
2
00:00:01,412 --> 00:00:02,698
(sticker popping)
3
00:00:02,698 --> 00:00:05,670
(screen whooshing)
4
00:00:05,670 --> 00:00:07,650
- What is the BLEU metric?
5
00:00:07,650 --> 00:00:10,170
For many NLP tasks we
can use common metrics
6
00:00:10,170 --> 00:00:12,810
like accuracy or F1
score, but what do you do
7
00:00:12,810 --> 00:00:14,340
when you wanna measure the quality of text
8
00:00:14,340 --> 00:00:16,560
that's been translated from a model?
9
00:00:16,560 --> 00:00:18,750
In this video, we'll take a
look at a widely used metric
10
00:00:18,750 --> 00:00:20,613
for machine translation called BLEU.
11
00:00:22,290 --> 00:00:23,940
The basic idea behind BLEU is to assign
12
00:00:23,940 --> 00:00:26,250
a single numerical score to a translation
13
00:00:26,250 --> 00:00:27,450
that tells us how good it is
14
00:00:27,450 --> 00:00:30,199
compared to one or more
reference translations.
15
00:00:30,199 --> 00:00:32,130
In this example, we have
a sentence in Spanish
16
00:00:32,130 --> 00:00:35,340
that has been translated
into English by some model.
17
00:00:35,340 --> 00:00:37,170
If we compare the generated translation
18
00:00:37,170 --> 00:00:39,150
to some reference human translations,
19
00:00:39,150 --> 00:00:41,190
we can see that the model
is actually pretty good,
20
00:00:41,190 --> 00:00:43,260
but has made a common error.
21
00:00:43,260 --> 00:00:46,050
The Spanish word tengo
means have in English,
22
00:00:46,050 --> 00:00:48,700
and this one-to-one translation
is not quite natural.
23
00:00:49,890 --> 00:00:51,270
So how can we measure the quality
24
00:00:51,270 --> 00:00:54,270
of a generated translation
in some automatic way?
25
00:00:54,270 --> 00:00:56,730
The approach that BLEU takes
is to compare the n-grams
26
00:00:56,730 --> 00:00:58,550
of the generated
translation to the n-grams
27
00:00:58,550 --> 00:01:00,390
in the references.
28
00:01:00,390 --> 00:01:02,400
Now, an n-gram is just
a fancy way of saying
29
00:01:02,400 --> 00:01:03,960
a chunk of n words.
30
00:01:03,960 --> 00:01:05,220
So let's start with unigrams,
31
00:01:05,220 --> 00:01:08,020
which corresponds to the
individual words in a sentence.
32
00:01:08,880 --> 00:01:11,250
In this example, you can
see that four of the words
33
00:01:11,250 --> 00:01:13,140
in the generated
translation are also found
34
00:01:13,140 --> 00:01:14,990
in one of the reference translations.
35
00:01:16,350 --> 00:01:18,240
And once we've found our matches,
36
00:01:18,240 --> 00:01:20,130
one way to assign a
score to the translation
37
00:01:20,130 --> 00:01:23,070
is to compute the
precision of the unigrams.
38
00:01:23,070 --> 00:01:25,200
This means we just count
the number of matching words
39
00:01:25,200 --> 00:01:27,360
in the generated and
reference translations
40
00:01:27,360 --> 00:01:29,660
and normalize the count by
dividing by the number of words
41
00:01:29,660 --> 00:01:30,753
in the generation.
42
00:01:31,800 --> 00:01:34,080
In this example, we
found four matching words
43
00:01:34,080 --> 00:01:36,033
and our generation has five words.
44
00:01:37,140 --> 00:01:39,690
Now, in general, precision
ranges from zero to one,
45
00:01:39,690 --> 00:01:42,390
and higher precision scores
mean a better translation.
46
00:01:44,160 --> 00:01:45,570
But this isn't really the whole story
47
00:01:45,570 --> 00:01:47,310
because one problem with unigram precision
48
00:01:47,310 --> 00:01:49,140
is that translation
models sometimes get stuck
49
00:01:49,140 --> 00:01:51,330
in repetitive patterns and
just repeat the same word
50
00:01:51,330 --> 00:01:52,293
several times.
51
00:01:53,160 --> 00:01:54,690
If we just count the
number of word matches,
52
00:01:54,690 --> 00:01:56,370
we can get really high precision scores
53
00:01:56,370 --> 00:01:57,840
even though the translation is terrible
54
00:01:57,840 --> 00:01:59,090
from a human perspective!
55
00:02:00,000 --> 00:02:02,970
For example, if our model
just generates the word six,
56
00:02:02,970 --> 00:02:05,020
we get a perfect unigram precision score.
57
00:02:06,960 --> 00:02:09,930
So to handle this, BLEU
uses a modified precision
58
00:02:09,930 --> 00:02:12,210
that clips the number of
times to count a word,
59
00:02:12,210 --> 00:02:13,680
based on the maximum number of times
60
00:02:13,680 --> 00:02:16,399
it appears in the reference translation.
61
00:02:16,399 --> 00:02:18,630
In this example, the word
six only appears once
62
00:02:18,630 --> 00:02:21,360
in the reference, so we
clip the numerator to one
63
00:02:21,360 --> 00:02:22,710
and the modified unigram precision
64
00:02:22,710 --> 00:02:25,233
now gives a much lower score as expected.
65
00:02:27,660 --> 00:02:29,400
Another problem with unigram precision
66
00:02:29,400 --> 00:02:30,780
is that it doesn't take into account
67
00:02:30,780 --> 00:02:33,900
the order in which the words
appear in the translations.
68
00:02:33,900 --> 00:02:35,700
For example, suppose we had Yoda
69
00:02:35,700 --> 00:02:37,410
translate our Spanish sentence,
70
00:02:37,410 --> 00:02:39,457
then we might get
something backwards like,
71
00:02:39,457 --> 00:02:42,450
"Years sixty thirty have I."
72
00:02:42,450 --> 00:02:44,670
In this case, the
modified unigram precision
73
00:02:44,670 --> 00:02:47,393
gives a high precision which
is not really what we want.
74
00:02:48,480 --> 00:02:50,460
So to deal with word ordering problems,
75
00:02:50,460 --> 00:02:52,020
BLEU actually computes the precision
76
00:02:52,020 --> 00:02:55,410
for several different n-grams
and then averages the result.
77
00:02:55,410 --> 00:02:57,300
For example, if we compare 4-grams,
78
00:02:57,300 --> 00:02:58,830
we can see that there
are no matching chunks
79
00:02:58,830 --> 00:03:01,020
of four words in the translations,
80
00:03:01,020 --> 00:03:02,913
and so the 4-gram precision is 0.
81
00:03:05,460 --> 00:03:07,560
Now, to compute BLEU
scores in Datasets library
82
00:03:07,560 --> 00:03:09,120
is really very simple.
83
00:03:09,120 --> 00:03:11,100
You just use the load_metric function,
84
00:03:11,100 --> 00:03:13,290
provide your model's predictions
with their references
85
00:03:13,290 --> 00:03:14,390
and you're good to go!
86
00:03:16,470 --> 00:03:19,200
The output will contain
several fields of interest.
87
00:03:19,200 --> 00:03:20,490
The precisions field contains
88
00:03:20,490 --> 00:03:23,133
all the individual precision
scores for each n-gram.
89
00:03:25,050 --> 00:03:26,940
The BLEU score itself is then calculated
90
00:03:26,940 --> 00:03:30,090
by taking the geometric mean
of the precision scores.
91
00:03:30,090 --> 00:03:32,790
And by default, the mean of
all four n-gram precisions
92
00:03:32,790 --> 00:03:35,793
is reported, a metric that is
sometimes also called BLEU-4.
93
00:03:36,660 --> 00:03:38,880
In this example, we can
see the BLEU score is zero
94
00:03:38,880 --> 00:03:40,780
because the 4-gram precision was zero.
95
00:03:43,290 --> 00:03:45,390
Now, the BLEU metric has
some nice properties,
96
00:03:45,390 --> 00:03:47,520
but it is far from a perfect metric.
97
00:03:47,520 --> 00:03:49,440
The good properties are
that it's easy to compute
98
00:03:49,440 --> 00:03:50,970
and it's widely used in research
99
00:03:50,970 --> 00:03:52,620
so you can compare your
model against others
100
00:03:52,620 --> 00:03:54,630
on common benchmarks.
101
00:03:54,630 --> 00:03:56,670
On the other hand, there are
several big problems with BLEU,
102
00:03:56,670 --> 00:03:58,830
including the fact it
doesn't incorporate semantics
103
00:03:58,830 --> 00:04:01,920
and it struggles a lot
on non-English languages.
104
00:04:01,920 --> 00:04:02,790
Another problem with BLEU
105
00:04:02,790 --> 00:04:04,620
is that it assumes the human translations
106
00:04:04,620 --> 00:04:05,820
have already been tokenized
107
00:04:05,820 --> 00:04:07,320
and this makes it hard to compare models
108
00:04:07,320 --> 00:04:08,820
that use different tokenizers.
109
00:04:10,590 --> 00:04:12,570
So as we've seen, measuring
the quality of texts
110
00:04:12,570 --> 00:04:15,570
is still a difficult and
open problem in NLP research.
111
00:04:15,570 --> 00:04:17,580
For machine translation,
the current recommendation
112
00:04:17,580 --> 00:04:19,440
is to use the SacreBLEU metric,
113
00:04:19,440 --> 00:04:22,830
which addresses the tokenization
limitations of BLEU.
114
00:04:22,830 --> 00:04:24,360
As you can see in this example,
115
00:04:24,360 --> 00:04:26,580
computing the SacreBLEU
score is almost identical
116
00:04:26,580 --> 00:04:28,020
to the BLEU one.
117
00:04:28,020 --> 00:04:30,360
The main difference is that
we now pass a list of texts
118
00:04:30,360 --> 00:04:32,640
instead of a list of
words to the translations,
119
00:04:32,640 --> 00:04:35,640
and SacreBLEU takes care of the
tokenization under the hood.
120
00:04:36,582 --> 00:04:39,499
(screen whooshing)