subtitles/en/60_what-is-the-bleu-metric.srt (420 lines of code) (raw):

1 00:00:00,147 --> 00:00:01,412 (screen whooshing) 2 00:00:01,412 --> 00:00:02,698 (sticker popping) 3 00:00:02,698 --> 00:00:05,670 (screen whooshing) 4 00:00:05,670 --> 00:00:07,650 - What is the BLEU metric? 5 00:00:07,650 --> 00:00:10,170 For many NLP tasks we can use common metrics 6 00:00:10,170 --> 00:00:12,810 like accuracy or F1 score, but what do you do 7 00:00:12,810 --> 00:00:14,340 when you wanna measure the quality of text 8 00:00:14,340 --> 00:00:16,560 that's been translated from a model? 9 00:00:16,560 --> 00:00:18,750 In this video, we'll take a look at a widely used metric 10 00:00:18,750 --> 00:00:20,613 for machine translation called BLEU. 11 00:00:22,290 --> 00:00:23,940 The basic idea behind BLEU is to assign 12 00:00:23,940 --> 00:00:26,250 a single numerical score to a translation 13 00:00:26,250 --> 00:00:27,450 that tells us how good it is 14 00:00:27,450 --> 00:00:30,199 compared to one or more reference translations. 15 00:00:30,199 --> 00:00:32,130 In this example, we have a sentence in Spanish 16 00:00:32,130 --> 00:00:35,340 that has been translated into English by some model. 17 00:00:35,340 --> 00:00:37,170 If we compare the generated translation 18 00:00:37,170 --> 00:00:39,150 to some reference human translations, 19 00:00:39,150 --> 00:00:41,190 we can see that the model is actually pretty good, 20 00:00:41,190 --> 00:00:43,260 but has made a common error. 21 00:00:43,260 --> 00:00:46,050 The Spanish word tengo means have in English, 22 00:00:46,050 --> 00:00:48,700 and this one-to-one translation is not quite natural. 23 00:00:49,890 --> 00:00:51,270 So how can we measure the quality 24 00:00:51,270 --> 00:00:54,270 of a generated translation in some automatic way? 25 00:00:54,270 --> 00:00:56,730 The approach that BLEU takes is to compare the n-grams 26 00:00:56,730 --> 00:00:58,550 of the generated translation to the n-grams 27 00:00:58,550 --> 00:01:00,390 in the references. 28 00:01:00,390 --> 00:01:02,400 Now, an n-gram is just a fancy way of saying 29 00:01:02,400 --> 00:01:03,960 a chunk of n words. 30 00:01:03,960 --> 00:01:05,220 So let's start with unigrams, 31 00:01:05,220 --> 00:01:08,020 which corresponds to the individual words in a sentence. 32 00:01:08,880 --> 00:01:11,250 In this example, you can see that four of the words 33 00:01:11,250 --> 00:01:13,140 in the generated translation are also found 34 00:01:13,140 --> 00:01:14,990 in one of the reference translations. 35 00:01:16,350 --> 00:01:18,240 And once we've found our matches, 36 00:01:18,240 --> 00:01:20,130 one way to assign a score to the translation 37 00:01:20,130 --> 00:01:23,070 is to compute the precision of the unigrams. 38 00:01:23,070 --> 00:01:25,200 This means we just count the number of matching words 39 00:01:25,200 --> 00:01:27,360 in the generated and reference translations 40 00:01:27,360 --> 00:01:29,660 and normalize the count by dividing by the number of words 41 00:01:29,660 --> 00:01:30,753 in the generation. 42 00:01:31,800 --> 00:01:34,080 In this example, we found four matching words 43 00:01:34,080 --> 00:01:36,033 and our generation has five words. 44 00:01:37,140 --> 00:01:39,690 Now, in general, precision ranges from zero to one, 45 00:01:39,690 --> 00:01:42,390 and higher precision scores mean a better translation. 46 00:01:44,160 --> 00:01:45,570 But this isn't really the whole story 47 00:01:45,570 --> 00:01:47,310 because one problem with unigram precision 48 00:01:47,310 --> 00:01:49,140 is that translation models sometimes get stuck 49 00:01:49,140 --> 00:01:51,330 in repetitive patterns and just repeat the same word 50 00:01:51,330 --> 00:01:52,293 several times. 51 00:01:53,160 --> 00:01:54,690 If we just count the number of word matches, 52 00:01:54,690 --> 00:01:56,370 we can get really high precision scores 53 00:01:56,370 --> 00:01:57,840 even though the translation is terrible 54 00:01:57,840 --> 00:01:59,090 from a human perspective! 55 00:02:00,000 --> 00:02:02,970 For example, if our model just generates the word six, 56 00:02:02,970 --> 00:02:05,020 we get a perfect unigram precision score. 57 00:02:06,960 --> 00:02:09,930 So to handle this, BLEU uses a modified precision 58 00:02:09,930 --> 00:02:12,210 that clips the number of times to count a word, 59 00:02:12,210 --> 00:02:13,680 based on the maximum number of times 60 00:02:13,680 --> 00:02:16,399 it appears in the reference translation. 61 00:02:16,399 --> 00:02:18,630 In this example, the word six only appears once 62 00:02:18,630 --> 00:02:21,360 in the reference, so we clip the numerator to one 63 00:02:21,360 --> 00:02:22,710 and the modified unigram precision 64 00:02:22,710 --> 00:02:25,233 now gives a much lower score as expected. 65 00:02:27,660 --> 00:02:29,400 Another problem with unigram precision 66 00:02:29,400 --> 00:02:30,780 is that it doesn't take into account 67 00:02:30,780 --> 00:02:33,900 the order in which the words appear in the translations. 68 00:02:33,900 --> 00:02:35,700 For example, suppose we had Yoda 69 00:02:35,700 --> 00:02:37,410 translate our Spanish sentence, 70 00:02:37,410 --> 00:02:39,457 then we might get something backwards like, 71 00:02:39,457 --> 00:02:42,450 "Years sixty thirty have I." 72 00:02:42,450 --> 00:02:44,670 In this case, the modified unigram precision 73 00:02:44,670 --> 00:02:47,393 gives a high precision which is not really what we want. 74 00:02:48,480 --> 00:02:50,460 So to deal with word ordering problems, 75 00:02:50,460 --> 00:02:52,020 BLEU actually computes the precision 76 00:02:52,020 --> 00:02:55,410 for several different n-grams and then averages the result. 77 00:02:55,410 --> 00:02:57,300 For example, if we compare 4-grams, 78 00:02:57,300 --> 00:02:58,830 we can see that there are no matching chunks 79 00:02:58,830 --> 00:03:01,020 of four words in the translations, 80 00:03:01,020 --> 00:03:02,913 and so the 4-gram precision is 0. 81 00:03:05,460 --> 00:03:07,560 Now, to compute BLEU scores in Datasets library 82 00:03:07,560 --> 00:03:09,120 is really very simple. 83 00:03:09,120 --> 00:03:11,100 You just use the load_metric function, 84 00:03:11,100 --> 00:03:13,290 provide your model's predictions with their references 85 00:03:13,290 --> 00:03:14,390 and you're good to go! 86 00:03:16,470 --> 00:03:19,200 The output will contain several fields of interest. 87 00:03:19,200 --> 00:03:20,490 The precisions field contains 88 00:03:20,490 --> 00:03:23,133 all the individual precision scores for each n-gram. 89 00:03:25,050 --> 00:03:26,940 The BLEU score itself is then calculated 90 00:03:26,940 --> 00:03:30,090 by taking the geometric mean of the precision scores. 91 00:03:30,090 --> 00:03:32,790 And by default, the mean of all four n-gram precisions 92 00:03:32,790 --> 00:03:35,793 is reported, a metric that is sometimes also called BLEU-4. 93 00:03:36,660 --> 00:03:38,880 In this example, we can see the BLEU score is zero 94 00:03:38,880 --> 00:03:40,780 because the 4-gram precision was zero. 95 00:03:43,290 --> 00:03:45,390 Now, the BLEU metric has some nice properties, 96 00:03:45,390 --> 00:03:47,520 but it is far from a perfect metric. 97 00:03:47,520 --> 00:03:49,440 The good properties are that it's easy to compute 98 00:03:49,440 --> 00:03:50,970 and it's widely used in research 99 00:03:50,970 --> 00:03:52,620 so you can compare your model against others 100 00:03:52,620 --> 00:03:54,630 on common benchmarks. 101 00:03:54,630 --> 00:03:56,670 On the other hand, there are several big problems with BLEU, 102 00:03:56,670 --> 00:03:58,830 including the fact it doesn't incorporate semantics 103 00:03:58,830 --> 00:04:01,920 and it struggles a lot on non-English languages. 104 00:04:01,920 --> 00:04:02,790 Another problem with BLEU 105 00:04:02,790 --> 00:04:04,620 is that it assumes the human translations 106 00:04:04,620 --> 00:04:05,820 have already been tokenized 107 00:04:05,820 --> 00:04:07,320 and this makes it hard to compare models 108 00:04:07,320 --> 00:04:08,820 that use different tokenizers. 109 00:04:10,590 --> 00:04:12,570 So as we've seen, measuring the quality of texts 110 00:04:12,570 --> 00:04:15,570 is still a difficult and open problem in NLP research. 111 00:04:15,570 --> 00:04:17,580 For machine translation, the current recommendation 112 00:04:17,580 --> 00:04:19,440 is to use the SacreBLEU metric, 113 00:04:19,440 --> 00:04:22,830 which addresses the tokenization limitations of BLEU. 114 00:04:22,830 --> 00:04:24,360 As you can see in this example, 115 00:04:24,360 --> 00:04:26,580 computing the SacreBLEU score is almost identical 116 00:04:26,580 --> 00:04:28,020 to the BLEU one. 117 00:04:28,020 --> 00:04:30,360 The main difference is that we now pass a list of texts 118 00:04:30,360 --> 00:04:32,640 instead of a list of words to the translations, 119 00:04:32,640 --> 00:04:35,640 and SacreBLEU takes care of the tokenization under the hood. 120 00:04:36,582 --> 00:04:39,499 (screen whooshing)