subtitles/en/49_what-is-normalization.srt

1 00:00:00,286 --> 00:00:02,869 (subtle blast) 2 00:00:04,694 --> 00:00:07,380 - In this video, we will see together 3 00:00:07,380 --> 00:00:09,930 what is the normalizer component 4 00:00:09,930 --> 00:00:13,023 that we'd find at the beginning of each tokenizer. 5 00:00:14,550 --> 00:00:16,830 The normalization operation consists 6 00:00:16,830 --> 00:00:19,890 in applying a succession of normalization rules 7 00:00:19,890 --> 00:00:20,853 to the raw text. 8 00:00:21,870 --> 00:00:25,710 We choose normalization rules to remove noise in the text 9 00:00:25,710 --> 00:00:27,900 which seem useless for the learning 10 00:00:27,900 --> 00:00:30,363 and use of our language model. 11 00:00:33,090 --> 00:00:37,470 Let's take a very diverse sentence with different fonts, 12 00:00:37,470 --> 00:00:39,780 upper and lower case characters, 13 00:00:39,780 --> 00:00:43,083 accents, punctuation and multiple spaces, 14 00:00:43,920 --> 00:00:46,683 to see how several tokenizer normalize it. 15 00:00:48,488 --> 00:00:50,730 The tokenizer from the FNet model 16 00:00:50,730 --> 00:00:53,700 has transformed the letter with font variants 17 00:00:53,700 --> 00:00:57,480 or circled into their basic version 18 00:00:57,480 --> 00:00:59,733 and has removed the multiple spaces. 19 00:01:00,960 --> 00:01:03,960 And now if we look at the normalization 20 00:01:03,960 --> 00:01:05,880 with Retribert's tokenizer, 21 00:01:05,880 --> 00:01:08,010 we can see that it keeps characters 22 00:01:08,010 --> 00:01:12,090 with several font variants and keeps the multiple spaces, 23 00:01:12,090 --> 00:01:14,223 but it removes all the accents. 24 00:01:16,170 --> 00:01:18,870 And if we continue to test this normalization 25 00:01:18,870 --> 00:01:23,040 of many other tokenizers associated to models 26 00:01:23,040 --> 00:01:25,110 that we can find on the Hub, 27 00:01:25,110 --> 00:01:28,833 we see that they also propose other kind of normalization. 28 00:01:33,900 --> 00:01:35,850 With the fast tokenizers, 29 00:01:35,850 --> 00:01:39,060 it's very easy to observe the normalization chosen 30 00:01:39,060 --> 00:01:41,193 for the currently loaded tokenizer. 31 00:01:42,330 --> 00:01:46,140 Indeed, each instance of a fast tokenizer 32 00:01:46,140 --> 00:01:48,030 has an underlying tokenizer 33 00:01:48,030 --> 00:01:51,390 from the HuggingFace Tokenizers library stored 34 00:01:51,390 --> 00:01:53,643 in the backend_tokenizer attribute. 35 00:01:54,690 --> 00:01:58,470 This object has itself a normalizer attribute 36 00:01:58,470 --> 00:02:01,830 that we can use thanks to the normalize_str method 37 00:02:01,830 --> 00:02:03,153 to normalize a string. 38 00:02:04,560 --> 00:02:08,700 It is thus very practical that this normalization, 39 00:02:08,700 --> 00:02:11,070 which was used at the time of the training 40 00:02:11,070 --> 00:02:12,903 of the tokenizer was saved, 41 00:02:13,857 --> 00:02:16,200 and that it applies automatically 42 00:02:16,200 --> 00:02:19,233 when you ask a trained tokenizer to tokenize a text. 43 00:02:21,000 --> 00:02:25,500 For example, if we hadn't included the albert normalizer, 44 00:02:25,500 --> 00:02:28,770 we would have had a lot of unknown tokens 45 00:02:28,770 --> 00:02:30,930 by tokenizing this sentence 46 00:02:30,930 --> 00:02:33,213 with accents and capital letters. 47 00:02:35,730 --> 00:02:38,370 This transformation can also be undetectable 48 00:02:38,370 --> 00:02:40,050 with a simple print. 49 00:02:40,050 --> 00:02:42,810 Indeed, keep in mind that for a computer, 50 00:02:42,810 --> 00:02:45,840 text is only a succession of 0 and 1, 51 00:02:45,840 --> 00:02:47,820 and it happens that different successions 52 00:02:47,820 --> 00:02:51,363 of 0 and 1 render the same printed character. 53 00:02:52,380 --> 00:02:56,403 The 0 and 1 go in group of 8 to form a byte. 54 00:02:57,480 --> 00:03:00,690 The computer must then decode this sequence of bytes 55 00:03:00,690 --> 00:03:02,493 into a sequence of code points. 56 00:03:04,530 --> 00:03:09,530 In our example, the 2 bytes is decoded using UTF-8 57 00:03:09,900 --> 00:03:11,403 into a single code point. 58 00:03:12,450 --> 00:03:15,090 The unicode standard then allows us 59 00:03:15,090 --> 00:03:18,191 to find the character corresponding to this code point, 60 00:03:18,191 --> 00:03:20,283 the c cedilla. 61 00:03:21,499 --> 00:03:23,790 Let's repeat the same operation 62 00:03:23,790 --> 00:03:26,577 with this new sequence composed of 3 bytes,. 63 00:03:27,420 --> 00:03:30,543 This time it is transformed into two code points, 64 00:03:31,410 --> 00:03:35,280 which also correspond to the c cedilla character. 65 00:03:35,280 --> 00:03:36,780 It is in fact the composition 66 00:03:36,780 --> 00:03:39,810 of the unicode Latin Small Letter C 67 00:03:39,810 --> 00:03:42,240 and the combining cedilla. 68 00:03:42,240 --> 00:03:45,000 But it's annoying because what appears to us 69 00:03:45,000 --> 00:03:46,680 to be a single character 70 00:03:46,680 --> 00:03:49,653 is not at all the same thing for the computer. 71 00:03:52,470 --> 00:03:57,240 Fortunately, there are unicode standardization standards 72 00:03:57,240 --> 00:04:02,130 known as NFC, NFD, NFKC or NFKD 73 00:04:02,130 --> 00:04:04,893 that allow erasing some of these differences. 74 00:04:05,730 --> 00:04:08,223 These standards are often used by tokenizers. 75 00:04:09,900 --> 00:04:12,090 On all these previous examples, 76 00:04:12,090 --> 00:04:15,510 even if the normalizations changed the look of the text, 77 00:04:15,510 --> 00:04:17,970 they did not change the content; 78 00:04:17,970 --> 00:04:19,177 you could still read, 79 00:04:19,177 --> 00:04:21,987 "Hello world, let's normalize this sentence." 80 00:04:22,980 --> 00:04:25,980 However, you must be aware that some normalizations 81 00:04:25,980 --> 00:04:30,363 can be very harmful if they are not adapted to their corpus. 82 00:04:31,620 --> 00:04:34,387 For example, if you take the French sentence, 83 00:04:34,387 --> 00:04:38,790 "Un pere indigne," which means "An indignant father," 84 00:04:38,790 --> 00:04:42,510 and normalize it with the bert-base-uncase tokenizer 85 00:04:42,510 --> 00:04:44,313 which removes the accent, 86 00:04:45,150 --> 00:04:48,000 then the sentence becomes "Un pere indigne" 87 00:04:48,000 --> 00:04:49,707 which means "An unworthy father". 88 00:04:53,460 --> 00:04:56,760 If you watched this video to build your own tokenizer, 89 00:04:56,760 --> 00:04:59,610 there are no absolute rules to choose or not 90 00:04:59,610 --> 00:05:02,970 a normalization for a new tokenizer, 91 00:05:02,970 --> 00:05:06,210 but I advise you to take the time to select them 92 00:05:06,210 --> 00:05:10,743 so that they do not make you lose important information. 93 00:05:12,296 --> 00:05:14,879 (subtle blast)

subtitles/en/49_what-is-normalization.srt (321 lines of code) (raw):