subtitles/zh-CN/49_what-is-normalization.srt (372 lines of code) (raw):

1 00:00:00,286 --> 00:00:02,869 (微妙的爆炸) (subtle blast) 2 00:00:04,694 --> 00:00:07,380 - 在这个视频中,我们将一起看到 - In this video, we will see together 3 00:00:07,380 --> 00:00:09,930 什么是规范化组件 what is the normalizer component 4 00:00:09,930 --> 00:00:13,023 我们会在每个 tokenizer 的开头找到它。 that we'd find at the beginning of each tokenizer. 5 00:00:14,550 --> 00:00:16,830 规范化操作包括 The normalization operation consists 6 00:00:16,830 --> 00:00:19,890 在应用一系列规范化规则 in applying a succession of normalization rules 7 00:00:19,890 --> 00:00:20,853 到原始文本时。 to the raw text. 8 00:00:21,870 --> 00:00:25,710 我们选择规范化规则来去除文本中的噪音 We choose normalization rules to remove noise in the text 9 00:00:25,710 --> 00:00:27,900 这似乎对学习没有用 which seem useless for the learning 10 00:00:27,900 --> 00:00:30,363 以及使用我们的语言模型。 and use of our language model. 11 00:00:33,090 --> 00:00:37,470 让我们来看一个非常多样化的句子,用不同的字体 Let's take a very diverse sentence with different fonts, 12 00:00:37,470 --> 00:00:39,780 大写和小写的字符, upper and lower case characters, 13 00:00:39,780 --> 00:00:43,083 重音符号、标点符号和多个空格, accents, punctuation and multiple spaces, 14 00:00:43,920 --> 00:00:46,683 看看几个 tokenizer 是如何规范化它的。 to see how several tokenizer normalize it. 15 00:00:48,488 --> 00:00:50,730 来自 FNet 模型的 tokenizer The tokenizer from the FNet model 16 00:00:50,730 --> 00:00:53,700 用字体变体改变了字母 has transformed the letter with font variants 17 00:00:53,700 --> 00:00:57,480 或圈入他们的基本版本 or circled into their basic version 18 00:00:57,480 --> 00:00:59,733 并删除了多个空格。 and has removed the multiple spaces. 19 00:01:00,960 --> 00:01:03,960 现在,如果我们看一下规范化 And now if we look at the normalization 20 00:01:03,960 --> 00:01:05,880 使用 Retribert 的分词器, with Retribert's tokenizer, 21 00:01:05,880 --> 00:01:08,010 我们可以看到它保留了字符 we can see that it keeps characters 22 00:01:08,010 --> 00:01:12,090 具有多种字体变体并保留多个空格, with several font variants and keeps the multiple spaces, 23 00:01:12,090 --> 00:01:14,223 但它删除了所有的重音。 but it removes all the accents. 24 00:01:16,170 --> 00:01:18,870 如果我们继续测试这种标准化 And if we continue to test this normalization 25 00:01:18,870 --> 00:01:23,040 与模型相关的许多其他 tokenizer of many other tokenizers associated to models 26 00:01:23,040 --> 00:01:25,110 我们可以在 Hub 上找到, that we can find on the Hub, 27 00:01:25,110 --> 00:01:28,833 我们看到他们还提出了其他类型的正常化。 we see that they also propose other kind of normalization. 28 00:01:33,900 --> 00:01:35,850 使用快速 tokenizer , With the fast tokenizers, 29 00:01:35,850 --> 00:01:39,060 很容易观察到选择的规范化 it's very easy to observe the normalization chosen 30 00:01:39,060 --> 00:01:41,193 对于当前加载的分词器。 for the currently loaded tokenizer. 31 00:01:42,330 --> 00:01:46,140 事实上,每个快速 tokenizer 的实例 Indeed, each instance of a fast tokenizer 32 00:01:46,140 --> 00:01:48,030 有一个底层 tokenizer has an underlying tokenizer 33 00:01:48,030 --> 00:01:51,390 来自存储的 HuggingFace Tokenizers 库 from the HuggingFace Tokenizers library stored 34 00:01:51,390 --> 00:01:53,643 在 backend_tokenizer 属性中。 in the backend_tokenizer attribute. 35 00:01:54,690 --> 00:01:58,470 这个对象本身有一个规范器属性 This object has itself a normalizer attribute 36 00:01:58,470 --> 00:02:01,830 我们可以使用, 多亏 normalize_str 方法 that we can use thanks to the normalize_str method 37 00:02:01,830 --> 00:02:03,153 以规范化一个字符串。 to normalize a string. 38 00:02:04,560 --> 00:02:08,700 因此,这种标准化非常实用, It is thus very practical that this normalization, 39 00:02:08,700 --> 00:02:11,070 使用在训练时 which was used at the time of the training 40 00:02:11,070 --> 00:02:12,903 保存分词器, of the tokenizer was saved, 41 00:02:13,857 --> 00:02:16,200 并且它会自动应用 and that it applies automatically 42 00:02:16,200 --> 00:02:19,233 当你要求训练过的 tokenizer 对文本进行分词时。 when you ask a trained tokenizer to tokenize a text. 43 00:02:21,000 --> 00:02:25,500 例如,如果我们没有包含 albert 标准化器, For example, if we hadn't included the albert normalizer, 44 00:02:25,500 --> 00:02:28,770 我们会有很多未知的 token we would have had a lot of unknown tokens 45 00:02:28,770 --> 00:02:30,930 通过标记这个句子 by tokenizing this sentence 46 00:02:30,930 --> 00:02:33,213 带有重音符号和大写字母。 with accents and capital letters. 47 00:02:35,730 --> 00:02:38,370 这种转变也可能是无法检测的 This transformation can also be undetectable 48 00:02:38,370 --> 00:02:40,050 通过简单的打印出来。 with a simple print. 49 00:02:40,050 --> 00:02:42,810 确实,请记住,对于计算机来说, Indeed, keep in mind that for a computer, 50 00:02:42,810 --> 00:02:45,840 文本只是连续的 0 和 1, text is only a succession of 0 and 1, 51 00:02:45,840 --> 00:02:47,820 碰巧是不同的承接 and it happens that different successions 52 00:02:47,820 --> 00:02:51,363 让 0 和 1 呈现相同的打印字符。 of 0 and 1 render the same printed character. 53 00:02:52,380 --> 00:02:56,403 0 和 1 以 8 个为一组组成一个字节。 The 0 and 1 go in group of 8 to form a byte. 54 00:02:57,480 --> 00:03:00,690 然后计算机必须解码这个字节序列 The computer must then decode this sequence of bytes 55 00:03:00,690 --> 00:03:02,493 成一系列代码点。 into a sequence of code points. 56 00:03:04,530 --> 00:03:09,530 在我们的示例中,2 个字节通过 UTF-8 解码 In our example, the 2 bytes is decoded using UTF-8 57 00:03:09,900 --> 00:03:11,403 成一个单一的代码点。 into a single code point. 58 00:03:12,450 --> 00:03:15,090 然后 unicode 标准允许我们 The unicode standard then allows us 59 00:03:15,090 --> 00:03:18,191 找到与此代码点对应的字符, to find the character corresponding to this code point, 60 00:03:18,191 --> 00:03:20,283 c 音符。 the c cedilla. 61 00:03:21,499 --> 00:03:23,790 让我们重复同样的操作 Let's repeat the same operation 62 00:03:23,790 --> 00:03:26,577 有了这个由 3 个字节组成的新序列, with this new sequence composed of 3 bytes,. 63 00:03:27,420 --> 00:03:30,543 这次转化为两个码点, This time it is transformed into two code points, 64 00:03:31,410 --> 00:03:35,280 这也对应于 c cedilla 字符。 which also correspond to the c cedilla character. 65 00:03:35,280 --> 00:03:36,780 它实际上是组成 It is in fact the composition 66 00:03:36,780 --> 00:03:39,810 unicode 拉丁文小写字母 C of the unicode Latin Small Letter C 67 00:03:39,810 --> 00:03:42,240 和组合的音符。 and the combining cedilla. 68 00:03:42,240 --> 00:03:45,000 但这很烦人,因为在我们看来 But it's annoying because what appears to us 69 00:03:45,000 --> 00:03:46,680 为了成为一个单一的字符 to be a single character 70 00:03:46,680 --> 00:03:49,653 对于计算机来说完全不是一回事。 is not at all the same thing for the computer. 71 00:03:52,470 --> 00:03:57,240 幸好有 unicode 标准化标准 Fortunately, there are unicode standardization standards 72 00:03:57,240 --> 00:04:02,130 称为 NFC、NFD、NFKC 或 NFKD known as NFC, NFD, NFKC or NFKD 73 00:04:02,130 --> 00:04:04,893 这允许消除其中的一些差异。 that allow erasing some of these differences. 74 00:04:05,730 --> 00:04:08,223 这些标准通常由 tokenizer 使用。 These standards are often used by tokenizers. 75 00:04:09,900 --> 00:04:12,090 在所有这些前面的例子中, On all these previous examples, 76 00:04:12,090 --> 00:04:15,510 即使规范化改变了文本的外观, even if the normalizations changed the look of the text, 77 00:04:15,510 --> 00:04:17,970 他们没有改变内容; they did not change the content; 78 00:04:17,970 --> 00:04:19,177 你仍然可以阅读, you could still read, 79 00:04:19,177 --> 00:04:21,987 “Hello world,让我们规范一下这句话。” "Hello world, let's normalize this sentence." 80 00:04:22,980 --> 00:04:25,980 但是,你必须知道一些规范化 However, you must be aware that some normalizations 81 00:04:25,980 --> 00:04:30,363 如果他们不适配他们的语料库,可能会非常有害。 can be very harmful if they are not adapted to their corpus. 82 00:04:31,620 --> 00:04:34,387 例如,如果你使用法语句子, For example, if you take the French sentence, 83 00:04:34,387 --> 00:04:38,790 “Un pere indigne”,意思是 “愤怒的父亲”, "Un pere indigne," which means "An indignant father," 84 00:04:38,790 --> 00:04:42,510 并使用 bert-base-uncase tokenizer 对其进行规范化 and normalize it with the bert-base-uncase tokenizer 85 00:04:42,510 --> 00:04:44,313 这消除了口音, which removes the accent, 86 00:04:45,150 --> 00:04:48,000 然后句子变成 “Un pere indigne” then the sentence becomes "Un pere indigne" 87 00:04:48,000 --> 00:04:49,707 意思是 “一个不称职的父亲”。 which means "An unworthy father". 88 00:04:53,460 --> 00:04:56,760 如果你观看此视频是为了构建自己的 tokenizer , If you watched this video to build your own tokenizer, 89 00:04:56,760 --> 00:04:59,610 没有绝对的规则选择与否 there are no absolute rules to choose or not 90 00:04:59,610 --> 00:05:02,970 一个新 tokenizer 的规范化, a normalization for a new tokenizer, 91 00:05:02,970 --> 00:05:06,210 但我建议你花时间选择它们 but I advise you to take the time to select them 92 00:05:06,210 --> 00:05:10,743 这样它们就不会让你丢失重要信息。 so that they do not make you lose important information. 93 00:05:12,296 --> 00:05:14,879 (微妙的爆炸) (subtle blast)