1 00:00:06,450 --> 00:00:09,540 - 让我们来看看基于子词的分词。 *[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳] - Let's take a look at subword based tokenization. 2 00:00:09,540 --> 00:00:11,610 了解为什么基于子词的分词是 Understanding why subword based tokenization 3 00:00:11,610 --> 00:00:13,980 是有趣的需要理解 is interesting requires understanding the flaws 4 00:00:13,980 --> 00:00:17,340 基于单词和基于校正器分词化的缺陷。 of word based and corrector based tokenization. 5 00:00:17,340 --> 00:00:18,780 如果你还没有看过第一个视频 If you haven't seen the first videos 6 00:00:18,780 --> 00:00:22,020 关于基于单词和基于字符的标记化 on word based and character based tokenization 7 00:00:22,020 --> 00:00:23,130 我们建议你观看它们 we recommend you check them out 8 00:00:23,130 --> 00:00:24,780 在看这个视频之前。 before looking at this video. 9 00:00:27,840 --> 00:00:31,493 基于子词的分词化介于基于字符 Subword based tokenization lies in between character based 10 00:00:31,493 --> 00:00:35,280 和基于单词的分词算法之间。 and word based tokenization algorithms. 11 00:00:35,280 --> 00:00:37,410 这个想法是找到一个中间场 The idea is to find a middle ground 12 00:00:37,410 --> 00:00:39,486 在很大的词汇表, between very large vocabularies, 13 00:00:39,486 --> 00:00:42,600 大量的词汇分词 a large quantity of out vocabulary tokens 14 00:00:42,600 --> 00:00:45,360 还有在非常相似的词之间意义差 and a loss of meaning across very similar words 15 00:00:45,360 --> 00:00:48,630 对基于单词的分词器和非常长的序列 for word based tokenizers and very long sequences 16 00:00:48,630 --> 00:00:51,330 以及意义不大的单个标记 as well as less meaningful individual tokens 17 00:00:51,330 --> 00:00:53,133 对于基于字符的分词器之间。 for character based tokenizers. 18 00:00:54,840 --> 00:00:57,960 这些算法依赖于以下原则。 These algorithms rely on the following principle. 19 00:00:57,960 --> 00:01:00,000 常用词不宜拆分 Frequently used words should not be split 20 00:01:00,000 --> 00:01:01,500 成更小的子词 into smaller subwords 21 00:01:01,500 --> 00:01:03,433 而稀有词应该被分解 while rare words should be decomposed 22 00:01:03,433 --> 00:01:05,103 成有意义的子词。 into meaningful subwords. 23 00:01:06,510 --> 00:01:08,460 一个例子是 dog 这个词。 An example is the word dog. 24 00:01:08,460 --> 00:01:11,190 我们想让我们的分词器有一个单一的 ID We would like to have our tokenizer to have a single ID 25 00:01:11,190 --> 00:01:12,600 对于 dog 这个词 for the word dog rather 26 00:01:12,600 --> 00:01:15,363 而不是将其拆分为字母 d o 和 g。 than splitting it into characters d o and g. 27 00:01:16,650 --> 00:01:19,260 然而,当遇到 dog 这个词时 However, when encountering the word dogs 28 00:01:19,260 --> 00:01:22,710 我们希望我们的分词从词根上理解这一点 we would like our tokenize to understand that at the root 29 00:01:22,710 --> 00:01:24,120 这还是 dog 这个词。 this is still the word dog. 30 00:01:24,120 --> 00:01:27,030 添加 s 后,意思略有改变 With an added "s", that slightly changes the meaning 31 00:01:27,030 --> 00:01:28,923 同时保持最初的意思。 while keeping the original idea. 32 00:01:30,600 --> 00:01:34,080 另一个例子是像 tokenization 这样的复杂词 Another example is a complex word like tokenization 33 00:01:34,080 --> 00:01:37,140 可以拆分成有意义的子词。 which can be split into meaningful subwords. 34 00:01:37,140 --> 00:01:37,973 这个词的根 The root of the word 35 00:01:37,973 --> 00:01:40,590 是 token ,以及 -ization 完整了这个词 is token and -ization completes the root 36 00:01:40,590 --> 00:01:42,870 以赋予它稍微不同的含义。 to give it a slightly different meaning. 37 00:01:42,870 --> 00:01:44,430 拆分这个词是有道理的 It makes sense to split the word 38 00:01:44,430 --> 00:01:47,640 一分为二,token 作为词根, into two, token as the root of the word, 39 00:01:47,640 --> 00:01:49,950 标记为单词的开头 labeled as the start of the word 40 00:01:49,950 --> 00:01:52,530 和 -ization 作为标记的附加信息 and -ization as additional information labeled 41 00:01:52,530 --> 00:01:54,393 作为单词的完整化。 as a completion of the word. 42 00:01:55,826 --> 00:01:58,740 如此一来,该模型现在将能够有作用 In turn, the model will now be able to make sense 43 00:01:58,740 --> 00:02:01,080 在不同情况下的分词。 of token in different situations. 44 00:02:01,080 --> 00:02:04,602 它会理解这个词的形式: token, tokens, tokenizing It will understand that the word's token, tokens, tokenizing 45 00:02:04,602 --> 00:02:08,760 和 tokenization 具有相似的含义并且是相关联的。 and tokenization have a similar meaning and are linked. 46 00:02:08,760 --> 00:02:12,450 它还将理解 tokenization 、modernization It's will also understand that tokenization, modernization 47 00:02:12,450 --> 00:02:16,200 和 immunization ,都有相同的后缀 and immunization, which all have the same suffixes 48 00:02:16,200 --> 00:02:19,383 可能在相同的句法情况下使用。 are probably used in the same syntactic situations. 49 00:02:20,610 --> 00:02:23,130 基于子词的分词器通常有办法来 Subword based tokenizers generally have a way to 50 00:02:23,130 --> 00:02:25,890 识别哪些分词是单词的开头 identify which tokens are a start of word 51 00:02:25,890 --> 00:02:28,443 以及哪些分词完成了单词的开始。 and which tokens complete start of words. 52 00:02:29,520 --> 00:02:31,140 所以这里以分词作为单词的开始 So here token as the start of a word 53 00:02:31,140 --> 00:02:35,100 以及 ##ization 的开始作为单词的完成。 and ##ization as completion of a word. 54 00:02:35,100 --> 00:02:38,103 这里 ## 表示 -ization 是单词的一部分 Here, the ## prefix indicates that ization is part of a word 55 00:02:38,103 --> 00:02:41,013 而不是它的开始。 rather than the beginning of it. 56 00:02:41,910 --> 00:02:43,110 ## 记号 The ## comes 57 00:02:43,110 --> 00:02:47,013 来自基于单词片算法的 BERT 分词器。 from the BERT tokenizer based on the word piece algorithm. 58 00:02:47,850 --> 00:02:50,700 其他分词器使用其他前缀, 可以是 Other tokenizes use other prefixes which can be 59 00:02:50,700 --> 00:02:52,200 用来表示单词的一部分 placed to indicate part of words 60 00:02:52,200 --> 00:02:55,083 比如在这里或单词的开头。 like in here or start of words instead. 61 00:02:56,250 --> 00:02:57,083 有很多 There are a lot 62 00:02:57,083 --> 00:02:58,740 可以使用的不同算法 of different algorithms that can be used 63 00:02:58,740 --> 00:03:00,090 用于子词分词化 for subword tokenization 64 00:03:00,090 --> 00:03:02,670 大多数模型都获得了目前最先进的结果 and most models obtaining state-of-the-art results 65 00:03:02,670 --> 00:03:03,780 今日英语 in English today 66 00:03:03,780 --> 00:03:06,663 使用一些子词标记化算法。 use some kind of subword tokenization algorithms. 67 00:03:07,620 --> 00:03:10,953 这些方法有助于减少词汇量 These approaches help in reducing the vocabulary sizes 68 00:03:10,953 --> 00:03:13,636 通过不同词之间共享的信息 by sharing information across different words 69 00:03:13,636 --> 00:03:15,960 有能力以理解前缀 having the ability to have prefixes 70 00:03:15,960 --> 00:03:18,630 和后缀如斯。 and suffixes understood as such. 71 00:03:18,630 --> 00:03:20,700 他们在非常相似的词中保留意义 They keep meaning across very similar words 72 00:03:20,700 --> 00:03:23,103 通过识别相似的分词,将它们组合起来。 by recognizing similar tokens, making them up.