1 00:00:00,234 --> 00:00:02,901 (翻页) (page whirring) 2 00:00:04,260 --> 00:00:07,200 - 在深入研究基于字符的分词化之前, *[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳] - Before diving in character-based tokenization, 3 00:00:07,200 --> 00:00:10,350 理解为什么这种分词化很有趣 understanding why this kind of tokenization is interesting 4 00:00:10,350 --> 00:00:13,533 需要了解基于单词的分词化的缺陷。 requires understanding the flaws of word-based tokenization. 5 00:00:14,640 --> 00:00:16,320 如果你还没有看过第一个视频, If you haven't seen the first video 6 00:00:16,320 --> 00:00:17,880 基于词的分词 on word-based tokenization 7 00:00:17,880 --> 00:00:21,450 我们建议你在观看此视频之前看一下。 we recommend you check it out before looking at this video. 8 00:00:21,450 --> 00:00:24,250 好的,让我们看一下基于字符的分词化。 Okay, let's take a look at character-based tokenization. 9 00:00:25,650 --> 00:00:28,560 我们现在将文本拆分为单个字符, We now split our text into individual characters, 10 00:00:28,560 --> 00:00:29,673 而不是文字。 rather than words. 11 00:00:32,850 --> 00:00:35,550 语言中通常有很多不同的词, There are generally a lot of different words in languages, 12 00:00:35,550 --> 00:00:37,743 而字符数保持较低。 while the number of characters stays low. 13 00:00:38,610 --> 00:00:41,313 首先, 让我们看一下英语, To begin let's take a look at the English language, 14 00:00:42,210 --> 00:00:45,540 它估计有 170,000 个不同的词, it has an estimated 170,000 different words, 15 00:00:45,540 --> 00:00:47,730 所以我们需要非常大的词汇量 so we would need a very large vocabulary 16 00:00:47,730 --> 00:00:49,413 来包含所有单词。 to encompass all words. 17 00:00:50,280 --> 00:00:52,200 使用基于字符的词汇表, With a character-based vocabulary, 18 00:00:52,200 --> 00:00:55,440 我们可以只用 256 个字符, we can get by with only 256 characters, 19 00:00:55,440 --> 00:00:58,683 其中包括字母、数字和特殊字符。 which includes letters, numbers and special characters. 20 00:00:59,760 --> 00:01:02,190 即使是有大量不同字符的语言 Even languages with a lot of different characters 21 00:01:02,190 --> 00:01:04,800 就像中文可以有字典一样 like the Chinese languages can have dictionaries 22 00:01:04,800 --> 00:01:08,130 多达 20,000 个不同的汉字 with up to 20,000 different characters 23 00:01:08,130 --> 00:01:11,523 超过 375,000 个不同的词语。 but more than 375,000 different words. 24 00:01:12,480 --> 00:01:14,310 所以基于字符的词汇 So character-based vocabularies 25 00:01:14,310 --> 00:01:16,293 让我们使用更少的不同分词 let us use fewer different tokens 26 00:01:16,293 --> 00:01:19,050 比基于单词的分词词典 than the word-based tokenization dictionaries 27 00:01:19,050 --> 00:01:20,523 否则我们会使用。 we would otherwise use. 28 00:01:23,250 --> 00:01:25,830 这些词汇也更全面 These vocabularies are also more complete 29 00:01:25,830 --> 00:01:28,950 相较于其基于单词的词汇。 than their word-based vocabularies counterparts. 30 00:01:28,950 --> 00:01:31,410 由于我们的词汇表包含所有字符 As our vocabulary contains all characters 31 00:01:31,410 --> 00:01:33,960 在一种语言中的,甚至是看不见的词 used in a language, even words unseen 32 00:01:33,960 --> 00:01:36,990 在分词器训练期间仍然可以分词, during the tokenizer training can still be tokenized, 33 00:01:36,990 --> 00:01:39,633 因此溢出的分词将不那么频繁。 so out-of-vocabulary tokens will be less frequent. 34 00:01:40,680 --> 00:01:42,840 这包括正确分词化 This includes the ability to correctly tokenize 35 00:01:42,840 --> 00:01:45,210 拼错的单词,而不是丢弃它们 misspelled words, rather than discarding them 36 00:01:45,210 --> 00:01:46,623 作为未知的。 as unknown straight away. 37 00:01:48,240 --> 00:01:52,380 然而,这个算法也不完美。 However, this algorithm isn't perfect either. 38 00:01:52,380 --> 00:01:54,360 直觉上,字符不成立 Intuitively, characters do not hold 39 00:01:54,360 --> 00:01:57,990 一个词所能包含的信息量。 as much information individually as a word would hold. 40 00:01:57,990 --> 00:02:00,930 例如,“让我们” 包含更多信息 For example, "Let's" holds more information 41 00:02:00,930 --> 00:02:03,570 比它的第一个字母 “l”。 than it's first letter "l". 42 00:02:03,570 --> 00:02:05,880 当然,并非所有语言都如此, Of course, this is not true for all languages, 43 00:02:05,880 --> 00:02:08,880 作为一些语言,比如基于表意文字的语言 as some languages like ideogram-based languages 44 00:02:08,880 --> 00:02:11,523 有很多信息保存在单个字符中, have a lot of information held in single characters, 45 00:02:12,750 --> 00:02:15,360 但对于其他像基于字母的语言, but for others like roman-based languages, 46 00:02:15,360 --> 00:02:17,760 该模型必须一次性理解多个分词 the model will have to make sense of multiple tokens at a time 47 00:02:17,760 --> 00:02:20,670 以获取信息 to get the information otherwise held 48 00:02:20,670 --> 00:02:21,753 在一句话中。 in a single word. 49 00:02:23,760 --> 00:02:27,000 这导致了基于字符的分词器的另一个问题, This leads to another issue with character-based tokenizers, 50 00:02:27,000 --> 00:02:29,520 他们的序列被翻译成非常大量 their sequences are translated into very large amount 51 00:02:29,520 --> 00:02:31,593 模型要处理的分词。 of tokens to be processed by the model. 52 00:02:33,090 --> 00:02:36,810 这会对上下文的大小产生影响 And this can have an impact on the size of the context 53 00:02:36,810 --> 00:02:40,020 该模型将装载,并会减小 the model will carry around, and will reduce the size 54 00:02:40,020 --> 00:02:42,030 可以用作模型输入文本的尺寸, of the text we can use as input for our model, 55 00:02:42,030 --> 00:02:43,233 这通常是有限的。 which is often limited. 56 00:02:44,100 --> 00:02:46,650 这种标记化虽然存在一些问题, This tokenization, while it has some issues, 57 00:02:46,650 --> 00:02:48,720 但在过去看到了一些非常好的结果 has seen some very good results in the past 58 00:02:48,720 --> 00:02:50,490 所以应该被考虑 and so it should be considered 59 00:02:50,490 --> 00:02:52,680 当碰到新问题时, 来解决问题. when approaching a new problem as it solves issues 60 00:02:52,680 --> 00:02:54,843 在基于词的算法中遇到的。 encountered in the word-based algorithm. 61 00:02:56,107 --> 00:02:58,774 (翻页) (page whirring)