1 00:00:00,165 --> 00:00:01,416 (屏幕呼啸) (screen whooshing) 2 00:00:01,416 --> 00:00:02,716 (贴纸弹出) (sticker popping) 3 00:00:02,716 --> 00:00:03,549 (屏幕呼啸) (screen whooshing) 4 00:00:03,549 --> 00:00:05,603 - 让我们来看看基于单词的分词。 *[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳] - Let's take a look at word-based tokenization. 5 00:00:07,650 --> 00:00:09,780 基于单词的分词化的想法是 Word-based tokenization is the idea 6 00:00:09,780 --> 00:00:11,940 将原始文本拆分成单词 of splitting the raw text into words 7 00:00:11,940 --> 00:00:14,673 通过按空格或其他特定规则拆分, by splitting on spaces or other specific rules, 8 00:00:16,020 --> 00:00:17,163 比如标点符号。 like punctuation. 9 00:00:18,900 --> 00:00:21,810 在这个算法中,每个单词都有一个特定的数字 In this algorithm, each word has a specific number 10 00:00:21,810 --> 00:00:23,463 或者说他的 ID。 or ID attributed to it. 11 00:00:24,360 --> 00:00:27,270 在这里,"let's" 的 ID 是 250, Here, let's has the ID 250, 12 00:00:27,270 --> 00:00:30,150 "do" 是 861,并且分词化 do has 861, and tokenization 13 00:00:30,150 --> 00:00:33,393 后面跟感叹号的是 345。 followed by an exclamation mark has 345. 14 00:00:34,380 --> 00:00:36,000 这个方法很有趣 This approach is interesting 15 00:00:36,000 --> 00:00:38,100 因为模型有表示 as the model has representations 16 00:00:38,100 --> 00:00:40,233 是基于整个单词的。 that are based on entire words. 17 00:00:42,720 --> 00:00:45,960 单个号码所持有的信息量高, The information held in a single number is high, 18 00:00:45,960 --> 00:00:48,240 因为一个词包含很多上下文 as a word contains a lot of contextual 19 00:00:48,240 --> 00:00:49,803 和语义信息。 and semantic information. 20 00:00:53,070 --> 00:00:55,473 然而,这种方法确实有其局限性。 However, this approach does have its limits. 21 00:00:56,610 --> 00:01:00,570 比如 dog 这个词和 dogs 这个词很相似 For example, the word dog and the word dogs are very similar 22 00:01:00,570 --> 00:01:01,923 他们的意思很接近。 and their meaning is close. 23 00:01:03,210 --> 00:01:05,550 然而,基于单词的分词化, The word-based tokenization, however, 24 00:01:05,550 --> 00:01:08,520 会给这两个词赋予完全不同的 ID will attribute entirely different IDs to these two words 25 00:01:08,520 --> 00:01:10,110 因此模型将学习 and the model will therefore learn 26 00:01:10,110 --> 00:01:12,930 这两个词的两个不同的嵌入。 two different embeddings for these two words. 27 00:01:12,930 --> 00:01:15,090 这很不幸,因为我们想要这个模型 This is unfortunate as we would like the model 28 00:01:15,090 --> 00:01:18,240 理解这些词是确实相关的, to understand that these words are indeed related, 29 00:01:18,240 --> 00:01:21,483 而 dogs 只是 dog 这个词的复数形式。 and that dogs is simply the plural form of the word dog. 30 00:01:22,980 --> 00:01:24,480 这种方法的另一个问题是, Another issue with this approach, 31 00:01:24,480 --> 00:01:28,050 是语言中有很多不同的词。 is that there are a lot of different words in the language. 32 00:01:28,050 --> 00:01:29,490 如果我们想让我们的模型理解 If we want our model to understand 33 00:01:29,490 --> 00:01:32,160 该语言中所有可能的句子, all possible sentences in that language, 34 00:01:32,160 --> 00:01:35,850 那么我们需要为每个不同的词设置一个 ID。 then we will need to have an ID for each different word. 35 00:01:35,850 --> 00:01:37,380 以及总字数, And the total number of words, 36 00:01:37,380 --> 00:01:40,080 也称为词汇量大小, which is also known as the vocabulary size, 37 00:01:40,080 --> 00:01:41,913 可以很快变得非常大。 can quickly become very large. 38 00:01:44,400 --> 00:01:47,640 这是一个问题,因为每个 ID 都映射到一个大向量 This is an issue because each ID is mapped to a large vector 39 00:01:47,640 --> 00:01:50,190 代表这个词的意思, that represents the word's meaning, 40 00:01:50,190 --> 00:01:52,170 并实现保持这些映射 and keeping track of these mappings 41 00:01:52,170 --> 00:01:54,990 需要很大的模型 requires an enormous number of weights 42 00:01:54,990 --> 00:01:57,123 当词汇量很大时。 when the vocabulary size is very large. 43 00:01:59,160 --> 00:02:00,960 如果我们希望我们的模型保持精简, If we want our models to stay lean, 44 00:02:00,960 --> 00:02:04,440 我们可以选择让分词器忽略某些词 we can opt for our tokenizer to ignore certain words 45 00:02:04,440 --> 00:02:06,093 我们不一定需要的。 that we don't necessarily need. 46 00:02:08,400 --> 00:02:11,970 例如,在这里,当在文本上训练我们的分词器时, For example, here, when training our tokenizer on a text, 47 00:02:11,970 --> 00:02:15,020 我们可能只想使用 10,000 个最常用的单词 we might want to take only the 10,000 most frequent words 48 00:02:15,020 --> 00:02:16,320 在该文本中。 in that text. 49 00:02:16,320 --> 00:02:18,600 而不是从该文本中提取所有单词 Rather than taking all words from in that text 50 00:02:18,600 --> 00:02:22,503 或所有语言的单词来创建我们的基本词汇。 or all languages words to create our basic vocabulary. 51 00:02:23,790 --> 00:02:26,520 分词器将知道如何转换这 10,000 个单词 The tokenizer will know how to convert those 10,000 words 52 00:02:26,520 --> 00:02:29,370 转换成数字,但任何其他词都会被转换 into numbers, but any other word will be converted 53 00:02:29,370 --> 00:02:31,530 作为词汇外的词, to the out-of-vocabulary word, 54 00:02:31,530 --> 00:02:33,783 或者像这里显示的那样,未知的词。 or like shown here, the unknown word. 55 00:02:35,280 --> 00:02:37,440 不幸的是,这是一种妥协。 Unfortunately, this is a compromise. 56 00:02:37,440 --> 00:02:39,900 该模型将具有完全相同的表示 The model will have the exact same representation 57 00:02:39,900 --> 00:02:42,390 对于所有它不知道的单词, for all words that it doesn't know, 58 00:02:42,390 --> 00:02:45,210 这可能会导致大量信息丢失 which can result in a lot of lost information 59 00:02:45,210 --> 00:02:47,664 如果存在许多未知单词。 if many unknown words are present. 60 00:02:47,664 --> 00:02:50,581 (屏幕呼啸) (screen whooshing)