1 00:00:00,000 --> 00:00:02,667 (空气呼啸) (air whooshing) 2 00:00:05,310 --> 00:00:08,700 - 在这段视频中,我们将一起看到 - In this video we will see together 3 00:00:08,700 --> 00:00:11,820 训练 tokenizer 的目的是什么, what is the purpose of training a tokenizer, 4 00:00:11,820 --> 00:00:14,400 要遵循的关键步骤是什么, what are the key steps to follow, 5 00:00:14,400 --> 00:00:16,953 什么是最简单的方法。 and what is the easiest way to do it. 6 00:00:18,690 --> 00:00:20,677 你会问自己这个问题, You will ask yourself the question, 7 00:00:20,677 --> 00:00:23,040 “我应该训练一个新的 tokenizer 吗?”, "Should I train a new tokenizer?", 8 00:00:23,040 --> 00:00:25,773 当你计划从头开始训练新模型时。 when you plan to train a new model from scratch. 9 00:00:29,520 --> 00:00:34,020 一个训练过的分词器会不适合你的语料库 A trained tokenizer would not be suitable for your corpus 10 00:00:34,020 --> 00:00:37,080 如果你的语料库使用一个不同的语言, if your corpus is in a different language, 11 00:00:37,080 --> 00:00:42,060 使用新字符,比如重音符号或大写字母, uses new characters, such as accents or upper cased letters, 12 00:00:42,060 --> 00:00:47,060 有特定的词汇,例如医学的或法律的, has a specific vocabulary, for example medical or legal, 13 00:00:47,100 --> 00:00:49,050 或使用一个不同的风格, or uses a different style, 14 00:00:49,050 --> 00:00:51,873 例如,来自另一个世纪时的语言。 a language from another century for example. 15 00:00:56,490 --> 00:00:58,320 如果我让 tokenizer 训练在 If I take the tokenizer trained on 16 00:00:58,320 --> 00:01:00,780 bert-base-uncased 模型, the bert-base-uncased model, 17 00:01:00,780 --> 00:01:03,213 并忽略其归一化步骤, and ignore its normalization step, 18 00:01:04,260 --> 00:01:07,650 然后我们就能看到 tokenization 操作 then we can see that the tokenization operation 19 00:01:07,650 --> 00:01:09,277 在英语句子上, on the English sentence, 20 00:01:09,277 --> 00:01:12,480 “这是一个适合我们 tokenizer 的句子”, "Here is a sentence adapted to our tokenizer", 21 00:01:12,480 --> 00:01:15,600 产生一个相当令人满意的 token 列表, produces a rather satisfactory list of tokens, 22 00:01:15,600 --> 00:01:18,510 从某种意义上说,这八个字的句子 in the sense that this sentence of eight words 23 00:01:18,510 --> 00:01:20,643 被标记成了九个 token 。 is tokenized into nine tokens. 24 00:01:22,920 --> 00:01:26,340 另一方面,如果我使用相同的 tokenizer On the other hand, if I use this same tokenizer 25 00:01:26,340 --> 00:01:29,370 在孟加拉语的一个句子中,我们看到 on a sentence in Bengali, we see that 26 00:01:29,370 --> 00:01:33,690 要么一个词被分成许多子 token , either a word is divided into many sub tokens, 27 00:01:33,690 --> 00:01:36,270 或者 tokenizer 不知道任何一个 or that the tokenizer does not know one of 28 00:01:36,270 --> 00:01:39,873 unicode 字符并仅返回未知 token 。 the unicode characters and returns only an unknown token. 29 00:01:41,220 --> 00:01:44,970 一个常用词被分成许多 token 的事实 The fact that a common word is split into many tokens 30 00:01:44,970 --> 00:01:47,910 可能是问题的,因为语言模型 can be problematic, because language models 31 00:01:47,910 --> 00:01:51,903 只能处理有限长度的 token 序列。 can only handle a sequence of tokens of limited length. 32 00:01:52,830 --> 00:01:55,830 过度拆分初始文本的 tokenizer A tokenizer that excessively splits your initial text 33 00:01:55,830 --> 00:01:58,503 甚至可能影响模型的性能。 may even impact the performance of your model. 34 00:01:59,760 --> 00:02:02,280 未知的 token 也有问题, Unknown tokens are also problematic, 35 00:02:02,280 --> 00:02:04,530 因为模型将无法提取 because the model will not be able to extract 36 00:02:04,530 --> 00:02:07,563 来自文本未知部分的任何信息。 any information from the unknown part of the text. 37 00:02:11,430 --> 00:02:13,440 在另一个例子中,我们可以看到 In this other example, we can see that 38 00:02:13,440 --> 00:02:17,100 tokenizer 替换包含字符的单词 the tokenizer replaces words containing characters 39 00:02:17,100 --> 00:02:20,973 带有重音和带有未知 token 的大写字母。 with accents and capital letters with unknown tokens. 40 00:02:22,050 --> 00:02:24,770 最后,如果我们再次使用这个 tokenizer Finally, if we use again this tokenizer 41 00:02:24,770 --> 00:02:28,170 为了标记医学词汇,我们再次看到 to tokenize medical vocabulary, we see again that 42 00:02:28,170 --> 00:02:31,800 一个单词被分成许多子 token , a single word is divided into many sub tokens, 43 00:02:31,800 --> 00:02:34,803 四个用于 "paracetamol" ,四个用于 "pharyngitis" 。 four for paracetamol, and four for pharyngitis. 44 00:02:37,110 --> 00:02:39,360 目前的大多数 tokenizer 被 Most of the tokenizers used by the current 45 00:02:39,360 --> 00:02:42,540 最先进的语言模型使用的, 需要被训练 state of the art language models need to be trained 46 00:02:42,540 --> 00:02:45,360 在语料库上, 其与使用的语料库如要相似于 on a corpus that is similar to the one used 47 00:02:45,360 --> 00:02:47,463 预训练语言模型的。 to pre-train the language model. 48 00:02:49,140 --> 00:02:51,150 这种训练包括学习规则 This training consists in learning rules 49 00:02:51,150 --> 00:02:53,250 将文本分成标记。 to divide the text into tokens. 50 00:02:53,250 --> 00:02:56,160 以及学习和使用这些规则的方法 And the way to learn these rules and use them 51 00:02:56,160 --> 00:02:58,233 取决于所选的 tokenizer 模型。 depends on the chosen tokenizer model. 52 00:03:00,630 --> 00:03:04,590 因此,要训练一个新的 tokenizer ,首先需要 Thus, to train a new tokenizer, it is first necessary 53 00:03:04,590 --> 00:03:07,653 构建由原始文本组成的训练语料库。 to build a training corpus composed of raw texts. 54 00:03:08,910 --> 00:03:12,423 然后,你必须为你的 tokenizer 选择一种结构。 Then, you have to choose an architecture for your tokenizer. 55 00:03:13,410 --> 00:03:14,763 这里有两个选项。 Here there are two options. 56 00:03:15,900 --> 00:03:19,710 最简单的是重用与那个结构, 其相同于 The simplest is to reuse the same architecture as the one 57 00:03:19,710 --> 00:03:22,863 另一个已经训练过的模型使用的 tokenizer 。 of a tokenizer used by another model already trained. 58 00:03:24,210 --> 00:03:25,980 否则也可以 Otherwise it is also possible 59 00:03:25,980 --> 00:03:28,560 彻底设计你的分词器。 to completely design your tokenizer. 60 00:03:28,560 --> 00:03:31,683 但这需要更多的经验和观察。 But it requires more experience and attention. 61 00:03:33,750 --> 00:03:36,660 一旦选择了结构,你就可以 Once the architecture is chosen, you can thus 62 00:03:36,660 --> 00:03:39,513 在你构建的语料库上训练这个 tokenizer 。 train this tokenizer on your constituted corpus. 63 00:03:40,650 --> 00:03:43,440 最后,你需要做的最后一件事就是保存 Finally, the last thing that you need to do is to save 64 00:03:43,440 --> 00:03:46,443 能够使用此 tokenizer 的学习规则。 the learned rules to be able to use this tokenizer. 65 00:03:49,530 --> 00:03:51,330 让我们举个例子。 Let's take an example. 66 00:03:51,330 --> 00:03:54,873 假设你想在 Python 代码上训练 GPT-2 模型。 Let's say you want to train a GPT-2 model on Python code. 67 00:03:56,160 --> 00:03:59,640 即使 Python 代码通常是英文的 Even if the Python code is usually in English 68 00:03:59,640 --> 00:04:02,386 这种类型的文本非常具体, this type of text is very specific, 69 00:04:02,386 --> 00:04:04,473 并值得一个训练过的 tokenizer 。 and deserves a tokenizer trained on it. 70 00:04:05,340 --> 00:04:07,980 为了让你相信这一点,我们将在最后看到 To convince you of this, we will see at the end 71 00:04:07,980 --> 00:04:10,023 一个例子产生的差异。 the difference produced on an example. 72 00:04:11,400 --> 00:04:13,747 为此,我们将使用该方法 For that we are going to use the method 73 00:04:13,747 --> 00:04:18,240 “train_new_from_iterator”,所有快速的 tokenizer "train_new_from_iterator" that all the fast tokenizers 74 00:04:18,240 --> 00:04:20,040 在库中有的,因此, of the library have and thus, 75 00:04:20,040 --> 00:04:22,503 特别是 GPT2TokenizerFast。 in particular GPT2TokenizerFast. 76 00:04:23,880 --> 00:04:26,100 这是我们案例中最简单的方法 This is the simplest method in our case 77 00:04:26,100 --> 00:04:28,983 有一个适合 Python 代码的 tokenizer 。 to have a tokenizer adapted to Python code. 78 00:04:30,180 --> 00:04:34,140 请记住,第一件事是收集训练语料库。 Remember, the first thing is to gather a training corpus. 79 00:04:34,140 --> 00:04:37,320 我们将使用 CodeSearchNet 数据集的一个子部分 We will use a subpart of the CodeSearchNet dataset 80 00:04:37,320 --> 00:04:39,360 仅包含 Python 函数 containing only Python functions 81 00:04:39,360 --> 00:04:42,360 来自 Github 上的开源库。 from open source libraries on Github. 82 00:04:42,360 --> 00:04:43,650 这是个好时机。 It's good timing. 83 00:04:43,650 --> 00:04:46,980 该数据集为数据集库所知 This dataset is known by the datasets library 84 00:04:46,980 --> 00:04:49,203 我们可以用两行代码加载它。 and we can load it in two lines of code. 85 00:04:50,760 --> 00:04:55,230 然后,正如 “train_new_from_iterator” 方法所期望的那样 Then, as the "train_new_from_iterator" method expects 86 00:04:55,230 --> 00:04:57,150 文本列表的迭代器, a iterator of lists of texts, 87 00:04:57,150 --> 00:04:59,970 我们创建 “get_training_corpus” 函数, we create the "get_training_corpus" function, 88 00:04:59,970 --> 00:05:01,743 这将返回一个迭代器。 which will return an iterator. 89 00:05:03,870 --> 00:05:05,430 现在我们有了迭代器 Now that we have our iterator 90 00:05:05,430 --> 00:05:09,630 在我们的 Python 函数语料库中,我们可以加载 on our Python functions corpus, we can load 91 00:05:09,630 --> 00:05:12,351 GPT-2 分词器结构。 the GPT-2 tokenizer architecture. 92 00:05:12,351 --> 00:05:16,560 这里 old_tokenizer 不适合我们的语料库。 Here old_tokenizer is not adapted to our corpus. 93 00:05:16,560 --> 00:05:17,700 但我们只需要 But we only need 94 00:05:17,700 --> 00:05:20,733 再用一行在我们的新语料库上训练它。 one more line to train it on our new corpus. 95 00:05:21,780 --> 00:05:24,720 有一点很普遍的是大多数 分词化 An argument that is common to most of the tokenization 96 00:05:24,720 --> 00:05:28,980 算法, 目前使用的是根据词汇表的大小。 algorithms used at the moment is the size of the vocabulary. 97 00:05:28,980 --> 00:05:31,773 我们在这里选择值 52,000。 We choose here the value 52,000. 98 00:05:32,820 --> 00:05:35,760 最后,一旦训练结束, Finally, once the training is finished, 99 00:05:35,760 --> 00:05:38,850 我们只需要在本地保存我们的新 tokenizer , we just have to save our new tokenizer locally, 100 00:05:38,850 --> 00:05:41,730 或将其发送到 hub 以便能够重用它 or send it to the hub to be able to reuse it 101 00:05:41,730 --> 00:05:43,593 之后很容易。 very easily afterwards. 102 00:05:45,270 --> 00:05:48,990 最后大家一起看个例子有没有用 Finally, let's see together on an example if it was useful 103 00:05:48,990 --> 00:05:53,073 重新训练一个类似于 GPT-2 的 tokenizer 。 to re-train a tokenizer similar to GPT-2 one. 104 00:05:55,110 --> 00:05:57,660 使用 GPT-2 的原始 tokenizer With the original tokenizer of GPT-2 105 00:05:57,660 --> 00:06:00,330 我们看到所有的空间都是孤立的, we see that all spaces are isolated, 106 00:06:00,330 --> 00:06:01,920 和方法名称 "randn", and the method name randn, 107 00:06:01,920 --> 00:06:04,833 在 Python 代码中相对比较常见,分为两部分。 relatively common in Python code, is split in two. 108 00:06:05,730 --> 00:06:09,060 使用我们新的分词器,单缩进和双缩进 With our new tokenizer, single and double indentations 109 00:06:09,060 --> 00:06:10,890 已经学习和方法 "randn" have been learned and the method randn 110 00:06:10,890 --> 00:06:13,770 被分词化为一个 token 。 is tokenized into one token. 111 00:06:13,770 --> 00:06:15,000 有了这个, And with that, 112 00:06:15,000 --> 00:06:18,123 你现在知道如何训练你自己的分词器了。 you now know how to train your very own tokenizers now. 113 00:06:19,498 --> 00:06:22,165 (空气呼啸) (air whooshing)