subtitles/zh-CN/54_building-a-new-tokenizer.srt

1 00:00:00,188 --> 00:00:02,855 （空气呼啸） (air whooshing) 2 00:00:05,400 --> 00:00:07,500 在本视频中，我们将看到如何 In this video, we will see how 3 00:00:07,500 --> 00:00:11,310 你可以从头开始创建自己的 tokenizer(分词器) 。 you can create your own tokenizer from scratch. 4 00:00:11,310 --> 00:00:15,000 要创建自己的 tokenizer ，你必须考虑 To create your own tokenizer, you will have to think about 5 00:00:15,000 --> 00:00:18,180 分词化中涉及的每个操作。 each of the operations involved in tokenization. 6 00:00:18,180 --> 00:00:22,440 即，规范化，预标记化， Namely, the normalization, the pre-tokenization, 7 00:00:22,440 --> 00:00:25,233 模型、后处理和解码。 the model, the post processing, and the decoding. 8 00:00:26,100 --> 00:00:28,350 如果你不知道什么是规范化， If you don't know what normalization, 9 00:00:28,350 --> 00:00:30,900 预标记化，模型是， pre-tokenization, and the model are, 10 00:00:30,900 --> 00:00:34,531 我建议你去看下面链接的视频。 I advise you to go and see the videos linked below. 11 00:00:34,531 --> 00:00:37,110 后处理包括所有修改 The post processing gathers all the modifications 12 00:00:37,110 --> 00:00:40,860 我们将对分词化文本执行的。 that we will carry out on the tokenized text. 13 00:00:40,860 --> 00:00:43,890 它可以包括添加特殊 token ， It can include the addition of special tokens, 14 00:00:43,890 --> 00:00:46,290 创建一个意图掩码， the creation of an intention mask, 15 00:00:46,290 --> 00:00:48,903 还会生成 token 的 ID 列表。 but also the generation of a list of token IDs. 16 00:00:50,220 --> 00:00:53,487 解码操作发生在最后， The decoding operation occurs at the very end, 17 00:00:53,487 --> 00:00:54,660 并将允许通过 and will allow passing 18 00:00:54,660 --> 00:00:57,753 来自句子中的 ID 序列。 from the sequence of IDs in a sentence. 19 00:00:58,890 --> 00:01:01,800 例如，你可以看到 hash 标签 For example, you can see that the hashtags 20 00:01:01,800 --> 00:01:04,260 已被删除，并且 token have been removed, and the tokens 21 00:01:04,260 --> 00:01:07,323 今天的词都归在了一起。 composing the word today have been grouped together. 22 00:01:10,860 --> 00:01:13,440 在快速 tokenizer 中，所有这些组件 In a fast tokenizer, all these components 23 00:01:13,440 --> 00:01:16,413 收集在 backend_tokenizer 属性中。 are gathered in the backend_tokenizer attribute. 24 00:01:17,370 --> 00:01:20,070 正如你在这个小片代码中看到的那样， As you can see with this small code snippet, 25 00:01:20,070 --> 00:01:22,020 它是 tokenizer 的一个实例 it is an instance of a tokenizer 26 00:01:22,020 --> 00:01:23,763 来自 tokenizers 库。 from the tokenizers library. 27 00:01:25,740 --> 00:01:28,263 因此，要创建自己的 tokenizer ， So, to create your own tokenizer, 28 00:01:29,970 --> 00:01:31,770 你将必须遵循这些步骤。 you will have to follow these steps. 29 00:01:33,270 --> 00:01:35,433 第一，创建一个训练数据集。 First, create a training dataset. 30 00:01:36,690 --> 00:01:39,000 第二, 创建和训练 tokenizer Second, create and train a tokenizer 31 00:01:39,000 --> 00:01:41,700 用 transformer 库。 with the transformer library. 32 00:01:41,700 --> 00:01:46,700 第三，将此 tokenizer 加载为 transformer 的 tokenizer 中。 And third, load this tokenizer into a transformer tokenizer. 33 00:01:49,350 --> 00:01:50,850 要了解这些步骤， To understand these steps, 34 00:01:50,850 --> 00:01:54,573 我建议我们一起重新创建一个 BERT 分词器。 I propose that we recreate a BERT tokenizer together. 35 00:01:56,460 --> 00:01:58,893 首先要做的是创建一个数据集。 The first thing to do is to create a dataset. 36 00:01:59,970 --> 00:02:02,460 使用此代码片段，你可以创建一个迭代器 With this code snippet you can create an iterator 37 00:02:02,460 --> 00:02:05,430 在数据集 wikitext-2-raw-V1 上， on the dataset wikitext-2-raw-V1, 38 00:02:05,430 --> 00:02:08,160 这是一个相当小的英语数据集， which is a rather small dataset in English, 39 00:02:08,160 --> 00:02:09,730 完美的例子。 perfect for the example. 40 00:02:12,210 --> 00:02:13,920 我们主要修改这里， We attack here the big part, 41 00:02:13,920 --> 00:02:17,373 使用 tokenizer 库设计我们的 tokenizer 。 the design of our tokenizer with the tokenizer library. 42 00:02:18,750 --> 00:02:22,020 我们首先初始化一个 tokenizer 实例 We start by initializing a tokenizer instance 43 00:02:22,020 --> 00:02:26,133 使用 WordPiece 模型，因为它是 BERT 使用的模型。 with a WordPiece model because it is the model used by BERT. 44 00:02:29,100 --> 00:02:32,190 然后我们可以定义我们的规范化器。 Then we can define our normalizer. 45 00:02:32,190 --> 00:02:35,891 我们将其定义为连续的两个规范化 We will define it as a succession of two normalizations 46 00:02:35,891 --> 00:02:39,453 用于清理文本中不可见的字符。 used to clean up characters not visible in the text. 47 00:02:40,590 --> 00:02:43,440 一个小写归一化， One lowercasing normalization, 48 00:02:43,440 --> 00:02:47,253 最后两个标准化用于删除重音。 and two last normalizations used to remove accents. 49 00:02:49,500 --> 00:02:53,553 对于预标记化，我们将链接两个 pre_tokenizers。 For the pre-tokenization, we will chain two pre_tokenizers. 50 00:02:54,390 --> 00:02:58,200 第一个在空格级别分隔文本， The first one separating the text at the level of spaces, 51 00:02:58,200 --> 00:03:01,533 第二个隔离标点符号。 and the second one isolating the punctuation marks. 52 00:03:03,360 --> 00:03:06,360 现在，我们可以定义允许我们的训练方法 Now, we can define the trainer that will allow us 53 00:03:06,360 --> 00:03:09,753 训练一开始选择的 WordPiece 模型。 to train the WordPiece model chosen at the beginning. 54 00:03:11,160 --> 00:03:12,600 为了开展训练， To carry out the training, 55 00:03:12,600 --> 00:03:14,853 我们将不得不选择词汇量。 we will have to choose a vocabulary size. 56 00:03:16,050 --> 00:03:17,910 这里我们选择 25,000。 Here we choose 25,000. 57 00:03:17,910 --> 00:03:21,270 我们还需要公布特殊 token And we also need to announce the special tokens 58 00:03:21,270 --> 00:03:24,663 我们绝对想添加到我们的词汇表中。 that we absolutely want to add to our vocabulary. 59 00:03:29,160 --> 00:03:33,000 在一行代码中，我们可以训练我们的 WordPiece 模型 In one line of code, we can train our WordPiece model 60 00:03:33,000 --> 00:03:35,553 使用我们之前定义的迭代器。 using the iterator we defined earlier. 61 00:03:39,060 --> 00:03:42,570 模型训练完成后，我们可以检索 Once the model has been trained, we can retrieve 62 00:03:42,570 --> 00:03:46,560 特殊类别和分离 token 的 ID， the IDs of the special class and separation tokens, 63 00:03:46,560 --> 00:03:49,413 因为我们需要它们来对我们的序列进行后期处理。 because we will need them to post-process our sequence. 64 00:03:50,820 --> 00:03:52,860 感谢 TemplateProcessing 类， Thanks to the TemplateProcessing class, 65 00:03:52,860 --> 00:03:57,210 我们可以在每个序列的开头添加 CLS token ， we can add the CLS token at the beginning of each sequence, 66 00:03:57,210 --> 00:04:00,120 和序列末尾的 SEP token ， and the SEP token at the end of the sequence, 67 00:04:00,120 --> 00:04:03,873 如果我们标记一对文本，则在两个句子之间。 and between two sentences if we tokenize a pair of text. 68 00:04:07,260 --> 00:04:10,500 最后，我们只需要定义我们的解码器， Finally, we just have to define our decoder, 69 00:04:10,500 --> 00:04:12,690 这将允许我们删除主题标签 which will allow us to remove the hashtags 70 00:04:12,690 --> 00:04:14,610 在 token 的开头 at the beginning of the tokens 71 00:04:14,610 --> 00:04:17,193 必须重新附加到以前的 token 。 that must be reattached to the previous token. 72 00:04:21,300 --> 00:04:22,260 它就在那里。 And there it is. 73 00:04:22,260 --> 00:04:25,110 你拥有所有必要的代码行 You have all the necessary lines of code 74 00:04:25,110 --> 00:04:29,403 使用 tokenizer 库定义你自己的 tokenizer 。 to define your own tokenizer with the tokenizer library. 75 00:04:30,960 --> 00:04:32,280 现在我们有了一个全新的 tokenizer Now that we have a brand new tokenizer 76 00:04:32,280 --> 00:04:35,400 使用 tokenizer 库，我们只需要加载它 with the tokenizer library, we just have to load it 77 00:04:35,400 --> 00:04:38,463 从 transformers 库中转换为快速 tokenizer 。 into a fast tokenizer from the transformers library. 78 00:04:39,960 --> 00:04:42,630 同样，我们有几种可能性。 Here again, we have several possibilities. 79 00:04:42,630 --> 00:04:44,430 我们可以在一般类中加载它， We can load it in the generic class, 80 00:04:44,430 --> 00:04:48,330 PreTrainedTokenizerFast，或在 BertTokenizerFast 类中 PreTrainedTokenizerFast, or in the BertTokenizerFast class 81 00:04:48,330 --> 00:04:52,353 因为我们在这里构建了一个类似 BERT 的分词器。 since we have built a BERT like tokenizer here. 82 00:04:57,000 --> 00:04:59,670 我真的希望这个视频能帮助你理解 I really hope this video has helped you understand 83 00:04:59,670 --> 00:05:02,133 如何创建自己的 tokenizer ， how you can create your own tokenizer, 84 00:05:03,178 --> 00:05:06,240 并且你现在已准备好导航 and that you are ready now to navigate 85 00:05:06,240 --> 00:05:08,070 tokenizer 库文档 the tokenizer library documentation 86 00:05:08,070 --> 00:05:11,367 为你的全新 tokenizer 选择组件。 to choose the components for your brand new tokenizer. 87 00:05:12,674 --> 00:05:15,341 （空气呼啸） (air whooshing)

subtitles/zh-CN/54_building-a-new-tokenizer.srt (348 lines of code) (raw):