1
00:00:00,188 --> 00:00:02,855
（空气呼啸）
(air whooshing)

2
00:00:05,400 --> 00:00:07,500
在本视频中，我们将看到如何
In this video, we will see how

3
00:00:07,500 --> 00:00:11,310
你可以从头开始创建自己的 tokenizer(分词器) 。
you can create your own tokenizer from scratch.

4
00:00:11,310 --> 00:00:15,000
要创建自己的 tokenizer ，你必须考虑
To create your own tokenizer, you will have to think about

5
00:00:15,000 --> 00:00:18,180
分词化中涉及的每个操作。
each of the operations involved in tokenization.

6
00:00:18,180 --> 00:00:22,440
即，规范化，预标记化，
Namely, the normalization, the pre-tokenization,

7
00:00:22,440 --> 00:00:25,233
模型、后处理和解码。
the model, the post processing, and the decoding.

8
00:00:26,100 --> 00:00:28,350
如果你不知道什么是规范化，
If you don't know what normalization,

9
00:00:28,350 --> 00:00:30,900
预标记化，模型是，
pre-tokenization, and the model are,

10
00:00:30,900 --> 00:00:34,531
我建议你去看下面链接的视频。
I advise you to go and see the videos linked below.

11
00:00:34,531 --> 00:00:37,110
后处理包括所有修改
The post processing gathers all the modifications

12
00:00:37,110 --> 00:00:40,860
我们将对分词化文本执行的。
that we will carry out on the tokenized text.

13
00:00:40,860 --> 00:00:43,890
它可以包括添加特殊 token ，
It can include the addition of special tokens,

14
00:00:43,890 --> 00:00:46,290
创建一个意图掩码，
the creation of an intention mask,

15
00:00:46,290 --> 00:00:48,903
还会生成 token 的 ID 列表。
but also the generation of a list of token IDs.

16
00:00:50,220 --> 00:00:53,487
解码操作发生在最后，
The decoding operation occurs at the very end,

17
00:00:53,487 --> 00:00:54,660
并将允许通过
and will allow passing

18
00:00:54,660 --> 00:00:57,753
来自句子中的 ID 序列。
from the sequence of IDs in a sentence.

19
00:00:58,890 --> 00:01:01,800
例如，你可以看到 hash 标签
For example, you can see that the hashtags

20
00:01:01,800 --> 00:01:04,260
已被删除，并且 token 
have been removed, and the tokens

21
00:01:04,260 --> 00:01:07,323
今天的词都归在了一起。
composing the word today have been grouped together.

22
00:01:10,860 --> 00:01:13,440
在快速 tokenizer 中，所有这些组件
In a fast tokenizer, all these components

23
00:01:13,440 --> 00:01:16,413
收集在 backend_tokenizer 属性中。
are gathered in the backend_tokenizer attribute.

24
00:01:17,370 --> 00:01:20,070
正如你在这个小片代码中看到的那样，
As you can see with this small code snippet,

25
00:01:20,070 --> 00:01:22,020
它是 tokenizer 的一个实例
it is an instance of a tokenizer

26
00:01:22,020 --> 00:01:23,763
来自 tokenizers 库。
from the tokenizers library.

27
00:01:25,740 --> 00:01:28,263
因此，要创建自己的 tokenizer ，
So, to create your own tokenizer,

28
00:01:29,970 --> 00:01:31,770
你将必须遵循这些步骤。
you will have to follow these steps.

29
00:01:33,270 --> 00:01:35,433
第一，创建一个训练数据集。
First, create a training dataset.

30
00:01:36,690 --> 00:01:39,000
第二, 创建和训练 tokenizer 
Second, create and train a tokenizer

31
00:01:39,000 --> 00:01:41,700
用 transformer 库。
with the transformer library.

32
00:01:41,700 --> 00:01:46,700
第三，将此 tokenizer 加载为 transformer 的 tokenizer 中。
And third, load this tokenizer into a transformer tokenizer.

33
00:01:49,350 --> 00:01:50,850
要了解这些步骤，
To understand these steps,

34
00:01:50,850 --> 00:01:54,573
我建议我们一起重新创建一个 BERT 分词器。
I propose that we recreate a BERT tokenizer together.

35
00:01:56,460 --> 00:01:58,893
首先要做的是创建一个数据集。
The first thing to do is to create a dataset.

36
00:01:59,970 --> 00:02:02,460
使用此代码片段，你可以创建一个迭代器
With this code snippet you can create an iterator

37
00:02:02,460 --> 00:02:05,430
在数据集 wikitext-2-raw-V1 上，
on the dataset wikitext-2-raw-V1,

38
00:02:05,430 --> 00:02:08,160
这是一个相当小的英语数据集，
which is a rather small dataset in English,

39
00:02:08,160 --> 00:02:09,730
完美的例子。
perfect for the example.

40
00:02:12,210 --> 00:02:13,920
我们主要修改这里，
We attack here the big part,

41
00:02:13,920 --> 00:02:17,373
使用 tokenizer 库设计我们的 tokenizer 。
the design of our tokenizer with the tokenizer library.

42
00:02:18,750 --> 00:02:22,020
我们首先初始化一个 tokenizer 实例
We start by initializing a tokenizer instance

43
00:02:22,020 --> 00:02:26,133
使用 WordPiece 模型，因为它是 BERT 使用的模型。
with a WordPiece model because it is the model used by BERT.

44
00:02:29,100 --> 00:02:32,190
然后我们可以定义我们的规范化器。
Then we can define our normalizer.

45
00:02:32,190 --> 00:02:35,891
我们将其定义为连续的两个规范化
We will define it as a succession of two normalizations

46
00:02:35,891 --> 00:02:39,453
用于清理文本中不可见的字符。
used to clean up characters not visible in the text.

47
00:02:40,590 --> 00:02:43,440
一个小写归一化，
One lowercasing normalization,

48
00:02:43,440 --> 00:02:47,253
最后两个标准化用于删除重音。
and two last normalizations used to remove accents.

49
00:02:49,500 --> 00:02:53,553
对于预标记化，我们将链接两个 pre_tokenizers。
For the pre-tokenization, we will chain two pre_tokenizers.

50
00:02:54,390 --> 00:02:58,200
第一个在空格级别分隔文本，
The first one separating the text at the level of spaces,

51
00:02:58,200 --> 00:03:01,533
第二个隔离标点符号。
and the second one isolating the punctuation marks.

52
00:03:03,360 --> 00:03:06,360
现在，我们可以定义允许我们的训练方法
Now, we can define the trainer that will allow us

53
00:03:06,360 --> 00:03:09,753
训练一开始选择的 WordPiece 模型。
to train the WordPiece model chosen at the beginning.

54
00:03:11,160 --> 00:03:12,600
为了开展训练，
To carry out the training,

55
00:03:12,600 --> 00:03:14,853
我们将不得不选择词汇量。
we will have to choose a vocabulary size.

56
00:03:16,050 --> 00:03:17,910
这里我们选择 25,000。
Here we choose 25,000.

57
00:03:17,910 --> 00:03:21,270
我们还需要公布特殊 token 
And we also need to announce the special tokens

58
00:03:21,270 --> 00:03:24,663
我们绝对想添加到我们的词汇表中。
that we absolutely want to add to our vocabulary.

59
00:03:29,160 --> 00:03:33,000
在一行代码中，我们可以训练我们的 WordPiece 模型
In one line of code, we can train our WordPiece model

60
00:03:33,000 --> 00:03:35,553
使用我们之前定义的迭代器。
using the iterator we defined earlier.

61
00:03:39,060 --> 00:03:42,570
模型训练完成后，我们可以检索
Once the model has been trained, we can retrieve

62
00:03:42,570 --> 00:03:46,560
特殊类别和分离 token 的 ID，
the IDs of the special class and separation tokens,

63
00:03:46,560 --> 00:03:49,413
因为我们需要它们来对我们的序列进行后期处理。
because we will need them to post-process our sequence.

64
00:03:50,820 --> 00:03:52,860
感谢 TemplateProcessing 类，
Thanks to the TemplateProcessing class,

65
00:03:52,860 --> 00:03:57,210
我们可以在每个序列的开头添加 CLS token ，
we can add the CLS token at the beginning of each sequence,

66
00:03:57,210 --> 00:04:00,120
和序列末尾的 SEP token ，
and the SEP token at the end of the sequence,

67
00:04:00,120 --> 00:04:03,873
如果我们标记一对文本，则在两个句子之间。
and between two sentences if we tokenize a pair of text.

68
00:04:07,260 --> 00:04:10,500
最后，我们只需要定义我们的解码器，
Finally, we just have to define our decoder,

69
00:04:10,500 --> 00:04:12,690
这将允许我们删除主题标签
which will allow us to remove the hashtags

70
00:04:12,690 --> 00:04:14,610
在 token 的开头
at the beginning of the tokens

71
00:04:14,610 --> 00:04:17,193
必须重新附加到以前的 token 。
that must be reattached to the previous token.

72
00:04:21,300 --> 00:04:22,260
它就在那里。
And there it is.

73
00:04:22,260 --> 00:04:25,110
你拥有所有必要的代码行
You have all the necessary lines of code

74
00:04:25,110 --> 00:04:29,403
使用 tokenizer 库定义你自己的 tokenizer 。
to define your own tokenizer with the tokenizer library.

75
00:04:30,960 --> 00:04:32,280
现在我们有了一个全新的 tokenizer 
Now that we have a brand new tokenizer

76
00:04:32,280 --> 00:04:35,400
使用 tokenizer 库，我们只需要加载它
with the tokenizer library, we just have to load it

77
00:04:35,400 --> 00:04:38,463
从 transformers 库中转换为快速 tokenizer 。
into a fast tokenizer from the transformers library.

78
00:04:39,960 --> 00:04:42,630
同样，我们有几种可能性。
Here again, we have several possibilities.

79
00:04:42,630 --> 00:04:44,430
我们可以在一般类中加载它，
We can load it in the generic class,

80
00:04:44,430 --> 00:04:48,330
PreTrainedTokenizerFast，或在 BertTokenizerFast 类中
PreTrainedTokenizerFast, or in the BertTokenizerFast class

81
00:04:48,330 --> 00:04:52,353
因为我们在这里构建了一个类似 BERT 的分词器。
since we have built a BERT like tokenizer here.

82
00:04:57,000 --> 00:04:59,670
我真的希望这个视频能帮助你理解
I really hope this video has helped you understand

83
00:04:59,670 --> 00:05:02,133
如何创建自己的 tokenizer ，
how you can create your own tokenizer,

84
00:05:03,178 --> 00:05:06,240
并且你现在已准备好导航
and that you are ready now to navigate

85
00:05:06,240 --> 00:05:08,070
 tokenizer 库文档
the tokenizer library documentation

86
00:05:08,070 --> 00:05:11,367
为你的全新 tokenizer 选择组件。
to choose the components for your brand new tokenizer.

87
00:05:12,674 --> 00:05:15,341
（空气呼啸）
(air whooshing)