subtitles/zh-CN/54_building-a-new-tokenizer.srt (348 lines of code) (raw):
1
00:00:00,188 --> 00:00:02,855
(空气呼啸)
(air whooshing)
2
00:00:05,400 --> 00:00:07,500
在本视频中,我们将看到如何
In this video, we will see how
3
00:00:07,500 --> 00:00:11,310
你可以从头开始创建自己的 tokenizer(分词器) 。
you can create your own tokenizer from scratch.
4
00:00:11,310 --> 00:00:15,000
要创建自己的 tokenizer ,你必须考虑
To create your own tokenizer, you will have to think about
5
00:00:15,000 --> 00:00:18,180
分词化中涉及的每个操作。
each of the operations involved in tokenization.
6
00:00:18,180 --> 00:00:22,440
即,规范化,预标记化,
Namely, the normalization, the pre-tokenization,
7
00:00:22,440 --> 00:00:25,233
模型、后处理和解码。
the model, the post processing, and the decoding.
8
00:00:26,100 --> 00:00:28,350
如果你不知道什么是规范化,
If you don't know what normalization,
9
00:00:28,350 --> 00:00:30,900
预标记化,模型是,
pre-tokenization, and the model are,
10
00:00:30,900 --> 00:00:34,531
我建议你去看下面链接的视频。
I advise you to go and see the videos linked below.
11
00:00:34,531 --> 00:00:37,110
后处理包括所有修改
The post processing gathers all the modifications
12
00:00:37,110 --> 00:00:40,860
我们将对分词化文本执行的。
that we will carry out on the tokenized text.
13
00:00:40,860 --> 00:00:43,890
它可以包括添加特殊 token ,
It can include the addition of special tokens,
14
00:00:43,890 --> 00:00:46,290
创建一个意图掩码,
the creation of an intention mask,
15
00:00:46,290 --> 00:00:48,903
还会生成 token 的 ID 列表。
but also the generation of a list of token IDs.
16
00:00:50,220 --> 00:00:53,487
解码操作发生在最后,
The decoding operation occurs at the very end,
17
00:00:53,487 --> 00:00:54,660
并将允许通过
and will allow passing
18
00:00:54,660 --> 00:00:57,753
来自句子中的 ID 序列。
from the sequence of IDs in a sentence.
19
00:00:58,890 --> 00:01:01,800
例如,你可以看到 hash 标签
For example, you can see that the hashtags
20
00:01:01,800 --> 00:01:04,260
已被删除,并且 token
have been removed, and the tokens
21
00:01:04,260 --> 00:01:07,323
今天的词都归在了一起。
composing the word today have been grouped together.
22
00:01:10,860 --> 00:01:13,440
在快速 tokenizer 中,所有这些组件
In a fast tokenizer, all these components
23
00:01:13,440 --> 00:01:16,413
收集在 backend_tokenizer 属性中。
are gathered in the backend_tokenizer attribute.
24
00:01:17,370 --> 00:01:20,070
正如你在这个小片代码中看到的那样,
As you can see with this small code snippet,
25
00:01:20,070 --> 00:01:22,020
它是 tokenizer 的一个实例
it is an instance of a tokenizer
26
00:01:22,020 --> 00:01:23,763
来自 tokenizers 库。
from the tokenizers library.
27
00:01:25,740 --> 00:01:28,263
因此,要创建自己的 tokenizer ,
So, to create your own tokenizer,
28
00:01:29,970 --> 00:01:31,770
你将必须遵循这些步骤。
you will have to follow these steps.
29
00:01:33,270 --> 00:01:35,433
第一,创建一个训练数据集。
First, create a training dataset.
30
00:01:36,690 --> 00:01:39,000
第二, 创建和训练 tokenizer
Second, create and train a tokenizer
31
00:01:39,000 --> 00:01:41,700
用 transformer 库。
with the transformer library.
32
00:01:41,700 --> 00:01:46,700
第三,将此 tokenizer 加载为 transformer 的 tokenizer 中。
And third, load this tokenizer into a transformer tokenizer.
33
00:01:49,350 --> 00:01:50,850
要了解这些步骤,
To understand these steps,
34
00:01:50,850 --> 00:01:54,573
我建议我们一起重新创建一个 BERT 分词器。
I propose that we recreate a BERT tokenizer together.
35
00:01:56,460 --> 00:01:58,893
首先要做的是创建一个数据集。
The first thing to do is to create a dataset.
36
00:01:59,970 --> 00:02:02,460
使用此代码片段,你可以创建一个迭代器
With this code snippet you can create an iterator
37
00:02:02,460 --> 00:02:05,430
在数据集 wikitext-2-raw-V1 上,
on the dataset wikitext-2-raw-V1,
38
00:02:05,430 --> 00:02:08,160
这是一个相当小的英语数据集,
which is a rather small dataset in English,
39
00:02:08,160 --> 00:02:09,730
完美的例子。
perfect for the example.
40
00:02:12,210 --> 00:02:13,920
我们主要修改这里,
We attack here the big part,
41
00:02:13,920 --> 00:02:17,373
使用 tokenizer 库设计我们的 tokenizer 。
the design of our tokenizer with the tokenizer library.
42
00:02:18,750 --> 00:02:22,020
我们首先初始化一个 tokenizer 实例
We start by initializing a tokenizer instance
43
00:02:22,020 --> 00:02:26,133
使用 WordPiece 模型,因为它是 BERT 使用的模型。
with a WordPiece model because it is the model used by BERT.
44
00:02:29,100 --> 00:02:32,190
然后我们可以定义我们的规范化器。
Then we can define our normalizer.
45
00:02:32,190 --> 00:02:35,891
我们将其定义为连续的两个规范化
We will define it as a succession of two normalizations
46
00:02:35,891 --> 00:02:39,453
用于清理文本中不可见的字符。
used to clean up characters not visible in the text.
47
00:02:40,590 --> 00:02:43,440
一个小写归一化,
One lowercasing normalization,
48
00:02:43,440 --> 00:02:47,253
最后两个标准化用于删除重音。
and two last normalizations used to remove accents.
49
00:02:49,500 --> 00:02:53,553
对于预标记化,我们将链接两个 pre_tokenizers。
For the pre-tokenization, we will chain two pre_tokenizers.
50
00:02:54,390 --> 00:02:58,200
第一个在空格级别分隔文本,
The first one separating the text at the level of spaces,
51
00:02:58,200 --> 00:03:01,533
第二个隔离标点符号。
and the second one isolating the punctuation marks.
52
00:03:03,360 --> 00:03:06,360
现在,我们可以定义允许我们的训练方法
Now, we can define the trainer that will allow us
53
00:03:06,360 --> 00:03:09,753
训练一开始选择的 WordPiece 模型。
to train the WordPiece model chosen at the beginning.
54
00:03:11,160 --> 00:03:12,600
为了开展训练,
To carry out the training,
55
00:03:12,600 --> 00:03:14,853
我们将不得不选择词汇量。
we will have to choose a vocabulary size.
56
00:03:16,050 --> 00:03:17,910
这里我们选择 25,000。
Here we choose 25,000.
57
00:03:17,910 --> 00:03:21,270
我们还需要公布特殊 token
And we also need to announce the special tokens
58
00:03:21,270 --> 00:03:24,663
我们绝对想添加到我们的词汇表中。
that we absolutely want to add to our vocabulary.
59
00:03:29,160 --> 00:03:33,000
在一行代码中,我们可以训练我们的 WordPiece 模型
In one line of code, we can train our WordPiece model
60
00:03:33,000 --> 00:03:35,553
使用我们之前定义的迭代器。
using the iterator we defined earlier.
61
00:03:39,060 --> 00:03:42,570
模型训练完成后,我们可以检索
Once the model has been trained, we can retrieve
62
00:03:42,570 --> 00:03:46,560
特殊类别和分离 token 的 ID,
the IDs of the special class and separation tokens,
63
00:03:46,560 --> 00:03:49,413
因为我们需要它们来对我们的序列进行后期处理。
because we will need them to post-process our sequence.
64
00:03:50,820 --> 00:03:52,860
感谢 TemplateProcessing 类,
Thanks to the TemplateProcessing class,
65
00:03:52,860 --> 00:03:57,210
我们可以在每个序列的开头添加 CLS token ,
we can add the CLS token at the beginning of each sequence,
66
00:03:57,210 --> 00:04:00,120
和序列末尾的 SEP token ,
and the SEP token at the end of the sequence,
67
00:04:00,120 --> 00:04:03,873
如果我们标记一对文本,则在两个句子之间。
and between two sentences if we tokenize a pair of text.
68
00:04:07,260 --> 00:04:10,500
最后,我们只需要定义我们的解码器,
Finally, we just have to define our decoder,
69
00:04:10,500 --> 00:04:12,690
这将允许我们删除主题标签
which will allow us to remove the hashtags
70
00:04:12,690 --> 00:04:14,610
在 token 的开头
at the beginning of the tokens
71
00:04:14,610 --> 00:04:17,193
必须重新附加到以前的 token 。
that must be reattached to the previous token.
72
00:04:21,300 --> 00:04:22,260
它就在那里。
And there it is.
73
00:04:22,260 --> 00:04:25,110
你拥有所有必要的代码行
You have all the necessary lines of code
74
00:04:25,110 --> 00:04:29,403
使用 tokenizer 库定义你自己的 tokenizer 。
to define your own tokenizer with the tokenizer library.
75
00:04:30,960 --> 00:04:32,280
现在我们有了一个全新的 tokenizer
Now that we have a brand new tokenizer
76
00:04:32,280 --> 00:04:35,400
使用 tokenizer 库,我们只需要加载它
with the tokenizer library, we just have to load it
77
00:04:35,400 --> 00:04:38,463
从 transformers 库中转换为快速 tokenizer 。
into a fast tokenizer from the transformers library.
78
00:04:39,960 --> 00:04:42,630
同样,我们有几种可能性。
Here again, we have several possibilities.
79
00:04:42,630 --> 00:04:44,430
我们可以在一般类中加载它,
We can load it in the generic class,
80
00:04:44,430 --> 00:04:48,330
PreTrainedTokenizerFast,或在 BertTokenizerFast 类中
PreTrainedTokenizerFast, or in the BertTokenizerFast class
81
00:04:48,330 --> 00:04:52,353
因为我们在这里构建了一个类似 BERT 的分词器。
since we have built a BERT like tokenizer here.
82
00:04:57,000 --> 00:04:59,670
我真的希望这个视频能帮助你理解
I really hope this video has helped you understand
83
00:04:59,670 --> 00:05:02,133
如何创建自己的 tokenizer ,
how you can create your own tokenizer,
84
00:05:03,178 --> 00:05:06,240
并且你现在已准备好导航
and that you are ready now to navigate
85
00:05:06,240 --> 00:05:08,070
tokenizer 库文档
the tokenizer library documentation
86
00:05:08,070 --> 00:05:11,367
为你的全新 tokenizer 选择组件。
to choose the components for your brand new tokenizer.
87
00:05:12,674 --> 00:05:15,341
(空气呼啸)
(air whooshing)