subtitles/zh-CN/42_training-a-new-tokenizer.srt (452 lines of code) (raw):
1
00:00:00,000 --> 00:00:02,667
(空气呼啸)
(air whooshing)
2
00:00:05,310 --> 00:00:08,700
- 在这段视频中,我们将一起看到
- In this video we will see together
3
00:00:08,700 --> 00:00:11,820
训练 tokenizer 的目的是什么,
what is the purpose of training a tokenizer,
4
00:00:11,820 --> 00:00:14,400
要遵循的关键步骤是什么,
what are the key steps to follow,
5
00:00:14,400 --> 00:00:16,953
什么是最简单的方法。
and what is the easiest way to do it.
6
00:00:18,690 --> 00:00:20,677
你会问自己这个问题,
You will ask yourself the question,
7
00:00:20,677 --> 00:00:23,040
“我应该训练一个新的 tokenizer 吗?”,
"Should I train a new tokenizer?",
8
00:00:23,040 --> 00:00:25,773
当你计划从头开始训练新模型时。
when you plan to train a new model from scratch.
9
00:00:29,520 --> 00:00:34,020
一个训练过的分词器会不适合你的语料库
A trained tokenizer would not be suitable for your corpus
10
00:00:34,020 --> 00:00:37,080
如果你的语料库使用一个不同的语言,
if your corpus is in a different language,
11
00:00:37,080 --> 00:00:42,060
使用新字符,比如重音符号或大写字母,
uses new characters, such as accents or upper cased letters,
12
00:00:42,060 --> 00:00:47,060
有特定的词汇,例如医学的或法律的,
has a specific vocabulary, for example medical or legal,
13
00:00:47,100 --> 00:00:49,050
或使用一个不同的风格,
or uses a different style,
14
00:00:49,050 --> 00:00:51,873
例如,来自另一个世纪时的语言。
a language from another century for example.
15
00:00:56,490 --> 00:00:58,320
如果我让 tokenizer 训练在
If I take the tokenizer trained on
16
00:00:58,320 --> 00:01:00,780
bert-base-uncased 模型,
the bert-base-uncased model,
17
00:01:00,780 --> 00:01:03,213
并忽略其归一化步骤,
and ignore its normalization step,
18
00:01:04,260 --> 00:01:07,650
然后我们就能看到 tokenization 操作
then we can see that the tokenization operation
19
00:01:07,650 --> 00:01:09,277
在英语句子上,
on the English sentence,
20
00:01:09,277 --> 00:01:12,480
“这是一个适合我们 tokenizer 的句子”,
"Here is a sentence adapted to our tokenizer",
21
00:01:12,480 --> 00:01:15,600
产生一个相当令人满意的 token 列表,
produces a rather satisfactory list of tokens,
22
00:01:15,600 --> 00:01:18,510
从某种意义上说,这八个字的句子
in the sense that this sentence of eight words
23
00:01:18,510 --> 00:01:20,643
被标记成了九个 token 。
is tokenized into nine tokens.
24
00:01:22,920 --> 00:01:26,340
另一方面,如果我使用相同的 tokenizer
On the other hand, if I use this same tokenizer
25
00:01:26,340 --> 00:01:29,370
在孟加拉语的一个句子中,我们看到
on a sentence in Bengali, we see that
26
00:01:29,370 --> 00:01:33,690
要么一个词被分成许多子 token ,
either a word is divided into many sub tokens,
27
00:01:33,690 --> 00:01:36,270
或者 tokenizer 不知道任何一个
or that the tokenizer does not know one of
28
00:01:36,270 --> 00:01:39,873
unicode 字符并仅返回未知 token 。
the unicode characters and returns only an unknown token.
29
00:01:41,220 --> 00:01:44,970
一个常用词被分成许多 token 的事实
The fact that a common word is split into many tokens
30
00:01:44,970 --> 00:01:47,910
可能是问题的,因为语言模型
can be problematic, because language models
31
00:01:47,910 --> 00:01:51,903
只能处理有限长度的 token 序列。
can only handle a sequence of tokens of limited length.
32
00:01:52,830 --> 00:01:55,830
过度拆分初始文本的 tokenizer
A tokenizer that excessively splits your initial text
33
00:01:55,830 --> 00:01:58,503
甚至可能影响模型的性能。
may even impact the performance of your model.
34
00:01:59,760 --> 00:02:02,280
未知的 token 也有问题,
Unknown tokens are also problematic,
35
00:02:02,280 --> 00:02:04,530
因为模型将无法提取
because the model will not be able to extract
36
00:02:04,530 --> 00:02:07,563
来自文本未知部分的任何信息。
any information from the unknown part of the text.
37
00:02:11,430 --> 00:02:13,440
在另一个例子中,我们可以看到
In this other example, we can see that
38
00:02:13,440 --> 00:02:17,100
tokenizer 替换包含字符的单词
the tokenizer replaces words containing characters
39
00:02:17,100 --> 00:02:20,973
带有重音和带有未知 token 的大写字母。
with accents and capital letters with unknown tokens.
40
00:02:22,050 --> 00:02:24,770
最后,如果我们再次使用这个 tokenizer
Finally, if we use again this tokenizer
41
00:02:24,770 --> 00:02:28,170
为了标记医学词汇,我们再次看到
to tokenize medical vocabulary, we see again that
42
00:02:28,170 --> 00:02:31,800
一个单词被分成许多子 token ,
a single word is divided into many sub tokens,
43
00:02:31,800 --> 00:02:34,803
四个用于 "paracetamol" ,四个用于 "pharyngitis" 。
four for paracetamol, and four for pharyngitis.
44
00:02:37,110 --> 00:02:39,360
目前的大多数 tokenizer 被
Most of the tokenizers used by the current
45
00:02:39,360 --> 00:02:42,540
最先进的语言模型使用的, 需要被训练
state of the art language models need to be trained
46
00:02:42,540 --> 00:02:45,360
在语料库上, 其与使用的语料库如要相似于
on a corpus that is similar to the one used
47
00:02:45,360 --> 00:02:47,463
预训练语言模型的。
to pre-train the language model.
48
00:02:49,140 --> 00:02:51,150
这种训练包括学习规则
This training consists in learning rules
49
00:02:51,150 --> 00:02:53,250
将文本分成标记。
to divide the text into tokens.
50
00:02:53,250 --> 00:02:56,160
以及学习和使用这些规则的方法
And the way to learn these rules and use them
51
00:02:56,160 --> 00:02:58,233
取决于所选的 tokenizer 模型。
depends on the chosen tokenizer model.
52
00:03:00,630 --> 00:03:04,590
因此,要训练一个新的 tokenizer ,首先需要
Thus, to train a new tokenizer, it is first necessary
53
00:03:04,590 --> 00:03:07,653
构建由原始文本组成的训练语料库。
to build a training corpus composed of raw texts.
54
00:03:08,910 --> 00:03:12,423
然后,你必须为你的 tokenizer 选择一种结构。
Then, you have to choose an architecture for your tokenizer.
55
00:03:13,410 --> 00:03:14,763
这里有两个选项。
Here there are two options.
56
00:03:15,900 --> 00:03:19,710
最简单的是重用与那个结构, 其相同于
The simplest is to reuse the same architecture as the one
57
00:03:19,710 --> 00:03:22,863
另一个已经训练过的模型使用的 tokenizer 。
of a tokenizer used by another model already trained.
58
00:03:24,210 --> 00:03:25,980
否则也可以
Otherwise it is also possible
59
00:03:25,980 --> 00:03:28,560
彻底设计你的分词器。
to completely design your tokenizer.
60
00:03:28,560 --> 00:03:31,683
但这需要更多的经验和观察。
But it requires more experience and attention.
61
00:03:33,750 --> 00:03:36,660
一旦选择了结构,你就可以
Once the architecture is chosen, you can thus
62
00:03:36,660 --> 00:03:39,513
在你构建的语料库上训练这个 tokenizer 。
train this tokenizer on your constituted corpus.
63
00:03:40,650 --> 00:03:43,440
最后,你需要做的最后一件事就是保存
Finally, the last thing that you need to do is to save
64
00:03:43,440 --> 00:03:46,443
能够使用此 tokenizer 的学习规则。
the learned rules to be able to use this tokenizer.
65
00:03:49,530 --> 00:03:51,330
让我们举个例子。
Let's take an example.
66
00:03:51,330 --> 00:03:54,873
假设你想在 Python 代码上训练 GPT-2 模型。
Let's say you want to train a GPT-2 model on Python code.
67
00:03:56,160 --> 00:03:59,640
即使 Python 代码通常是英文的
Even if the Python code is usually in English
68
00:03:59,640 --> 00:04:02,386
这种类型的文本非常具体,
this type of text is very specific,
69
00:04:02,386 --> 00:04:04,473
并值得一个训练过的 tokenizer 。
and deserves a tokenizer trained on it.
70
00:04:05,340 --> 00:04:07,980
为了让你相信这一点,我们将在最后看到
To convince you of this, we will see at the end
71
00:04:07,980 --> 00:04:10,023
一个例子产生的差异。
the difference produced on an example.
72
00:04:11,400 --> 00:04:13,747
为此,我们将使用该方法
For that we are going to use the method
73
00:04:13,747 --> 00:04:18,240
“train_new_from_iterator”,所有快速的 tokenizer
"train_new_from_iterator" that all the fast tokenizers
74
00:04:18,240 --> 00:04:20,040
在库中有的,因此,
of the library have and thus,
75
00:04:20,040 --> 00:04:22,503
特别是 GPT2TokenizerFast。
in particular GPT2TokenizerFast.
76
00:04:23,880 --> 00:04:26,100
这是我们案例中最简单的方法
This is the simplest method in our case
77
00:04:26,100 --> 00:04:28,983
有一个适合 Python 代码的 tokenizer 。
to have a tokenizer adapted to Python code.
78
00:04:30,180 --> 00:04:34,140
请记住,第一件事是收集训练语料库。
Remember, the first thing is to gather a training corpus.
79
00:04:34,140 --> 00:04:37,320
我们将使用 CodeSearchNet 数据集的一个子部分
We will use a subpart of the CodeSearchNet dataset
80
00:04:37,320 --> 00:04:39,360
仅包含 Python 函数
containing only Python functions
81
00:04:39,360 --> 00:04:42,360
来自 Github 上的开源库。
from open source libraries on Github.
82
00:04:42,360 --> 00:04:43,650
这是个好时机。
It's good timing.
83
00:04:43,650 --> 00:04:46,980
该数据集为数据集库所知
This dataset is known by the datasets library
84
00:04:46,980 --> 00:04:49,203
我们可以用两行代码加载它。
and we can load it in two lines of code.
85
00:04:50,760 --> 00:04:55,230
然后,正如 “train_new_from_iterator” 方法所期望的那样
Then, as the "train_new_from_iterator" method expects
86
00:04:55,230 --> 00:04:57,150
文本列表的迭代器,
a iterator of lists of texts,
87
00:04:57,150 --> 00:04:59,970
我们创建 “get_training_corpus” 函数,
we create the "get_training_corpus" function,
88
00:04:59,970 --> 00:05:01,743
这将返回一个迭代器。
which will return an iterator.
89
00:05:03,870 --> 00:05:05,430
现在我们有了迭代器
Now that we have our iterator
90
00:05:05,430 --> 00:05:09,630
在我们的 Python 函数语料库中,我们可以加载
on our Python functions corpus, we can load
91
00:05:09,630 --> 00:05:12,351
GPT-2 分词器结构。
the GPT-2 tokenizer architecture.
92
00:05:12,351 --> 00:05:16,560
这里 old_tokenizer 不适合我们的语料库。
Here old_tokenizer is not adapted to our corpus.
93
00:05:16,560 --> 00:05:17,700
但我们只需要
But we only need
94
00:05:17,700 --> 00:05:20,733
再用一行在我们的新语料库上训练它。
one more line to train it on our new corpus.
95
00:05:21,780 --> 00:05:24,720
有一点很普遍的是大多数 分词化
An argument that is common to most of the tokenization
96
00:05:24,720 --> 00:05:28,980
算法, 目前使用的是根据词汇表的大小。
algorithms used at the moment is the size of the vocabulary.
97
00:05:28,980 --> 00:05:31,773
我们在这里选择值 52,000。
We choose here the value 52,000.
98
00:05:32,820 --> 00:05:35,760
最后,一旦训练结束,
Finally, once the training is finished,
99
00:05:35,760 --> 00:05:38,850
我们只需要在本地保存我们的新 tokenizer ,
we just have to save our new tokenizer locally,
100
00:05:38,850 --> 00:05:41,730
或将其发送到 hub 以便能够重用它
or send it to the hub to be able to reuse it
101
00:05:41,730 --> 00:05:43,593
之后很容易。
very easily afterwards.
102
00:05:45,270 --> 00:05:48,990
最后大家一起看个例子有没有用
Finally, let's see together on an example if it was useful
103
00:05:48,990 --> 00:05:53,073
重新训练一个类似于 GPT-2 的 tokenizer 。
to re-train a tokenizer similar to GPT-2 one.
104
00:05:55,110 --> 00:05:57,660
使用 GPT-2 的原始 tokenizer
With the original tokenizer of GPT-2
105
00:05:57,660 --> 00:06:00,330
我们看到所有的空间都是孤立的,
we see that all spaces are isolated,
106
00:06:00,330 --> 00:06:01,920
和方法名称 "randn",
and the method name randn,
107
00:06:01,920 --> 00:06:04,833
在 Python 代码中相对比较常见,分为两部分。
relatively common in Python code, is split in two.
108
00:06:05,730 --> 00:06:09,060
使用我们新的分词器,单缩进和双缩进
With our new tokenizer, single and double indentations
109
00:06:09,060 --> 00:06:10,890
已经学习和方法 "randn"
have been learned and the method randn
110
00:06:10,890 --> 00:06:13,770
被分词化为一个 token 。
is tokenized into one token.
111
00:06:13,770 --> 00:06:15,000
有了这个,
And with that,
112
00:06:15,000 --> 00:06:18,123
你现在知道如何训练你自己的分词器了。
you now know how to train your very own tokenizers now.
113
00:06:19,498 --> 00:06:22,165
(空气呼啸)
(air whooshing)