subtitles/zh-CN/51_byte-pair-encoding-tokenization.srt (348 lines of code) (raw):
1
00:00:00,125 --> 00:00:05,125
(空气呼啸)
(air whooshing)
2
00:00:05,190 --> 00:00:06,720
- 你是在正确的地方
- You are at the right place
3
00:00:06,720 --> 00:00:10,464
如果你想了解什么是字节对编码(BPE)算法
if you want to understand what the Byte Pair Encoding
4
00:00:10,464 --> 00:00:13,263
子词分词化算法是,
subword tokenization algorithm is,
5
00:00:14,160 --> 00:00:15,505
如何训练它
how to train it
6
00:00:15,505 --> 00:00:17,790
以及文本的分词化是如何完成的
and how the tokenization of a text is done
7
00:00:17,790 --> 00:00:19,107
用这个算法。
with this algorithm.
8
00:00:21,417 --> 00:00:22,920
BPE 算法
The BPE algorithm
9
00:00:22,920 --> 00:00:26,820
最初被提出作为文本压缩算法
was initially proposed as a text compression algorithm
10
00:00:26,820 --> 00:00:28,770
但它也非常适合
but it is also very well suited
11
00:00:28,770 --> 00:00:31,143
作为你的语言模型的分词器。
as a tokenizer for your language models.
12
00:00:32,910 --> 00:00:34,890
BPE 的思想是分词
The idea of BPE is to divide words
13
00:00:34,890 --> 00:00:36,933
进入一系列 “子词单元”
into a sequence of'subword units'
14
00:00:38,100 --> 00:00:41,970
其是参考语料库中频繁出现的单位
which are units that appear frequently in a reference corpus
15
00:00:41,970 --> 00:00:44,613
也是我们用来训练它的语料库。
which is, the corpus we used to train it.
16
00:00:46,701 --> 00:00:49,083
BPE 分词器是如何训练的?
How is a BPE tokenizer trained?
17
00:00:50,100 --> 00:00:53,340
首先,我们必须得到一个文本语料库。
First of all, we have to get a corpus of texts.
18
00:00:53,340 --> 00:00:56,940
我们不会在这个原始文本上训练我们的分词器
We will not train our tokenizer on this raw text
19
00:00:56,940 --> 00:00:59,490
但我们首先将其规范化
but we will first normalize it
20
00:00:59,490 --> 00:01:00,873
然后对其进行预标记。
then pre-tokenize it.
21
00:01:01,890 --> 00:01:03,240
作为预标记化
As the pre-tokenization
22
00:01:03,240 --> 00:01:05,790
将文本分成单词列表,
divides the text into a list of words,
23
00:01:05,790 --> 00:01:08,400
我们可以用另一种方式表示我们的语料库
we can represent our corpus in another way
24
00:01:08,400 --> 00:01:10,350
通过收集相同的词
by gathering together the same words
25
00:01:10,350 --> 00:01:12,450
并通过维护一个柜,
and by maintaining a counter,
26
00:01:12,450 --> 00:01:14,223
这里用蓝色表示。
here represented in blue.
27
00:01:17,340 --> 00:01:19,860
要了解训练的工作原理,
To understand how the training works,
28
00:01:19,860 --> 00:01:23,730
我们认为这个小语料库由以下单词组成:
we consider this toy corpus composed of the following words:
29
00:01:23,730 --> 00:01:28,203
huggingface, hugging, hug, hugger, 等
huggingface, hugging, hug, hugger, etc.
30
00:01:29,100 --> 00:01:32,640
BPE 是一种从初始词汇表开始的算法
BPE is an algorithm that starts with an initial vocabulary
31
00:01:32,640 --> 00:01:35,583
然后将其增加到所需的大小。
and then increases it to the desired size.
32
00:01:36,450 --> 00:01:38,460
为了建立初始词汇表,
To build the initial vocabulary,
33
00:01:38,460 --> 00:01:41,550
我们从分离语料库的每个词开始
we start by separating each word of the corpus
34
00:01:41,550 --> 00:01:44,253
组成它们的基本单元列表,
into a list of elementary units that compose them,
35
00:01:45,210 --> 00:01:47,013
在这里, 字符。
here, the characters.
36
00:01:50,850 --> 00:01:54,310
我们在词汇表中列出所有出现的字符
We list in our vocabulary all the characters that appear
37
00:01:55,218 --> 00:01:58,053
这将构成我们最初的词汇表。
and that will constitute our initial vocabulary.
38
00:02:00,420 --> 00:02:02,523
现在让我们看看如何增加它。
Let's now see how to increase it.
39
00:02:05,520 --> 00:02:08,250
我们回到我们拆分的语料库,
We return to our split corpus,
40
00:02:08,250 --> 00:02:11,340
我们将逐字逐句
we will go through the words one by one
41
00:02:11,340 --> 00:02:14,313
并计算 token 对的所有出现次数。
and count all the occurrences of token pairs.
42
00:02:15,450 --> 00:02:18,397
第一对由标记 “h” 和 “u” 组成,
The first pair is composed of the token 'h' and 'u',
43
00:02:20,130 --> 00:02:23,067
第二个 "u" 和 "g",
the second 'u' and 'g',
44
00:02:23,067 --> 00:02:26,253
我们继续这样,直到我们有完整的列表。
and we continue like that until we have the complete list.
45
00:02:35,580 --> 00:02:37,724
一旦我们知道所有的对
Once we know all the pairs
46
00:02:37,724 --> 00:02:40,140
以及它们出现的频率,
and their frequency of appearance,
47
00:02:40,140 --> 00:02:42,940
我们将选择出现频率最高的那个。
we will choose the one that appears the most frequently.
48
00:02:44,220 --> 00:02:47,697
这是由字母 “l” 和 “e” 组成的一对。
Here it is the pair composed of the letters 'l' and 'e'.
49
00:02:51,930 --> 00:02:53,590
我们注意到我们的第一个合并规则
We note our first merging rule
50
00:02:54,593 --> 00:02:57,243
然后我们将新 token 添加到我们的词汇表中。
and we add the new token to our vocabulary.
51
00:03:00,330 --> 00:03:04,260
然后我们可以将此合并规则应用于我们的拆分。
We can then apply this merging rule to our splits.
52
00:03:04,260 --> 00:03:07,350
你可以看到我们已经合并了所有的 token 对
You can see that we have merged all the pairs of tokens
53
00:03:07,350 --> 00:03:09,793
由标记 “l” 和 “e” 组成。
composed of the tokens 'l' and 'e'.
54
00:03:14,008 --> 00:03:18,150
现在,我们只需要重现相同的步骤
And now, we just have to reproduce the same steps
55
00:03:18,150 --> 00:03:19,353
与我们的新拆分。
with our new splits.
56
00:03:21,750 --> 00:03:23,460
我们计算出现频率
We calculate the frequency of occurrence
57
00:03:23,460 --> 00:03:25,023
对每对 token ,
of each pair of tokens,
58
00:03:27,990 --> 00:03:30,603
我们选择频率最高的一对,
we select the pair with the highest frequency,
59
00:03:32,190 --> 00:03:34,083
我们在我们的合并规则中注意到它,
we note it in our merge rules,
60
00:03:36,000 --> 00:03:39,360
我们将新的 token 添加到词汇表中
we add the new one token the vocabulary
61
00:03:39,360 --> 00:03:41,880
然后我们合并所有的 token 对
and then we merge all the pairs of tokens
62
00:03:41,880 --> 00:03:46,503
由标记 “le” 和 “a” 组成,进入我们的拆分。
composed of the token 'le' and 'a' into our splits.
63
00:03:50,323 --> 00:03:51,960
我们可以重复这个操作
And we can repeat this operation
64
00:03:51,960 --> 00:03:54,843
直到我们达到所需的词汇量。
until we reach the desired vocabulary size.
65
00:04:05,671 --> 00:04:10,671
在这里,当我们的词汇量达到 21 个 token 时,我们就停止了。
Here, we stopped when our vocabulary reached 21 tokens.
66
00:04:11,040 --> 00:04:13,920
我们现在可以看到,比起训练开始时
We can see now that the words of our corpus
67
00:04:13,920 --> 00:04:17,040
我们的语料库中的单词
are now divided into far fewer tokens
68
00:04:17,040 --> 00:04:20,280
被分成了更少的 token
than at the beginning of the training.
69
00:04:20,280 --> 00:04:21,720
而我们的算法
And that our algorithm
70
00:04:21,720 --> 00:04:24,990
学会了部首 “hug” 和 “learn”
has learned the radicals 'hug' and 'learn'
71
00:04:24,990 --> 00:04:27,537
以及动词结尾 “ing”。
and also the verbal ending 'ing'.
72
00:04:29,880 --> 00:04:32,160
现在我们已经学会了我们的词汇
Now that we have learned our vocabulary
73
00:04:32,160 --> 00:04:35,943
和合并规则,我们可以标记新文本。
and merging rules, we can tokenize new texts.
74
00:04:37,980 --> 00:04:39,210
例如,
For example,
75
00:04:39,210 --> 00:04:41,160
如果我们想标记 “hugs” 这个词,
if we want to tokenize the word 'hugs',
76
00:04:42,960 --> 00:04:46,680
首先我们将它分成基本单元
first we'll divide it into elementary units
77
00:04:46,680 --> 00:04:48,843
所以它变成了一个字符序列。
so it became a sequence of characters.
78
00:04:50,040 --> 00:04:52,020
然后,我们将通过我们的合并规则
Then, we'll go through our merge rules
79
00:04:52,020 --> 00:04:54,690
直到我们有一个我们可以应用。
until we have one we can apply.
80
00:04:54,690 --> 00:04:57,930
在这里,我们可以合并字母 “h” 和 “u”。
Here, we can merge the letters 'h' and 'u'.
81
00:04:57,930 --> 00:05:01,467
在这里,我们可以合并 2 个 token 以获得新 token “hug”。
And here, we can merge 2 tokens to get the new token 'hug'.
82
00:05:02,400 --> 00:05:05,760
当我们到达合并规则的末尾时,
When we get to the end of our merge rules,
83
00:05:05,760 --> 00:05:07,563
分词化完成。
the tokenization is finished.
84
00:05:10,650 --> 00:05:11,727
就是这样。
And that's it.
85
00:05:12,846 --> 00:05:14,850
我希望现在的 BPE 算法
I hope that now the BPE algorithm
86
00:05:14,850 --> 00:05:16,413
对你而言不再是秘密!
has no more secret for you!
87
00:05:17,739 --> 00:05:20,406
(空气呼啸)
(air whooshing)