subtitles/zh-CN/53_unigram-tokenization.srt (640 lines of code) (raw):
1
00:00:00,000 --> 00:00:02,667
(空气呼啸)
(air whooshing)
2
00:00:05,310 --> 00:00:06,420
- 在这个视频中,
- In this video,
3
00:00:06,420 --> 00:00:09,881
我们将一起研究 “Unigram 语言模型
we will study together 'the Unigram Language Model
4
00:00:09,881 --> 00:00:13,288
子词分词化算法 " 。
subword tokenization algorithm'.
5
00:00:13,288 --> 00:00:15,567
整体训练策略
The overall training strategy
6
00:00:15,567 --> 00:00:18,450
对一个 Unigram 语言模型分词器
of a Unigram Language Model tokenizer
7
00:00:18,450 --> 00:00:21,480
是从一个非常大的词汇量开始
is to start with a very large vocabulary
8
00:00:21,480 --> 00:00:24,240
然后在每次迭代中删除 token
and then to remove tokens at each iteration
9
00:00:24,240 --> 00:00:27,300
直到我们达到所需的大小。
until we reach the desired size.
10
00:00:27,300 --> 00:00:28,530
在每次迭代中,
At each iteration,
11
00:00:28,530 --> 00:00:30,930
我们将计算训练语料库的损失
we will calculate a loss on our training corpus
12
00:00:30,930 --> 00:00:33,480
多亏了 Unigram 模型。
thanks to the Unigram model.
13
00:00:33,480 --> 00:00:37,470
由于损失计算取决于可用的词汇表,
As the loss calculation depends on the available vocabulary,
14
00:00:37,470 --> 00:00:40,563
我们可以用它来选择如何减少词汇量。
we can use it to choose how to reduce the vocabulary.
15
00:00:41,550 --> 00:00:43,620
所以我们看看损失的变化
So we look at the evolution of the loss
16
00:00:43,620 --> 00:00:47,103
通过依次从词汇表中删除每个 token 。
by removing in turn each token from the vocabulary.
17
00:00:48,000 --> 00:00:50,430
我们将选择删除 p 百分比
We will choose to remove the p-percents
18
00:00:50,430 --> 00:00:52,200
增加的损失越少。
which increase the loss the less.
19
00:00:56,310 --> 00:00:57,540
在进一步之前
Before going further
20
00:00:57,540 --> 00:01:00,240
在训练算法的解释中,
in the explanation of the training algorithm,
21
00:01:00,240 --> 00:01:02,973
我需要解释什么是 Unigram 模型。
I need to explain what is an Unigram model.
22
00:01:04,183 --> 00:01:06,030
Unigram 语言模型
The Unigram Language Model
23
00:01:06,030 --> 00:01:08,493
是一种统计语言模型。
is a type of Statistical Language Model.
24
00:01:09,450 --> 00:01:10,980
统计语言模型
A Statistical Language Model
25
00:01:10,980 --> 00:01:13,530
将为文本分配概率
will assign a probability to a text
26
00:01:13,530 --> 00:01:18,090
考虑到文本实际上是一系列 token 。
considering that the text is in fact a sequence of tokens.
27
00:01:18,090 --> 00:01:21,090
可以想象的最简单的 token 序列
The simplest sequences of tokens to imagine
28
00:01:21,090 --> 00:01:24,753
是组成句子或字符的单词。
are the words that compose the sentence or the characters.
29
00:01:26,130 --> 00:01:28,890
Unigram 语言模型的特殊性
The particularity of Unigram Language Model
30
00:01:28,890 --> 00:01:32,010
是它假设每个词的出现
is that it assumes that the occurrence of each word
31
00:01:32,010 --> 00:01:34,533
独立于它的前一个词。
is independent of its previous word.
32
00:01:35,400 --> 00:01:37,620
这个假设允许我们写
This assumption allows us to write
33
00:01:37,620 --> 00:01:39,570
一个文本的概率
that the probability of a text
34
00:01:39,570 --> 00:01:42,210
等于概率的乘积
is equal to the product of the probabilities
35
00:01:42,210 --> 00:01:43,953
对组成它的 token 。
of the tokens that compose it.
36
00:01:45,840 --> 00:01:50,220
这里需要注意的是,它是一个非常简单的模型
It should be noted here that it is a very simple model
37
00:01:50,220 --> 00:01:53,850
这不会适应文本的生成
which would not be adapted to the generation of text
38
00:01:53,850 --> 00:01:57,840
因为这个模型总是会生成相同的 token ,
since this model would always generate the same token,
39
00:01:57,840 --> 00:02:00,453
概率最大的那个。
the one which has the greatest probability.
40
00:02:01,320 --> 00:02:03,360
然而,要进行分词化,
Nevertheless, to do tokenization,
41
00:02:03,360 --> 00:02:05,790
这个模型对我们很有用
this model is very useful to us
42
00:02:05,790 --> 00:02:07,440
因为它可以使用
because it can be used
43
00:02:07,440 --> 00:02:10,893
估计不同短语的相对可能性。
to estimate the relative likelihood of different phrases.
44
00:02:14,100 --> 00:02:15,000
我们现在准备好了
We are now ready
45
00:02:15,000 --> 00:02:19,830
回到我们对训练算法的解释。
to return to our explanation of the training algorithm.
46
00:02:19,830 --> 00:02:21,690
假设我们有一个训练语料库
Let's say that we have as a training corpus
47
00:02:21,690 --> 00:02:23,880
用 10 次 hug 这个词,
with 10 times the word hug,
48
00:02:23,880 --> 00:02:25,410
12 次 pug 这个词,
12 times the word pug,
49
00:02:25,410 --> 00:02:27,330
5 次 lug 这个词,
5 times the word lug,
50
00:02:27,330 --> 00:02:28,560
4 次 bug
4 times bug
51
00:02:28,560 --> 00:02:29,943
5 次 dug。
and 5 times dug.
52
00:02:33,120 --> 00:02:34,560
如前所述,
As said earlier,
53
00:02:34,560 --> 00:02:37,473
训练从大量词汇开始。
the training starts with a big vocabulary.
54
00:02:38,460 --> 00:02:41,400
显然,由于我们使用的是玩具语料库,
Obviously, as we are using a toy corpus,
55
00:02:41,400 --> 00:02:44,430
这个词汇量不会那么大
this vocabulary will not be that big
56
00:02:44,430 --> 00:02:46,773
但它应该告诉你原理。
but it should show you the principle.
57
00:02:47,610 --> 00:02:51,870
第一种方法是列出所有可能的严格子串
A first method is to list all the possible strict substrings
58
00:02:51,870 --> 00:02:53,823
这就是我们在这里要做的。
and that's what we'll do here.
59
00:02:54,780 --> 00:02:58,170
我们也可以使用 BPE 算法
We could also have used the BPE algorithm
60
00:02:58,170 --> 00:03:00,010
词汇量非常大
with a very large vocabulary size
61
00:03:01,410 --> 00:03:05,103
但就目前而言,严格的子字符串就足够了。
but for now, the strict substrings are enough.
62
00:03:06,990 --> 00:03:09,120
Unigram 分词器的训练
The training of the Unigram tokenizer
63
00:03:09,120 --> 00:03:12,093
基于期望最大化方法。
is based on the Expectation-Maximization method.
64
00:03:13,320 --> 00:03:15,120
在每次迭代中,
At each iteration,
65
00:03:15,120 --> 00:03:17,430
我们估计 token 的概率
we estimate the probabilities of the tokens
66
00:03:17,430 --> 00:03:18,430
对词汇
of the vocabulary
67
00:03:20,130 --> 00:03:23,100
然后我们删除 p 百分比的 token
and then we remove the p-percent of tokens
68
00:03:23,100 --> 00:03:26,070
最小化语料库的损失
that minimize the loss on the corpus
69
00:03:26,070 --> 00:03:28,900
而哪些不属于基本字
and which do not belong to the basic character
70
00:03:29,880 --> 00:03:33,150
因为我们想保留在我们的最终词汇表中
as we want to keep in our final vocabulary
71
00:03:33,150 --> 00:03:36,693
能够标记任何单词的基本字符。
the basic characters to be able to tokenize any word.
72
00:03:37,770 --> 00:03:39,641
让我们开始吧!
Let's go for it!
73
00:03:39,641 --> 00:03:42,360
简单估计一个 token 的概率
The probability of a token simply estimated
74
00:03:42,360 --> 00:03:44,760
按此 token 出现的次数
by the number of appearance of this token
75
00:03:44,760 --> 00:03:46,440
在我们的训练语料库中
in our training corpus
76
00:03:46,440 --> 00:03:50,133
除以所有 token 出现的总数。
divided by the total number of appearance of all the tokens.
77
00:03:51,510 --> 00:03:54,390
我们可以使用这个词汇表来标记我们的单词
We could use this vocabulary to tokenize our words
78
00:03:54,390 --> 00:03:56,283
根据 Unigram 模型。
according to the Unigram model.
79
00:03:57,150 --> 00:04:00,892
我们将一起做,以了解两件事:
We will do it together to understand two things:
80
00:04:00,892 --> 00:04:04,110
我们如何使用 Unigram 模型对单词进行分词
how we tokenize a word with a Unigram model
81
00:04:04,110 --> 00:04:07,803
以及如何在我们的语料库上计算损失。
and how the loss is calculated on our corpus.
82
00:04:09,088 --> 00:04:12,263
我们的文本 “Hug” 的 Unigram LM 分词化
The Unigram LM tokenization of our text 'Hug'
83
00:04:12,263 --> 00:04:15,270
将是发生概率最高的那个
will be the one with the highest probability of occurrence
84
00:04:15,270 --> 00:04:17,403
根据我们的 Unigram 模型。
according to our Unigram model.
85
00:04:19,080 --> 00:04:21,750
找到它,最简单的方法
To find it, the simplest way to proceed
86
00:04:21,750 --> 00:04:24,120
将列出所有可能的分割
would be to list all the possible segmentations
87
00:04:24,120 --> 00:04:25,800
对我们的文本 "Hug" ,
of our text 'Hug',
88
00:04:25,800 --> 00:04:29,340
计算每个细分的概率
calculate the probability of each of these segmentations
89
00:04:29,340 --> 00:04:32,043
然后选择概率最高的那个。
and then choose the one with the highest probability.
90
00:04:33,210 --> 00:04:34,920
以现在的词汇量,
With the current vocabulary,
91
00:04:34,920 --> 00:04:38,640
两个分词化获得完全相同的概率。
two tokenizations get exactly the same probability.
92
00:04:38,640 --> 00:04:40,080
所以我们选择其中之一
So we choose one of them
93
00:04:40,080 --> 00:04:42,603
并记住相关的概率。
and keep in memory the associated probability.
94
00:04:43,710 --> 00:04:46,380
为了计算我们训练语料库的损失,
To compute the loss on our training corpus,
95
00:04:46,380 --> 00:04:48,570
我们需要像刚才那样进行分词化
we need to tokenize as we just did
96
00:04:48,570 --> 00:04:50,673
语料库中所有剩余的单词。
all the remaining words in the corpus.
97
00:04:52,290 --> 00:04:56,430
损失就是所有单词的总和, 语料库中
The loss is then the sum over all the words in the corpus
98
00:04:56,430 --> 00:04:58,920
词的出现频率
of the frequency of occurrence of the word
99
00:04:58,920 --> 00:05:02,670
乘以概率对数的相反数
multiplied by the opposite of the log of the probability
100
00:05:02,670 --> 00:05:05,463
与单词的分词化相关联的。
associated with the tokenization of the word.
101
00:05:07,620 --> 00:05:10,803
我们在这里得到了 170 的损失。
We obtain here a loss of 170.
102
00:05:13,830 --> 00:05:18,630
请记住,我们最初的目标是减少词汇量。
Remember, our initial goal was to reduce the vocabulary.
103
00:05:18,630 --> 00:05:21,870
为此,我们将从词汇表中删除一个 token
To do this, we will remove a token from the vocabulary
104
00:05:21,870 --> 00:05:24,213
并计算相关损失。
and calculate the associated loss.
105
00:05:27,630 --> 00:05:30,627
例如,让我们删除标记 “ug”。
Let's remove for example, the token 'ug'.
106
00:05:31,920 --> 00:05:35,370
我们注意到 “hug” 的分词化
We notice that the tokenization for 'hug'
107
00:05:35,370 --> 00:05:39,990
字母 "h" 和元组 "ug" 现在是不可能的。
with the letter 'h' and the tuple 'ug' is now impossible.
108
00:05:39,990 --> 00:05:42,240
尽管如此,正如我们之前看到的
Nevertheless, as we saw earlier
109
00:05:42,240 --> 00:05:45,180
两个分词化具有相同的概率,
that two tokenizations had the same probability,
110
00:05:45,180 --> 00:05:47,730
我们仍然可以选择剩余的分词化
we can still choose the remaining tokenization
111
00:05:47,730 --> 00:05:51,093
概率为 1.10e-2。
with a probability of 1.10e-2.
112
00:05:52,410 --> 00:05:55,350
词汇表中其他词的分词
The tokenizations of the other words of the vocabulary
113
00:05:55,350 --> 00:05:57,060
也保持不变。
also remain unchanged.
114
00:05:57,060 --> 00:06:00,600
最后,即使我们删除了标记 “ug”
And finally, even if we remove the token 'ug'
115
00:06:00,600 --> 00:06:05,403
从我们的词汇表来看,损失仍然等于 170。
from our vocabulary the loss remains equal to 170.
116
00:06:06,630 --> 00:06:08,100
对于第一次迭代,
For this first iteration,
117
00:06:08,100 --> 00:06:10,080
如果我们继续计算,
if we continue the calculation,
118
00:06:10,080 --> 00:06:13,050
我们会注意到我们可以删除任何 token
we would notice that we could remove any token
119
00:06:13,050 --> 00:06:16,110
在不影响损失的情况下。
without it impacting the loss.
120
00:06:16,110 --> 00:06:19,200
因此,我们将随机选择删除标记 “ug”
We will therefore choose at random to remove the token 'ug'
121
00:06:19,200 --> 00:06:21,843
在开始第二次迭代之前。
before starting a second iteration.
122
00:06:24,240 --> 00:06:27,300
所以我们再次估计每个 token 的概率
So we estimate again the probability of each token
123
00:06:27,300 --> 00:06:30,630
在计算每个 token 对损失的影响之前。
before calculating the impact of each token on the loss.
124
00:06:32,160 --> 00:06:33,990
例如,如果我们现在删除
For example, if we remove now
125
00:06:33,990 --> 00:06:36,290
由字母 “h” 和 “u” 组成的 token ,
the token composed of the letters 'h' and 'u',
126
00:06:37,350 --> 00:06:41,013
"hug" 只剩下一种可能的分词化。
there is only one possible tokenization left for "hug".
127
00:06:41,940 --> 00:06:44,700
词汇表中其他词的分词化
The tokenization of the other words of the vocabulary
128
00:06:44,700 --> 00:06:45,633
没有改变。
is not changed.
129
00:06:46,560 --> 00:06:47,393
到底,
In the end,
130
00:06:47,393 --> 00:06:49,200
我们通过删除 token 获得
we obtain by removing the token
131
00:06:49,200 --> 00:06:52,749
由词汇表中的字母 “h” 和 “u” 组成,
composed of the letters 'h' and 'u' from the vocabulary,
132
00:06:52,749 --> 00:06:56,430
损失为 168。
a loss of 168.
133
00:06:56,430 --> 00:06:59,490
最后,要选择要删除的 token ,
Finally, to choose which token to remove,
134
00:06:59,490 --> 00:07:02,490
我们将为词汇表的每个剩余 token ,
we will for each remaining token of the vocabulary,
135
00:07:02,490 --> 00:07:04,800
这不是基本 token ,
which is not an elementary token,
136
00:07:04,800 --> 00:07:07,380
计算相关损失。
calculate the associated loss.
137
00:07:07,380 --> 00:07:09,843
然后,比较它们之间的这些损失。
Then, compare these losses between them.
138
00:07:11,730 --> 00:07:13,800
我们将删除的 token
The token which we will remove
139
00:07:13,800 --> 00:07:17,340
是对损失影响最小的 token ,
is the token which impacts the least the loss,
140
00:07:17,340 --> 00:07:18,870
这里是 token “bu”。
here the token 'bu'.
141
00:07:20,040 --> 00:07:22,380
我们在视频开头提到过
We had mentioned at the beginning of the video
142
00:07:22,380 --> 00:07:24,930
在每次迭代中我们可以删除
that at each iteration we could remove
143
00:07:24,930 --> 00:07:27,093
迭代中标记的 p 百分比。
p-percent of the tokens by iteration.
144
00:07:29,356 --> 00:07:33,000
可以在本次迭代中删除的第二个 token
The second token that could be removed at this iteration
145
00:07:33,000 --> 00:07:34,317
是 token “du”。
is the token 'du'.
146
00:07:36,510 --> 00:07:37,920
就是这样。
And that's it.
147
00:07:37,920 --> 00:07:39,720
我们只需要重复这些步骤
We just have to repeat these steps
148
00:07:39,720 --> 00:07:43,203
直到我们得到所需大小的词汇表。
until we get the vocabulary of the desired size.
149
00:07:45,030 --> 00:07:46,500
最后一件事。
One last thing.
150
00:07:46,500 --> 00:07:50,310
在实践中,当我们用 Unigram 模型标记一个词时,
In practice, when we tokenize a word with a Unigram model,
151
00:07:50,310 --> 00:07:53,130
我们不计算概率的集合
we don't compute the set of probabilities of
152
00:07:53,130 --> 00:07:55,500
一个词所有可能的拆分
all the possible splits of a word
153
00:07:55,500 --> 00:07:58,770
在比较它们以保留最好的之前
before comparing them to keep the best one
154
00:07:58,770 --> 00:08:01,440
但是我们使用 Viterbi 算法
but we use the Viterbi algorithm
155
00:08:01,440 --> 00:08:04,563
这是更有效的方法。
which is much more efficient way to do it.
156
00:08:06,540 --> 00:08:07,680
就是这样!
And that's it!
157
00:08:07,680 --> 00:08:09,270
我希望这个例子
I hope that this example
158
00:08:09,270 --> 00:08:10,987
已经让你更好的了解
has allowed you to better understand
159
00:08:10,987 --> 00:08:12,933
Unigram 分词算法。
the Unigram tokenization algorithm.
160
00:08:14,355 --> 00:08:17,022
(空气呼啸)
(air whooshing)