subtitles/zh-CN/52_wordpiece-tokenization.srt (256 lines of code) (raw):
1
00:00:00,151 --> 00:00:02,818
(空气呼啸)
(air whooshing)
2
00:00:05,520 --> 00:00:08,370
- 一起来看看什么是训练策略
- Let's see together what is the training strategy
3
00:00:08,370 --> 00:00:11,851
对 WordPiece 算法及其执行方式
of the WordPiece algorithm, and how it performs
4
00:00:11,851 --> 00:00:15,150
一旦训练,文本的分词化。
the tokenization of a text, once trained.
5
00:00:19,351 --> 00:00:23,580
WordPiece 是 Google 推出的一种分词算法。
WordPiece is a tokenization algorithm introduced by Google.
6
00:00:23,580 --> 00:00:25,653
例如,它被 BERT 使用。
It is used, for example, by BERT.
7
00:00:26,640 --> 00:00:28,020
据我们所知,
To our knowledge,
8
00:00:28,020 --> 00:00:31,590
WordPiece 的代码还没有开源。
the code of WordPiece has not been open source.
9
00:00:31,590 --> 00:00:33,510
所以我们根据我们的解释
So we base our explanations
10
00:00:33,510 --> 00:00:36,903
根据我们自己对已发表文献的理解。
on our own interpretation of the published literature.
11
00:00:42,090 --> 00:00:44,883
那么,WordPiece 的训练策略是怎样的呢?
So, what is the training strategy of WordPiece?
12
00:00:46,200 --> 00:00:48,663
与 BPE 算法类似,
Similarly to the BPE algorithm,
13
00:00:48,663 --> 00:00:52,380
WordPiece 从建立初始词汇表开始
WordPiece starts by establishing an initial vocabulary
14
00:00:52,380 --> 00:00:54,660
由基本单元组成,
composed of elementary units,
15
00:00:54,660 --> 00:00:58,773
然后将这个词汇量增加到所需的大小。
and then increases this vocabulary to the desired size.
16
00:00:59,970 --> 00:01:01,950
为了建立初始词汇表,
To build the initial vocabulary,
17
00:01:01,950 --> 00:01:04,920
我们划分训练语料库中的每个单词
we divide each word in the training corpus
18
00:01:04,920 --> 00:01:07,443
进入构成它的字母序列。
into the sequence of letters that make it up.
19
00:01:08,430 --> 00:01:11,820
如你所见,有一个小细节。
As you can see, there is a small subtlety.
20
00:01:11,820 --> 00:01:14,190
我们在字母前添加两个标签
We add two hashtags in front of the letters
21
00:01:14,190 --> 00:01:16,083
那不开启一个词。
that do not start a word.
22
00:01:17,190 --> 00:01:20,430
通过每个基本单元只保留一次出现,
By keeping only one occurrence per elementary unit,
23
00:01:20,430 --> 00:01:23,313
我们现在有了最初的词汇表。
we now have our initial vocabulary.
24
00:01:26,580 --> 00:01:29,823
我们将列出语料库中所有现有的对。
We will list all the existing pairs in our corpus.
25
00:01:30,990 --> 00:01:32,640
一旦我们有了这份表格,
Once we have this list,
26
00:01:32,640 --> 00:01:35,253
我们将为每一对计算一个分数。
we will calculate a score for each of these pairs.
27
00:01:36,630 --> 00:01:38,400
至于 BPE 算法,
As for the BPE algorithm,
28
00:01:38,400 --> 00:01:40,750
我们将选择得分最高的一对。
we will select the pair with the highest score.
29
00:01:43,260 --> 00:01:44,340
举个例子,
Taking for example,
30
00:01:44,340 --> 00:01:47,343
第一对由字母 H 和 U 组成。
the first pair composed of the letters H and U.
31
00:01:48,510 --> 00:01:51,390
一对的分数单纯地等于频率
The score of a pair is simply equal to the frequency
32
00:01:51,390 --> 00:01:54,510
对这对出现的, 除以乘积
of appearance of the pair, divided by the product
33
00:01:54,510 --> 00:01:57,330
对第一个 token 出现的频率,
of the frequency of appearance of the first token,
34
00:01:57,330 --> 00:02:00,063
乘上第二个 token 的出现频率。
by the frequency of appearance of the second token.
35
00:02:01,260 --> 00:02:05,550
因此,在一对固定的出现频率下,
Thus, at a fixed frequency of appearance of the pair,
36
00:02:05,550 --> 00:02:09,913
如果该对的子部分在语料库中非常频繁,
if the subparts of the pair are very frequent in the corpus,
37
00:02:09,913 --> 00:02:11,823
那么这个分数就会降低。
then this score will be decreased.
38
00:02:13,140 --> 00:02:17,460
在我们的示例中,HU 对出现了四次,
In our example, the pair HU appears four times,
39
00:02:17,460 --> 00:02:22,460
字母 H 四次,字母 U 四次。
the letter H four times, and the letter U four times.
40
00:02:24,030 --> 00:02:26,733
这给了我们 0.25 的分数。
This gives us a score of 0.25.
41
00:02:28,410 --> 00:02:30,960
现在我们知道如何计算这个分数了,
Now that we know how to calculate this score,
42
00:02:30,960 --> 00:02:33,360
我们可以为所有对做。
we can do it for all pairs.
43
00:02:33,360 --> 00:02:35,217
我们现在可以添加到词汇表中
We can now add to the vocabulary
44
00:02:35,217 --> 00:02:38,973
得分最高的一对,当然是在合并之后。
the pair with the highest score, after merging it of course.
45
00:02:40,140 --> 00:02:43,863
现在我们可以将同样的操作应用于我们的拆分语料库。
And now we can apply this same fusion to our split corpus.
46
00:02:45,780 --> 00:02:47,490
你可以想象,
As you can imagine,
47
00:02:47,490 --> 00:02:50,130
我们只需要重复相同的操作
we just have to repeat the same operations
48
00:02:50,130 --> 00:02:53,013
直到我们拥有所需大小的词汇表。
until we have the vocabulary at the desired size.
49
00:02:54,000 --> 00:02:55,800
让我们再看几个步骤
Let's look at a few more steps
50
00:02:55,800 --> 00:02:58,113
看看我们词汇的演变,
to see the evolution of our vocabulary,
51
00:02:58,957 --> 00:03:01,773
以及拆分长度的演变。
and also the evolution of the length of the splits.
52
00:03:06,390 --> 00:03:09,180
现在我们对自己的词汇感到满意了,
And now that we are happy with our vocabulary,
53
00:03:09,180 --> 00:03:12,663
你可能想知道如何使用它来标记文本。
you are probably wondering how to use it to tokenize a text.
54
00:03:13,830 --> 00:03:17,640
假设我们想要标记 “huggingface” 这个词。
Let's say we want to tokenize the word "huggingface".
55
00:03:17,640 --> 00:03:20,310
WordPiece 遵循这些规则。
WordPiece follows these rules.
56
00:03:20,310 --> 00:03:22,530
我们将寻找最长的 token
We will look for the longest possible token
57
00:03:22,530 --> 00:03:24,960
在单词的开头。
at the beginning of the word.
58
00:03:24,960 --> 00:03:28,920
然后我们重新开始我们的词语的剩余部分,
Then we start again on the remaining part of our word,
59
00:03:28,920 --> 00:03:31,143
依此类推,直到我们到达终点。
and so on until we reach the end.
60
00:03:32,100 --> 00:03:35,973
就是这样。 Huggingface 分为四个子 token 。
And that's it. Huggingface is divided into four sub-tokens.
61
00:03:37,200 --> 00:03:39,180
本视频即将结束。
This video is about to end.
62
00:03:39,180 --> 00:03:41,370
我希望它能帮助你更好地理解
I hope it helped you to understand better
63
00:03:41,370 --> 00:03:43,653
工作的背后是什么,WordPiece。
what is behind the work, WordPiece.
64
00:03:45,114 --> 00:03:47,864
(空气呼啸)
(air whooshing)