subtitles/zh-CN/52_wordpiece-tokenization.srt (256 lines of code) (raw):

1 00:00:00,151 --> 00:00:02,818 (空气呼啸) (air whooshing) 2 00:00:05,520 --> 00:00:08,370 - 一起来看看什么是训练策略 - Let's see together what is the training strategy 3 00:00:08,370 --> 00:00:11,851 对 WordPiece 算法及其执行方式 of the WordPiece algorithm, and how it performs 4 00:00:11,851 --> 00:00:15,150 一旦训练,文本的分词化。 the tokenization of a text, once trained. 5 00:00:19,351 --> 00:00:23,580 WordPiece 是 Google 推出的一种分词算法。 WordPiece is a tokenization algorithm introduced by Google. 6 00:00:23,580 --> 00:00:25,653 例如,它被 BERT 使用。 It is used, for example, by BERT. 7 00:00:26,640 --> 00:00:28,020 据我们所知, To our knowledge, 8 00:00:28,020 --> 00:00:31,590 WordPiece 的代码还没有开源。 the code of WordPiece has not been open source. 9 00:00:31,590 --> 00:00:33,510 所以我们根据我们的解释 So we base our explanations 10 00:00:33,510 --> 00:00:36,903 根据我们自己对已发表文献的理解。 on our own interpretation of the published literature. 11 00:00:42,090 --> 00:00:44,883 那么,WordPiece 的训练策略是怎样的呢? So, what is the training strategy of WordPiece? 12 00:00:46,200 --> 00:00:48,663 与 BPE 算法类似, Similarly to the BPE algorithm, 13 00:00:48,663 --> 00:00:52,380 WordPiece 从建立初始词汇表开始 WordPiece starts by establishing an initial vocabulary 14 00:00:52,380 --> 00:00:54,660 由基本单元组成, composed of elementary units, 15 00:00:54,660 --> 00:00:58,773 然后将这个词汇量增加到所需的大小。 and then increases this vocabulary to the desired size. 16 00:00:59,970 --> 00:01:01,950 为了建立初始词汇表, To build the initial vocabulary, 17 00:01:01,950 --> 00:01:04,920 我们划分训练语料库中的每个单词 we divide each word in the training corpus 18 00:01:04,920 --> 00:01:07,443 进入构成它的字母序列。 into the sequence of letters that make it up. 19 00:01:08,430 --> 00:01:11,820 如你所见,有一个小细节。 As you can see, there is a small subtlety. 20 00:01:11,820 --> 00:01:14,190 我们在字母前添加两个标签 We add two hashtags in front of the letters 21 00:01:14,190 --> 00:01:16,083 那不开启一个词。 that do not start a word. 22 00:01:17,190 --> 00:01:20,430 通过每个基本单元只保留一次出现, By keeping only one occurrence per elementary unit, 23 00:01:20,430 --> 00:01:23,313 我们现在有了最初的词汇表。 we now have our initial vocabulary. 24 00:01:26,580 --> 00:01:29,823 我们将列出语料库中所有现有的对。 We will list all the existing pairs in our corpus. 25 00:01:30,990 --> 00:01:32,640 一旦我们有了这份表格, Once we have this list, 26 00:01:32,640 --> 00:01:35,253 我们将为每一对计算一个分数。 we will calculate a score for each of these pairs. 27 00:01:36,630 --> 00:01:38,400 至于 BPE 算法, As for the BPE algorithm, 28 00:01:38,400 --> 00:01:40,750 我们将选择得分最高的一对。 we will select the pair with the highest score. 29 00:01:43,260 --> 00:01:44,340 举个例子, Taking for example, 30 00:01:44,340 --> 00:01:47,343 第一对由字母 H 和 U 组成。 the first pair composed of the letters H and U. 31 00:01:48,510 --> 00:01:51,390 一对的分数单纯地等于频率 The score of a pair is simply equal to the frequency 32 00:01:51,390 --> 00:01:54,510 对这对出现的, 除以乘积 of appearance of the pair, divided by the product 33 00:01:54,510 --> 00:01:57,330 对第一个 token 出现的频率, of the frequency of appearance of the first token, 34 00:01:57,330 --> 00:02:00,063 乘上第二个 token 的出现频率。 by the frequency of appearance of the second token. 35 00:02:01,260 --> 00:02:05,550 因此,在一对固定的出现频率下, Thus, at a fixed frequency of appearance of the pair, 36 00:02:05,550 --> 00:02:09,913 如果该对的子部分在语料库中非常频繁, if the subparts of the pair are very frequent in the corpus, 37 00:02:09,913 --> 00:02:11,823 那么这个分数就会降低。 then this score will be decreased. 38 00:02:13,140 --> 00:02:17,460 在我们的示例中,HU 对出现了四次, In our example, the pair HU appears four times, 39 00:02:17,460 --> 00:02:22,460 字母 H 四次,字母 U 四次。 the letter H four times, and the letter U four times. 40 00:02:24,030 --> 00:02:26,733 这给了我们 0.25 的分数。 This gives us a score of 0.25. 41 00:02:28,410 --> 00:02:30,960 现在我们知道如何计算这个分数了, Now that we know how to calculate this score, 42 00:02:30,960 --> 00:02:33,360 我们可以为所有对做。 we can do it for all pairs. 43 00:02:33,360 --> 00:02:35,217 我们现在可以添加到词汇表中 We can now add to the vocabulary 44 00:02:35,217 --> 00:02:38,973 得分最高的一对,当然是在合并之后。 the pair with the highest score, after merging it of course. 45 00:02:40,140 --> 00:02:43,863 现在我们可以将同样的操作应用于我们的拆分语料库。 And now we can apply this same fusion to our split corpus. 46 00:02:45,780 --> 00:02:47,490 你可以想象, As you can imagine, 47 00:02:47,490 --> 00:02:50,130 我们只需要重复相同的操作 we just have to repeat the same operations 48 00:02:50,130 --> 00:02:53,013 直到我们拥有所需大小的词汇表。 until we have the vocabulary at the desired size. 49 00:02:54,000 --> 00:02:55,800 让我们再看几个步骤 Let's look at a few more steps 50 00:02:55,800 --> 00:02:58,113 看看我们词汇的演变, to see the evolution of our vocabulary, 51 00:02:58,957 --> 00:03:01,773 以及拆分长度的演变。 and also the evolution of the length of the splits. 52 00:03:06,390 --> 00:03:09,180 现在我们对自己的词汇感到满意了, And now that we are happy with our vocabulary, 53 00:03:09,180 --> 00:03:12,663 你可能想知道如何使用它来标记文本。 you are probably wondering how to use it to tokenize a text. 54 00:03:13,830 --> 00:03:17,640 假设我们想要标记 “huggingface” 这个词。 Let's say we want to tokenize the word "huggingface". 55 00:03:17,640 --> 00:03:20,310 WordPiece 遵循这些规则。 WordPiece follows these rules. 56 00:03:20,310 --> 00:03:22,530 我们将寻找最长的 token We will look for the longest possible token 57 00:03:22,530 --> 00:03:24,960 在单词的开头。 at the beginning of the word. 58 00:03:24,960 --> 00:03:28,920 然后我们重新开始我们的词语的剩余部分, Then we start again on the remaining part of our word, 59 00:03:28,920 --> 00:03:31,143 依此类推,直到我们到达终点。 and so on until we reach the end. 60 00:03:32,100 --> 00:03:35,973 就是这样。 Huggingface 分为四个子 token 。 And that's it. Huggingface is divided into four sub-tokens. 61 00:03:37,200 --> 00:03:39,180 本视频即将结束。 This video is about to end. 62 00:03:39,180 --> 00:03:41,370 我希望它能帮助你更好地理解 I hope it helped you to understand better 63 00:03:41,370 --> 00:03:43,653 工作的背后是什么,WordPiece。 what is behind the work, WordPiece. 64 00:03:45,114 --> 00:03:47,864 (空气呼啸) (air whooshing)