subtitles/zh-CN/51_byte-pair-encoding-tokenization.srt

1 00:00:00,125 --> 00:00:05,125 （空气呼啸） (air whooshing) 2 00:00:05,190 --> 00:00:06,720 - 你是在正确的地方 - You are at the right place 3 00:00:06,720 --> 00:00:10,464 如果你想了解什么是字节对编码(BPE)算法 if you want to understand what the Byte Pair Encoding 4 00:00:10,464 --> 00:00:13,263 子词分词化算法是， subword tokenization algorithm is, 5 00:00:14,160 --> 00:00:15,505 如何训练它 how to train it 6 00:00:15,505 --> 00:00:17,790 以及文本的分词化是如何完成的 and how the tokenization of a text is done 7 00:00:17,790 --> 00:00:19,107 用这个算法。 with this algorithm. 8 00:00:21,417 --> 00:00:22,920 BPE 算法 The BPE algorithm 9 00:00:22,920 --> 00:00:26,820 最初被提出作为文本压缩算法 was initially proposed as a text compression algorithm 10 00:00:26,820 --> 00:00:28,770 但它也非常适合 but it is also very well suited 11 00:00:28,770 --> 00:00:31,143 作为你的语言模型的分词器。 as a tokenizer for your language models. 12 00:00:32,910 --> 00:00:34,890 BPE 的思想是分词 The idea of BPE is to divide words 13 00:00:34,890 --> 00:00:36,933 进入一系列 “子词单元” into a sequence of'subword units' 14 00:00:38,100 --> 00:00:41,970 其是参考语料库中频繁出现的单位 which are units that appear frequently in a reference corpus 15 00:00:41,970 --> 00:00:44,613 也是我们用来训练它的语料库。 which is, the corpus we used to train it. 16 00:00:46,701 --> 00:00:49,083 BPE 分词器是如何训练的？ How is a BPE tokenizer trained? 17 00:00:50,100 --> 00:00:53,340 首先，我们必须得到一个文本语料库。 First of all, we have to get a corpus of texts. 18 00:00:53,340 --> 00:00:56,940 我们不会在这个原始文本上训练我们的分词器 We will not train our tokenizer on this raw text 19 00:00:56,940 --> 00:00:59,490 但我们首先将其规范化 but we will first normalize it 20 00:00:59,490 --> 00:01:00,873 然后对其进行预标记。 then pre-tokenize it. 21 00:01:01,890 --> 00:01:03,240 作为预标记化 As the pre-tokenization 22 00:01:03,240 --> 00:01:05,790 将文本分成单词列表， divides the text into a list of words, 23 00:01:05,790 --> 00:01:08,400 我们可以用另一种方式表示我们的语料库 we can represent our corpus in another way 24 00:01:08,400 --> 00:01:10,350 通过收集相同的词 by gathering together the same words 25 00:01:10,350 --> 00:01:12,450 并通过维护一个柜， and by maintaining a counter, 26 00:01:12,450 --> 00:01:14,223 这里用蓝色表示。 here represented in blue. 27 00:01:17,340 --> 00:01:19,860 要了解训练的工作原理， To understand how the training works, 28 00:01:19,860 --> 00:01:23,730 我们认为这个小语料库由以下单词组成： we consider this toy corpus composed of the following words: 29 00:01:23,730 --> 00:01:28,203 huggingface, hugging, hug, hugger, 等 huggingface, hugging, hug, hugger, etc. 30 00:01:29,100 --> 00:01:32,640 BPE 是一种从初始词汇表开始的算法 BPE is an algorithm that starts with an initial vocabulary 31 00:01:32,640 --> 00:01:35,583 然后将其增加到所需的大小。 and then increases it to the desired size. 32 00:01:36,450 --> 00:01:38,460 为了建立初始词汇表， To build the initial vocabulary, 33 00:01:38,460 --> 00:01:41,550 我们从分离语料库的每个词开始 we start by separating each word of the corpus 34 00:01:41,550 --> 00:01:44,253 组成它们的基本单元列表， into a list of elementary units that compose them, 35 00:01:45,210 --> 00:01:47,013 在这里, 字符。 here, the characters. 36 00:01:50,850 --> 00:01:54,310 我们在词汇表中列出所有出现的字符 We list in our vocabulary all the characters that appear 37 00:01:55,218 --> 00:01:58,053 这将构成我们最初的词汇表。 and that will constitute our initial vocabulary. 38 00:02:00,420 --> 00:02:02,523 现在让我们看看如何增加它。 Let's now see how to increase it. 39 00:02:05,520 --> 00:02:08,250 我们回到我们拆分的语料库， We return to our split corpus, 40 00:02:08,250 --> 00:02:11,340 我们将逐字逐句 we will go through the words one by one 41 00:02:11,340 --> 00:02:14,313 并计算 token 对的所有出现次数。 and count all the occurrences of token pairs. 42 00:02:15,450 --> 00:02:18,397 第一对由标记 “h” 和 “u” 组成， The first pair is composed of the token 'h' and 'u', 43 00:02:20,130 --> 00:02:23,067 第二个 "u" 和 "g"， the second 'u' and 'g', 44 00:02:23,067 --> 00:02:26,253 我们继续这样，直到我们有完整的列表。 and we continue like that until we have the complete list. 45 00:02:35,580 --> 00:02:37,724 一旦我们知道所有的对 Once we know all the pairs 46 00:02:37,724 --> 00:02:40,140 以及它们出现的频率， and their frequency of appearance, 47 00:02:40,140 --> 00:02:42,940 我们将选择出现频率最高的那个。 we will choose the one that appears the most frequently. 48 00:02:44,220 --> 00:02:47,697 这是由字母 “l” 和 “e” 组成的一对。 Here it is the pair composed of the letters 'l' and 'e'. 49 00:02:51,930 --> 00:02:53,590 我们注意到我们的第一个合并规则 We note our first merging rule 50 00:02:54,593 --> 00:02:57,243 然后我们将新 token 添加到我们的词汇表中。 and we add the new token to our vocabulary. 51 00:03:00,330 --> 00:03:04,260 然后我们可以将此合并规则应用于我们的拆分。 We can then apply this merging rule to our splits. 52 00:03:04,260 --> 00:03:07,350 你可以看到我们已经合并了所有的 token 对 You can see that we have merged all the pairs of tokens 53 00:03:07,350 --> 00:03:09,793 由标记 “l” 和 “e” 组成。 composed of the tokens 'l' and 'e'. 54 00:03:14,008 --> 00:03:18,150 现在，我们只需要重现相同的步骤 And now, we just have to reproduce the same steps 55 00:03:18,150 --> 00:03:19,353 与我们的新拆分。 with our new splits. 56 00:03:21,750 --> 00:03:23,460 我们计算出现频率 We calculate the frequency of occurrence 57 00:03:23,460 --> 00:03:25,023 对每对 token ， of each pair of tokens, 58 00:03:27,990 --> 00:03:30,603 我们选择频率最高的一对， we select the pair with the highest frequency, 59 00:03:32,190 --> 00:03:34,083 我们在我们的合并规则中注意到它， we note it in our merge rules, 60 00:03:36,000 --> 00:03:39,360 我们将新的 token 添加到词汇表中 we add the new one token the vocabulary 61 00:03:39,360 --> 00:03:41,880 然后我们合并所有的 token 对 and then we merge all the pairs of tokens 62 00:03:41,880 --> 00:03:46,503 由标记 “le” 和 “a” 组成，进入我们的拆分。 composed of the token 'le' and 'a' into our splits. 63 00:03:50,323 --> 00:03:51,960 我们可以重复这个操作 And we can repeat this operation 64 00:03:51,960 --> 00:03:54,843 直到我们达到所需的词汇量。 until we reach the desired vocabulary size. 65 00:04:05,671 --> 00:04:10,671 在这里，当我们的词汇量达到 21 个 token 时，我们就停止了。 Here, we stopped when our vocabulary reached 21 tokens. 66 00:04:11,040 --> 00:04:13,920 我们现在可以看到，比起训练开始时 We can see now that the words of our corpus 67 00:04:13,920 --> 00:04:17,040 我们的语料库中的单词 are now divided into far fewer tokens 68 00:04:17,040 --> 00:04:20,280 被分成了更少的 token than at the beginning of the training. 69 00:04:20,280 --> 00:04:21,720 而我们的算法 And that our algorithm 70 00:04:21,720 --> 00:04:24,990 学会了部首 “hug” 和 “learn” has learned the radicals 'hug' and 'learn' 71 00:04:24,990 --> 00:04:27,537 以及动词结尾 “ing”。 and also the verbal ending 'ing'. 72 00:04:29,880 --> 00:04:32,160 现在我们已经学会了我们的词汇 Now that we have learned our vocabulary 73 00:04:32,160 --> 00:04:35,943 和合并规则，我们可以标记新文本。 and merging rules, we can tokenize new texts. 74 00:04:37,980 --> 00:04:39,210 例如， For example, 75 00:04:39,210 --> 00:04:41,160 如果我们想标记 “hugs” 这个词， if we want to tokenize the word 'hugs', 76 00:04:42,960 --> 00:04:46,680 首先我们将它分成基本单元 first we'll divide it into elementary units 77 00:04:46,680 --> 00:04:48,843 所以它变成了一个字符序列。 so it became a sequence of characters. 78 00:04:50,040 --> 00:04:52,020 然后，我们将通过我们的合并规则 Then, we'll go through our merge rules 79 00:04:52,020 --> 00:04:54,690 直到我们有一个我们可以应用。 until we have one we can apply. 80 00:04:54,690 --> 00:04:57,930 在这里，我们可以合并字母 “h” 和 “u”。 Here, we can merge the letters 'h' and 'u'. 81 00:04:57,930 --> 00:05:01,467 在这里，我们可以合并 2 个 token 以获得新 token “hug”。 And here, we can merge 2 tokens to get the new token 'hug'. 82 00:05:02,400 --> 00:05:05,760 当我们到达合并规则的末尾时， When we get to the end of our merge rules, 83 00:05:05,760 --> 00:05:07,563 分词化完成。 the tokenization is finished. 84 00:05:10,650 --> 00:05:11,727 就是这样。 And that's it. 85 00:05:12,846 --> 00:05:14,850 我希望现在的 BPE 算法 I hope that now the BPE algorithm 86 00:05:14,850 --> 00:05:16,413 对你而言不再是秘密！ has no more secret for you! 87 00:05:17,739 --> 00:05:20,406 （空气呼啸） (air whooshing)

subtitles/zh-CN/51_byte-pair-encoding-tokenization.srt (348 lines of code) (raw):