subtitles/zh-CN/53_unigram-tokenization.srt (640 lines of code) (raw):

1 00:00:00,000 --> 00:00:02,667 (空气呼啸) (air whooshing) 2 00:00:05,310 --> 00:00:06,420 - 在这个视频中, - In this video, 3 00:00:06,420 --> 00:00:09,881 我们将一起研究 “Unigram 语言模型 we will study together 'the Unigram Language Model 4 00:00:09,881 --> 00:00:13,288 子词分词化算法 " 。 subword tokenization algorithm'. 5 00:00:13,288 --> 00:00:15,567 整体训练策略 The overall training strategy 6 00:00:15,567 --> 00:00:18,450 对一个 Unigram 语言模型分词器 of a Unigram Language Model tokenizer 7 00:00:18,450 --> 00:00:21,480 是从一个非常大的词汇量开始 is to start with a very large vocabulary 8 00:00:21,480 --> 00:00:24,240 然后在每次迭代中删除 token and then to remove tokens at each iteration 9 00:00:24,240 --> 00:00:27,300 直到我们达到所需的大小。 until we reach the desired size. 10 00:00:27,300 --> 00:00:28,530 在每次迭代中, At each iteration, 11 00:00:28,530 --> 00:00:30,930 我们将计算训练语料库的损失 we will calculate a loss on our training corpus 12 00:00:30,930 --> 00:00:33,480 多亏了 Unigram 模型。 thanks to the Unigram model. 13 00:00:33,480 --> 00:00:37,470 由于损失计算取决于可用的词汇表, As the loss calculation depends on the available vocabulary, 14 00:00:37,470 --> 00:00:40,563 我们可以用它来选择如何减少词汇量。 we can use it to choose how to reduce the vocabulary. 15 00:00:41,550 --> 00:00:43,620 所以我们看看损失的变化 So we look at the evolution of the loss 16 00:00:43,620 --> 00:00:47,103 通过依次从词汇表中删除每个 token 。 by removing in turn each token from the vocabulary. 17 00:00:48,000 --> 00:00:50,430 我们将选择删除 p 百分比 We will choose to remove the p-percents 18 00:00:50,430 --> 00:00:52,200 增加的损失越少。 which increase the loss the less. 19 00:00:56,310 --> 00:00:57,540 在进一步之前 Before going further 20 00:00:57,540 --> 00:01:00,240 在训练算法的解释中, in the explanation of the training algorithm, 21 00:01:00,240 --> 00:01:02,973 我需要解释什么是 Unigram 模型。 I need to explain what is an Unigram model. 22 00:01:04,183 --> 00:01:06,030 Unigram 语言模型 The Unigram Language Model 23 00:01:06,030 --> 00:01:08,493 是一种统计语言模型。 is a type of Statistical Language Model. 24 00:01:09,450 --> 00:01:10,980 统计语言模型 A Statistical Language Model 25 00:01:10,980 --> 00:01:13,530 将为文本分配概率 will assign a probability to a text 26 00:01:13,530 --> 00:01:18,090 考虑到文本实际上是一系列 token 。 considering that the text is in fact a sequence of tokens. 27 00:01:18,090 --> 00:01:21,090 可以想象的最简单的 token 序列 The simplest sequences of tokens to imagine 28 00:01:21,090 --> 00:01:24,753 是组成句子或字符的单词。 are the words that compose the sentence or the characters. 29 00:01:26,130 --> 00:01:28,890 Unigram 语言模型的特殊性 The particularity of Unigram Language Model 30 00:01:28,890 --> 00:01:32,010 是它假设每个词的出现 is that it assumes that the occurrence of each word 31 00:01:32,010 --> 00:01:34,533 独立于它的前一个词。 is independent of its previous word. 32 00:01:35,400 --> 00:01:37,620 这个假设允许我们写 This assumption allows us to write 33 00:01:37,620 --> 00:01:39,570 一个文本的概率 that the probability of a text 34 00:01:39,570 --> 00:01:42,210 等于概率的乘积 is equal to the product of the probabilities 35 00:01:42,210 --> 00:01:43,953 对组成它的 token 。 of the tokens that compose it. 36 00:01:45,840 --> 00:01:50,220 这里需要注意的是,它是一个非常简单的模型 It should be noted here that it is a very simple model 37 00:01:50,220 --> 00:01:53,850 这不会适应文本的生成 which would not be adapted to the generation of text 38 00:01:53,850 --> 00:01:57,840 因为这个模型总是会生成相同的 token , since this model would always generate the same token, 39 00:01:57,840 --> 00:02:00,453 概率最大的那个。 the one which has the greatest probability. 40 00:02:01,320 --> 00:02:03,360 然而,要进行分词化, Nevertheless, to do tokenization, 41 00:02:03,360 --> 00:02:05,790 这个模型对我们很有用 this model is very useful to us 42 00:02:05,790 --> 00:02:07,440 因为它可以使用 because it can be used 43 00:02:07,440 --> 00:02:10,893 估计不同短语的相对可能性。 to estimate the relative likelihood of different phrases. 44 00:02:14,100 --> 00:02:15,000 我们现在准备好了 We are now ready 45 00:02:15,000 --> 00:02:19,830 回到我们对训练算法的解释。 to return to our explanation of the training algorithm. 46 00:02:19,830 --> 00:02:21,690 假设我们有一个训练语料库 Let's say that we have as a training corpus 47 00:02:21,690 --> 00:02:23,880 用 10 次 hug 这个词, with 10 times the word hug, 48 00:02:23,880 --> 00:02:25,410 12 次 pug 这个词, 12 times the word pug, 49 00:02:25,410 --> 00:02:27,330 5 次 lug 这个词, 5 times the word lug, 50 00:02:27,330 --> 00:02:28,560 4 次 bug 4 times bug 51 00:02:28,560 --> 00:02:29,943 5 次 dug。 and 5 times dug. 52 00:02:33,120 --> 00:02:34,560 如前所述, As said earlier, 53 00:02:34,560 --> 00:02:37,473 训练从大量词汇开始。 the training starts with a big vocabulary. 54 00:02:38,460 --> 00:02:41,400 显然,由于我们使用的是玩具语料库, Obviously, as we are using a toy corpus, 55 00:02:41,400 --> 00:02:44,430 这个词汇量不会那么大 this vocabulary will not be that big 56 00:02:44,430 --> 00:02:46,773 但它应该告诉你原理。 but it should show you the principle. 57 00:02:47,610 --> 00:02:51,870 第一种方法是列出所有可能的严格子串 A first method is to list all the possible strict substrings 58 00:02:51,870 --> 00:02:53,823 这就是我们在这里要做的。 and that's what we'll do here. 59 00:02:54,780 --> 00:02:58,170 我们也可以使用 BPE 算法 We could also have used the BPE algorithm 60 00:02:58,170 --> 00:03:00,010 词汇量非常大 with a very large vocabulary size 61 00:03:01,410 --> 00:03:05,103 但就目前而言,严格的子字符串就足够了。 but for now, the strict substrings are enough. 62 00:03:06,990 --> 00:03:09,120 Unigram 分词器的训练 The training of the Unigram tokenizer 63 00:03:09,120 --> 00:03:12,093 基于期望最大化方法。 is based on the Expectation-Maximization method. 64 00:03:13,320 --> 00:03:15,120 在每次迭代中, At each iteration, 65 00:03:15,120 --> 00:03:17,430 我们估计 token 的概率 we estimate the probabilities of the tokens 66 00:03:17,430 --> 00:03:18,430 对词汇 of the vocabulary 67 00:03:20,130 --> 00:03:23,100 然后我们删除 p 百分比的 token and then we remove the p-percent of tokens 68 00:03:23,100 --> 00:03:26,070 最小化语料库的损失 that minimize the loss on the corpus 69 00:03:26,070 --> 00:03:28,900 而哪些不属于基本字 and which do not belong to the basic character 70 00:03:29,880 --> 00:03:33,150 因为我们想保留在我们的最终词汇表中 as we want to keep in our final vocabulary 71 00:03:33,150 --> 00:03:36,693 能够标记任何单词的基本字符。 the basic characters to be able to tokenize any word. 72 00:03:37,770 --> 00:03:39,641 让我们开始吧! Let's go for it! 73 00:03:39,641 --> 00:03:42,360 简单估计一个 token 的概率 The probability of a token simply estimated 74 00:03:42,360 --> 00:03:44,760 按此 token 出现的次数 by the number of appearance of this token 75 00:03:44,760 --> 00:03:46,440 在我们的训练语料库中 in our training corpus 76 00:03:46,440 --> 00:03:50,133 除以所有 token 出现的总数。 divided by the total number of appearance of all the tokens. 77 00:03:51,510 --> 00:03:54,390 我们可以使用这个词汇表来标记我们的单词 We could use this vocabulary to tokenize our words 78 00:03:54,390 --> 00:03:56,283 根据 Unigram 模型。 according to the Unigram model. 79 00:03:57,150 --> 00:04:00,892 我们将一起做,以了解两件事: We will do it together to understand two things: 80 00:04:00,892 --> 00:04:04,110 我们如何使用 Unigram 模型对单词进行分词 how we tokenize a word with a Unigram model 81 00:04:04,110 --> 00:04:07,803 以及如何在我们的语料库上计算损失。 and how the loss is calculated on our corpus. 82 00:04:09,088 --> 00:04:12,263 我们的文本 “Hug” 的 Unigram LM 分词化 The Unigram LM tokenization of our text 'Hug' 83 00:04:12,263 --> 00:04:15,270 将是发生概率最高的那个 will be the one with the highest probability of occurrence 84 00:04:15,270 --> 00:04:17,403 根据我们的 Unigram 模型。 according to our Unigram model. 85 00:04:19,080 --> 00:04:21,750 找到它,最简单的方法 To find it, the simplest way to proceed 86 00:04:21,750 --> 00:04:24,120 将列出所有可能的分割 would be to list all the possible segmentations 87 00:04:24,120 --> 00:04:25,800 对我们的文本 "Hug" , of our text 'Hug', 88 00:04:25,800 --> 00:04:29,340 计算每个细分的概率 calculate the probability of each of these segmentations 89 00:04:29,340 --> 00:04:32,043 然后选择概率最高的那个。 and then choose the one with the highest probability. 90 00:04:33,210 --> 00:04:34,920 以现在的词汇量, With the current vocabulary, 91 00:04:34,920 --> 00:04:38,640 两个分词化获得完全相同的概率。 two tokenizations get exactly the same probability. 92 00:04:38,640 --> 00:04:40,080 所以我们选择其中之一 So we choose one of them 93 00:04:40,080 --> 00:04:42,603 并记住相关的概率。 and keep in memory the associated probability. 94 00:04:43,710 --> 00:04:46,380 为了计算我们训练语料库的损失, To compute the loss on our training corpus, 95 00:04:46,380 --> 00:04:48,570 我们需要像刚才那样进行分词化 we need to tokenize as we just did 96 00:04:48,570 --> 00:04:50,673 语料库中所有剩余的单词。 all the remaining words in the corpus. 97 00:04:52,290 --> 00:04:56,430 损失就是所有单词的总和, 语料库中 The loss is then the sum over all the words in the corpus 98 00:04:56,430 --> 00:04:58,920 词的出现频率 of the frequency of occurrence of the word 99 00:04:58,920 --> 00:05:02,670 乘以概率对数的相反数 multiplied by the opposite of the log of the probability 100 00:05:02,670 --> 00:05:05,463 与单词的分词化相关联的。 associated with the tokenization of the word. 101 00:05:07,620 --> 00:05:10,803 我们在这里得到了 170 的损失。 We obtain here a loss of 170. 102 00:05:13,830 --> 00:05:18,630 请记住,我们最初的目标是减少词汇量。 Remember, our initial goal was to reduce the vocabulary. 103 00:05:18,630 --> 00:05:21,870 为此,我们将从词汇表中删除一个 token To do this, we will remove a token from the vocabulary 104 00:05:21,870 --> 00:05:24,213 并计算相关损失。 and calculate the associated loss. 105 00:05:27,630 --> 00:05:30,627 例如,让我们删除标记 “ug”。 Let's remove for example, the token 'ug'. 106 00:05:31,920 --> 00:05:35,370 我们注意到 “hug” 的分词化 We notice that the tokenization for 'hug' 107 00:05:35,370 --> 00:05:39,990 字母 "h" 和元组 "ug" 现在是不可能的。 with the letter 'h' and the tuple 'ug' is now impossible. 108 00:05:39,990 --> 00:05:42,240 尽管如此,正如我们之前看到的 Nevertheless, as we saw earlier 109 00:05:42,240 --> 00:05:45,180 两个分词化具有相同的概率, that two tokenizations had the same probability, 110 00:05:45,180 --> 00:05:47,730 我们仍然可以选择剩余的分词化 we can still choose the remaining tokenization 111 00:05:47,730 --> 00:05:51,093 概率为 1.10e-2。 with a probability of 1.10e-2. 112 00:05:52,410 --> 00:05:55,350 词汇表中其他词的分词 The tokenizations of the other words of the vocabulary 113 00:05:55,350 --> 00:05:57,060 也保持不变。 also remain unchanged. 114 00:05:57,060 --> 00:06:00,600 最后,即使我们删除了标记 “ug” And finally, even if we remove the token 'ug' 115 00:06:00,600 --> 00:06:05,403 从我们的词汇表来看,损失仍然等于 170。 from our vocabulary the loss remains equal to 170. 116 00:06:06,630 --> 00:06:08,100 对于第一次迭代, For this first iteration, 117 00:06:08,100 --> 00:06:10,080 如果我们继续计算, if we continue the calculation, 118 00:06:10,080 --> 00:06:13,050 我们会注意到我们可以删除任何 token we would notice that we could remove any token 119 00:06:13,050 --> 00:06:16,110 在不影响损失的情况下。 without it impacting the loss. 120 00:06:16,110 --> 00:06:19,200 因此,我们将随机选择删除标记 “ug” We will therefore choose at random to remove the token 'ug' 121 00:06:19,200 --> 00:06:21,843 在开始第二次迭代之前。 before starting a second iteration. 122 00:06:24,240 --> 00:06:27,300 所以我们再次估计每个 token 的概率 So we estimate again the probability of each token 123 00:06:27,300 --> 00:06:30,630 在计算每个 token 对损失的影响之前。 before calculating the impact of each token on the loss. 124 00:06:32,160 --> 00:06:33,990 例如,如果我们现在删除 For example, if we remove now 125 00:06:33,990 --> 00:06:36,290 由字母 “h” 和 “u” 组成的 token , the token composed of the letters 'h' and 'u', 126 00:06:37,350 --> 00:06:41,013 "hug" 只剩下一种可能的分词化。 there is only one possible tokenization left for "hug". 127 00:06:41,940 --> 00:06:44,700 词汇表中其他词的分词化 The tokenization of the other words of the vocabulary 128 00:06:44,700 --> 00:06:45,633 没有改变。 is not changed. 129 00:06:46,560 --> 00:06:47,393 到底, In the end, 130 00:06:47,393 --> 00:06:49,200 我们通过删除 token 获得 we obtain by removing the token 131 00:06:49,200 --> 00:06:52,749 由词汇表中的字母 “h” 和 “u” 组成, composed of the letters 'h' and 'u' from the vocabulary, 132 00:06:52,749 --> 00:06:56,430 损失为 168。 a loss of 168. 133 00:06:56,430 --> 00:06:59,490 最后,要选择要删除的 token , Finally, to choose which token to remove, 134 00:06:59,490 --> 00:07:02,490 我们将为词汇表的每个剩余 token , we will for each remaining token of the vocabulary, 135 00:07:02,490 --> 00:07:04,800 这不是基本 token , which is not an elementary token, 136 00:07:04,800 --> 00:07:07,380 计算相关损失。 calculate the associated loss. 137 00:07:07,380 --> 00:07:09,843 然后,比较它们之间的这些损失。 Then, compare these losses between them. 138 00:07:11,730 --> 00:07:13,800 我们将删除的 token The token which we will remove 139 00:07:13,800 --> 00:07:17,340 是对损失影响最小的 token , is the token which impacts the least the loss, 140 00:07:17,340 --> 00:07:18,870 这里是 token “bu”。 here the token 'bu'. 141 00:07:20,040 --> 00:07:22,380 我们在视频开头提到过 We had mentioned at the beginning of the video 142 00:07:22,380 --> 00:07:24,930 在每次迭代中我们可以删除 that at each iteration we could remove 143 00:07:24,930 --> 00:07:27,093 迭代中标记的 p 百分比。 p-percent of the tokens by iteration. 144 00:07:29,356 --> 00:07:33,000 可以在本次迭代中删除的第二个 token The second token that could be removed at this iteration 145 00:07:33,000 --> 00:07:34,317 是 token “du”。 is the token 'du'. 146 00:07:36,510 --> 00:07:37,920 就是这样。 And that's it. 147 00:07:37,920 --> 00:07:39,720 我们只需要重复这些步骤 We just have to repeat these steps 148 00:07:39,720 --> 00:07:43,203 直到我们得到所需大小的词汇表。 until we get the vocabulary of the desired size. 149 00:07:45,030 --> 00:07:46,500 最后一件事。 One last thing. 150 00:07:46,500 --> 00:07:50,310 在实践中,当我们用 Unigram 模型标记一个词时, In practice, when we tokenize a word with a Unigram model, 151 00:07:50,310 --> 00:07:53,130 我们不计算概率的集合 we don't compute the set of probabilities of 152 00:07:53,130 --> 00:07:55,500 一个词所有可能的拆分 all the possible splits of a word 153 00:07:55,500 --> 00:07:58,770 在比较它们以保留最好的之前 before comparing them to keep the best one 154 00:07:58,770 --> 00:08:01,440 但是我们使用 Viterbi 算法 but we use the Viterbi algorithm 155 00:08:01,440 --> 00:08:04,563 这是更有效的方法。 which is much more efficient way to do it. 156 00:08:06,540 --> 00:08:07,680 就是这样! And that's it! 157 00:08:07,680 --> 00:08:09,270 我希望这个例子 I hope that this example 158 00:08:09,270 --> 00:08:10,987 已经让你更好的了解 has allowed you to better understand 159 00:08:10,987 --> 00:08:12,933 Unigram 分词算法。 the Unigram tokenization algorithm. 160 00:08:14,355 --> 00:08:17,022 (空气呼啸) (air whooshing)