subtitles/zh-CN/16_the-tokenization-pipeline.srt

1 00:00:00,479 --> 00:00:03,396 （物体呼啸） (object whooshing) 2 00:00:05,610 --> 00:00:06,873 - 分词器的 pipeline 。 *[译者注: token, tokenization, tokenizer 等词均译成了分词*, 实则不翻译最佳] - The tokenizer pipeline. 3 00:00:07,920 --> 00:00:10,570 在本视频中，我们将了解分词器如何转换 In this video, we'll look at how a tokenizer converts 4 00:00:11,433 --> 00:00:12,480 从原始文本到数字， raw texts to numbers, 5 00:00:12,480 --> 00:00:14,970 Transformer 模型可以理解成， that a Transformer model can make sense of, 6 00:00:14,970 --> 00:00:16,520 就像我们执行这段代码时一样。 like when we execute this code. 7 00:00:17,760 --> 00:00:18,690 这是一个快速概述 Here is a quick overview 8 00:00:18,690 --> 00:00:21,630 对于 tokenizer 对象内部发生的事情： of what happens inside the tokenizer object: 9 00:00:21,630 --> 00:00:24,360 首先，文本被分成分词， first, the text is split into tokens, 10 00:00:24,360 --> 00:00:27,453 它们是单词、单词的一部分或标点符号。 which are words, parts of words, or punctuation symbols. 11 00:00:28,440 --> 00:00:31,500 然后分词器添加潜在的特殊分词 Then the tokenizer adds potential special tokens 12 00:00:31,500 --> 00:00:34,680 并将每个分词转换为各自唯一的 ID and converts each token to their unique respective ID 13 00:00:34,680 --> 00:00:36,843 由分词器的词汇表定义。 as defined by the tokenizer's vocabulary. 14 00:00:37,710 --> 00:00:40,380 正如我们将要看到的，它并不完全是按照这个顺序发生的， As we'll see, it doesn't quite happen in this order, 15 00:00:40,380 --> 00:00:43,233 但这样更利于理解。 but doing it like this is better for understandings. 16 00:00:44,280 --> 00:00:47,670 第一步是将我们的输入文本拆分为分词。 The first step is to split our input text into tokens. 17 00:00:47,670 --> 00:00:49,653 为此，我们使用 tokenize 方法。 We use the tokenize method for this. 18 00:00:50,550 --> 00:00:54,030 为此，分词器可能首先执行一些操作， To do that, the tokenizer may first perform some operations, 19 00:00:54,030 --> 00:00:56,880 比如将所有单词小写，然后遵循一组规则 like lowercasing all words, then follow a set of rules 20 00:00:56,880 --> 00:00:59,283 将结果分成小块文本。 to split the result in small chunks of text. 21 00:01:00,480 --> 00:01:02,286 大多数 Transformer 模型使用 Most of the Transformer models uses 22 00:01:02,286 --> 00:01:04,890 单词分词化算法，这意味着 a word tokenization algorithm, which means 23 00:01:04,890 --> 00:01:06,750 一个给定的单词可以被拆分 that one given word can be split 24 00:01:06,750 --> 00:01:10,050 在几个分词中，例如这里的分词。 in several tokens like tokenize here. 25 00:01:10,050 --> 00:01:12,570 查看下面的 “分词化算法” 视频链接 Look at the "Tokenization algorithms" video link below 26 00:01:12,570 --> 00:01:13,743 了解更多信息。 for more information. 27 00:01:14,760 --> 00:01:17,820 我们在 ize 前面看到的 ## 前缀是 The ## prefix we see in front of ize is 28 00:01:17,820 --> 00:01:19,830 BERT 的约定, 用来表示 a convention used by BERT to indicate 29 00:01:19,830 --> 00:01:22,762 这个分词不是单词的开头。 this token is not the beginning of the word. 30 00:01:22,762 --> 00:01:26,310 然而，其他分词器可能使用不同的约定： Other tokenizers may use different conventions however: 31 00:01:26,310 --> 00:01:29,984 例如，ALBERT 分词器会添加一个长下划线 for instance, ALBERT tokenizers will add a long underscore 32 00:01:29,984 --> 00:01:31,620 在所有分词前 in front of all the tokens 33 00:01:31,620 --> 00:01:34,920 在他们之前增加了空间，这是一个共同的约定 that added space before them, which is a convention shared 34 00:01:34,920 --> 00:01:37,700 由所有句子分词器。 by all sentencepiece tokenizers. 35 00:01:38,580 --> 00:01:41,040 分词化管道的第二步是 The second step of the tokenization pipeline is 36 00:01:41,040 --> 00:01:43,470 将这些分词映射到它们各自的 ID to map those tokens to their respective IDs 37 00:01:43,470 --> 00:01:45,770 由分词器的词汇定义的。 as defined by the vocabulary of the tokenizer. 38 00:01:46,770 --> 00:01:48,690 这就是我们需要下载文件的原因 This is why we need to download the file 39 00:01:48,690 --> 00:01:50,580 当我们实例化一个分词器时 when we instantiate a tokenizer 40 00:01:50,580 --> 00:01:52,400 使用 from_pretrained 方法。 with the from_pretrained method. 41 00:01:52,400 --> 00:01:54,390 我们必须确保我们使用相同的映射 We have to make sure we use the same mapping 42 00:01:54,390 --> 00:01:56,520 和模型被预训练时一致。 as when the model was pretrained. 43 00:01:56,520 --> 00:01:59,703 为此，我们使用 convert_tokens_to_ids 方法。 To do this, we use the convert_tokens_to_ids method. 44 00:02:01,050 --> 00:02:01,883 我们可能已经注意到 We may have noticed 45 00:02:01,883 --> 00:02:03,540 我们没有完全相同的结果 that we don't have the exact same results 46 00:02:03,540 --> 00:02:05,580 就像我们的第一张幻灯片一样, 或许 as in our first slide, or not 47 00:02:05,580 --> 00:02:07,920 因为这看起来像是一个随机数列表， as this looks like a list of random numbers anyway, 48 00:02:07,920 --> 00:02:10,680 在这种情况下，请允许我重温一下你的回忆。 in which case, allow me to refresh your memory. 49 00:02:10,680 --> 00:02:15,350 我们有缺少开头的数字和一个结尾的数字 We had a the number at the beginning and a number at the end, that are missing, 50 00:02:15,350 --> 00:02:17,130 是特殊标记。 those are the special tokens. 51 00:02:17,130 --> 00:02:20,340 特殊标记由 prepare_for_model 方法添加 The special tokens are added by the prepare_for_model method 52 00:02:20,340 --> 00:02:22,350 它知道这个分词的索引 which knows the indices of this token 53 00:02:22,350 --> 00:02:25,680 在词汇表中, 并仅添加适当的数字。 in the vocabulary and just adds the proper numbers. 54 00:02:25,680 --> 00:02:27,243 在输入 ID 列表中。 in the input IDs list. 55 00:02:28,590 --> 00:02:29,541 你可以查看特殊标记 You can look at the special tokens 56 00:02:29,541 --> 00:02:30,990 更一般地说， and, more generally, 57 00:02:30,990 --> 00:02:33,870 关于分词器如何更改你的文本， at how the tokenizer has changed your text, 58 00:02:33,870 --> 00:02:35,280 通过使用 decode 方法 by using the decode method 59 00:02:35,280 --> 00:02:37,503 在分词器对象的输出上。 on the outputs of the tokenizer object. 60 00:02:38,490 --> 00:02:39,423 至于开头的前缀 As for the prefix for beginning 61 00:02:39,423 --> 00:02:44,160 词的/词的部分，对于特殊标记, 依赖于 of words/ part of words, for special tokens vary depending 62 00:02:44,160 --> 00:02:46,500 你正在使用哪个分词器。 on which tokenizer you're using. 63 00:02:46,500 --> 00:02:48,810 因此分词器使用 CLS 和 SEP， So that tokenizer uses CLS and SEP, 64 00:02:48,810 --> 00:02:52,417 但是 RoBERTa 分词器使用类似 HTML 的锚点 but the RoBERTa tokenizer uses HTML-like anchors 65 00:02:52,417 --> 00:02:55,230 <s> 和 </s> <s> and </s>. 66 00:02:55,230 --> 00:02:57,090 既然你知道分词器是如何工作的， Now that you know how the tokenizer works, 67 00:02:57,090 --> 00:02:59,390 你可以忘记所有那些中间方法， you can forget all those intermediate methods, 68 00:03:00,283 --> 00:03:01,650 然后你记得你只需要调用它 and then you remember that you just have to call it 69 00:03:01,650 --> 00:03:02,913 在你的输入文本上。 on your input texts. 70 00:03:03,870 --> 00:03:05,310 分词器的输出不只是 The output of a tokenizer don't 71 00:03:05,310 --> 00:03:07,853 仅包含输入 ID, 然而. just contain the input IDs, however. 72 00:03:07,853 --> 00:03:09,750 要了解注意力掩码是什么， To learn what the attention mask is, 73 00:03:09,750 --> 00:03:12,360 查看 “一起批量输入” 视频。 check out the "Batch input together" video. 74 00:03:12,360 --> 00:03:14,220 要了解分词类型 ID， To learn about token type IDs, 75 00:03:14,220 --> 00:03:16,570 观看 “处理句子对” 视频。 look at the "Process pairs of sentences" video. 76 00:03:18,003 --> 00:03:20,920 （物体呼啸） (object whooshing)

subtitles/zh-CN/16_the-tokenization-pipeline.srt (305 lines of code) (raw):