1 00:00:00,418 --> 00:00:03,251 (戏剧性的嗖嗖声) (dramatic whoosh) 2 00:00:05,340 --> 00:00:08,460 - 为什么快速 tokenizer 被称 "快速"? - Why are fast tokenizers called fast? 3 00:00:08,460 --> 00:00:10,950 在这个视频中,我们将看到到底有多快, In this video, we'll see exactly how much faster, 4 00:00:10,950 --> 00:00:13,800 另外,所谓的快速 tokenizer 被比较 also, so-called fast organizers are compared 5 00:00:13,800 --> 00:00:15,153 和他们慢的参照物。 to their slow counterparts. 6 00:00:16,200 --> 00:00:19,260 对于这个基准测试,我们将使用 GLUE MNLI 数据集 For this benchmark, we'll use the GLUE MNLI dataset 7 00:00:19,260 --> 00:00:23,160 其中包含 432,000 个拼写的文本。 which contains 432,000 spells of text. 8 00:00:23,160 --> 00:00:25,890 我们将看看快速和慢速版本需要多长时间 We'll see how long it takes for the fast and slow versions 9 00:00:25,890 --> 00:00:28,143 一个 BERT tokenizer 来处理它们。 of a BERT tokenizer to process them all. 10 00:00:29,670 --> 00:00:31,380 我们定义我们的快速和慢速的 tokenizer We define our fast and slow token organizer 11 00:00:31,380 --> 00:00:33,717 使用 AutoTokenizer API。 using the AutoTokenizer API. 12 00:00:33,717 --> 00:00:37,110 快速 tokenizer 在可用时是默认的。 The fast tokenizer is a default when available. 13 00:00:37,110 --> 00:00:40,443 所以我们通过设置 use_fast=False 来定义成慢速。 So we pass along, use_fast=False to define the slow one. 14 00:00:41,430 --> 00:00:43,530 在笔记本中,我们可以计时执行 In a notebook, we can time the execution 15 00:00:43,530 --> 00:00:46,800 让其本身附带时间魔术命令,就像这样。 of itself with a time magic command, like this. 16 00:00:46,800 --> 00:00:49,350 处理整个数据集快四倍 Processing the whole dataset is four times faster 17 00:00:49,350 --> 00:00:50,970 使用快速 tokenizer 。 with a fast tokenizer. 18 00:00:50,970 --> 00:00:54,000 确实更快,但不够吸引人。 That quicker indeed, but not very impressive. 19 00:00:54,000 --> 00:00:55,380 这是因为我们传递了文本 This is because we passed along the texts 20 00:00:55,380 --> 00:00:57,240 一次一个到 tokenizer 。 to the tokenizer one at a time. 21 00:00:57,240 --> 00:00:59,730 这是快速 tokenizer 的常见错误 This is a common mistake to do with fast organizers 22 00:00:59,730 --> 00:01:02,550 由 Rust 支持,因此能够确定优先化 which are backed by Rust, and thus able to prioritize 23 00:01:02,550 --> 00:01:05,370 多个文本的 tokenization 。 the tokenization of multiple texts. 24 00:01:05,370 --> 00:01:07,290 一次只向他们传递一个文本, Passing them only one text at a time, 25 00:01:07,290 --> 00:01:09,720 就像在两大洲之间发送一艘货船 is like sending a cargo ship between two continents 26 00:01:09,720 --> 00:01:13,140 只有一个容器,效率非常低。 with just one container, it's very inefficient. 27 00:01:13,140 --> 00:01:15,810 为了释放我们快速 tokenizer 的全部速度, To unleash the full speed of our fast tokenizers, 28 00:01:15,810 --> 00:01:18,840 我们需要向他们发送成批的文本,我们可以做到 we need to send them batches of texts, which we can do 29 00:01:18,840 --> 00:01:21,423 使用 map 方法的 batched=True 参数。 with the batched=True argument of the map method. 30 00:01:22,620 --> 00:01:25,950 现在这些都是令人印象深刻的结果,所以快速 tokenizer Now those are impressive results, so the fast tokenizer 31 00:01:25,950 --> 00:01:28,410 需要 12 秒处理, 而需要 4 takes 12 second to process the dataset that takes four 32 00:01:28,410 --> 00:01:30,093 分钟, 对于慢速 tokenizer 。 minute to the slow tokenizer. 33 00:01:31,440 --> 00:01:33,510 结果总结于此表中, Summarizing the results in this table, 34 00:01:33,510 --> 00:01:36,630 你可以看到为什么我们称 tokenizer 为快速。 you can see why we have called those tokenizers fast. 35 00:01:36,630 --> 00:01:38,760 这仅用于分词化文本。 And this is only for tokenizing texts. 36 00:01:38,760 --> 00:01:40,710 如果你需要训练一个新的 tokenizer , If you ever need to train a new tokenizer, 37 00:01:40,710 --> 00:01:42,523 它们也很快做到这一点。 they do this very quickly too.