1
00:00:00,418 --> 00:00:03,251
（戏剧性的嗖嗖声）
(dramatic whoosh)

2
00:00:05,340 --> 00:00:08,460
- 为什么快速 tokenizer 被称 "快速"？
- Why are fast tokenizers called fast?

3
00:00:08,460 --> 00:00:10,950
在这个视频中，我们将看到到底有多快，
In this video, we'll see exactly how much faster,

4
00:00:10,950 --> 00:00:13,800
另外，所谓的快速 tokenizer 被比较
also, so-called fast organizers are compared

5
00:00:13,800 --> 00:00:15,153
和他们慢的参照物。
to their slow counterparts.

6
00:00:16,200 --> 00:00:19,260
对于这个基准测试，我们将使用 GLUE MNLI 数据集
For this benchmark, we'll use the GLUE MNLI dataset

7
00:00:19,260 --> 00:00:23,160
其中包含 432,000 个拼写的文本。
which contains 432,000 spells of text.

8
00:00:23,160 --> 00:00:25,890
我们将看看快速和慢速版本需要多长时间
We'll see how long it takes for the fast and slow versions

9
00:00:25,890 --> 00:00:28,143
一个 BERT tokenizer 来处理它们。
of a BERT tokenizer to process them all.

10
00:00:29,670 --> 00:00:31,380
我们定义我们的快速和慢速的 tokenizer 
We define our fast and slow token organizer

11
00:00:31,380 --> 00:00:33,717
使用 AutoTokenizer API。
using the AutoTokenizer API.

12
00:00:33,717 --> 00:00:37,110
快速 tokenizer 在可用时是默认的。
The fast tokenizer is a default when available.

13
00:00:37,110 --> 00:00:40,443
所以我们通过设置 use_fast=False 来定义成慢速。
So we pass along, use_fast=False to define the slow one.

14
00:00:41,430 --> 00:00:43,530
在笔记本中，我们可以计时执行
In a notebook, we can time the execution

15
00:00:43,530 --> 00:00:46,800
让其本身附带时间魔术命令，就像这样。
of itself with a time magic command, like this.

16
00:00:46,800 --> 00:00:49,350
处理整个数据集快四倍
Processing the whole dataset is four times faster

17
00:00:49,350 --> 00:00:50,970
使用快速 tokenizer 。
with a fast tokenizer.

18
00:00:50,970 --> 00:00:54,000
确实更快，但不够吸引人。
That quicker indeed, but not very impressive.

19
00:00:54,000 --> 00:00:55,380
这是因为我们传递了文本
This is because we passed along the texts

20
00:00:55,380 --> 00:00:57,240
一次一个到 tokenizer 。
to the tokenizer one at a time.

21
00:00:57,240 --> 00:00:59,730
这是快速 tokenizer 的常见错误
This is a common mistake to do with fast organizers

22
00:00:59,730 --> 00:01:02,550
由 Rust 支持，因此能够确定优先化
which are backed by Rust, and thus able to prioritize

23
00:01:02,550 --> 00:01:05,370
多个文本的 tokenization 。 
the tokenization of multiple texts.

24
00:01:05,370 --> 00:01:07,290
一次只向他们传递一个文本，
Passing them only one text at a time,

25
00:01:07,290 --> 00:01:09,720
就像在两大洲之间发送一艘货船
is like sending a cargo ship between two continents

26
00:01:09,720 --> 00:01:13,140
只有一个容器，效率非常低。
with just one container, it's very inefficient.

27
00:01:13,140 --> 00:01:15,810
为了释放我们快速 tokenizer 的全部速度，
To unleash the full speed of our fast tokenizers,

28
00:01:15,810 --> 00:01:18,840
我们需要向他们发送成批的文本，我们可以做到
we need to send them batches of texts, which we can do

29
00:01:18,840 --> 00:01:21,423
使用 map 方法的 batched=True 参数。
with the batched=True argument of the map method.

30
00:01:22,620 --> 00:01:25,950
现在这些都是令人印象深刻的结果，所以快速 tokenizer 
Now those are impressive results, so the fast tokenizer

31
00:01:25,950 --> 00:01:28,410
需要 12 秒处理, 而需要 4
takes 12 second to process the dataset that takes four

32
00:01:28,410 --> 00:01:30,093
分钟, 对于慢速 tokenizer 。
minute to the slow tokenizer.

33
00:01:31,440 --> 00:01:33,510
结果总结于此表中，
Summarizing the results in this table,

34
00:01:33,510 --> 00:01:36,630
你可以看到为什么我们称 tokenizer 为快速。
you can see why we have called those tokenizers fast.

35
00:01:36,630 --> 00:01:38,760
这仅用于分词化文本。
And this is only for tokenizing texts.

36
00:01:38,760 --> 00:01:40,710
如果你需要训练一个新的 tokenizer ，
If you ever need to train a new tokenizer,

37
00:01:40,710 --> 00:01:42,523
它们也很快做到这一点。
they do this very quickly too.