subtitles/zh-CN/43_why-are-fast-tokenizers-called-fast.srt (148 lines of code) (raw):
1
00:00:00,418 --> 00:00:03,251
(戏剧性的嗖嗖声)
(dramatic whoosh)
2
00:00:05,340 --> 00:00:08,460
- 为什么快速 tokenizer 被称 "快速"?
- Why are fast tokenizers called fast?
3
00:00:08,460 --> 00:00:10,950
在这个视频中,我们将看到到底有多快,
In this video, we'll see exactly how much faster,
4
00:00:10,950 --> 00:00:13,800
另外,所谓的快速 tokenizer 被比较
also, so-called fast organizers are compared
5
00:00:13,800 --> 00:00:15,153
和他们慢的参照物。
to their slow counterparts.
6
00:00:16,200 --> 00:00:19,260
对于这个基准测试,我们将使用 GLUE MNLI 数据集
For this benchmark, we'll use the GLUE MNLI dataset
7
00:00:19,260 --> 00:00:23,160
其中包含 432,000 个拼写的文本。
which contains 432,000 spells of text.
8
00:00:23,160 --> 00:00:25,890
我们将看看快速和慢速版本需要多长时间
We'll see how long it takes for the fast and slow versions
9
00:00:25,890 --> 00:00:28,143
一个 BERT tokenizer 来处理它们。
of a BERT tokenizer to process them all.
10
00:00:29,670 --> 00:00:31,380
我们定义我们的快速和慢速的 tokenizer
We define our fast and slow token organizer
11
00:00:31,380 --> 00:00:33,717
使用 AutoTokenizer API。
using the AutoTokenizer API.
12
00:00:33,717 --> 00:00:37,110
快速 tokenizer 在可用时是默认的。
The fast tokenizer is a default when available.
13
00:00:37,110 --> 00:00:40,443
所以我们通过设置 use_fast=False 来定义成慢速。
So we pass along, use_fast=False to define the slow one.
14
00:00:41,430 --> 00:00:43,530
在笔记本中,我们可以计时执行
In a notebook, we can time the execution
15
00:00:43,530 --> 00:00:46,800
让其本身附带时间魔术命令,就像这样。
of itself with a time magic command, like this.
16
00:00:46,800 --> 00:00:49,350
处理整个数据集快四倍
Processing the whole dataset is four times faster
17
00:00:49,350 --> 00:00:50,970
使用快速 tokenizer 。
with a fast tokenizer.
18
00:00:50,970 --> 00:00:54,000
确实更快,但不够吸引人。
That quicker indeed, but not very impressive.
19
00:00:54,000 --> 00:00:55,380
这是因为我们传递了文本
This is because we passed along the texts
20
00:00:55,380 --> 00:00:57,240
一次一个到 tokenizer 。
to the tokenizer one at a time.
21
00:00:57,240 --> 00:00:59,730
这是快速 tokenizer 的常见错误
This is a common mistake to do with fast organizers
22
00:00:59,730 --> 00:01:02,550
由 Rust 支持,因此能够确定优先化
which are backed by Rust, and thus able to prioritize
23
00:01:02,550 --> 00:01:05,370
多个文本的 tokenization 。
the tokenization of multiple texts.
24
00:01:05,370 --> 00:01:07,290
一次只向他们传递一个文本,
Passing them only one text at a time,
25
00:01:07,290 --> 00:01:09,720
就像在两大洲之间发送一艘货船
is like sending a cargo ship between two continents
26
00:01:09,720 --> 00:01:13,140
只有一个容器,效率非常低。
with just one container, it's very inefficient.
27
00:01:13,140 --> 00:01:15,810
为了释放我们快速 tokenizer 的全部速度,
To unleash the full speed of our fast tokenizers,
28
00:01:15,810 --> 00:01:18,840
我们需要向他们发送成批的文本,我们可以做到
we need to send them batches of texts, which we can do
29
00:01:18,840 --> 00:01:21,423
使用 map 方法的 batched=True 参数。
with the batched=True argument of the map method.
30
00:01:22,620 --> 00:01:25,950
现在这些都是令人印象深刻的结果,所以快速 tokenizer
Now those are impressive results, so the fast tokenizer
31
00:01:25,950 --> 00:01:28,410
需要 12 秒处理, 而需要 4
takes 12 second to process the dataset that takes four
32
00:01:28,410 --> 00:01:30,093
分钟, 对于慢速 tokenizer 。
minute to the slow tokenizer.
33
00:01:31,440 --> 00:01:33,510
结果总结于此表中,
Summarizing the results in this table,
34
00:01:33,510 --> 00:01:36,630
你可以看到为什么我们称 tokenizer 为快速。
you can see why we have called those tokenizers fast.
35
00:01:36,630 --> 00:01:38,760
这仅用于分词化文本。
And this is only for tokenizing texts.
36
00:01:38,760 --> 00:01:40,710
如果你需要训练一个新的 tokenizer ,
If you ever need to train a new tokenizer,
37
00:01:40,710 --> 00:01:42,523
它们也很快做到这一点。
they do this very quickly too.