subtitles/en/43_why-are-fast-tokenizers-called-fast.srt (131 lines of code) (raw):
1
00:00:00,418 --> 00:00:03,251
(dramatic whoosh)
2
00:00:05,340 --> 00:00:08,460
- Why are fast tokenizers called fast?
3
00:00:08,460 --> 00:00:10,950
In this video, we'll see
exactly how much faster,
4
00:00:10,950 --> 00:00:13,800
also, so-called fast
organizers are compared
5
00:00:13,800 --> 00:00:15,153
to their slow counterparts.
6
00:00:16,200 --> 00:00:19,260
For this benchmark, we'll
use the GLUE MNLI dataset
7
00:00:19,260 --> 00:00:23,160
which contains 432,000 spells of text.
8
00:00:23,160 --> 00:00:25,890
We'll see how long it takes
for the fast and slow versions
9
00:00:25,890 --> 00:00:28,143
of a BERT tokenizer to process them all.
10
00:00:29,670 --> 00:00:31,380
We define our fast and
slow token organizer
11
00:00:31,380 --> 00:00:33,717
using the AutoTokenizer API.
12
00:00:33,717 --> 00:00:37,110
The fast tokenizer is a
default when available.
13
00:00:37,110 --> 00:00:40,443
So we pass along, use_fast=False
to define the slow one.
14
00:00:41,430 --> 00:00:43,530
In a notebook, we can time the execution
15
00:00:43,530 --> 00:00:46,800
of itself with a time
magic command, like this.
16
00:00:46,800 --> 00:00:49,350
Processing the whole data
set is four times faster
17
00:00:49,350 --> 00:00:50,970
with a fast tokenizer.
18
00:00:50,970 --> 00:00:54,000
That quicker indeed,
but not very impressive.
19
00:00:54,000 --> 00:00:55,380
This is because we passed along the texts
20
00:00:55,380 --> 00:00:57,240
to the tokenizer one at a time.
21
00:00:57,240 --> 00:00:59,730
This is a common mistake
to do with fast organizers
22
00:00:59,730 --> 00:01:02,550
which are backed by Rust,
and thus able to prioritize
23
00:01:02,550 --> 00:01:05,370
the tokenization of multiple texts.
24
00:01:05,370 --> 00:01:07,290
Passing them only one text at a time,
25
00:01:07,290 --> 00:01:09,720
is like sending a cargo
ship between two continents
26
00:01:09,720 --> 00:01:13,140
with just one container,
it's very inefficient.
27
00:01:13,140 --> 00:01:15,810
To unleash the full speed
of our fast tokenizers,
28
00:01:15,810 --> 00:01:18,840
we need to send them batches
of texts, which we can do
29
00:01:18,840 --> 00:01:21,423
with the batched=True
argument of the map method.
30
00:01:22,620 --> 00:01:25,950
Now those are impressive
results, so the fast tokenizer
31
00:01:25,950 --> 00:01:28,410
takes 12 second to process
the dataset that takes four
32
00:01:28,410 --> 00:01:30,093
minute to the slow tokenizer.
33
00:01:31,440 --> 00:01:33,510
Summarizing the results in this table,
34
00:01:33,510 --> 00:01:36,630
you can see why we have
called those tokenizers fast.
35
00:01:36,630 --> 00:01:38,760
And this is only for tokenizing texts.
36
00:01:38,760 --> 00:01:40,710
If you ever need to train a new tokenizer,
37
00:01:40,710 --> 00:01:42,523
they do this very quickly too.