subtitles/en/43_why-are-fast-tokenizers-called-fast.srt

1 00:00:00,418 --> 00:00:03,251 (dramatic whoosh) 2 00:00:05,340 --> 00:00:08,460 - Why are fast tokenizers called fast? 3 00:00:08,460 --> 00:00:10,950 In this video, we'll see exactly how much faster, 4 00:00:10,950 --> 00:00:13,800 also, so-called fast organizers are compared 5 00:00:13,800 --> 00:00:15,153 to their slow counterparts. 6 00:00:16,200 --> 00:00:19,260 For this benchmark, we'll use the GLUE MNLI dataset 7 00:00:19,260 --> 00:00:23,160 which contains 432,000 spells of text. 8 00:00:23,160 --> 00:00:25,890 We'll see how long it takes for the fast and slow versions 9 00:00:25,890 --> 00:00:28,143 of a BERT tokenizer to process them all. 10 00:00:29,670 --> 00:00:31,380 We define our fast and slow token organizer 11 00:00:31,380 --> 00:00:33,717 using the AutoTokenizer API. 12 00:00:33,717 --> 00:00:37,110 The fast tokenizer is a default when available. 13 00:00:37,110 --> 00:00:40,443 So we pass along, use_fast=False to define the slow one. 14 00:00:41,430 --> 00:00:43,530 In a notebook, we can time the execution 15 00:00:43,530 --> 00:00:46,800 of itself with a time magic command, like this. 16 00:00:46,800 --> 00:00:49,350 Processing the whole data set is four times faster 17 00:00:49,350 --> 00:00:50,970 with a fast tokenizer. 18 00:00:50,970 --> 00:00:54,000 That quicker indeed, but not very impressive. 19 00:00:54,000 --> 00:00:55,380 This is because we passed along the texts 20 00:00:55,380 --> 00:00:57,240 to the tokenizer one at a time. 21 00:00:57,240 --> 00:00:59,730 This is a common mistake to do with fast organizers 22 00:00:59,730 --> 00:01:02,550 which are backed by Rust, and thus able to prioritize 23 00:01:02,550 --> 00:01:05,370 the tokenization of multiple texts. 24 00:01:05,370 --> 00:01:07,290 Passing them only one text at a time, 25 00:01:07,290 --> 00:01:09,720 is like sending a cargo ship between two continents 26 00:01:09,720 --> 00:01:13,140 with just one container, it's very inefficient. 27 00:01:13,140 --> 00:01:15,810 To unleash the full speed of our fast tokenizers, 28 00:01:15,810 --> 00:01:18,840 we need to send them batches of texts, which we can do 29 00:01:18,840 --> 00:01:21,423 with the batched=True argument of the map method. 30 00:01:22,620 --> 00:01:25,950 Now those are impressive results, so the fast tokenizer 31 00:01:25,950 --> 00:01:28,410 takes 12 second to process the dataset that takes four 32 00:01:28,410 --> 00:01:30,093 minute to the slow tokenizer. 33 00:01:31,440 --> 00:01:33,510 Summarizing the results in this table, 34 00:01:33,510 --> 00:01:36,630 you can see why we have called those tokenizers fast. 35 00:01:36,630 --> 00:01:38,760 And this is only for tokenizing texts. 36 00:01:38,760 --> 00:01:40,710 If you ever need to train a new tokenizer, 37 00:01:40,710 --> 00:01:42,523 they do this very quickly too.

subtitles/en/43_why-are-fast-tokenizers-called-fast.srt (131 lines of code) (raw):