subtitles/en/39_memory-mapping-&-streaming.srt

1 00:00:00,511 --> 00:00:01,784 (air whooshing) 2 00:00:01,784 --> 00:00:02,964 (logo popping) 3 00:00:02,964 --> 00:00:05,640 (metal sliding) 4 00:00:05,640 --> 00:00:07,203 - Memory mapping and streaming. 5 00:00:08,040 --> 00:00:09,180 In this video, we'll take a look 6 00:00:09,180 --> 00:00:11,520 at two core features of the Datasets library 7 00:00:11,520 --> 00:00:14,220 that allow you to load and process huge datasets 8 00:00:14,220 --> 00:00:16,263 without blowing up your laptop's CPU. 9 00:00:18,300 --> 00:00:20,280 Nowadays, it's not uncommon to find yourself 10 00:00:20,280 --> 00:00:22,950 working with multi-GB sized datasets, 11 00:00:22,950 --> 00:00:24,420 especially if you're planning to pretrain 12 00:00:24,420 --> 00:00:28,110 a transformer like BERT or GPT-2 from scratch. 13 00:00:28,110 --> 00:00:31,260 In these cases, even loading the data can be a challenge. 14 00:00:31,260 --> 00:00:34,680 For example, the c4 corpus used to pretrain T5 15 00:00:34,680 --> 00:00:36,903 consists of over two terabytes of data. 16 00:00:38,400 --> 00:00:40,050 To handle these large datasets, 17 00:00:40,050 --> 00:00:42,990 the Datasets library is built on two core features: 18 00:00:42,990 --> 00:00:46,350 the Apache Arrow format and a streaming API. 19 00:00:46,350 --> 00:00:49,110 Arrow is designed for high-performance data processing 20 00:00:49,110 --> 00:00:51,360 and represents each table-like dataset 21 00:00:51,360 --> 00:00:52,773 with a column-based format. 22 00:00:53,730 --> 00:00:56,130 As you can see in this example, column-based formats 23 00:00:56,130 --> 00:00:59,280 group the elements of a table in consecutive blocks of RAM 24 00:00:59,280 --> 00:01:01,563 and this unlocks fast access and processing. 25 00:01:02,760 --> 00:01:05,550 Arrow is great at processing data at any scale 26 00:01:05,550 --> 00:01:07,110 but some datasets are so large 27 00:01:07,110 --> 00:01:09,600 that you can't even fit them on your hard disk. 28 00:01:09,600 --> 00:01:11,730 So for these cases, the Datasets library provides 29 00:01:11,730 --> 00:01:14,820 a streaming API that allows you to progressively download 30 00:01:14,820 --> 00:01:17,700 the raw data one element at a time. 31 00:01:17,700 --> 00:01:20,430 The result is a special object called an IterableDataset 32 00:01:20,430 --> 00:01:22,180 that we'll see in more detail soon. 33 00:01:23,700 --> 00:01:26,670 Let's start by looking at why Arrow is so powerful. 34 00:01:26,670 --> 00:01:28,860 The first feature is that it treats every dataset 35 00:01:28,860 --> 00:01:30,153 as a memory-mapped file. 36 00:01:31,020 --> 00:01:32,430 Now, memory mapping is a mechanism 37 00:01:32,430 --> 00:01:35,400 that maps a portion of a file or an entire file and disc 38 00:01:35,400 --> 00:01:37,410 to a chunk of virtual memory. 39 00:01:37,410 --> 00:01:38,520 This allows applications 40 00:01:38,520 --> 00:01:41,280 to access segments of an extremely large file 41 00:01:41,280 --> 00:01:44,080 without having to read the whole file into memory first. 42 00:01:45,150 --> 00:01:48,120 Another cool feature of Arrow's memory mapping capabilities 43 00:01:48,120 --> 00:01:49,860 is that it allows multiple processes 44 00:01:49,860 --> 00:01:51,840 to work with the same large dataset 45 00:01:51,840 --> 00:01:54,333 without moving it or copying it in any way. 46 00:01:55,680 --> 00:01:57,570 This zero-copy feature of Arrow 47 00:01:57,570 --> 00:02:00,600 makes it extremely fast for iterating over a dataset. 48 00:02:00,600 --> 00:02:02,640 And this example, you can see that we iterate 49 00:02:02,640 --> 00:02:05,160 over 15 million rows in about a minute 50 00:02:05,160 --> 00:02:06,780 just using a standard laptop. 51 00:02:06,780 --> 00:02:08,080 That's not too bad at all. 52 00:02:09,750 --> 00:02:12,660 Let's now take a look at how we can stream a large dataset. 53 00:02:12,660 --> 00:02:14,520 The only change you need to make is to set 54 00:02:14,520 --> 00:02:17,910 the streaming=True argument in the load_dataset() function. 55 00:02:17,910 --> 00:02:20,580 This will return a special IterableDataset object 56 00:02:20,580 --> 00:02:22,260 which is a bit different to the Dataset objects 57 00:02:22,260 --> 00:02:24,330 we've seen in other videos. 58 00:02:24,330 --> 00:02:25,980 This object is an iterable, 59 00:02:25,980 --> 00:02:28,530 which means we can't index it to access elements, 60 00:02:28,530 --> 00:02:30,180 but instead we iterate on it 61 00:02:30,180 --> 00:02:32,850 using the iter and next methods. 62 00:02:32,850 --> 00:02:34,050 This will download and access 63 00:02:34,050 --> 00:02:35,850 a single example from the dataset, 64 00:02:35,850 --> 00:02:37,410 which means you can progressively iterate 65 00:02:37,410 --> 00:02:40,360 through a huge dataset without having to download it first. 66 00:02:42,150 --> 00:02:43,590 Tokenizing text with a map() method 67 00:02:43,590 --> 00:02:45,660 also works in a similar way. 68 00:02:45,660 --> 00:02:47,160 We first stream the dataset 69 00:02:47,160 --> 00:02:49,830 and then apply the map() method with the tokenizer. 70 00:02:49,830 --> 00:02:53,283 To get the first tokenized example, we apply iter and next. 71 00:02:54,750 --> 00:02:57,210 The main difference with an IterableDataset is that 72 00:02:57,210 --> 00:02:59,970 instead of using a select() method to return examples, 73 00:02:59,970 --> 00:03:01,530 we use the take() and skip() methods 74 00:03:01,530 --> 00:03:03,573 because we can't index into the dataset. 75 00:03:04,470 --> 00:03:05,460 The take() method returns 76 00:03:05,460 --> 00:03:07,500 the first N examples in the dataset, 77 00:03:07,500 --> 00:03:09,270 while skip(), as you can imagine, 78 00:03:09,270 --> 00:03:12,480 skips the first N and returns the rest. 79 00:03:12,480 --> 00:03:15,300 You can see examples of both of these methods in action 80 00:03:15,300 --> 00:03:16,710 where we create a validation set 81 00:03:16,710 --> 00:03:18,660 from the first 1000 examples 82 00:03:18,660 --> 00:03:21,010 and then skip those to create the training set. 83 00:03:23,012 --> 00:03:25,762 (air whooshing)

subtitles/en/39_memory-mapping-&-streaming.srt (287 lines of code) (raw):