subtitles/en/39_memory-mapping-&-streaming.srt (287 lines of code) (raw):
1
00:00:00,511 --> 00:00:01,784
(air whooshing)
2
00:00:01,784 --> 00:00:02,964
(logo popping)
3
00:00:02,964 --> 00:00:05,640
(metal sliding)
4
00:00:05,640 --> 00:00:07,203
- Memory mapping and streaming.
5
00:00:08,040 --> 00:00:09,180
In this video, we'll take a look
6
00:00:09,180 --> 00:00:11,520
at two core features
of the Datasets library
7
00:00:11,520 --> 00:00:14,220
that allow you to load
and process huge datasets
8
00:00:14,220 --> 00:00:16,263
without blowing up your laptop's CPU.
9
00:00:18,300 --> 00:00:20,280
Nowadays, it's not
uncommon to find yourself
10
00:00:20,280 --> 00:00:22,950
working with multi-GB sized datasets,
11
00:00:22,950 --> 00:00:24,420
especially if you're planning to pretrain
12
00:00:24,420 --> 00:00:28,110
a transformer like BERT
or GPT-2 from scratch.
13
00:00:28,110 --> 00:00:31,260
In these cases, even loading
the data can be a challenge.
14
00:00:31,260 --> 00:00:34,680
For example, the c4
corpus used to pretrain T5
15
00:00:34,680 --> 00:00:36,903
consists of over two terabytes of data.
16
00:00:38,400 --> 00:00:40,050
To handle these large datasets,
17
00:00:40,050 --> 00:00:42,990
the Datasets library is
built on two core features:
18
00:00:42,990 --> 00:00:46,350
the Apache Arrow format
and a streaming API.
19
00:00:46,350 --> 00:00:49,110
Arrow is designed for
high-performance data processing
20
00:00:49,110 --> 00:00:51,360
and represents each table-like dataset
21
00:00:51,360 --> 00:00:52,773
with a column-based format.
22
00:00:53,730 --> 00:00:56,130
As you can see in this
example, column-based formats
23
00:00:56,130 --> 00:00:59,280
group the elements of a table
in consecutive blocks of RAM
24
00:00:59,280 --> 00:01:01,563
and this unlocks fast
access and processing.
25
00:01:02,760 --> 00:01:05,550
Arrow is great at
processing data at any scale
26
00:01:05,550 --> 00:01:07,110
but some datasets are so large
27
00:01:07,110 --> 00:01:09,600
that you can't even fit
them on your hard disk.
28
00:01:09,600 --> 00:01:11,730
So for these cases, the
Datasets library provides
29
00:01:11,730 --> 00:01:14,820
a streaming API that allows
you to progressively download
30
00:01:14,820 --> 00:01:17,700
the raw data one element at a time.
31
00:01:17,700 --> 00:01:20,430
The result is a special object
called an IterableDataset
32
00:01:20,430 --> 00:01:22,180
that we'll see in more detail soon.
33
00:01:23,700 --> 00:01:26,670
Let's start by looking at
why Arrow is so powerful.
34
00:01:26,670 --> 00:01:28,860
The first feature is that
it treats every dataset
35
00:01:28,860 --> 00:01:30,153
as a memory-mapped file.
36
00:01:31,020 --> 00:01:32,430
Now, memory mapping is a mechanism
37
00:01:32,430 --> 00:01:35,400
that maps a portion of a file
or an entire file and disc
38
00:01:35,400 --> 00:01:37,410
to a chunk of virtual memory.
39
00:01:37,410 --> 00:01:38,520
This allows applications
40
00:01:38,520 --> 00:01:41,280
to access segments of
an extremely large file
41
00:01:41,280 --> 00:01:44,080
without having to read the
whole file into memory first.
42
00:01:45,150 --> 00:01:48,120
Another cool feature of Arrow's
memory mapping capabilities
43
00:01:48,120 --> 00:01:49,860
is that it allows multiple processes
44
00:01:49,860 --> 00:01:51,840
to work with the same large dataset
45
00:01:51,840 --> 00:01:54,333
without moving it or
copying it in any way.
46
00:01:55,680 --> 00:01:57,570
This zero-copy feature of Arrow
47
00:01:57,570 --> 00:02:00,600
makes it extremely fast for
iterating over a dataset.
48
00:02:00,600 --> 00:02:02,640
And this example, you
can see that we iterate
49
00:02:02,640 --> 00:02:05,160
over 15 million rows in about a minute
50
00:02:05,160 --> 00:02:06,780
just using a standard laptop.
51
00:02:06,780 --> 00:02:08,080
That's not too bad at all.
52
00:02:09,750 --> 00:02:12,660
Let's now take a look at how
we can stream a large dataset.
53
00:02:12,660 --> 00:02:14,520
The only change you need to make is to set
54
00:02:14,520 --> 00:02:17,910
the streaming=True argument in
the load_dataset() function.
55
00:02:17,910 --> 00:02:20,580
This will return a special
IterableDataset object
56
00:02:20,580 --> 00:02:22,260
which is a bit different
to the Dataset objects
57
00:02:22,260 --> 00:02:24,330
we've seen in other videos.
58
00:02:24,330 --> 00:02:25,980
This object is an iterable,
59
00:02:25,980 --> 00:02:28,530
which means we can't index
it to access elements,
60
00:02:28,530 --> 00:02:30,180
but instead we iterate on it
61
00:02:30,180 --> 00:02:32,850
using the iter and next methods.
62
00:02:32,850 --> 00:02:34,050
This will download and access
63
00:02:34,050 --> 00:02:35,850
a single example from the dataset,
64
00:02:35,850 --> 00:02:37,410
which means you can progressively iterate
65
00:02:37,410 --> 00:02:40,360
through a huge dataset without
having to download it first.
66
00:02:42,150 --> 00:02:43,590
Tokenizing text with a map() method
67
00:02:43,590 --> 00:02:45,660
also works in a similar way.
68
00:02:45,660 --> 00:02:47,160
We first stream the dataset
69
00:02:47,160 --> 00:02:49,830
and then apply the map()
method with the tokenizer.
70
00:02:49,830 --> 00:02:53,283
To get the first tokenized
example, we apply iter and next.
71
00:02:54,750 --> 00:02:57,210
The main difference with
an IterableDataset is that
72
00:02:57,210 --> 00:02:59,970
instead of using a select()
method to return examples,
73
00:02:59,970 --> 00:03:01,530
we use the take() and skip() methods
74
00:03:01,530 --> 00:03:03,573
because we can't index into the dataset.
75
00:03:04,470 --> 00:03:05,460
The take() method returns
76
00:03:05,460 --> 00:03:07,500
the first N examples in the dataset,
77
00:03:07,500 --> 00:03:09,270
while skip(), as you can imagine,
78
00:03:09,270 --> 00:03:12,480
skips the first N and returns the rest.
79
00:03:12,480 --> 00:03:15,300
You can see examples of both
of these methods in action
80
00:03:15,300 --> 00:03:16,710
where we create a validation set
81
00:03:16,710 --> 00:03:18,660
from the first 1000 examples
82
00:03:18,660 --> 00:03:21,010
and then skip those to
create the training set.
83
00:03:23,012 --> 00:03:25,762
(air whooshing)