subtitles/zh-CN/39_memory-mapping-&-streaming.srt (332 lines of code) (raw):
1
00:00:00,511 --> 00:00:01,784
(空气呼啸)
(air whooshing)
2
00:00:01,784 --> 00:00:02,964
(徽标弹出)
(logo popping)
3
00:00:02,964 --> 00:00:05,640
(金属滑动)
(metal sliding)
4
00:00:05,640 --> 00:00:07,203
- 内存映射和流式数据。
- Memory mapping and streaming.
5
00:00:08,040 --> 00:00:09,180
在本视频中,我们将了解
In this video, we'll take a look
6
00:00:09,180 --> 00:00:11,520
Datasets 库的两个核心特性
at two core features of the Datasets library
7
00:00:11,520 --> 00:00:14,220
在不耗尽笔记本电脑的 CPU 资源的前提下
that allow you to load and process huge datasets
8
00:00:14,220 --> 00:00:16,263
允许你加载和处理庞大的数据集。
without blowing up your laptop's CPU.
9
00:00:18,300 --> 00:00:20,280
如今,工作上处理多达数个 GB 体量的数据集
Nowadays, it's not uncommon to find yourself
10
00:00:20,280 --> 00:00:22,950
已经不是什么新鲜事了,
working with multi-GB sized datasets,
11
00:00:22,950 --> 00:00:24,420
特别是如果你打算预训练
especially if you're planning to pretrain
12
00:00:24,420 --> 00:00:28,110
类似 BERT 或 GPT-2 这样的 transformer。
a transformer like BERT or GPT-2 from scratch.
13
00:00:28,110 --> 00:00:31,260
在这些场景下,即使加载数据也可能是一个挑战。
In these cases, even loading the data can be a challenge.
14
00:00:31,260 --> 00:00:34,680
例如,用于预训练 T5 的 c4 语料库
For example, the c4 corpus used to pretrain T5
15
00:00:34,680 --> 00:00:36,903
包含超过 2 TB 的数据。
consists of over two terabytes of data.
16
00:00:38,400 --> 00:00:40,050
为了处理这些大型数据集,
To handle these large datasets,
17
00:00:40,050 --> 00:00:42,990
Datasets 库建立在两个核心特性之上:
the Datasets library is built on two core features:
18
00:00:42,990 --> 00:00:46,350
Apache Arrow 格式和流式 API。
the Apache Arrow format and a streaming API.
19
00:00:46,350 --> 00:00:49,110
Arrow 专为高性能数据处理而设计
Arrow is designed for high-performance data processing
20
00:00:49,110 --> 00:00:51,360
并代表每个类似表格的
and represents each table-like dataset
21
00:00:51,360 --> 00:00:52,773
基于列格式的数据集。
with a column-based format.
22
00:00:53,730 --> 00:00:56,130
正如你在此示例中所见,基于列的格式
As you can see in this example, column-based formats
23
00:00:56,130 --> 00:00:59,280
将表格的元素分组缓存到连续的 RAM 块中
group the elements of a table in consecutive blocks of RAM
24
00:00:59,280 --> 00:01:01,563
这实现了快速访问和处理。
and this unlocks fast access and processing.
25
00:01:02,760 --> 00:01:05,550
Arrow 擅长处理任何规模的数据
Arrow is great at processing data at any scale
26
00:01:05,550 --> 00:01:07,110
但有些数据集很大
but some datasets are so large
27
00:01:07,110 --> 00:01:09,600
你甚至不能把它们完全放在你的硬盘上。
that you can't even fit them on your hard disk.
28
00:01:09,600 --> 00:01:11,730
所以对于这些情况,Datasets 库提供了
So for these cases, the Datasets library provides
29
00:01:11,730 --> 00:01:14,820
允许你逐步下载的流式 API
a streaming API that allows you to progressively download
30
00:01:14,820 --> 00:01:17,700
可以每次下载原始数据的一个元素。
the raw data one element at a time.
31
00:01:17,700 --> 00:01:20,430
结果是一个称为 IterableDataset 的特殊对象
The result is a special object called an IterableDataset
32
00:01:20,430 --> 00:01:22,180
我们接下来就会看到更多细节。
that we'll see in more detail soon.
33
00:01:23,700 --> 00:01:26,670
让我们先来看看为什么 Arrow 如此强大。
Let's start by looking at why Arrow is so powerful.
34
00:01:26,670 --> 00:01:28,860
第一个特点是它将每个数据集
The first feature is that it treats every dataset
35
00:01:28,860 --> 00:01:30,153
作为内存映射文件处理。
as a memory-mapped file.
36
00:01:31,020 --> 00:01:32,430
现在,内存映射是一种机制
Now, memory mapping is a mechanism
37
00:01:32,430 --> 00:01:35,400
映射文件的一部分或整个文件和光盘
that maps a portion of a file or an entire file and disc
38
00:01:35,400 --> 00:01:37,410
到一大块虚拟内存。
to a chunk of virtual memory.
39
00:01:37,410 --> 00:01:38,520
这允许应用程序
This allows applications
40
00:01:38,520 --> 00:01:41,280
访问一个非常大的文件的片段
to access segments of an extremely large file
41
00:01:41,280 --> 00:01:44,080
而无需先将整个文件读入内存。
without having to read the whole file into memory first.
42
00:01:45,150 --> 00:01:48,120
Arrow 内存映射功能的另一个很酷的特性
Another cool feature of Arrow's memory mapping capabilities
43
00:01:48,120 --> 00:01:49,860
是它允许多个进程
is that it allows multiple processes
44
00:01:49,860 --> 00:01:51,840
使用相同的大型数据集
to work with the same large dataset
45
00:01:51,840 --> 00:01:54,333
而无需以任何方式移动或复制它。
without moving it or copying it in any way.
46
00:01:55,680 --> 00:01:57,570
Arrow 的这种零拷贝功能
This zero-copy feature of Arrow
47
00:01:57,570 --> 00:02:00,600
使得迭代数据集的速度非常快。
makes it extremely fast for iterating over a dataset.
48
00:02:00,600 --> 00:02:02,640
这个例子,你可以看到我们使用
And this example, you can see that we iterate
49
00:02:02,640 --> 00:02:05,160
普通的笔记本电脑在一分钟之内迭代
over 15 million rows in about a minute
50
00:02:05,160 --> 00:02:06,780
大约超过 1500 万行数据。
just using a standard laptop.
51
00:02:06,780 --> 00:02:08,080
这还不算太糟糕。
That's not too bad at all.
52
00:02:09,750 --> 00:02:12,660
现在让我们看一下如何流式传输大型数据集。
Let's now take a look at how we can stream a large dataset.
53
00:02:12,660 --> 00:02:14,520
你需要做的唯一修改是
The only change you need to make is to set
54
00:02:14,520 --> 00:02:17,910
设置 load_dataset () 函数中的 streaming=True 参数。
the streaming=True argument in the load_dataset () function.
55
00:02:17,910 --> 00:02:20,580
这将返回一个特殊的 IterableDataset 对象
This will return a special IterableDataset object
56
00:02:20,580 --> 00:02:22,260
这与 Dataset 对象有点不同
which is a bit different to the Dataset objects
57
00:02:22,260 --> 00:02:24,330
我们在其他视频中看到过。
we've seen in other videos.
58
00:02:24,330 --> 00:02:25,980
这个对象是一个可迭代的,
This object is an iterable,
59
00:02:25,980 --> 00:02:28,530
这意味着我们不能索引它来访问元素,
which means we can't index it to access elements,
60
00:02:28,530 --> 00:02:30,180
但相反我们使用 iter
but instead we iterate on it
61
00:02:30,180 --> 00:02:32,850
和 next 方法迭代它。
using the iter and next methods.
62
00:02:32,850 --> 00:02:34,050
这将下载并访问
This will download and access
63
00:02:34,050 --> 00:02:35,850
来自数据集的单个示例,
a single example from the dataset,
64
00:02:35,850 --> 00:02:37,410
这意味着你可以逐步迭代
which means you can progressively iterate
65
00:02:37,410 --> 00:02:40,360
庞大的数据集,而无需提前下载它。
through a huge dataset without having to download it first.
66
00:02:42,150 --> 00:02:43,590
使用 map () 方法标记文本
Tokenizing text with a map () method
67
00:02:43,590 --> 00:02:45,660
也以类似的方式工作。
also works in a similar way.
68
00:02:45,660 --> 00:02:47,160
我们首先流式传输数据集
We first stream the dataset
69
00:02:47,160 --> 00:02:49,830
然后将 map () 方法与分词器一起应用。
and then apply the map () method with the tokenizer.
70
00:02:49,830 --> 00:02:53,283
要获得第一个词元化示例,我们应用 iter 和 next。
To get the first tokenized example, we apply iter and next.
71
00:02:54,750 --> 00:02:57,210
与 IterableDataset 的主要区别在于
The main difference with an IterableDataset is that
72
00:02:57,210 --> 00:02:59,970
并未使用 select () 方法返回示例,
instead of using a select () method to return examples,
73
00:02:59,970 --> 00:03:01,530
而是使用 take () 和 skip () 方法
we use the take () and skip () methods
74
00:03:01,530 --> 00:03:03,573
因为我们无法索引数据集。
because we can't index into the dataset.
75
00:03:04,470 --> 00:03:05,460
take () 方法返回
The take () method returns
76
00:03:05,460 --> 00:03:07,500
数据集中的前 N 个示例,
the first N examples in the dataset,
77
00:03:07,500 --> 00:03:09,270
而 skip (),如你所想,
while skip (), as you can imagine,
78
00:03:09,270 --> 00:03:12,480
跳过第一个 N 并返回其余的。
skips the first N and returns the rest.
79
00:03:12,480 --> 00:03:15,300
你可以看到这两种方法的实际示例
You can see examples of both of these methods in action
80
00:03:15,300 --> 00:03:16,710
我们在哪里创建验证集
where we create a validation set
81
00:03:16,710 --> 00:03:18,660
来自前 1000 个示例
from the first 1000 examples
82
00:03:18,660 --> 00:03:21,010
然后跳过那些来创建训练集。
and then skip those to create the training set.
83
00:03:23,012 --> 00:03:25,762
(空气呼啸)
(air whooshing)