1
00:00:00,511 --> 00:00:01,784
（空气呼啸）
(air whooshing)

2
00:00:01,784 --> 00:00:02,964
（徽标弹出）
(logo popping)

3
00:00:02,964 --> 00:00:05,640
（金属滑动）
(metal sliding)

4
00:00:05,640 --> 00:00:07,203
- 内存映射和流式数据。
- Memory mapping and streaming.

5
00:00:08,040 --> 00:00:09,180
在本视频中，我们将了解
In this video, we'll take a look

6
00:00:09,180 --> 00:00:11,520
Datasets 库的两个核心特性
at two core features of the Datasets library

7
00:00:11,520 --> 00:00:14,220
在不耗尽笔记本电脑的 CPU 资源的前提下
that allow you to load and process huge datasets

8
00:00:14,220 --> 00:00:16,263
允许你加载和处理庞大的数据集。
without blowing up your laptop's CPU.

9
00:00:18,300 --> 00:00:20,280
如今，工作上处理多达数个 GB 体量的数据集
Nowadays, it's not uncommon to find yourself

10
00:00:20,280 --> 00:00:22,950
已经不是什么新鲜事了，
working with multi-GB sized datasets,

11
00:00:22,950 --> 00:00:24,420
特别是如果你打算预训练
especially if you're planning to pretrain

12
00:00:24,420 --> 00:00:28,110
类似 BERT 或 GPT-2 这样的 transformer。
a transformer like BERT or GPT-2 from scratch.

13
00:00:28,110 --> 00:00:31,260
在这些场景下，即使加载数据也可能是一个挑战。
In these cases, even loading the data can be a challenge.

14
00:00:31,260 --> 00:00:34,680
例如，用于预训练 T5 的 c4 语料库
For example, the c4 corpus used to pretrain T5

15
00:00:34,680 --> 00:00:36,903
包含超过 2 TB 的数据。
consists of over two terabytes of data.

16
00:00:38,400 --> 00:00:40,050
为了处理这些大型数据集，
To handle these large datasets,

17
00:00:40,050 --> 00:00:42,990
Datasets 库建立在两个核心特性之上：
the Datasets library is built on two core features:

18
00:00:42,990 --> 00:00:46,350
Apache Arrow 格式和流式 API。
the Apache Arrow format and a streaming API.

19
00:00:46,350 --> 00:00:49,110
Arrow 专为高性能数据处理而设计
Arrow is designed for high-performance data processing

20
00:00:49,110 --> 00:00:51,360
并代表每个类似表格的
and represents each table-like dataset

21
00:00:51,360 --> 00:00:52,773
基于列格式的数据集。
with a column-based format.

22
00:00:53,730 --> 00:00:56,130
正如你在此示例中所见，基于列的格式
As you can see in this example, column-based formats

23
00:00:56,130 --> 00:00:59,280
将表格的元素分组缓存到连续的 RAM 块中
group the elements of a table in consecutive blocks of RAM

24
00:00:59,280 --> 00:01:01,563
这实现了快速访问和处理。
and this unlocks fast access and processing.

25
00:01:02,760 --> 00:01:05,550
Arrow 擅长处理任何规模的数据
Arrow is great at processing data at any scale

26
00:01:05,550 --> 00:01:07,110
但有些数据集很大
but some datasets are so large

27
00:01:07,110 --> 00:01:09,600
你甚至不能把它们完全放在你的硬盘上。
that you can't even fit them on your hard disk.

28
00:01:09,600 --> 00:01:11,730
所以对于这些情况，Datasets 库提供了
So for these cases, the Datasets library provides

29
00:01:11,730 --> 00:01:14,820
允许你逐步下载的流式 API
a streaming API that allows you to progressively download

30
00:01:14,820 --> 00:01:17,700
可以每次下载原始数据的一个元素。
the raw data one element at a time.

31
00:01:17,700 --> 00:01:20,430
结果是一个称为 IterableDataset 的特殊对象
The result is a special object called an IterableDataset

32
00:01:20,430 --> 00:01:22,180
我们接下来就会看到更多细节。
that we'll see in more detail soon.

33
00:01:23,700 --> 00:01:26,670
让我们先来看看为什么 Arrow 如此强大。
Let's start by looking at why Arrow is so powerful.

34
00:01:26,670 --> 00:01:28,860
第一个特点是它将每个数据集
The first feature is that it treats every dataset

35
00:01:28,860 --> 00:01:30,153
作为内存映射文件处理。
as a memory-mapped file.

36
00:01:31,020 --> 00:01:32,430
现在，内存映射是一种机制
Now, memory mapping is a mechanism

37
00:01:32,430 --> 00:01:35,400
映射文件的一部分或整个文件和光盘
that maps a portion of a file or an entire file and disc

38
00:01:35,400 --> 00:01:37,410
到一大块虚拟内存。
to a chunk of virtual memory.

39
00:01:37,410 --> 00:01:38,520
这允许应用程序
This allows applications

40
00:01:38,520 --> 00:01:41,280
访问一个非常大的文件的片段
to access segments of an extremely large file

41
00:01:41,280 --> 00:01:44,080
而无需先将整个文件读入内存。
without having to read the whole file into memory first.

42
00:01:45,150 --> 00:01:48,120
Arrow 内存映射功能的另一个很酷的特性
Another cool feature of Arrow's memory mapping capabilities

43
00:01:48,120 --> 00:01:49,860
是它允许多个进程
is that it allows multiple processes

44
00:01:49,860 --> 00:01:51,840
使用相同的大型数据集
to work with the same large dataset

45
00:01:51,840 --> 00:01:54,333
而无需以任何方式移动或复制它。
without moving it or copying it in any way.

46
00:01:55,680 --> 00:01:57,570
Arrow 的这种零拷贝功能
This zero-copy feature of Arrow

47
00:01:57,570 --> 00:02:00,600
使得迭代数据集的速度非常快。
makes it extremely fast for iterating over a dataset.

48
00:02:00,600 --> 00:02:02,640
这个例子，你可以看到我们使用
And this example, you can see that we iterate

49
00:02:02,640 --> 00:02:05,160
普通的笔记本电脑在一分钟之内迭代
over 15 million rows in about a minute

50
00:02:05,160 --> 00:02:06,780
大约超过 1500 万行数据。
just using a standard laptop.

51
00:02:06,780 --> 00:02:08,080
这还不算太糟糕。
That's not too bad at all.

52
00:02:09,750 --> 00:02:12,660
现在让我们看一下如何流式传输大型数据集。
Let's now take a look at how we can stream a large dataset.

53
00:02:12,660 --> 00:02:14,520
你需要做的唯一修改是
The only change you need to make is to set

54
00:02:14,520 --> 00:02:17,910
设置 load_dataset () 函数中的 streaming=True 参数。
the streaming=True argument in the load_dataset () function.

55
00:02:17,910 --> 00:02:20,580
这将返回一个特殊的 IterableDataset 对象
This will return a special IterableDataset object

56
00:02:20,580 --> 00:02:22,260
这与 Dataset 对象有点不同
which is a bit different to the Dataset objects

57
00:02:22,260 --> 00:02:24,330
我们在其他视频中看到过。
we've seen in other videos.

58
00:02:24,330 --> 00:02:25,980
这个对象是一个可迭代的，
This object is an iterable,

59
00:02:25,980 --> 00:02:28,530
这意味着我们不能索引它来访问元素，
which means we can't index it to access elements,

60
00:02:28,530 --> 00:02:30,180
但相反我们使用 iter 
but instead we iterate on it

61
00:02:30,180 --> 00:02:32,850
和 next 方法迭代它。
using the iter and next methods.

62
00:02:32,850 --> 00:02:34,050
这将下载并访问
This will download and access

63
00:02:34,050 --> 00:02:35,850
来自数据集的单个示例，
a single example from the dataset,

64
00:02:35,850 --> 00:02:37,410
这意味着你可以逐步迭代
which means you can progressively iterate

65
00:02:37,410 --> 00:02:40,360
庞大的数据集，而无需提前下载它。
through a huge dataset without having to download it first.

66
00:02:42,150 --> 00:02:43,590
使用 map () 方法标记文本
Tokenizing text with a map () method

67
00:02:43,590 --> 00:02:45,660
也以类似的方式工作。
also works in a similar way.

68
00:02:45,660 --> 00:02:47,160
我们首先流式传输数据集
We first stream the dataset

69
00:02:47,160 --> 00:02:49,830
然后将 map () 方法与分词器一起应用。
and then apply the map () method with the tokenizer.

70
00:02:49,830 --> 00:02:53,283
要获得第一个词元化示例，我们应用 iter 和 next。
To get the first tokenized example, we apply iter and next.

71
00:02:54,750 --> 00:02:57,210
与 IterableDataset 的主要区别在于
The main difference with an IterableDataset is that

72
00:02:57,210 --> 00:02:59,970
并未使用 select () 方法返回示例，
instead of using a select () method to return examples,

73
00:02:59,970 --> 00:03:01,530
而是使用 take () 和 skip () 方法
we use the take () and skip () methods

74
00:03:01,530 --> 00:03:03,573
因为我们无法索引数据集。
because we can't index into the dataset.

75
00:03:04,470 --> 00:03:05,460
take () 方法返回
The take () method returns

76
00:03:05,460 --> 00:03:07,500
数据集中的前 N 个示例，
the first N examples in the dataset,

77
00:03:07,500 --> 00:03:09,270
而 skip ()，如你所想，
while skip (), as you can imagine,

78
00:03:09,270 --> 00:03:12,480
跳过第一个 N 并返回其余的。
skips the first N and returns the rest.

79
00:03:12,480 --> 00:03:15,300
你可以看到这两种方法的实际示例
You can see examples of both of these methods in action

80
00:03:15,300 --> 00:03:16,710
我们在哪里创建验证集
where we create a validation set

81
00:03:16,710 --> 00:03:18,660
来自前 1000 个示例
from the first 1000 examples

82
00:03:18,660 --> 00:03:21,010
然后跳过那些来创建训练集。
and then skip those to create the training set.

83
00:03:23,012 --> 00:03:25,762
（空气呼啸）
(air whooshing)