1 00:00:00,511 --> 00:00:01,784 (空气呼啸) (air whooshing) 2 00:00:01,784 --> 00:00:02,964 (徽标弹出) (logo popping) 3 00:00:02,964 --> 00:00:05,640 (金属滑动) (metal sliding) 4 00:00:05,640 --> 00:00:07,203 - 内存映射和流式数据。 - Memory mapping and streaming. 5 00:00:08,040 --> 00:00:09,180 在本视频中,我们将了解 In this video, we'll take a look 6 00:00:09,180 --> 00:00:11,520 Datasets 库的两个核心特性 at two core features of the Datasets library 7 00:00:11,520 --> 00:00:14,220 在不耗尽笔记本电脑的 CPU 资源的前提下 that allow you to load and process huge datasets 8 00:00:14,220 --> 00:00:16,263 允许你加载和处理庞大的数据集。 without blowing up your laptop's CPU. 9 00:00:18,300 --> 00:00:20,280 如今,工作上处理多达数个 GB 体量的数据集 Nowadays, it's not uncommon to find yourself 10 00:00:20,280 --> 00:00:22,950 已经不是什么新鲜事了, working with multi-GB sized datasets, 11 00:00:22,950 --> 00:00:24,420 特别是如果你打算预训练 especially if you're planning to pretrain 12 00:00:24,420 --> 00:00:28,110 类似 BERT 或 GPT-2 这样的 transformer。 a transformer like BERT or GPT-2 from scratch. 13 00:00:28,110 --> 00:00:31,260 在这些场景下,即使加载数据也可能是一个挑战。 In these cases, even loading the data can be a challenge. 14 00:00:31,260 --> 00:00:34,680 例如,用于预训练 T5 的 c4 语料库 For example, the c4 corpus used to pretrain T5 15 00:00:34,680 --> 00:00:36,903 包含超过 2 TB 的数据。 consists of over two terabytes of data. 16 00:00:38,400 --> 00:00:40,050 为了处理这些大型数据集, To handle these large datasets, 17 00:00:40,050 --> 00:00:42,990 Datasets 库建立在两个核心特性之上: the Datasets library is built on two core features: 18 00:00:42,990 --> 00:00:46,350 Apache Arrow 格式和流式 API。 the Apache Arrow format and a streaming API. 19 00:00:46,350 --> 00:00:49,110 Arrow 专为高性能数据处理而设计 Arrow is designed for high-performance data processing 20 00:00:49,110 --> 00:00:51,360 并代表每个类似表格的 and represents each table-like dataset 21 00:00:51,360 --> 00:00:52,773 基于列格式的数据集。 with a column-based format. 22 00:00:53,730 --> 00:00:56,130 正如你在此示例中所见,基于列的格式 As you can see in this example, column-based formats 23 00:00:56,130 --> 00:00:59,280 将表格的元素分组缓存到连续的 RAM 块中 group the elements of a table in consecutive blocks of RAM 24 00:00:59,280 --> 00:01:01,563 这实现了快速访问和处理。 and this unlocks fast access and processing. 25 00:01:02,760 --> 00:01:05,550 Arrow 擅长处理任何规模的数据 Arrow is great at processing data at any scale 26 00:01:05,550 --> 00:01:07,110 但有些数据集很大 but some datasets are so large 27 00:01:07,110 --> 00:01:09,600 你甚至不能把它们完全放在你的硬盘上。 that you can't even fit them on your hard disk. 28 00:01:09,600 --> 00:01:11,730 所以对于这些情况,Datasets 库提供了 So for these cases, the Datasets library provides 29 00:01:11,730 --> 00:01:14,820 允许你逐步下载的流式 API a streaming API that allows you to progressively download 30 00:01:14,820 --> 00:01:17,700 可以每次下载原始数据的一个元素。 the raw data one element at a time. 31 00:01:17,700 --> 00:01:20,430 结果是一个称为 IterableDataset 的特殊对象 The result is a special object called an IterableDataset 32 00:01:20,430 --> 00:01:22,180 我们接下来就会看到更多细节。 that we'll see in more detail soon. 33 00:01:23,700 --> 00:01:26,670 让我们先来看看为什么 Arrow 如此强大。 Let's start by looking at why Arrow is so powerful. 34 00:01:26,670 --> 00:01:28,860 第一个特点是它将每个数据集 The first feature is that it treats every dataset 35 00:01:28,860 --> 00:01:30,153 作为内存映射文件处理。 as a memory-mapped file. 36 00:01:31,020 --> 00:01:32,430 现在,内存映射是一种机制 Now, memory mapping is a mechanism 37 00:01:32,430 --> 00:01:35,400 映射文件的一部分或整个文件和光盘 that maps a portion of a file or an entire file and disc 38 00:01:35,400 --> 00:01:37,410 到一大块虚拟内存。 to a chunk of virtual memory. 39 00:01:37,410 --> 00:01:38,520 这允许应用程序 This allows applications 40 00:01:38,520 --> 00:01:41,280 访问一个非常大的文件的片段 to access segments of an extremely large file 41 00:01:41,280 --> 00:01:44,080 而无需先将整个文件读入内存。 without having to read the whole file into memory first. 42 00:01:45,150 --> 00:01:48,120 Arrow 内存映射功能的另一个很酷的特性 Another cool feature of Arrow's memory mapping capabilities 43 00:01:48,120 --> 00:01:49,860 是它允许多个进程 is that it allows multiple processes 44 00:01:49,860 --> 00:01:51,840 使用相同的大型数据集 to work with the same large dataset 45 00:01:51,840 --> 00:01:54,333 而无需以任何方式移动或复制它。 without moving it or copying it in any way. 46 00:01:55,680 --> 00:01:57,570 Arrow 的这种零拷贝功能 This zero-copy feature of Arrow 47 00:01:57,570 --> 00:02:00,600 使得迭代数据集的速度非常快。 makes it extremely fast for iterating over a dataset. 48 00:02:00,600 --> 00:02:02,640 这个例子,你可以看到我们使用 And this example, you can see that we iterate 49 00:02:02,640 --> 00:02:05,160 普通的笔记本电脑在一分钟之内迭代 over 15 million rows in about a minute 50 00:02:05,160 --> 00:02:06,780 大约超过 1500 万行数据。 just using a standard laptop. 51 00:02:06,780 --> 00:02:08,080 这还不算太糟糕。 That's not too bad at all. 52 00:02:09,750 --> 00:02:12,660 现在让我们看一下如何流式传输大型数据集。 Let's now take a look at how we can stream a large dataset. 53 00:02:12,660 --> 00:02:14,520 你需要做的唯一修改是 The only change you need to make is to set 54 00:02:14,520 --> 00:02:17,910 设置 load_dataset () 函数中的 streaming=True 参数。 the streaming=True argument in the load_dataset () function. 55 00:02:17,910 --> 00:02:20,580 这将返回一个特殊的 IterableDataset 对象 This will return a special IterableDataset object 56 00:02:20,580 --> 00:02:22,260 这与 Dataset 对象有点不同 which is a bit different to the Dataset objects 57 00:02:22,260 --> 00:02:24,330 我们在其他视频中看到过。 we've seen in other videos. 58 00:02:24,330 --> 00:02:25,980 这个对象是一个可迭代的, This object is an iterable, 59 00:02:25,980 --> 00:02:28,530 这意味着我们不能索引它来访问元素, which means we can't index it to access elements, 60 00:02:28,530 --> 00:02:30,180 但相反我们使用 iter but instead we iterate on it 61 00:02:30,180 --> 00:02:32,850 和 next 方法迭代它。 using the iter and next methods. 62 00:02:32,850 --> 00:02:34,050 这将下载并访问 This will download and access 63 00:02:34,050 --> 00:02:35,850 来自数据集的单个示例, a single example from the dataset, 64 00:02:35,850 --> 00:02:37,410 这意味着你可以逐步迭代 which means you can progressively iterate 65 00:02:37,410 --> 00:02:40,360 庞大的数据集,而无需提前下载它。 through a huge dataset without having to download it first. 66 00:02:42,150 --> 00:02:43,590 使用 map () 方法标记文本 Tokenizing text with a map () method 67 00:02:43,590 --> 00:02:45,660 也以类似的方式工作。 also works in a similar way. 68 00:02:45,660 --> 00:02:47,160 我们首先流式传输数据集 We first stream the dataset 69 00:02:47,160 --> 00:02:49,830 然后将 map () 方法与分词器一起应用。 and then apply the map () method with the tokenizer. 70 00:02:49,830 --> 00:02:53,283 要获得第一个词元化示例,我们应用 iter 和 next。 To get the first tokenized example, we apply iter and next. 71 00:02:54,750 --> 00:02:57,210 与 IterableDataset 的主要区别在于 The main difference with an IterableDataset is that 72 00:02:57,210 --> 00:02:59,970 并未使用 select () 方法返回示例, instead of using a select () method to return examples, 73 00:02:59,970 --> 00:03:01,530 而是使用 take () 和 skip () 方法 we use the take () and skip () methods 74 00:03:01,530 --> 00:03:03,573 因为我们无法索引数据集。 because we can't index into the dataset. 75 00:03:04,470 --> 00:03:05,460 take () 方法返回 The take () method returns 76 00:03:05,460 --> 00:03:07,500 数据集中的前 N 个示例, the first N examples in the dataset, 77 00:03:07,500 --> 00:03:09,270 而 skip (),如你所想, while skip (), as you can imagine, 78 00:03:09,270 --> 00:03:12,480 跳过第一个 N 并返回其余的。 skips the first N and returns the rest. 79 00:03:12,480 --> 00:03:15,300 你可以看到这两种方法的实际示例 You can see examples of both of these methods in action 80 00:03:15,300 --> 00:03:16,710 我们在哪里创建验证集 where we create a validation set 81 00:03:16,710 --> 00:03:18,660 来自前 1000 个示例 from the first 1000 examples 82 00:03:18,660 --> 00:03:21,010 然后跳过那些来创建训练集。 and then skip those to create the training set. 83 00:03:23,012 --> 00:03:25,762 (空气呼啸) (air whooshing)