1 00:00:00,000 --> 00:00:02,917 (过渡音乐) (transition music) 2 00:00:06,600 --> 00:00:08,283 - 保存和重新加载数据集。 - Saving and reloading a dataset. 3 00:00:09,210 --> 00:00:10,320 在本视频中,我们将了解 In this video, we'll take a look 4 00:00:10,320 --> 00:00:12,360 以各种格式保存数据集 at saving a dataset in various formats 5 00:00:12,360 --> 00:00:14,660 并探索重新加载已保存数据的方法。 and explore the ways to reload the saved data. 6 00:00:17,310 --> 00:00:20,100 下载数据集时,所需的处理脚本和数据都会本地存储 When you download a dataset, the processing scripts and data 7 00:00:20,100 --> 00:00:22,470 在你的计算机上。 are stored locally on your computer. 8 00:00:22,470 --> 00:00:24,000 缓存允许 Datasets 库 The cache allows the Datasets library 9 00:00:24,000 --> 00:00:25,230 避免重新下载 to avoid re-downloading 10 00:00:25,230 --> 00:00:28,620 或每次使用时处理整个数据集。 or processing the entire dataset every time you use it. 11 00:00:28,620 --> 00:00:31,170 现在,数据以 Arrow 表的形式存储 Now, the data is stored in the form of Arrow tables 12 00:00:31,170 --> 00:00:32,490 通过访问数据集的 whose location can be found 13 00:00:32,490 --> 00:00:35,730 cache_files 属性可以找到它的位置。 by accessing the dataset's cache_files attribute. 14 00:00:35,730 --> 00:00:38,430 在这个例子中,我们从 Hugging Face Hub In this example, we've downloaded the allocine dataset 15 00:00:38,430 --> 00:00:40,080 下载了 allocine 数据集,你可以看到 from the Hugging Face Hub, and you can see 16 00:00:40,080 --> 00:00:41,430 一共有三个 Arrow 文件 that there are three Arrow files 17 00:00:41,430 --> 00:00:43,473 存储在缓存中,每个文件对应一个分片数据。 stored in the cache, one for each split. 18 00:00:45,360 --> 00:00:47,460 但在很多情况下,你希望在不同的物理地址或者以不同的格式 But in many cases, you'll wanna save your dataset 19 00:00:47,460 --> 00:00:49,890 保存你的数据集。 in a different location or format. 20 00:00:49,890 --> 00:00:51,900 如表所示,Datasets 库 As shown in the table, the Datasets library 21 00:00:51,900 --> 00:00:54,870 提供了四个主要功能来实现这一点。 provides four main functions to achieve this. 22 00:00:54,870 --> 00:00:56,130 现在,你可能已经很熟悉了 Now, you're probably already familiar 23 00:00:56,130 --> 00:00:58,770 使用 CSV 和 JSON 格式,这两种格式都很棒 with the CSV and JSON formats, both of which are great 24 00:00:58,770 --> 00:01:00,810 如果你只是想快速保存一个小规模 if you just wanna quickly save a small 25 00:01:00,810 --> 00:01:02,790 或中等规模的数据集。 or medium-sized dataset. 26 00:01:02,790 --> 00:01:03,976 但是如果你的数据集很大, But if your dataset is huge, 27 00:01:03,976 --> 00:01:07,860 你需要将其保存为 Arrow 或 Parquet 格式。 you'll wanna save it in either the Arrow or Parquet formats. 28 00:01:07,860 --> 00:01:09,660 如果你打算重新加载或在不久的将来处理数据, Arrow files are great if you plan to reload 29 00:01:09,660 --> 00:01:11,850 Arrow 文件就很棒。 or process the data in the near future. 30 00:01:11,850 --> 00:01:13,290 虽然 Parquet 文件被设计成 While Parquet files are designed 31 00:01:13,290 --> 00:01:16,140 用于长期存储并且非常节省空间。 for long-term storage and are very space-efficient. 32 00:01:16,140 --> 00:01:18,140 让我们仔细看看每种格式。 Let's take a closer look at each format. 33 00:01:19,800 --> 00:01:21,750 保存数据集或 dataset_dict 对象 To save a dataset or a dataset_dict object 34 00:01:21,750 --> 00:01:25,560 在 Arrow 格式中,我们使用 save_to_disk 函数。 in the Arrow format, we use the save_to_disk function. 35 00:01:25,560 --> 00:01:26,910 正如你在此示例中所见, As you can see in this example, 36 00:01:26,910 --> 00:01:29,790 只需提供我们希望将数据保存到的路径 we simply provide the path we wish to save the data to 37 00:01:29,790 --> 00:01:30,720 然后 Datasets 库 and the Datasets library 38 00:01:30,720 --> 00:01:32,340 会针对每个分片数据自动创建一个目录 will automatically create a directory 39 00:01:32,340 --> 00:01:35,790 来存储 Arrow 表和元数据。 for each split to store the Arrow table and the metadata. 40 00:01:35,790 --> 00:01:37,680 因为我们正在处理一个 dataset_dict 对象 Since we're dealing with a dataset_dict object 41 00:01:37,680 --> 00:01:39,090 其中包含多个分片数据, that has multiple splits, 42 00:01:39,090 --> 00:01:40,590 此信息也被存储 this information is also stored 43 00:01:40,590 --> 00:01:42,243 在 dataset_dict.json 文件中。 in the dataset_dict.json file. 44 00:01:44,250 --> 00:01:46,710 现在,当想要重新加载 Arrow 数据集时, Now, when we wanna reload the Arrow datasets, 45 00:01:46,710 --> 00:01:48,870 我们使用 load_from_disk 函数。 we use the load_from_disk function. 46 00:01:48,870 --> 00:01:51,210 只需传递数据集目录的路径, We simply pass the path of our dataset directory, 47 00:01:51,210 --> 00:01:53,583 啊瞧,原始数据集已恢复。 and voila, the original dataset is recovered. 48 00:01:55,594 --> 00:01:57,180 如果我们想保存我们的数据集 If we wanna save our dataset 49 00:01:57,180 --> 00:02:00,990 在 CSV 格式中,我们使用 to_csv 函数。 in the CSV format, we use the to_csv function. 50 00:02:00,990 --> 00:02:02,280 在这种情况下,你需要循环 In this case, you'll need to loop 51 00:02:02,280 --> 00:02:04,170 dataset_dict 对象的分片数据 over the splits of the dataset_dict object 52 00:02:04,170 --> 00:02:07,710 并将每个数据集保存为单独的 CSV 文件。 and save each dataset as an individual CSV file. 53 00:02:07,710 --> 00:02:10,950 由于 to_csv 函数基于 Pandas 中的函数, Since the to_csv function is based on the one from Pandas, 54 00:02:10,950 --> 00:02:13,980 你可以传递关键字参数来配置输出。 you can pass keyword arguments to configure the output. 55 00:02:13,980 --> 00:02:16,230 在这个例子中,我们将索引参数设置为 None In this example, we've set the index argument 56 00:02:16,230 --> 00:02:18,480 以防止数据集的索引列 to None to prevent the dataset's index column 57 00:02:18,480 --> 00:02:20,553 不包含在 CSV 文件中。 from being included in the CSV files. 58 00:02:22,470 --> 00:02:24,240 要重新加载我们的 CSV 文件, To reload our CSV files, 59 00:02:24,240 --> 00:02:27,180 就使用熟悉的 load_dataset 函数 we just then use the familiar load_dataset function 60 00:02:27,180 --> 00:02:29,160 连同 CSV 加载脚本 together with the CSV loading script 61 00:02:29,160 --> 00:02:30,360 和 data_files 参数, and the data_files argument, 62 00:02:30,360 --> 00:02:34,020 它指定与每个分片数据关联的文件名。 which specifies the file names associated with each split. 63 00:02:34,020 --> 00:02:35,400 正如你在此示例中所见, As you can see in this example, 64 00:02:35,400 --> 00:02:37,320 通过提供所有分片数据及其文件名, by providing all the splits and their file names, 65 00:02:37,320 --> 00:02:39,770 我们已经恢复了原始的 dataset_dict 对象。 we've recovered the original dataset_dict object. 66 00:02:41,880 --> 00:02:43,560 现在,将数据集保存在 JSON 中 Now, to save a dataset in the JSON 67 00:02:43,560 --> 00:02:46,710 或保存为 Parquet 格式与 CSV 的情况非常相似。 or Parquet formats is very similar to the CSV case. 68 00:02:46,710 --> 00:02:49,890 我们对 JSON 文件使用 to_json 函数 We use either the to_json function for JSON files 69 00:02:49,890 --> 00:02:52,740 或 Parquet 的 to_parquet 函数。 or the to_parquet function for Parquet ones. 70 00:02:52,740 --> 00:02:55,740 就像 CSV 案例一样,我们需要遍历分片数据 And just like the CSV case, we need to loop over the splits 71 00:02:55,740 --> 00:02:57,753 将每个保存为单独的文件。 to save each one as an individual file. 72 00:02:59,580 --> 00:03:02,940 一旦我们的数据集被保存为 JSON 或 Parquet 文件, And once our datasets are saved as JSON or Parquet files, 73 00:03:02,940 --> 00:03:03,990 我们可以重新加载它们 we can reload them again 74 00:03:03,990 --> 00:03:06,960 使用 load_dataset 函数中的合适的脚本。 with the appropriate script in the load_dataset function. 75 00:03:06,960 --> 00:03:09,993 我们只需要像以前一样提供一个 data_files 参数。 And we just need to provide a data_files argument as before. 76 00:03:10,860 --> 00:03:11,910 这个例子表明 This example shows 77 00:03:11,910 --> 00:03:14,560 我们如何以任何一种格式重新加载我们保存的数据集。 how we can reload our save datasets in either format. 78 00:03:16,620 --> 00:03:17,970 有了这个,你现在知道 And with that, you now know 79 00:03:17,970 --> 00:03:20,220 如何以各种格式保存数据集。 how to save your datasets in various formats. 80 00:03:21,441 --> 00:03:24,358 (过渡音乐) (transition music)