1
00:00:00,000 --> 00:00:02,917
（过渡音乐）
(transition music)

2
00:00:06,600 --> 00:00:08,283
- 保存和重新加载数据集。
- Saving and reloading a dataset.

3
00:00:09,210 --> 00:00:10,320
在本视频中，我们将了解
In this video, we'll take a look

4
00:00:10,320 --> 00:00:12,360
以各种格式保存数据集
at saving a dataset in various formats

5
00:00:12,360 --> 00:00:14,660
并探索重新加载已保存数据的方法。
and explore the ways to reload the saved data.

6
00:00:17,310 --> 00:00:20,100
下载数据集时，所需的处理脚本和数据都会本地存储
When you download a dataset, the processing scripts and data

7
00:00:20,100 --> 00:00:22,470
在你的计算机上。
are stored locally on your computer.

8
00:00:22,470 --> 00:00:24,000
缓存允许 Datasets 库
The cache allows the Datasets library

9
00:00:24,000 --> 00:00:25,230
避免重新下载
to avoid re-downloading

10
00:00:25,230 --> 00:00:28,620
或每次使用时处理整个数据集。
or processing the entire dataset every time you use it.

11
00:00:28,620 --> 00:00:31,170
现在，数据以 Arrow 表的形式存储
Now, the data is stored in the form of Arrow tables

12
00:00:31,170 --> 00:00:32,490
通过访问数据集的 
whose location can be found

13
00:00:32,490 --> 00:00:35,730
cache_files 属性可以找到它的位置。
by accessing the dataset's cache_files attribute.

14
00:00:35,730 --> 00:00:38,430
在这个例子中，我们从 Hugging Face Hub 
In this example, we've downloaded the allocine dataset

15
00:00:38,430 --> 00:00:40,080
下载了 allocine 数据集，你可以看到
from the Hugging Face Hub, and you can see

16
00:00:40,080 --> 00:00:41,430
一共有三个 Arrow 文件
that there are three Arrow files

17
00:00:41,430 --> 00:00:43,473
存储在缓存中，每个文件对应一个分片数据。
stored in the cache, one for each split.

18
00:00:45,360 --> 00:00:47,460
但在很多情况下，你希望在不同的物理地址或者以不同的格式
But in many cases, you'll wanna save your dataset

19
00:00:47,460 --> 00:00:49,890
保存你的数据集。
in a different location or format.

20
00:00:49,890 --> 00:00:51,900
如表所示，Datasets 库
As shown in the table, the Datasets library

21
00:00:51,900 --> 00:00:54,870
提供了四个主要功能来实现这一点。
provides four main functions to achieve this.

22
00:00:54,870 --> 00:00:56,130
现在，你可能已经很熟悉了
Now, you're probably already familiar

23
00:00:56,130 --> 00:00:58,770
使用 CSV 和 JSON 格式，这两种格式都很棒
with the CSV and JSON formats, both of which are great

24
00:00:58,770 --> 00:01:00,810
如果你只是想快速保存一个小规模
if you just wanna quickly save a small

25
00:01:00,810 --> 00:01:02,790
或中等规模的数据集。
or medium-sized dataset.

26
00:01:02,790 --> 00:01:03,976
但是如果你的数据集很大，
But if your dataset is huge,

27
00:01:03,976 --> 00:01:07,860
你需要将其保存为 Arrow 或 Parquet 格式。
you'll wanna save it in either the Arrow or Parquet formats.

28
00:01:07,860 --> 00:01:09,660
如果你打算重新加载或在不久的将来处理数据，
Arrow files are great if you plan to reload

29
00:01:09,660 --> 00:01:11,850
Arrow 文件就很棒。
or process the data in the near future.

30
00:01:11,850 --> 00:01:13,290
虽然 Parquet 文件被设计成
While Parquet files are designed

31
00:01:13,290 --> 00:01:16,140
用于长期存储并且非常节省空间。
for long-term storage and are very space-efficient.

32
00:01:16,140 --> 00:01:18,140
让我们仔细看看每种格式。
Let's take a closer look at each format.

33
00:01:19,800 --> 00:01:21,750
保存数据集或 dataset_dict 对象
To save a dataset or a dataset_dict object

34
00:01:21,750 --> 00:01:25,560
在 Arrow 格式中，我们使用 save_to_disk 函数。
in the Arrow format, we use the save_to_disk function.

35
00:01:25,560 --> 00:01:26,910
正如你在此示例中所见，
As you can see in this example,

36
00:01:26,910 --> 00:01:29,790
只需提供我们希望将数据保存到的路径
we simply provide the path we wish to save the data to

37
00:01:29,790 --> 00:01:30,720
然后 Datasets 库
and the Datasets library

38
00:01:30,720 --> 00:01:32,340
会针对每个分片数据自动创建一个目录
will automatically create a directory

39
00:01:32,340 --> 00:01:35,790
来存储 Arrow 表和元数据。
for each split to store the Arrow table and the metadata.

40
00:01:35,790 --> 00:01:37,680
因为我们正在处理一个 dataset_dict 对象
Since we're dealing with a dataset_dict object

41
00:01:37,680 --> 00:01:39,090
其中包含多个分片数据，
that has multiple splits,

42
00:01:39,090 --> 00:01:40,590
此信息也被存储
this information is also stored

43
00:01:40,590 --> 00:01:42,243
在 dataset_dict.json 文件中。
in the dataset_dict.json file.

44
00:01:44,250 --> 00:01:46,710
现在，当想要重新加载 Arrow 数据集时，
Now, when we wanna reload the Arrow datasets,

45
00:01:46,710 --> 00:01:48,870
我们使用 load_from_disk 函数。
we use the load_from_disk function.

46
00:01:48,870 --> 00:01:51,210
只需传递数据集目录的路径，
We simply pass the path of our dataset directory,

47
00:01:51,210 --> 00:01:53,583
啊瞧，原始数据集已恢复。
and voila, the original dataset is recovered.

48
00:01:55,594 --> 00:01:57,180
如果我们想保存我们的数据集
If we wanna save our dataset

49
00:01:57,180 --> 00:02:00,990
在 CSV 格式中，我们使用 to_csv 函数。
in the CSV format, we use the to_csv function.

50
00:02:00,990 --> 00:02:02,280
在这种情况下，你需要循环
In this case, you'll need to loop

51
00:02:02,280 --> 00:02:04,170
dataset_dict 对象的分片数据
over the splits of the dataset_dict object

52
00:02:04,170 --> 00:02:07,710
并将每个数据集保存为单独的 CSV 文件。
and save each dataset as an individual CSV file.

53
00:02:07,710 --> 00:02:10,950
由于 to_csv 函数基于 Pandas 中的函数，
Since the to_csv function is based on the one from Pandas,

54
00:02:10,950 --> 00:02:13,980
你可以传递关键字参数来配置输出。
you can pass keyword arguments to configure the output.

55
00:02:13,980 --> 00:02:16,230
在这个例子中，我们将索引参数设置为 None
In this example, we've set the index argument

56
00:02:16,230 --> 00:02:18,480
以防止数据集的索引列
to None to prevent the dataset's index column

57
00:02:18,480 --> 00:02:20,553
不包含在 CSV 文件中。
from being included in the CSV files.

58
00:02:22,470 --> 00:02:24,240
要重新加载我们的 CSV 文件，
To reload our CSV files,

59
00:02:24,240 --> 00:02:27,180
就使用熟悉的 load_dataset 函数
we just then use the familiar load_dataset function

60
00:02:27,180 --> 00:02:29,160
连同 CSV 加载脚本
together with the CSV loading script

61
00:02:29,160 --> 00:02:30,360
和 data_files 参数，
and the data_files argument,

62
00:02:30,360 --> 00:02:34,020
它指定与每个分片数据关联的文件名。
which specifies the file names associated with each split.

63
00:02:34,020 --> 00:02:35,400
正如你在此示例中所见，
As you can see in this example,

64
00:02:35,400 --> 00:02:37,320
通过提供所有分片数据及其文件名，
by providing all the splits and their file names,

65
00:02:37,320 --> 00:02:39,770
我们已经恢复了原始的 dataset_dict 对象。
we've recovered the original dataset_dict object.

66
00:02:41,880 --> 00:02:43,560
现在，将数据集保存在 JSON 中
Now, to save a dataset in the JSON

67
00:02:43,560 --> 00:02:46,710
或保存为 Parquet 格式与 CSV 的情况非常相似。
or Parquet formats is very similar to the CSV case.

68
00:02:46,710 --> 00:02:49,890
我们对 JSON 文件使用 to_json 函数
We use either the to_json function for JSON files

69
00:02:49,890 --> 00:02:52,740
或 Parquet 的 to_parquet 函数。
or the to_parquet function for Parquet ones.

70
00:02:52,740 --> 00:02:55,740
就像 CSV 案例一样，我们需要遍历分片数据
And just like the CSV case, we need to loop over the splits

71
00:02:55,740 --> 00:02:57,753
将每个保存为单独的文件。
to save each one as an individual file.

72
00:02:59,580 --> 00:03:02,940
一旦我们的数据集被保存为 JSON 或 Parquet 文件，
And once our datasets are saved as JSON or Parquet files,

73
00:03:02,940 --> 00:03:03,990
我们可以重新加载它们
we can reload them again

74
00:03:03,990 --> 00:03:06,960
使用 load_dataset 函数中的合适的脚本。
with the appropriate script in the load_dataset function.

75
00:03:06,960 --> 00:03:09,993
我们只需要像以前一样提供一个 data_files 参数。
And we just need to provide a data_files argument as before.

76
00:03:10,860 --> 00:03:11,910
这个例子表明
This example shows

77
00:03:11,910 --> 00:03:14,560
我们如何以任何一种格式重新加载我们保存的数据集。
how we can reload our save datasets in either format.

78
00:03:16,620 --> 00:03:17,970
有了这个，你现在知道
And with that, you now know

79
00:03:17,970 --> 00:03:20,220
如何以各种格式保存数据集。
how to save your datasets in various formats.

80
00:03:21,441 --> 00:03:24,358
（过渡音乐）
(transition music)