subtitles/zh-CN/38_saving-and-reloading-a-dataset.srt (320 lines of code) (raw):
1
00:00:00,000 --> 00:00:02,917
(过渡音乐)
(transition music)
2
00:00:06,600 --> 00:00:08,283
- 保存和重新加载数据集。
- Saving and reloading a dataset.
3
00:00:09,210 --> 00:00:10,320
在本视频中,我们将了解
In this video, we'll take a look
4
00:00:10,320 --> 00:00:12,360
以各种格式保存数据集
at saving a dataset in various formats
5
00:00:12,360 --> 00:00:14,660
并探索重新加载已保存数据的方法。
and explore the ways to reload the saved data.
6
00:00:17,310 --> 00:00:20,100
下载数据集时,所需的处理脚本和数据都会本地存储
When you download a dataset, the processing scripts and data
7
00:00:20,100 --> 00:00:22,470
在你的计算机上。
are stored locally on your computer.
8
00:00:22,470 --> 00:00:24,000
缓存允许 Datasets 库
The cache allows the Datasets library
9
00:00:24,000 --> 00:00:25,230
避免重新下载
to avoid re-downloading
10
00:00:25,230 --> 00:00:28,620
或每次使用时处理整个数据集。
or processing the entire dataset every time you use it.
11
00:00:28,620 --> 00:00:31,170
现在,数据以 Arrow 表的形式存储
Now, the data is stored in the form of Arrow tables
12
00:00:31,170 --> 00:00:32,490
通过访问数据集的
whose location can be found
13
00:00:32,490 --> 00:00:35,730
cache_files 属性可以找到它的位置。
by accessing the dataset's cache_files attribute.
14
00:00:35,730 --> 00:00:38,430
在这个例子中,我们从 Hugging Face Hub
In this example, we've downloaded the allocine dataset
15
00:00:38,430 --> 00:00:40,080
下载了 allocine 数据集,你可以看到
from the Hugging Face Hub, and you can see
16
00:00:40,080 --> 00:00:41,430
一共有三个 Arrow 文件
that there are three Arrow files
17
00:00:41,430 --> 00:00:43,473
存储在缓存中,每个文件对应一个分片数据。
stored in the cache, one for each split.
18
00:00:45,360 --> 00:00:47,460
但在很多情况下,你希望在不同的物理地址或者以不同的格式
But in many cases, you'll wanna save your dataset
19
00:00:47,460 --> 00:00:49,890
保存你的数据集。
in a different location or format.
20
00:00:49,890 --> 00:00:51,900
如表所示,Datasets 库
As shown in the table, the Datasets library
21
00:00:51,900 --> 00:00:54,870
提供了四个主要功能来实现这一点。
provides four main functions to achieve this.
22
00:00:54,870 --> 00:00:56,130
现在,你可能已经很熟悉了
Now, you're probably already familiar
23
00:00:56,130 --> 00:00:58,770
使用 CSV 和 JSON 格式,这两种格式都很棒
with the CSV and JSON formats, both of which are great
24
00:00:58,770 --> 00:01:00,810
如果你只是想快速保存一个小规模
if you just wanna quickly save a small
25
00:01:00,810 --> 00:01:02,790
或中等规模的数据集。
or medium-sized dataset.
26
00:01:02,790 --> 00:01:03,976
但是如果你的数据集很大,
But if your dataset is huge,
27
00:01:03,976 --> 00:01:07,860
你需要将其保存为 Arrow 或 Parquet 格式。
you'll wanna save it in either the Arrow or Parquet formats.
28
00:01:07,860 --> 00:01:09,660
如果你打算重新加载或在不久的将来处理数据,
Arrow files are great if you plan to reload
29
00:01:09,660 --> 00:01:11,850
Arrow 文件就很棒。
or process the data in the near future.
30
00:01:11,850 --> 00:01:13,290
虽然 Parquet 文件被设计成
While Parquet files are designed
31
00:01:13,290 --> 00:01:16,140
用于长期存储并且非常节省空间。
for long-term storage and are very space-efficient.
32
00:01:16,140 --> 00:01:18,140
让我们仔细看看每种格式。
Let's take a closer look at each format.
33
00:01:19,800 --> 00:01:21,750
保存数据集或 dataset_dict 对象
To save a dataset or a dataset_dict object
34
00:01:21,750 --> 00:01:25,560
在 Arrow 格式中,我们使用 save_to_disk 函数。
in the Arrow format, we use the save_to_disk function.
35
00:01:25,560 --> 00:01:26,910
正如你在此示例中所见,
As you can see in this example,
36
00:01:26,910 --> 00:01:29,790
只需提供我们希望将数据保存到的路径
we simply provide the path we wish to save the data to
37
00:01:29,790 --> 00:01:30,720
然后 Datasets 库
and the Datasets library
38
00:01:30,720 --> 00:01:32,340
会针对每个分片数据自动创建一个目录
will automatically create a directory
39
00:01:32,340 --> 00:01:35,790
来存储 Arrow 表和元数据。
for each split to store the Arrow table and the metadata.
40
00:01:35,790 --> 00:01:37,680
因为我们正在处理一个 dataset_dict 对象
Since we're dealing with a dataset_dict object
41
00:01:37,680 --> 00:01:39,090
其中包含多个分片数据,
that has multiple splits,
42
00:01:39,090 --> 00:01:40,590
此信息也被存储
this information is also stored
43
00:01:40,590 --> 00:01:42,243
在 dataset_dict.json 文件中。
in the dataset_dict.json file.
44
00:01:44,250 --> 00:01:46,710
现在,当想要重新加载 Arrow 数据集时,
Now, when we wanna reload the Arrow datasets,
45
00:01:46,710 --> 00:01:48,870
我们使用 load_from_disk 函数。
we use the load_from_disk function.
46
00:01:48,870 --> 00:01:51,210
只需传递数据集目录的路径,
We simply pass the path of our dataset directory,
47
00:01:51,210 --> 00:01:53,583
啊瞧,原始数据集已恢复。
and voila, the original dataset is recovered.
48
00:01:55,594 --> 00:01:57,180
如果我们想保存我们的数据集
If we wanna save our dataset
49
00:01:57,180 --> 00:02:00,990
在 CSV 格式中,我们使用 to_csv 函数。
in the CSV format, we use the to_csv function.
50
00:02:00,990 --> 00:02:02,280
在这种情况下,你需要循环
In this case, you'll need to loop
51
00:02:02,280 --> 00:02:04,170
dataset_dict 对象的分片数据
over the splits of the dataset_dict object
52
00:02:04,170 --> 00:02:07,710
并将每个数据集保存为单独的 CSV 文件。
and save each dataset as an individual CSV file.
53
00:02:07,710 --> 00:02:10,950
由于 to_csv 函数基于 Pandas 中的函数,
Since the to_csv function is based on the one from Pandas,
54
00:02:10,950 --> 00:02:13,980
你可以传递关键字参数来配置输出。
you can pass keyword arguments to configure the output.
55
00:02:13,980 --> 00:02:16,230
在这个例子中,我们将索引参数设置为 None
In this example, we've set the index argument
56
00:02:16,230 --> 00:02:18,480
以防止数据集的索引列
to None to prevent the dataset's index column
57
00:02:18,480 --> 00:02:20,553
不包含在 CSV 文件中。
from being included in the CSV files.
58
00:02:22,470 --> 00:02:24,240
要重新加载我们的 CSV 文件,
To reload our CSV files,
59
00:02:24,240 --> 00:02:27,180
就使用熟悉的 load_dataset 函数
we just then use the familiar load_dataset function
60
00:02:27,180 --> 00:02:29,160
连同 CSV 加载脚本
together with the CSV loading script
61
00:02:29,160 --> 00:02:30,360
和 data_files 参数,
and the data_files argument,
62
00:02:30,360 --> 00:02:34,020
它指定与每个分片数据关联的文件名。
which specifies the file names associated with each split.
63
00:02:34,020 --> 00:02:35,400
正如你在此示例中所见,
As you can see in this example,
64
00:02:35,400 --> 00:02:37,320
通过提供所有分片数据及其文件名,
by providing all the splits and their file names,
65
00:02:37,320 --> 00:02:39,770
我们已经恢复了原始的 dataset_dict 对象。
we've recovered the original dataset_dict object.
66
00:02:41,880 --> 00:02:43,560
现在,将数据集保存在 JSON 中
Now, to save a dataset in the JSON
67
00:02:43,560 --> 00:02:46,710
或保存为 Parquet 格式与 CSV 的情况非常相似。
or Parquet formats is very similar to the CSV case.
68
00:02:46,710 --> 00:02:49,890
我们对 JSON 文件使用 to_json 函数
We use either the to_json function for JSON files
69
00:02:49,890 --> 00:02:52,740
或 Parquet 的 to_parquet 函数。
or the to_parquet function for Parquet ones.
70
00:02:52,740 --> 00:02:55,740
就像 CSV 案例一样,我们需要遍历分片数据
And just like the CSV case, we need to loop over the splits
71
00:02:55,740 --> 00:02:57,753
将每个保存为单独的文件。
to save each one as an individual file.
72
00:02:59,580 --> 00:03:02,940
一旦我们的数据集被保存为 JSON 或 Parquet 文件,
And once our datasets are saved as JSON or Parquet files,
73
00:03:02,940 --> 00:03:03,990
我们可以重新加载它们
we can reload them again
74
00:03:03,990 --> 00:03:06,960
使用 load_dataset 函数中的合适的脚本。
with the appropriate script in the load_dataset function.
75
00:03:06,960 --> 00:03:09,993
我们只需要像以前一样提供一个 data_files 参数。
And we just need to provide a data_files argument as before.
76
00:03:10,860 --> 00:03:11,910
这个例子表明
This example shows
77
00:03:11,910 --> 00:03:14,560
我们如何以任何一种格式重新加载我们保存的数据集。
how we can reload our save datasets in either format.
78
00:03:16,620 --> 00:03:17,970
有了这个,你现在知道
And with that, you now know
79
00:03:17,970 --> 00:03:20,220
如何以各种格式保存数据集。
how to save your datasets in various formats.
80
00:03:21,441 --> 00:03:24,358
(过渡音乐)
(transition music)