subtitles/en/38_saving-and-reloading-a-dataset.srt (279 lines of code) (raw):
1
00:00:00,000 --> 00:00:02,917
(transition music)
2
00:00:06,600 --> 00:00:08,283
- Saving and reloading a dataset.
3
00:00:09,210 --> 00:00:10,320
In this video, we'll take a look
4
00:00:10,320 --> 00:00:12,360
at saving a dataset in various formats
5
00:00:12,360 --> 00:00:14,660
and explore the ways to
reload the saved data.
6
00:00:17,310 --> 00:00:20,100
When you download a dataset,
the processing scripts and data
7
00:00:20,100 --> 00:00:22,470
are stored locally on your computer.
8
00:00:22,470 --> 00:00:24,000
The cache allows the Datasets library
9
00:00:24,000 --> 00:00:25,230
to avoid re-downloading
10
00:00:25,230 --> 00:00:28,620
or processing the entire
dataset every time you use it.
11
00:00:28,620 --> 00:00:31,170
Now, the data is stored in
the form of Arrow tables
12
00:00:31,170 --> 00:00:32,490
whose location can be found
13
00:00:32,490 --> 00:00:35,730
by accessing the dataset's
cache_files attribute.
14
00:00:35,730 --> 00:00:38,430
In this example, we've
downloaded the allocine dataset
15
00:00:38,430 --> 00:00:40,080
from the Hugging Face Hub, and you can see
16
00:00:40,080 --> 00:00:41,430
that there are three Arrow files
17
00:00:41,430 --> 00:00:43,473
stored in the cache, one for each split.
18
00:00:45,360 --> 00:00:47,460
But in many cases, you'll
wanna save your dataset
19
00:00:47,460 --> 00:00:49,890
in a different location or format.
20
00:00:49,890 --> 00:00:51,900
As shown in the table,
the Datasets library
21
00:00:51,900 --> 00:00:54,870
provides four main
functions to achieve this.
22
00:00:54,870 --> 00:00:56,130
Now, you're probably already familiar
23
00:00:56,130 --> 00:00:58,770
with the CSV and JSON formats,
both of which are great
24
00:00:58,770 --> 00:01:00,810
if you just wanna quickly save a small
25
00:01:00,810 --> 00:01:02,790
or medium-sized dataset.
26
00:01:02,790 --> 00:01:03,976
But if your dataset is huge,
27
00:01:03,976 --> 00:01:07,860
you'll wanna save it in either
the Arrow or Parquet formats.
28
00:01:07,860 --> 00:01:09,660
Arrow files are great
if you plan to reload
29
00:01:09,660 --> 00:01:11,850
or process the data in the near future.
30
00:01:11,850 --> 00:01:13,290
While Parquet files are designed
31
00:01:13,290 --> 00:01:16,140
for long-term storage and
are very space-efficient.
32
00:01:16,140 --> 00:01:18,140
Let's take a closer look at each format.
33
00:01:19,800 --> 00:01:21,750
To save a dataset or a dataset_dict object
34
00:01:21,750 --> 00:01:25,560
in the Arrow format, we use
the save_to_disk function.
35
00:01:25,560 --> 00:01:26,910
As you can see in this example,
36
00:01:26,910 --> 00:01:29,790
we simply provide the path
we wish to save the data to
37
00:01:29,790 --> 00:01:30,720
and the Datasets library
38
00:01:30,720 --> 00:01:32,340
will automatically create a directory
39
00:01:32,340 --> 00:01:35,790
for each split to store the
Arrow table and the metadata.
40
00:01:35,790 --> 00:01:37,680
Since we're dealing with
a dataset_dict object
41
00:01:37,680 --> 00:01:39,090
that has multiple splits,
42
00:01:39,090 --> 00:01:40,590
this information is also stored
43
00:01:40,590 --> 00:01:42,243
in the dataset_dict.json file.
44
00:01:44,250 --> 00:01:46,710
Now, when we wanna reload
the Arrow datasets,
45
00:01:46,710 --> 00:01:48,870
we use the load_from_disk function.
46
00:01:48,870 --> 00:01:51,210
We simply pass the path
of our dataset directory,
47
00:01:51,210 --> 00:01:53,583
and voila, the original
dataset is recovered.
48
00:01:55,594 --> 00:01:57,180
If we wanna save our dataset
49
00:01:57,180 --> 00:02:00,990
in the CSV format, we
use the to_csv function.
50
00:02:00,990 --> 00:02:02,280
In this case, you'll need to loop
51
00:02:02,280 --> 00:02:04,170
over the splits of the dataset_dict object
52
00:02:04,170 --> 00:02:07,710
and save each dataset as
an individual CSV file.
53
00:02:07,710 --> 00:02:10,950
Since the to_csv function is
based on the one from Pandas,
54
00:02:10,950 --> 00:02:13,980
you can pass keyword arguments
to configure the output.
55
00:02:13,980 --> 00:02:16,230
In this example, we've
set the index argument
56
00:02:16,230 --> 00:02:18,480
to None to prevent the
dataset's index column
57
00:02:18,480 --> 00:02:20,553
from being included in the CSV files.
58
00:02:22,470 --> 00:02:24,240
To reload our CSV files,
59
00:02:24,240 --> 00:02:27,180
we just then use the familiar
load_dataset function
60
00:02:27,180 --> 00:02:29,160
together with the CSV loading script
61
00:02:29,160 --> 00:02:30,360
and the data_files argument,
62
00:02:30,360 --> 00:02:34,020
which specifies the file names
associated with each split.
63
00:02:34,020 --> 00:02:35,400
As you can see in this example,
64
00:02:35,400 --> 00:02:37,320
by providing all the splits
and their file names,
65
00:02:37,320 --> 00:02:39,770
we've recovered the original
dataset_dict object.
66
00:02:41,880 --> 00:02:43,560
Now, to save a dataset in the JSON
67
00:02:43,560 --> 00:02:46,710
or Parquet formats is very
similar to the CSV case.
68
00:02:46,710 --> 00:02:49,890
We use either the to_json
function for JSON files
69
00:02:49,890 --> 00:02:52,740
or the to_parquet
function for Parquet ones.
70
00:02:52,740 --> 00:02:55,740
And just like the CSV case, we
need to loop over the splits
71
00:02:55,740 --> 00:02:57,753
to save each one as an individual file.
72
00:02:59,580 --> 00:03:02,940
And once our datasets are
saved as JSON or Parquet files,
73
00:03:02,940 --> 00:03:03,990
we can reload them again
74
00:03:03,990 --> 00:03:06,960
with the appropriate script
in the load_dataset function.
75
00:03:06,960 --> 00:03:09,993
And we just need to provide a
data_files argument as before.
76
00:03:10,860 --> 00:03:11,910
This example shows
77
00:03:11,910 --> 00:03:14,560
how we can reload our save
datasets in either format.
78
00:03:16,620 --> 00:03:17,970
And with that, you now know
79
00:03:17,970 --> 00:03:20,220
how to save your datasets
in various formats.
80
00:03:21,441 --> 00:03:24,358
(transition music)