1
00:00:00,000 --> 00:00:02,917
(transition music)

2
00:00:06,600 --> 00:00:08,283
- Saving and reloading a dataset.

3
00:00:09,210 --> 00:00:10,320
In this video, we'll take a look

4
00:00:10,320 --> 00:00:12,360
at saving a dataset in various formats

5
00:00:12,360 --> 00:00:14,660
and explore the ways to
reload the saved data.

6
00:00:17,310 --> 00:00:20,100
When you download a dataset,
the processing scripts and data

7
00:00:20,100 --> 00:00:22,470
are stored locally on your computer.

8
00:00:22,470 --> 00:00:24,000
The cache allows the Datasets library

9
00:00:24,000 --> 00:00:25,230
to avoid re-downloading

10
00:00:25,230 --> 00:00:28,620
or processing the entire
dataset every time you use it.

11
00:00:28,620 --> 00:00:31,170
Now, the data is stored in
the form of Arrow tables

12
00:00:31,170 --> 00:00:32,490
whose location can be found

13
00:00:32,490 --> 00:00:35,730
by accessing the dataset's
cache_files attribute.

14
00:00:35,730 --> 00:00:38,430
In this example, we've
downloaded the allocine dataset

15
00:00:38,430 --> 00:00:40,080
from the Hugging Face Hub, and you can see

16
00:00:40,080 --> 00:00:41,430
that there are three Arrow files

17
00:00:41,430 --> 00:00:43,473
stored in the cache, one for each split.

18
00:00:45,360 --> 00:00:47,460
But in many cases, you'll
wanna save your dataset

19
00:00:47,460 --> 00:00:49,890
in a different location or format.

20
00:00:49,890 --> 00:00:51,900
As shown in the table,
the Datasets library

21
00:00:51,900 --> 00:00:54,870
provides four main
functions to achieve this.

22
00:00:54,870 --> 00:00:56,130
Now, you're probably already familiar

23
00:00:56,130 --> 00:00:58,770
with the CSV and JSON formats,
both of which are great

24
00:00:58,770 --> 00:01:00,810
if you just wanna quickly save a small

25
00:01:00,810 --> 00:01:02,790
or medium-sized dataset.

26
00:01:02,790 --> 00:01:03,976
But if your dataset is huge,

27
00:01:03,976 --> 00:01:07,860
you'll wanna save it in either
the Arrow or Parquet formats.

28
00:01:07,860 --> 00:01:09,660
Arrow files are great
if you plan to reload

29
00:01:09,660 --> 00:01:11,850
or process the data in the near future.

30
00:01:11,850 --> 00:01:13,290
While Parquet files are designed

31
00:01:13,290 --> 00:01:16,140
for long-term storage and
are very space-efficient.

32
00:01:16,140 --> 00:01:18,140
Let's take a closer look at each format.

33
00:01:19,800 --> 00:01:21,750
To save a dataset or a dataset_dict object

34
00:01:21,750 --> 00:01:25,560
in the Arrow format, we use
the save_to_disk function.

35
00:01:25,560 --> 00:01:26,910
As you can see in this example,

36
00:01:26,910 --> 00:01:29,790
we simply provide the path
we wish to save the data to

37
00:01:29,790 --> 00:01:30,720
and the Datasets library

38
00:01:30,720 --> 00:01:32,340
will automatically create a directory

39
00:01:32,340 --> 00:01:35,790
for each split to store the
Arrow table and the metadata.

40
00:01:35,790 --> 00:01:37,680
Since we're dealing with
a dataset_dict object

41
00:01:37,680 --> 00:01:39,090
that has multiple splits,

42
00:01:39,090 --> 00:01:40,590
this information is also stored

43
00:01:40,590 --> 00:01:42,243
in the dataset_dict.json file.

44
00:01:44,250 --> 00:01:46,710
Now, when we wanna reload
the Arrow datasets,

45
00:01:46,710 --> 00:01:48,870
we use the load_from_disk function.

46
00:01:48,870 --> 00:01:51,210
We simply pass the path
of our dataset directory,

47
00:01:51,210 --> 00:01:53,583
and voila, the original
dataset is recovered.

48
00:01:55,594 --> 00:01:57,180
If we wanna save our dataset

49
00:01:57,180 --> 00:02:00,990
in the CSV format, we
use the to_csv function.

50
00:02:00,990 --> 00:02:02,280
In this case, you'll need to loop

51
00:02:02,280 --> 00:02:04,170
over the splits of the dataset_dict object

52
00:02:04,170 --> 00:02:07,710
and save each dataset as
an individual CSV file.

53
00:02:07,710 --> 00:02:10,950
Since the to_csv function is
based on the one from Pandas,

54
00:02:10,950 --> 00:02:13,980
you can pass keyword arguments
to configure the output.

55
00:02:13,980 --> 00:02:16,230
In this example, we've
set the index argument

56
00:02:16,230 --> 00:02:18,480
to None to prevent the
dataset's index column

57
00:02:18,480 --> 00:02:20,553
from being included in the CSV files.

58
00:02:22,470 --> 00:02:24,240
To reload our CSV files,

59
00:02:24,240 --> 00:02:27,180
we just then use the familiar
load_dataset function

60
00:02:27,180 --> 00:02:29,160
together with the CSV loading script

61
00:02:29,160 --> 00:02:30,360
and the data_files argument,

62
00:02:30,360 --> 00:02:34,020
which specifies the file names
associated with each split.

63
00:02:34,020 --> 00:02:35,400
As you can see in this example,

64
00:02:35,400 --> 00:02:37,320
by providing all the splits
and their file names,

65
00:02:37,320 --> 00:02:39,770
we've recovered the original
dataset_dict object.

66
00:02:41,880 --> 00:02:43,560
Now, to save a dataset in the JSON

67
00:02:43,560 --> 00:02:46,710
or Parquet formats is very
similar to the CSV case.

68
00:02:46,710 --> 00:02:49,890
We use either the to_json
function for JSON files

69
00:02:49,890 --> 00:02:52,740
or the to_parquet
function for Parquet ones.

70
00:02:52,740 --> 00:02:55,740
And just like the CSV case, we
need to loop over the splits

71
00:02:55,740 --> 00:02:57,753
to save each one as an individual file.

72
00:02:59,580 --> 00:03:02,940
And once our datasets are
saved as JSON or Parquet files,

73
00:03:02,940 --> 00:03:03,990
we can reload them again

74
00:03:03,990 --> 00:03:06,960
with the appropriate script
in the load_dataset function.

75
00:03:06,960 --> 00:03:09,993
And we just need to provide a
data_files argument as before.

76
00:03:10,860 --> 00:03:11,910
This example shows

77
00:03:11,910 --> 00:03:14,560
how we can reload our save
datasets in either format.

78
00:03:16,620 --> 00:03:17,970
And with that, you now know

79
00:03:17,970 --> 00:03:20,220
how to save your datasets
in various formats.

80
00:03:21,441 --> 00:03:24,358
(transition music)