subtitles/en/38_saving-and-reloading-a-dataset.srt

1 00:00:00,000 --> 00:00:02,917 (transition music) 2 00:00:06,600 --> 00:00:08,283 - Saving and reloading a dataset. 3 00:00:09,210 --> 00:00:10,320 In this video, we'll take a look 4 00:00:10,320 --> 00:00:12,360 at saving a dataset in various formats 5 00:00:12,360 --> 00:00:14,660 and explore the ways to reload the saved data. 6 00:00:17,310 --> 00:00:20,100 When you download a dataset, the processing scripts and data 7 00:00:20,100 --> 00:00:22,470 are stored locally on your computer. 8 00:00:22,470 --> 00:00:24,000 The cache allows the Datasets library 9 00:00:24,000 --> 00:00:25,230 to avoid re-downloading 10 00:00:25,230 --> 00:00:28,620 or processing the entire dataset every time you use it. 11 00:00:28,620 --> 00:00:31,170 Now, the data is stored in the form of Arrow tables 12 00:00:31,170 --> 00:00:32,490 whose location can be found 13 00:00:32,490 --> 00:00:35,730 by accessing the dataset's cache_files attribute. 14 00:00:35,730 --> 00:00:38,430 In this example, we've downloaded the allocine dataset 15 00:00:38,430 --> 00:00:40,080 from the Hugging Face Hub, and you can see 16 00:00:40,080 --> 00:00:41,430 that there are three Arrow files 17 00:00:41,430 --> 00:00:43,473 stored in the cache, one for each split. 18 00:00:45,360 --> 00:00:47,460 But in many cases, you'll wanna save your dataset 19 00:00:47,460 --> 00:00:49,890 in a different location or format. 20 00:00:49,890 --> 00:00:51,900 As shown in the table, the Datasets library 21 00:00:51,900 --> 00:00:54,870 provides four main functions to achieve this. 22 00:00:54,870 --> 00:00:56,130 Now, you're probably already familiar 23 00:00:56,130 --> 00:00:58,770 with the CSV and JSON formats, both of which are great 24 00:00:58,770 --> 00:01:00,810 if you just wanna quickly save a small 25 00:01:00,810 --> 00:01:02,790 or medium-sized dataset. 26 00:01:02,790 --> 00:01:03,976 But if your dataset is huge, 27 00:01:03,976 --> 00:01:07,860 you'll wanna save it in either the Arrow or Parquet formats. 28 00:01:07,860 --> 00:01:09,660 Arrow files are great if you plan to reload 29 00:01:09,660 --> 00:01:11,850 or process the data in the near future. 30 00:01:11,850 --> 00:01:13,290 While Parquet files are designed 31 00:01:13,290 --> 00:01:16,140 for long-term storage and are very space-efficient. 32 00:01:16,140 --> 00:01:18,140 Let's take a closer look at each format. 33 00:01:19,800 --> 00:01:21,750 To save a dataset or a dataset_dict object 34 00:01:21,750 --> 00:01:25,560 in the Arrow format, we use the save_to_disk function. 35 00:01:25,560 --> 00:01:26,910 As you can see in this example, 36 00:01:26,910 --> 00:01:29,790 we simply provide the path we wish to save the data to 37 00:01:29,790 --> 00:01:30,720 and the Datasets library 38 00:01:30,720 --> 00:01:32,340 will automatically create a directory 39 00:01:32,340 --> 00:01:35,790 for each split to store the Arrow table and the metadata. 40 00:01:35,790 --> 00:01:37,680 Since we're dealing with a dataset_dict object 41 00:01:37,680 --> 00:01:39,090 that has multiple splits, 42 00:01:39,090 --> 00:01:40,590 this information is also stored 43 00:01:40,590 --> 00:01:42,243 in the dataset_dict.json file. 44 00:01:44,250 --> 00:01:46,710 Now, when we wanna reload the Arrow datasets, 45 00:01:46,710 --> 00:01:48,870 we use the load_from_disk function. 46 00:01:48,870 --> 00:01:51,210 We simply pass the path of our dataset directory, 47 00:01:51,210 --> 00:01:53,583 and voila, the original dataset is recovered. 48 00:01:55,594 --> 00:01:57,180 If we wanna save our dataset 49 00:01:57,180 --> 00:02:00,990 in the CSV format, we use the to_csv function. 50 00:02:00,990 --> 00:02:02,280 In this case, you'll need to loop 51 00:02:02,280 --> 00:02:04,170 over the splits of the dataset_dict object 52 00:02:04,170 --> 00:02:07,710 and save each dataset as an individual CSV file. 53 00:02:07,710 --> 00:02:10,950 Since the to_csv function is based on the one from Pandas, 54 00:02:10,950 --> 00:02:13,980 you can pass keyword arguments to configure the output. 55 00:02:13,980 --> 00:02:16,230 In this example, we've set the index argument 56 00:02:16,230 --> 00:02:18,480 to None to prevent the dataset's index column 57 00:02:18,480 --> 00:02:20,553 from being included in the CSV files. 58 00:02:22,470 --> 00:02:24,240 To reload our CSV files, 59 00:02:24,240 --> 00:02:27,180 we just then use the familiar load_dataset function 60 00:02:27,180 --> 00:02:29,160 together with the CSV loading script 61 00:02:29,160 --> 00:02:30,360 and the data_files argument, 62 00:02:30,360 --> 00:02:34,020 which specifies the file names associated with each split. 63 00:02:34,020 --> 00:02:35,400 As you can see in this example, 64 00:02:35,400 --> 00:02:37,320 by providing all the splits and their file names, 65 00:02:37,320 --> 00:02:39,770 we've recovered the original dataset_dict object. 66 00:02:41,880 --> 00:02:43,560 Now, to save a dataset in the JSON 67 00:02:43,560 --> 00:02:46,710 or Parquet formats is very similar to the CSV case. 68 00:02:46,710 --> 00:02:49,890 We use either the to_json function for JSON files 69 00:02:49,890 --> 00:02:52,740 or the to_parquet function for Parquet ones. 70 00:02:52,740 --> 00:02:55,740 And just like the CSV case, we need to loop over the splits 71 00:02:55,740 --> 00:02:57,753 to save each one as an individual file. 72 00:02:59,580 --> 00:03:02,940 And once our datasets are saved as JSON or Parquet files, 73 00:03:02,940 --> 00:03:03,990 we can reload them again 74 00:03:03,990 --> 00:03:06,960 with the appropriate script in the load_dataset function. 75 00:03:06,960 --> 00:03:09,993 And we just need to provide a data_files argument as before. 76 00:03:10,860 --> 00:03:11,910 This example shows 77 00:03:11,910 --> 00:03:14,560 how we can reload our save datasets in either format. 78 00:03:16,620 --> 00:03:17,970 And with that, you now know 79 00:03:17,970 --> 00:03:20,220 how to save your datasets in various formats. 80 00:03:21,441 --> 00:03:24,358 (transition music)

subtitles/en/38_saving-and-reloading-a-dataset.srt (279 lines of code) (raw):