subtitles/en/35_loading-a-custom-dataset.srt (266 lines of code) (raw):

1 00:00:00,195 --> 00:00:01,426 (screen whooshing) 2 00:00:01,426 --> 00:00:02,614 (sticker popping) 3 00:00:02,614 --> 00:00:06,150 (screen whooshing) 4 00:00:06,150 --> 00:00:08,430 - Loading a custom dataset. 5 00:00:08,430 --> 00:00:09,750 Although the Hugging Face Hub hosts 6 00:00:09,750 --> 00:00:11,730 over a thousand public datasets, 7 00:00:11,730 --> 00:00:12,930 you'll often need to work with data 8 00:00:12,930 --> 00:00:15,900 that is stored on your laptop or some remote server. 9 00:00:15,900 --> 00:00:18,060 In this video, we'll explore how the Datasets library 10 00:00:18,060 --> 00:00:20,310 can be used to load datasets that aren't available 11 00:00:20,310 --> 00:00:21,510 on the Hugging Face Hub. 12 00:00:22,980 --> 00:00:25,290 As you can see in this table, the Datasets library 13 00:00:25,290 --> 00:00:26,700 provides several in-built scripts 14 00:00:26,700 --> 00:00:29,370 to load datasets in several formats. 15 00:00:29,370 --> 00:00:31,200 To load a dataset in one of these formats, 16 00:00:31,200 --> 00:00:32,730 you just need to provide the name of the format 17 00:00:32,730 --> 00:00:34,350 to the load_dataset function, 18 00:00:34,350 --> 00:00:35,790 along with a data_files argument 19 00:00:35,790 --> 00:00:37,610 that points to one or more filepaths or URLs. 20 00:00:40,350 --> 00:00:43,590 To see this in action, let's start by loading a CSV file. 21 00:00:43,590 --> 00:00:45,960 In this example, we first download a dataset 22 00:00:45,960 --> 00:00:48,963 about wine quality from the UCI machine learning repository. 23 00:00:50,220 --> 00:00:52,590 Since this is a CSV file, we then specify 24 00:00:52,590 --> 00:00:53,943 the CSV loading script. 25 00:00:55,320 --> 00:00:57,570 Now, this script needs to know where our data is located, 26 00:00:57,570 --> 00:00:58,650 so we provide the filename 27 00:00:58,650 --> 00:01:00,483 as part of the data_files argument. 28 00:01:01,860 --> 00:01:03,360 And the loading script also allows you 29 00:01:03,360 --> 00:01:05,040 to pass several keyword arguments, 30 00:01:05,040 --> 00:01:06,750 so here we've also specified 31 00:01:06,750 --> 00:01:09,030 that the separator is a semi-colon. 32 00:01:09,030 --> 00:01:10,380 And with that, we can see the dataset 33 00:01:10,380 --> 00:01:13,020 is loaded automatically as a DatasetDict object, 34 00:01:13,020 --> 00:01:15,920 with each column in the CSV file represented as a feature. 35 00:01:17,610 --> 00:01:20,280 If your dataset is located on some remote server like GitHub 36 00:01:20,280 --> 00:01:22,050 or some other repository, 37 00:01:22,050 --> 00:01:23,700 the process is actually very similar. 38 00:01:23,700 --> 00:01:25,980 The only difference is that now the data_files argument 39 00:01:25,980 --> 00:01:28,623 points to a URL instead of a local filepath. 40 00:01:30,330 --> 00:01:33,270 Let's now take a look at loading raw text files. 41 00:01:33,270 --> 00:01:35,100 This format is quite common in NLP, 42 00:01:35,100 --> 00:01:36,750 and you'll typically find books and plays 43 00:01:36,750 --> 00:01:39,393 are just a single file with raw text inside. 44 00:01:40,410 --> 00:01:43,020 In this example, we have a text file of Shakespeare plays 45 00:01:43,020 --> 00:01:45,330 that's stored on a GitHub repository. 46 00:01:45,330 --> 00:01:47,040 And as we did for CSV files, 47 00:01:47,040 --> 00:01:49,020 we simply choose the text loading script 48 00:01:49,020 --> 00:01:51,423 and point the data_files argument to the URL. 49 00:01:52,260 --> 00:01:55,110 As you can see, these files are processed line-by-line, 50 00:01:55,110 --> 00:01:57,690 so empty lines in the raw text are also represented 51 00:01:57,690 --> 00:01:58,953 as a row in the dataset. 52 00:02:00,810 --> 00:02:04,230 For JSON files, there are two main formats to know about. 53 00:02:04,230 --> 00:02:06,060 The first one is called JSON Lines, 54 00:02:06,060 --> 00:02:09,510 where every row in the file is a separate JSON object. 55 00:02:09,510 --> 00:02:11,100 For these files, you can load the dataset 56 00:02:11,100 --> 00:02:13,020 by selecting the JSON loading script 57 00:02:13,020 --> 00:02:16,143 and pointing the data_files argument to the file or URL. 58 00:02:17,160 --> 00:02:19,410 In this example, we've loaded a JSON lines files 59 00:02:19,410 --> 00:02:21,710 based on Stack Exchange questions and answers. 60 00:02:23,490 --> 00:02:26,610 The other format is nested JSON files. 61 00:02:26,610 --> 00:02:29,100 These files basically look like one huge dictionary, 62 00:02:29,100 --> 00:02:31,200 so the load_dataset function allow you to specify 63 00:02:31,200 --> 00:02:32,733 which specific key to load. 64 00:02:33,630 --> 00:02:35,910 For example, the SQuAD dataset for question and answering 65 00:02:35,910 --> 00:02:38,340 has its format, and we can load it by specifying 66 00:02:38,340 --> 00:02:40,340 that we're interested in the data field. 67 00:02:41,400 --> 00:02:42,780 There is just one last thing to mention 68 00:02:42,780 --> 00:02:44,910 about all of these loading scripts. 69 00:02:44,910 --> 00:02:46,410 You can have more than one split, 70 00:02:46,410 --> 00:02:49,080 you can load them by treating data files as a dictionary, 71 00:02:49,080 --> 00:02:52,140 and map each split name to its corresponding file. 72 00:02:52,140 --> 00:02:53,970 Everything else stays completely unchanged 73 00:02:53,970 --> 00:02:55,350 and you can see an example of loading 74 00:02:55,350 --> 00:02:58,283 both the training and validation splits for this SQuAD here. 75 00:02:59,550 --> 00:03:02,310 And with that, you can now load datasets from your laptop, 76 00:03:02,310 --> 00:03:04,653 the Hugging Face Hub, or anywhere else want. 77 00:03:06,277 --> 00:03:09,194 (screen whooshing)