subtitles/en/35_loading-a-custom-dataset.srt (266 lines of code) (raw):
1
00:00:00,195 --> 00:00:01,426
(screen whooshing)
2
00:00:01,426 --> 00:00:02,614
(sticker popping)
3
00:00:02,614 --> 00:00:06,150
(screen whooshing)
4
00:00:06,150 --> 00:00:08,430
- Loading a custom dataset.
5
00:00:08,430 --> 00:00:09,750
Although the Hugging Face Hub hosts
6
00:00:09,750 --> 00:00:11,730
over a thousand public datasets,
7
00:00:11,730 --> 00:00:12,930
you'll often need to work with data
8
00:00:12,930 --> 00:00:15,900
that is stored on your
laptop or some remote server.
9
00:00:15,900 --> 00:00:18,060
In this video, we'll explore
how the Datasets library
10
00:00:18,060 --> 00:00:20,310
can be used to load datasets
that aren't available
11
00:00:20,310 --> 00:00:21,510
on the Hugging Face Hub.
12
00:00:22,980 --> 00:00:25,290
As you can see in this
table, the Datasets library
13
00:00:25,290 --> 00:00:26,700
provides several in-built scripts
14
00:00:26,700 --> 00:00:29,370
to load datasets in several formats.
15
00:00:29,370 --> 00:00:31,200
To load a dataset in one of these formats,
16
00:00:31,200 --> 00:00:32,730
you just need to provide
the name of the format
17
00:00:32,730 --> 00:00:34,350
to the load_dataset function,
18
00:00:34,350 --> 00:00:35,790
along with a data_files argument
19
00:00:35,790 --> 00:00:37,610
that points to one or
more filepaths or URLs.
20
00:00:40,350 --> 00:00:43,590
To see this in action, let's
start by loading a CSV file.
21
00:00:43,590 --> 00:00:45,960
In this example, we
first download a dataset
22
00:00:45,960 --> 00:00:48,963
about wine quality from the UCI
machine learning repository.
23
00:00:50,220 --> 00:00:52,590
Since this is a CSV file, we then specify
24
00:00:52,590 --> 00:00:53,943
the CSV loading script.
25
00:00:55,320 --> 00:00:57,570
Now, this script needs to know
where our data is located,
26
00:00:57,570 --> 00:00:58,650
so we provide the filename
27
00:00:58,650 --> 00:01:00,483
as part of the data_files argument.
28
00:01:01,860 --> 00:01:03,360
And the loading script also allows you
29
00:01:03,360 --> 00:01:05,040
to pass several keyword arguments,
30
00:01:05,040 --> 00:01:06,750
so here we've also specified
31
00:01:06,750 --> 00:01:09,030
that the separator is a semi-colon.
32
00:01:09,030 --> 00:01:10,380
And with that, we can see the dataset
33
00:01:10,380 --> 00:01:13,020
is loaded automatically
as a DatasetDict object,
34
00:01:13,020 --> 00:01:15,920
with each column in the CSV
file represented as a feature.
35
00:01:17,610 --> 00:01:20,280
If your dataset is located on
some remote server like GitHub
36
00:01:20,280 --> 00:01:22,050
or some other repository,
37
00:01:22,050 --> 00:01:23,700
the process is actually very similar.
38
00:01:23,700 --> 00:01:25,980
The only difference is that
now the data_files argument
39
00:01:25,980 --> 00:01:28,623
points to a URL instead
of a local filepath.
40
00:01:30,330 --> 00:01:33,270
Let's now take a look at
loading raw text files.
41
00:01:33,270 --> 00:01:35,100
This format is quite common in NLP,
42
00:01:35,100 --> 00:01:36,750
and you'll typically find books and plays
43
00:01:36,750 --> 00:01:39,393
are just a single file
with raw text inside.
44
00:01:40,410 --> 00:01:43,020
In this example, we have a
text file of Shakespeare plays
45
00:01:43,020 --> 00:01:45,330
that's stored on a GitHub repository.
46
00:01:45,330 --> 00:01:47,040
And as we did for CSV files,
47
00:01:47,040 --> 00:01:49,020
we simply choose the text loading script
48
00:01:49,020 --> 00:01:51,423
and point the data_files
argument to the URL.
49
00:01:52,260 --> 00:01:55,110
As you can see, these files
are processed line-by-line,
50
00:01:55,110 --> 00:01:57,690
so empty lines in the raw
text are also represented
51
00:01:57,690 --> 00:01:58,953
as a row in the dataset.
52
00:02:00,810 --> 00:02:04,230
For JSON files, there are two
main formats to know about.
53
00:02:04,230 --> 00:02:06,060
The first one is called JSON Lines,
54
00:02:06,060 --> 00:02:09,510
where every row in the file
is a separate JSON object.
55
00:02:09,510 --> 00:02:11,100
For these files, you can load the dataset
56
00:02:11,100 --> 00:02:13,020
by selecting the JSON loading script
57
00:02:13,020 --> 00:02:16,143
and pointing the data_files
argument to the file or URL.
58
00:02:17,160 --> 00:02:19,410
In this example, we've
loaded a JSON lines files
59
00:02:19,410 --> 00:02:21,710
based on Stack Exchange
questions and answers.
60
00:02:23,490 --> 00:02:26,610
The other format is nested JSON files.
61
00:02:26,610 --> 00:02:29,100
These files basically look
like one huge dictionary,
62
00:02:29,100 --> 00:02:31,200
so the load_dataset function
allow you to specify
63
00:02:31,200 --> 00:02:32,733
which specific key to load.
64
00:02:33,630 --> 00:02:35,910
For example, the SQuAD dataset
for question and answering
65
00:02:35,910 --> 00:02:38,340
has its format, and we
can load it by specifying
66
00:02:38,340 --> 00:02:40,340
that we're interested in the data field.
67
00:02:41,400 --> 00:02:42,780
There is just one last thing to mention
68
00:02:42,780 --> 00:02:44,910
about all of these loading scripts.
69
00:02:44,910 --> 00:02:46,410
You can have more than one split,
70
00:02:46,410 --> 00:02:49,080
you can load them by treating
data files as a dictionary,
71
00:02:49,080 --> 00:02:52,140
and map each split name
to its corresponding file.
72
00:02:52,140 --> 00:02:53,970
Everything else stays completely unchanged
73
00:02:53,970 --> 00:02:55,350
and you can see an example of loading
74
00:02:55,350 --> 00:02:58,283
both the training and validation
splits for this SQuAD here.
75
00:02:59,550 --> 00:03:02,310
And with that, you can now
load datasets from your laptop,
76
00:03:02,310 --> 00:03:04,653
the Hugging Face Hub,
or anywhere else want.
77
00:03:06,277 --> 00:03:09,194
(screen whooshing)