subtitles/en/19_hugging-face-datasets-overview-(pytorch).srt (266 lines of code) (raw):
1
00:00:00,213 --> 00:00:02,963
(slide whooshes)
2
00:00:05,340 --> 00:00:08,373
- The Hugging Face Datasets
library, a quick overview.
3
00:00:09,990 --> 00:00:11,670
The Hugging Face Datasets library
4
00:00:11,670 --> 00:00:14,310
is a library that provides
an API to quickly download
5
00:00:14,310 --> 00:00:17,610
many public datasets and preprocess them.
6
00:00:17,610 --> 00:00:20,614
In this video we will
explore how to do that.
7
00:00:20,614 --> 00:00:21,780
The downloading part is easy,
8
00:00:21,780 --> 00:00:23,760
with the load_dataset function.
9
00:00:23,760 --> 00:00:26,460
You can directly download
and cache a dataset
10
00:00:26,460 --> 00:00:28,473
from its identifier on the Dataset hub.
11
00:00:29,640 --> 00:00:33,570
Here, we fetch the MRPC dataset
from the GLUE benchmark,
12
00:00:33,570 --> 00:00:36,390
which is a dataset
containing pairs of sentences
13
00:00:36,390 --> 00:00:38,740
where the task is to
determine the paraphrases.
14
00:00:39,810 --> 00:00:42,420
The object returned by
the load_dataset function
15
00:00:42,420 --> 00:00:45,600
is a DatasetDict, which
is a sort of dictionary
16
00:00:45,600 --> 00:00:47,463
containing each split of our dataset.
17
00:00:48,946 --> 00:00:52,170
We can access each split
by indexing with its name.
18
00:00:52,170 --> 00:00:55,047
This split is then an
instance of the Dataset class,
19
00:00:55,047 --> 00:00:58,590
with columns, here sentence1, sentence2,
20
00:00:58,590 --> 00:01:01,233
label and idx, and rows.
21
00:01:02,400 --> 00:01:04,563
We can access a given
element by its index.
22
00:01:05,460 --> 00:01:08,220
The amazing thing about the
Hugging Face Datasets library
23
00:01:08,220 --> 00:01:11,880
is that everything is saved
to disk using Apache Arrow,
24
00:01:11,880 --> 00:01:14,550
which means that even
if your dataset is huge,
25
00:01:14,550 --> 00:01:16,350
you won't get out of RAM.
26
00:01:16,350 --> 00:01:19,113
Only the elements you
request are loaded in memory.
27
00:01:20,340 --> 00:01:23,940
Accessing a slice of your dataset
is as easy as one element.
28
00:01:23,940 --> 00:01:26,220
The result is then a
dictionary with list of values
29
00:01:26,220 --> 00:01:27,480
for each keys.
30
00:01:27,480 --> 00:01:29,070
Here the list of labels,
31
00:01:29,070 --> 00:01:30,147
the list of first sentences
32
00:01:30,147 --> 00:01:31,923
and the list of second sentences.
33
00:01:33,690 --> 00:01:35,580
The features attribute of a Dataset
34
00:01:35,580 --> 00:01:37,470
gives us more information
about its columns.
35
00:01:37,470 --> 00:01:40,020
In particular, we can see here
36
00:01:40,020 --> 00:01:41,400
it gives us the correspondence
37
00:01:41,400 --> 00:01:44,810
between the integers and
names for the labels.
38
00:01:44,810 --> 00:01:48,543
Zero stands for not equivalent
and one for equivalent.
39
00:01:49,830 --> 00:01:52,020
To preprocess all the
elements of our dataset,
40
00:01:52,020 --> 00:01:53,850
we need to tokenize them.
41
00:01:53,850 --> 00:01:56,160
Have a look at the video
"Preprocess sentence pairs"
42
00:01:56,160 --> 00:01:57,570
for a refresher,
43
00:01:57,570 --> 00:01:59,430
but you just have to
send the two sentences
44
00:01:59,430 --> 00:02:02,733
to the tokenizer with some
additional keyword arguments.
45
00:02:03,780 --> 00:02:06,600
Here we indicate a maximum length of 128
46
00:02:06,600 --> 00:02:08,820
and pad inputs shorter than this length,
47
00:02:08,820 --> 00:02:10,420
truncate inputs that are longer.
48
00:02:11,460 --> 00:02:13,470
We put all of this in a tokenize_function
49
00:02:13,470 --> 00:02:16,710
that we can directly apply to
all the splits in our dataset
50
00:02:16,710 --> 00:02:17,710
with the map method.
51
00:02:18,840 --> 00:02:22,110
As long as the function returns
a dictionary-like object,
52
00:02:22,110 --> 00:02:24,300
the map method will add
new columns as needed
53
00:02:24,300 --> 00:02:26,043
or update existing ones.
54
00:02:27,315 --> 00:02:28,830
To speed up preprocessing
55
00:02:28,830 --> 00:02:30,870
and take advantage of
the fact our tokenizer
56
00:02:30,870 --> 00:02:32,040
is backed by Rust,
57
00:02:32,040 --> 00:02:34,770
thanks to the Hugging
Face Tokenizers library,
58
00:02:34,770 --> 00:02:37,110
we can process several
elements at the same time
59
00:02:37,110 --> 00:02:40,710
to our tokenize function, using
the batched=True argument.
60
00:02:40,710 --> 00:02:42,120
Since the tokenizer can handle
61
00:02:42,120 --> 00:02:44,610
list of first sentences,
list of second sentences,
62
00:02:44,610 --> 00:02:47,493
the tokenize_function does
not need to change for this.
63
00:02:48,360 --> 00:02:51,180
You can also use multiprocessing
with the map method.
64
00:02:51,180 --> 00:02:53,583
Check out its documentation
in the linked video.
65
00:02:54,840 --> 00:02:57,990
Once this is done, we are
almost ready for training.
66
00:02:57,990 --> 00:02:59,970
We just remove the columns
we don't need anymore
67
00:02:59,970 --> 00:03:02,190
with the remove_columns method,
68
00:03:02,190 --> 00:03:03,750
rename label to labels,
69
00:03:03,750 --> 00:03:05,790
since the models from the
Hugging Face Transformers
70
00:03:05,790 --> 00:03:07,710
library expect that,
71
00:03:07,710 --> 00:03:10,470
and set the output format
to our desired backend,
72
00:03:10,470 --> 00:03:12,053
Torch, TensorFlow or NumPy.
73
00:03:13,440 --> 00:03:16,800
If needed, we can also generate
a short sample of a dataset
74
00:03:16,800 --> 00:03:18,000
using the select method.
75
00:03:20,211 --> 00:03:22,961
(slide whooshes)