subtitles/en/20_hugging-face-datasets-overview-(tensorflow).srt (249 lines of code) (raw):
1
00:00:00,170 --> 00:00:03,087
(screen whooshing)
2
00:00:05,371 --> 00:00:09,690
- The Hugging Face Datasets
library: A Quick overview.
3
00:00:09,690 --> 00:00:10,917
The Hugging Face Datasets library
4
00:00:10,917 --> 00:00:12,870
is a library that provides an API
5
00:00:12,870 --> 00:00:15,150
to quickly download many public datasets
6
00:00:15,150 --> 00:00:16,200
and pre-process them.
7
00:00:17,070 --> 00:00:19,473
In this video we will explore to do that.
8
00:00:20,520 --> 00:00:23,730
The downloading part is easy
with the load_dataset function,
9
00:00:23,730 --> 00:00:26,010
you can directly download
and cache a dataset
10
00:00:26,010 --> 00:00:28,023
from its identifier on the Dataset hub.
11
00:00:29,160 --> 00:00:33,690
Here we fetch the MRPC dataset
from the GLUE benchmark,
12
00:00:33,690 --> 00:00:36,030
is a dataset containing pairs of sentences
13
00:00:36,030 --> 00:00:38,380
where the task is to
determine the paraphrases.
14
00:00:39,720 --> 00:00:42,120
The object returned by
the load_dataset function
15
00:00:42,120 --> 00:00:45,090
is a DatasetDict, which
is a sort of dictionary
16
00:00:45,090 --> 00:00:46,940
containing each split of our dataset.
17
00:00:48,600 --> 00:00:51,780
We can access each split
by indexing with its name.
18
00:00:51,780 --> 00:00:54,540
This split is then an
instance of the Dataset class,
19
00:00:54,540 --> 00:00:57,423
with columns, here sentence1, sentence2,
20
00:00:58,350 --> 00:01:00,813
label and idx, and rows.
21
00:01:02,160 --> 00:01:05,220
We can access a given
element by its index.
22
00:01:05,220 --> 00:01:08,220
The amazing thing about the
Hugging Face Datasets library
23
00:01:08,220 --> 00:01:11,700
is that everything is saved
to disk using Apache Arrow,
24
00:01:11,700 --> 00:01:14,460
which means that even
if your dataset is huge
25
00:01:14,460 --> 00:01:16,219
you won't get out of RAM,
26
00:01:16,219 --> 00:01:18,769
only the elements you
request are loaded in memory.
27
00:01:19,920 --> 00:01:24,510
Accessing a slice of your dataset
is as easy as one element.
28
00:01:24,510 --> 00:01:27,150
The result is then a
dictionary with list of values
29
00:01:27,150 --> 00:01:30,630
for each keys, here the list of labels,
30
00:01:30,630 --> 00:01:32,190
the list of first sentences,
31
00:01:32,190 --> 00:01:33,840
and the list of second sentences.
32
00:01:35,100 --> 00:01:37,080
The features attribute of a Dataset
33
00:01:37,080 --> 00:01:39,840
gives us more information
about its columns.
34
00:01:39,840 --> 00:01:42,150
In particular, we can see here it gives us
35
00:01:42,150 --> 00:01:43,980
a correspondence between the integers
36
00:01:43,980 --> 00:01:46,110
and names for the labels.
37
00:01:46,110 --> 00:01:49,623
0 stands for not equivalent
and 1 for equivalent.
38
00:01:51,630 --> 00:01:54,090
To pre-process all the
elements of our dataset,
39
00:01:54,090 --> 00:01:55,980
we need to tokenize them.
40
00:01:55,980 --> 00:01:58,470
Have a look at the video
"Pre-process sentence pairs"
41
00:01:58,470 --> 00:02:01,800
for a refresher, but you just
have to send the two sentences
42
00:02:01,800 --> 00:02:04,833
to the tokenizer with some
additional keyword arguments.
43
00:02:05,880 --> 00:02:09,300
Here we indicate a maximum length of 128
44
00:02:09,300 --> 00:02:11,460
and pad inputs shorter than this length,
45
00:02:11,460 --> 00:02:13,060
truncate inputs that are longer.
46
00:02:14,040 --> 00:02:16,170
We put all of this in a tokenize_function
47
00:02:16,170 --> 00:02:18,510
that we can directly
apply to all the splits
48
00:02:18,510 --> 00:02:20,260
in our dataset with the map method.
49
00:02:21,210 --> 00:02:24,120
As long as the function returns
a dictionary-like object,
50
00:02:24,120 --> 00:02:26,580
the map method will add
new columns as needed
51
00:02:26,580 --> 00:02:28,113
or update existing ones.
52
00:02:30,060 --> 00:02:32,520
To speed up pre-processing
and take advantage
53
00:02:32,520 --> 00:02:35,130
of the fact our tokenizer
is backed by Rust
54
00:02:35,130 --> 00:02:38,160
thanks to the Hugging
Face Tokenizers library,
55
00:02:38,160 --> 00:02:40,590
we can process several
elements at the same time
56
00:02:40,590 --> 00:02:43,923
in our tokenize function, using
the batched=True argument.
57
00:02:45,300 --> 00:02:46,980
Since the tokenizer can handle a list
58
00:02:46,980 --> 00:02:50,280
of first or second sentences,
the tokenize_function
59
00:02:50,280 --> 00:02:52,740
does not need to change for this.
60
00:02:52,740 --> 00:02:55,410
You can also use multiprocessing
with the map method,
61
00:02:55,410 --> 00:02:57,460
check out its documentation linked below.
62
00:02:58,740 --> 00:03:02,130
Once this is done, we are
almost ready for training,
63
00:03:02,130 --> 00:03:04,020
we just remove the columns
we don't need anymore
64
00:03:04,020 --> 00:03:06,120
with the remove_columns method,
65
00:03:06,120 --> 00:03:08,580
rename label to labels, since the models
66
00:03:08,580 --> 00:03:11,430
from the transformers library expect that,
67
00:03:11,430 --> 00:03:14,040
and set the output format
to our desired backend,
68
00:03:14,040 --> 00:03:15,893
torch, tensorflow or numpy.
69
00:03:16,800 --> 00:03:19,050
If needed, we can also
generate a short sample
70
00:03:19,050 --> 00:03:21,377
of a dataset using the select method.
71
00:03:22,817 --> 00:03:25,734
(screen whooshing)