subtitles/en/36_slice-and-dice-a-dataset-🔪.srt (287 lines of code) (raw):
1
00:00:00,215 --> 00:00:02,882
(air whooshing)
2
00:00:05,760 --> 00:00:07,623
- How to slice and dice the dataset?
3
00:00:08,760 --> 00:00:10,410
Most of the time, the data you work with
4
00:00:10,410 --> 00:00:13,230
won't be perfectly prepared
for training models.
5
00:00:13,230 --> 00:00:15,810
In this video, we'll
explore various features
6
00:00:15,810 --> 00:00:18,660
that the datasets library
provides to clean up your data.
7
00:00:19,915 --> 00:00:22,500
The datasets library provides
several built-in methods
8
00:00:22,500 --> 00:00:25,350
that allow you to wrangle
your data in various ways.
9
00:00:25,350 --> 00:00:27,360
In this video, we'll
see how you can shuffle
10
00:00:27,360 --> 00:00:30,750
and split your data, select
the rows you're interested in,
11
00:00:30,750 --> 00:00:32,070
tweak the columns,
12
00:00:32,070 --> 00:00:34,620
and apply processing
functions with the map method.
13
00:00:35,640 --> 00:00:37,620
Let's start with shuffling.
14
00:00:37,620 --> 00:00:38,520
It is generally a good idea
15
00:00:38,520 --> 00:00:40,140
to apply shuffling to your training set
16
00:00:40,140 --> 00:00:41,250
so that your model doesn't learn
17
00:00:41,250 --> 00:00:43,590
any artificial ordering the data.
18
00:00:43,590 --> 00:00:45,360
If you wanna shuffle the whole dataset,
19
00:00:45,360 --> 00:00:48,390
you can apply the appropriately
named shuffle method.
20
00:00:48,390 --> 00:00:50,730
You can see an example of
this method in action here,
21
00:00:50,730 --> 00:00:52,200
where we've downloaded the training split
22
00:00:52,200 --> 00:00:55,000
of the squad dataset and
shuffled all the rows randomly.
23
00:00:56,880 --> 00:00:58,230
Another way to shuffle the data
24
00:00:58,230 --> 00:01:00,930
is to create random train and test splits.
25
00:01:00,930 --> 00:01:02,280
This can be useful if you have to create
26
00:01:02,280 --> 00:01:04,620
your own test splits from raw data.
27
00:01:04,620 --> 00:01:07,620
To do this, you just apply
the train_test_split method
28
00:01:07,620 --> 00:01:10,740
and specify how large
the test split should be.
29
00:01:10,740 --> 00:01:14,310
In this example, we specify
that the test set should be 10%
30
00:01:14,310 --> 00:01:15,963
of the total dataset size.
31
00:01:16,890 --> 00:01:19,140
You can see that the output
of the train_test_split method
32
00:01:19,140 --> 00:01:20,610
is a DatasetDict object
33
00:01:20,610 --> 00:01:22,743
whose keys correspond to the new splits.
34
00:01:25,170 --> 00:01:27,210
Now that we know how
to shuffle the dataset,
35
00:01:27,210 --> 00:01:30,060
let's take a look at returning
the rows we're interested in.
36
00:01:30,060 --> 00:01:33,180
The most common way to do this
is with the select method.
37
00:01:33,180 --> 00:01:34,590
This method expects a list
38
00:01:34,590 --> 00:01:36,750
or a generator of the datasets indices,
39
00:01:36,750 --> 00:01:38,670
and will then return a new dataset object
40
00:01:38,670 --> 00:01:40,143
containing just those rows.
41
00:01:41,490 --> 00:01:43,740
If you wanna create a
random sample of rows,
42
00:01:43,740 --> 00:01:45,360
you can do this by chaining the shuffle
43
00:01:45,360 --> 00:01:47,310
and select methods together.
44
00:01:47,310 --> 00:01:48,450
In this example,
45
00:01:48,450 --> 00:01:50,250
we've created a sample of five elements
46
00:01:50,250 --> 00:01:51,423
from the squad dataset.
47
00:01:53,550 --> 00:01:56,010
The last way to pick out
specific rows in a dataset
48
00:01:56,010 --> 00:01:58,290
is by applying the filter method.
49
00:01:58,290 --> 00:02:00,120
This method checks whether each row
50
00:02:00,120 --> 00:02:02,310
fulfills some condition or not.
51
00:02:02,310 --> 00:02:05,130
For example, here we've
created a small lambda function
52
00:02:05,130 --> 00:02:08,460
that checks whether the title
starts with the letter L.
53
00:02:08,460 --> 00:02:11,040
Once we apply this function
with the filter method,
54
00:02:11,040 --> 00:02:14,283
we get a subset of the data
just containing these rows.
55
00:02:16,200 --> 00:02:18,600
So far, we've been talking
about the rows of a dataset,
56
00:02:18,600 --> 00:02:20,490
but what about the columns?
57
00:02:20,490 --> 00:02:22,320
The datasets library has two main methods
58
00:02:22,320 --> 00:02:24,060
for transforming columns,
59
00:02:24,060 --> 00:02:26,760
a rename_column method to
change the name of the column
60
00:02:26,760 --> 00:02:29,460
and a remove_columns
method to delete them.
61
00:02:29,460 --> 00:02:31,860
You can see examples of
both these methods here.
62
00:02:34,140 --> 00:02:36,060
Some datasets have nested columns,
63
00:02:36,060 --> 00:02:39,360
and you can expand these by
applying the flatten method.
64
00:02:39,360 --> 00:02:41,430
For example, in the squad dataset,
65
00:02:41,430 --> 00:02:45,150
the answers column contains a
text and answer_start field.
66
00:02:45,150 --> 00:02:47,430
If we wanna promote them to
their own separate columns,
67
00:02:47,430 --> 00:02:49,383
we can apply flatten as shown here.
68
00:02:51,300 --> 00:02:53,760
Now of course, no discussion
of the datasets library
69
00:02:53,760 --> 00:02:56,880
would be complete without
mentioning the famous map method.
70
00:02:56,880 --> 00:02:59,160
This method applies a
custom processing function
71
00:02:59,160 --> 00:03:01,140
to each row in the dataset.
72
00:03:01,140 --> 00:03:03,360
For example, here we first define
73
00:03:03,360 --> 00:03:04,890
a lowercase title function,
74
00:03:04,890 --> 00:03:07,503
that simply lowercases the
text in the title column.
75
00:03:08,640 --> 00:03:11,700
And then we feed that
function to the map method,
76
00:03:11,700 --> 00:03:14,223
and voila, we now have lowercase titles.
77
00:03:16,020 --> 00:03:18,360
The map method can also be
used to feed batches of rows
78
00:03:18,360 --> 00:03:20,100
to the processing function.
79
00:03:20,100 --> 00:03:22,410
This is especially useful for tokenization
80
00:03:22,410 --> 00:03:25,290
where the tokenizer is backed
by the Tokenizers library,
81
00:03:25,290 --> 00:03:26,910
and they can use fast multithreading
82
00:03:26,910 --> 00:03:28,563
to process batches in parallel.
83
00:03:30,056 --> 00:03:32,723
(air whooshing)