1
00:00:00,215 --> 00:00:02,882
(air whooshing)

2
00:00:05,760 --> 00:00:07,623
- How to slice and dice the dataset?

3
00:00:08,760 --> 00:00:10,410
Most of the time, the data you work with

4
00:00:10,410 --> 00:00:13,230
won't be perfectly prepared
for training models.

5
00:00:13,230 --> 00:00:15,810
In this video, we'll
explore various features

6
00:00:15,810 --> 00:00:18,660
that the datasets library
provides to clean up your data.

7
00:00:19,915 --> 00:00:22,500
The datasets library provides
several built-in methods

8
00:00:22,500 --> 00:00:25,350
that allow you to wrangle
your data in various ways.

9
00:00:25,350 --> 00:00:27,360
In this video, we'll
see how you can shuffle

10
00:00:27,360 --> 00:00:30,750
and split your data, select
the rows you're interested in,

11
00:00:30,750 --> 00:00:32,070
tweak the columns,

12
00:00:32,070 --> 00:00:34,620
and apply processing
functions with the map method.

13
00:00:35,640 --> 00:00:37,620
Let's start with shuffling.

14
00:00:37,620 --> 00:00:38,520
It is generally a good idea

15
00:00:38,520 --> 00:00:40,140
to apply shuffling to your training set

16
00:00:40,140 --> 00:00:41,250
so that your model doesn't learn

17
00:00:41,250 --> 00:00:43,590
any artificial ordering the data.

18
00:00:43,590 --> 00:00:45,360
If you wanna shuffle the whole dataset,

19
00:00:45,360 --> 00:00:48,390
you can apply the appropriately
named shuffle method.

20
00:00:48,390 --> 00:00:50,730
You can see an example of
this method in action here,

21
00:00:50,730 --> 00:00:52,200
where we've downloaded the training split

22
00:00:52,200 --> 00:00:55,000
of the squad dataset and
shuffled all the rows randomly.

23
00:00:56,880 --> 00:00:58,230
Another way to shuffle the data

24
00:00:58,230 --> 00:01:00,930
is to create random train and test splits.

25
00:01:00,930 --> 00:01:02,280
This can be useful if you have to create

26
00:01:02,280 --> 00:01:04,620
your own test splits from raw data.

27
00:01:04,620 --> 00:01:07,620
To do this, you just apply
the train_test_split method

28
00:01:07,620 --> 00:01:10,740
and specify how large
the test split should be.

29
00:01:10,740 --> 00:01:14,310
In this example, we specify
that the test set should be 10%

30
00:01:14,310 --> 00:01:15,963
of the total dataset size.

31
00:01:16,890 --> 00:01:19,140
You can see that the output
of the train_test_split method

32
00:01:19,140 --> 00:01:20,610
is a DatasetDict object

33
00:01:20,610 --> 00:01:22,743
whose keys correspond to the new splits.

34
00:01:25,170 --> 00:01:27,210
Now that we know how
to shuffle the dataset,

35
00:01:27,210 --> 00:01:30,060
let's take a look at returning
the rows we're interested in.

36
00:01:30,060 --> 00:01:33,180
The most common way to do this
is with the select method.

37
00:01:33,180 --> 00:01:34,590
This method expects a list

38
00:01:34,590 --> 00:01:36,750
or a generator of the datasets indices,

39
00:01:36,750 --> 00:01:38,670
and will then return a new dataset object

40
00:01:38,670 --> 00:01:40,143
containing just those rows.

41
00:01:41,490 --> 00:01:43,740
If you wanna create a
random sample of rows,

42
00:01:43,740 --> 00:01:45,360
you can do this by chaining the shuffle

43
00:01:45,360 --> 00:01:47,310
and select methods together.

44
00:01:47,310 --> 00:01:48,450
In this example,

45
00:01:48,450 --> 00:01:50,250
we've created a sample of five elements

46
00:01:50,250 --> 00:01:51,423
from the squad dataset.

47
00:01:53,550 --> 00:01:56,010
The last way to pick out
specific rows in a dataset

48
00:01:56,010 --> 00:01:58,290
is by applying the filter method.

49
00:01:58,290 --> 00:02:00,120
This method checks whether each row

50
00:02:00,120 --> 00:02:02,310
fulfills some condition or not.

51
00:02:02,310 --> 00:02:05,130
For example, here we've
created a small lambda function

52
00:02:05,130 --> 00:02:08,460
that checks whether the title
starts with the letter L.

53
00:02:08,460 --> 00:02:11,040
Once we apply this function
with the filter method,

54
00:02:11,040 --> 00:02:14,283
we get a subset of the data
just containing these rows.

55
00:02:16,200 --> 00:02:18,600
So far, we've been talking
about the rows of a dataset,

56
00:02:18,600 --> 00:02:20,490
but what about the columns?

57
00:02:20,490 --> 00:02:22,320
The datasets library has two main methods

58
00:02:22,320 --> 00:02:24,060
for transforming columns,

59
00:02:24,060 --> 00:02:26,760
a rename_column method to
change the name of the column

60
00:02:26,760 --> 00:02:29,460
and a remove_columns
method to delete them.

61
00:02:29,460 --> 00:02:31,860
You can see examples of
both these methods here.

62
00:02:34,140 --> 00:02:36,060
Some datasets have nested columns,

63
00:02:36,060 --> 00:02:39,360
and you can expand these by
applying the flatten method.

64
00:02:39,360 --> 00:02:41,430
For example, in the squad dataset,

65
00:02:41,430 --> 00:02:45,150
the answers column contains a
text and answer_start field.

66
00:02:45,150 --> 00:02:47,430
If we wanna promote them to
their own separate columns,

67
00:02:47,430 --> 00:02:49,383
we can apply flatten as shown here.

68
00:02:51,300 --> 00:02:53,760
Now of course, no discussion
of the datasets library

69
00:02:53,760 --> 00:02:56,880
would be complete without
mentioning the famous map method.

70
00:02:56,880 --> 00:02:59,160
This method applies a
custom processing function

71
00:02:59,160 --> 00:03:01,140
to each row in the dataset.

72
00:03:01,140 --> 00:03:03,360
For example, here we first define

73
00:03:03,360 --> 00:03:04,890
a lowercase title function,

74
00:03:04,890 --> 00:03:07,503
that simply lowercases the
text in the title column.

75
00:03:08,640 --> 00:03:11,700
And then we feed that
function to the map method,

76
00:03:11,700 --> 00:03:14,223
and voila, we now have lowercase titles.

77
00:03:16,020 --> 00:03:18,360
The map method can also be
used to feed batches of rows

78
00:03:18,360 --> 00:03:20,100
to the processing function.

79
00:03:20,100 --> 00:03:22,410
This is especially useful for tokenization

80
00:03:22,410 --> 00:03:25,290
where the tokenizer is backed
by the Tokenizers library,

81
00:03:25,290 --> 00:03:26,910
and they can use fast multithreading

82
00:03:26,910 --> 00:03:28,563
to process batches in parallel.

83
00:03:30,056 --> 00:03:32,723
(air whooshing)