subtitles/en/36_slice-and-dice-a-dataset-🔪.srt

1 00:00:00,215 --> 00:00:02,882 (air whooshing) 2 00:00:05,760 --> 00:00:07,623 - How to slice and dice the dataset? 3 00:00:08,760 --> 00:00:10,410 Most of the time, the data you work with 4 00:00:10,410 --> 00:00:13,230 won't be perfectly prepared for training models. 5 00:00:13,230 --> 00:00:15,810 In this video, we'll explore various features 6 00:00:15,810 --> 00:00:18,660 that the datasets library provides to clean up your data. 7 00:00:19,915 --> 00:00:22,500 The datasets library provides several built-in methods 8 00:00:22,500 --> 00:00:25,350 that allow you to wrangle your data in various ways. 9 00:00:25,350 --> 00:00:27,360 In this video, we'll see how you can shuffle 10 00:00:27,360 --> 00:00:30,750 and split your data, select the rows you're interested in, 11 00:00:30,750 --> 00:00:32,070 tweak the columns, 12 00:00:32,070 --> 00:00:34,620 and apply processing functions with the map method. 13 00:00:35,640 --> 00:00:37,620 Let's start with shuffling. 14 00:00:37,620 --> 00:00:38,520 It is generally a good idea 15 00:00:38,520 --> 00:00:40,140 to apply shuffling to your training set 16 00:00:40,140 --> 00:00:41,250 so that your model doesn't learn 17 00:00:41,250 --> 00:00:43,590 any artificial ordering the data. 18 00:00:43,590 --> 00:00:45,360 If you wanna shuffle the whole dataset, 19 00:00:45,360 --> 00:00:48,390 you can apply the appropriately named shuffle method. 20 00:00:48,390 --> 00:00:50,730 You can see an example of this method in action here, 21 00:00:50,730 --> 00:00:52,200 where we've downloaded the training split 22 00:00:52,200 --> 00:00:55,000 of the squad dataset and shuffled all the rows randomly. 23 00:00:56,880 --> 00:00:58,230 Another way to shuffle the data 24 00:00:58,230 --> 00:01:00,930 is to create random train and test splits. 25 00:01:00,930 --> 00:01:02,280 This can be useful if you have to create 26 00:01:02,280 --> 00:01:04,620 your own test splits from raw data. 27 00:01:04,620 --> 00:01:07,620 To do this, you just apply the train_test_split method 28 00:01:07,620 --> 00:01:10,740 and specify how large the test split should be. 29 00:01:10,740 --> 00:01:14,310 In this example, we specify that the test set should be 10% 30 00:01:14,310 --> 00:01:15,963 of the total dataset size. 31 00:01:16,890 --> 00:01:19,140 You can see that the output of the train_test_split method 32 00:01:19,140 --> 00:01:20,610 is a DatasetDict object 33 00:01:20,610 --> 00:01:22,743 whose keys correspond to the new splits. 34 00:01:25,170 --> 00:01:27,210 Now that we know how to shuffle the dataset, 35 00:01:27,210 --> 00:01:30,060 let's take a look at returning the rows we're interested in. 36 00:01:30,060 --> 00:01:33,180 The most common way to do this is with the select method. 37 00:01:33,180 --> 00:01:34,590 This method expects a list 38 00:01:34,590 --> 00:01:36,750 or a generator of the datasets indices, 39 00:01:36,750 --> 00:01:38,670 and will then return a new dataset object 40 00:01:38,670 --> 00:01:40,143 containing just those rows. 41 00:01:41,490 --> 00:01:43,740 If you wanna create a random sample of rows, 42 00:01:43,740 --> 00:01:45,360 you can do this by chaining the shuffle 43 00:01:45,360 --> 00:01:47,310 and select methods together. 44 00:01:47,310 --> 00:01:48,450 In this example, 45 00:01:48,450 --> 00:01:50,250 we've created a sample of five elements 46 00:01:50,250 --> 00:01:51,423 from the squad dataset. 47 00:01:53,550 --> 00:01:56,010 The last way to pick out specific rows in a dataset 48 00:01:56,010 --> 00:01:58,290 is by applying the filter method. 49 00:01:58,290 --> 00:02:00,120 This method checks whether each row 50 00:02:00,120 --> 00:02:02,310 fulfills some condition or not. 51 00:02:02,310 --> 00:02:05,130 For example, here we've created a small lambda function 52 00:02:05,130 --> 00:02:08,460 that checks whether the title starts with the letter L. 53 00:02:08,460 --> 00:02:11,040 Once we apply this function with the filter method, 54 00:02:11,040 --> 00:02:14,283 we get a subset of the data just containing these rows. 55 00:02:16,200 --> 00:02:18,600 So far, we've been talking about the rows of a dataset, 56 00:02:18,600 --> 00:02:20,490 but what about the columns? 57 00:02:20,490 --> 00:02:22,320 The datasets library has two main methods 58 00:02:22,320 --> 00:02:24,060 for transforming columns, 59 00:02:24,060 --> 00:02:26,760 a rename_column method to change the name of the column 60 00:02:26,760 --> 00:02:29,460 and a remove_columns method to delete them. 61 00:02:29,460 --> 00:02:31,860 You can see examples of both these methods here. 62 00:02:34,140 --> 00:02:36,060 Some datasets have nested columns, 63 00:02:36,060 --> 00:02:39,360 and you can expand these by applying the flatten method. 64 00:02:39,360 --> 00:02:41,430 For example, in the squad dataset, 65 00:02:41,430 --> 00:02:45,150 the answers column contains a text and answer_start field. 66 00:02:45,150 --> 00:02:47,430 If we wanna promote them to their own separate columns, 67 00:02:47,430 --> 00:02:49,383 we can apply flatten as shown here. 68 00:02:51,300 --> 00:02:53,760 Now of course, no discussion of the datasets library 69 00:02:53,760 --> 00:02:56,880 would be complete without mentioning the famous map method. 70 00:02:56,880 --> 00:02:59,160 This method applies a custom processing function 71 00:02:59,160 --> 00:03:01,140 to each row in the dataset. 72 00:03:01,140 --> 00:03:03,360 For example, here we first define 73 00:03:03,360 --> 00:03:04,890 a lowercase title function, 74 00:03:04,890 --> 00:03:07,503 that simply lowercases the text in the title column. 75 00:03:08,640 --> 00:03:11,700 And then we feed that function to the map method, 76 00:03:11,700 --> 00:03:14,223 and voila, we now have lowercase titles. 77 00:03:16,020 --> 00:03:18,360 The map method can also be used to feed batches of rows 78 00:03:18,360 --> 00:03:20,100 to the processing function. 79 00:03:20,100 --> 00:03:22,410 This is especially useful for tokenization 80 00:03:22,410 --> 00:03:25,290 where the tokenizer is backed by the Tokenizers library, 81 00:03:25,290 --> 00:03:26,910 and they can use fast multithreading 82 00:03:26,910 --> 00:03:28,563 to process batches in parallel. 83 00:03:30,056 --> 00:03:32,723 (air whooshing)

subtitles/en/36_slice-and-dice-a-dataset-🔪.srt (287 lines of code) (raw):