subtitles/en/19_hugging-face-datasets-overview-(pytorch).srt

1 00:00:00,213 --> 00:00:02,963 (slide whooshes) 2 00:00:05,340 --> 00:00:08,373 - The Hugging Face Datasets library, a quick overview. 3 00:00:09,990 --> 00:00:11,670 The Hugging Face Datasets library 4 00:00:11,670 --> 00:00:14,310 is a library that provides an API to quickly download 5 00:00:14,310 --> 00:00:17,610 many public datasets and preprocess them. 6 00:00:17,610 --> 00:00:20,614 In this video we will explore how to do that. 7 00:00:20,614 --> 00:00:21,780 The downloading part is easy, 8 00:00:21,780 --> 00:00:23,760 with the load_dataset function. 9 00:00:23,760 --> 00:00:26,460 You can directly download and cache a dataset 10 00:00:26,460 --> 00:00:28,473 from its identifier on the Dataset hub. 11 00:00:29,640 --> 00:00:33,570 Here, we fetch the MRPC dataset from the GLUE benchmark, 12 00:00:33,570 --> 00:00:36,390 which is a dataset containing pairs of sentences 13 00:00:36,390 --> 00:00:38,740 where the task is to determine the paraphrases. 14 00:00:39,810 --> 00:00:42,420 The object returned by the load_dataset function 15 00:00:42,420 --> 00:00:45,600 is a DatasetDict, which is a sort of dictionary 16 00:00:45,600 --> 00:00:47,463 containing each split of our dataset. 17 00:00:48,946 --> 00:00:52,170 We can access each split by indexing with its name. 18 00:00:52,170 --> 00:00:55,047 This split is then an instance of the Dataset class, 19 00:00:55,047 --> 00:00:58,590 with columns, here sentence1, sentence2, 20 00:00:58,590 --> 00:01:01,233 label and idx, and rows. 21 00:01:02,400 --> 00:01:04,563 We can access a given element by its index. 22 00:01:05,460 --> 00:01:08,220 The amazing thing about the Hugging Face Datasets library 23 00:01:08,220 --> 00:01:11,880 is that everything is saved to disk using Apache Arrow, 24 00:01:11,880 --> 00:01:14,550 which means that even if your dataset is huge, 25 00:01:14,550 --> 00:01:16,350 you won't get out of RAM. 26 00:01:16,350 --> 00:01:19,113 Only the elements you request are loaded in memory. 27 00:01:20,340 --> 00:01:23,940 Accessing a slice of your dataset is as easy as one element. 28 00:01:23,940 --> 00:01:26,220 The result is then a dictionary with list of values 29 00:01:26,220 --> 00:01:27,480 for each keys. 30 00:01:27,480 --> 00:01:29,070 Here the list of labels, 31 00:01:29,070 --> 00:01:30,147 the list of first sentences 32 00:01:30,147 --> 00:01:31,923 and the list of second sentences. 33 00:01:33,690 --> 00:01:35,580 The features attribute of a Dataset 34 00:01:35,580 --> 00:01:37,470 gives us more information about its columns. 35 00:01:37,470 --> 00:01:40,020 In particular, we can see here 36 00:01:40,020 --> 00:01:41,400 it gives us the correspondence 37 00:01:41,400 --> 00:01:44,810 between the integers and names for the labels. 38 00:01:44,810 --> 00:01:48,543 Zero stands for not equivalent and one for equivalent. 39 00:01:49,830 --> 00:01:52,020 To preprocess all the elements of our dataset, 40 00:01:52,020 --> 00:01:53,850 we need to tokenize them. 41 00:01:53,850 --> 00:01:56,160 Have a look at the video "Preprocess sentence pairs" 42 00:01:56,160 --> 00:01:57,570 for a refresher, 43 00:01:57,570 --> 00:01:59,430 but you just have to send the two sentences 44 00:01:59,430 --> 00:02:02,733 to the tokenizer with some additional keyword arguments. 45 00:02:03,780 --> 00:02:06,600 Here we indicate a maximum length of 128 46 00:02:06,600 --> 00:02:08,820 and pad inputs shorter than this length, 47 00:02:08,820 --> 00:02:10,420 truncate inputs that are longer. 48 00:02:11,460 --> 00:02:13,470 We put all of this in a tokenize_function 49 00:02:13,470 --> 00:02:16,710 that we can directly apply to all the splits in our dataset 50 00:02:16,710 --> 00:02:17,710 with the map method. 51 00:02:18,840 --> 00:02:22,110 As long as the function returns a dictionary-like object, 52 00:02:22,110 --> 00:02:24,300 the map method will add new columns as needed 53 00:02:24,300 --> 00:02:26,043 or update existing ones. 54 00:02:27,315 --> 00:02:28,830 To speed up preprocessing 55 00:02:28,830 --> 00:02:30,870 and take advantage of the fact our tokenizer 56 00:02:30,870 --> 00:02:32,040 is backed by Rust, 57 00:02:32,040 --> 00:02:34,770 thanks to the Hugging Face Tokenizers library, 58 00:02:34,770 --> 00:02:37,110 we can process several elements at the same time 59 00:02:37,110 --> 00:02:40,710 to our tokenize function, using the batched=True argument. 60 00:02:40,710 --> 00:02:42,120 Since the tokenizer can handle 61 00:02:42,120 --> 00:02:44,610 list of first sentences, list of second sentences, 62 00:02:44,610 --> 00:02:47,493 the tokenize_function does not need to change for this. 63 00:02:48,360 --> 00:02:51,180 You can also use multiprocessing with the map method. 64 00:02:51,180 --> 00:02:53,583 Check out its documentation in the linked video. 65 00:02:54,840 --> 00:02:57,990 Once this is done, we are almost ready for training. 66 00:02:57,990 --> 00:02:59,970 We just remove the columns we don't need anymore 67 00:02:59,970 --> 00:03:02,190 with the remove_columns method, 68 00:03:02,190 --> 00:03:03,750 rename label to labels, 69 00:03:03,750 --> 00:03:05,790 since the models from the Hugging Face Transformers 70 00:03:05,790 --> 00:03:07,710 library expect that, 71 00:03:07,710 --> 00:03:10,470 and set the output format to our desired backend, 72 00:03:10,470 --> 00:03:12,053 Torch, TensorFlow or NumPy. 73 00:03:13,440 --> 00:03:16,800 If needed, we can also generate a short sample of a dataset 74 00:03:16,800 --> 00:03:18,000 using the select method. 75 00:03:20,211 --> 00:03:22,961 (slide whooshes)

subtitles/en/19_hugging-face-datasets-overview-(pytorch).srt (266 lines of code) (raw):