subtitles/en/20_hugging-face-datasets-overview-(tensorflow).srt (249 lines of code) (raw):

1 00:00:00,170 --> 00:00:03,087 (screen whooshing) 2 00:00:05,371 --> 00:00:09,690 - The Hugging Face Datasets library: A Quick overview. 3 00:00:09,690 --> 00:00:10,917 The Hugging Face Datasets library 4 00:00:10,917 --> 00:00:12,870 is a library that provides an API 5 00:00:12,870 --> 00:00:15,150 to quickly download many public datasets 6 00:00:15,150 --> 00:00:16,200 and pre-process them. 7 00:00:17,070 --> 00:00:19,473 In this video we will explore to do that. 8 00:00:20,520 --> 00:00:23,730 The downloading part is easy with the load_dataset function, 9 00:00:23,730 --> 00:00:26,010 you can directly download and cache a dataset 10 00:00:26,010 --> 00:00:28,023 from its identifier on the Dataset hub. 11 00:00:29,160 --> 00:00:33,690 Here we fetch the MRPC dataset from the GLUE benchmark, 12 00:00:33,690 --> 00:00:36,030 is a dataset containing pairs of sentences 13 00:00:36,030 --> 00:00:38,380 where the task is to determine the paraphrases. 14 00:00:39,720 --> 00:00:42,120 The object returned by the load_dataset function 15 00:00:42,120 --> 00:00:45,090 is a DatasetDict, which is a sort of dictionary 16 00:00:45,090 --> 00:00:46,940 containing each split of our dataset. 17 00:00:48,600 --> 00:00:51,780 We can access each split by indexing with its name. 18 00:00:51,780 --> 00:00:54,540 This split is then an instance of the Dataset class, 19 00:00:54,540 --> 00:00:57,423 with columns, here sentence1, sentence2, 20 00:00:58,350 --> 00:01:00,813 label and idx, and rows. 21 00:01:02,160 --> 00:01:05,220 We can access a given element by its index. 22 00:01:05,220 --> 00:01:08,220 The amazing thing about the Hugging Face Datasets library 23 00:01:08,220 --> 00:01:11,700 is that everything is saved to disk using Apache Arrow, 24 00:01:11,700 --> 00:01:14,460 which means that even if your dataset is huge 25 00:01:14,460 --> 00:01:16,219 you won't get out of RAM, 26 00:01:16,219 --> 00:01:18,769 only the elements you request are loaded in memory. 27 00:01:19,920 --> 00:01:24,510 Accessing a slice of your dataset is as easy as one element. 28 00:01:24,510 --> 00:01:27,150 The result is then a dictionary with list of values 29 00:01:27,150 --> 00:01:30,630 for each keys, here the list of labels, 30 00:01:30,630 --> 00:01:32,190 the list of first sentences, 31 00:01:32,190 --> 00:01:33,840 and the list of second sentences. 32 00:01:35,100 --> 00:01:37,080 The features attribute of a Dataset 33 00:01:37,080 --> 00:01:39,840 gives us more information about its columns. 34 00:01:39,840 --> 00:01:42,150 In particular, we can see here it gives us 35 00:01:42,150 --> 00:01:43,980 a correspondence between the integers 36 00:01:43,980 --> 00:01:46,110 and names for the labels. 37 00:01:46,110 --> 00:01:49,623 0 stands for not equivalent and 1 for equivalent. 38 00:01:51,630 --> 00:01:54,090 To pre-process all the elements of our dataset, 39 00:01:54,090 --> 00:01:55,980 we need to tokenize them. 40 00:01:55,980 --> 00:01:58,470 Have a look at the video "Pre-process sentence pairs" 41 00:01:58,470 --> 00:02:01,800 for a refresher, but you just have to send the two sentences 42 00:02:01,800 --> 00:02:04,833 to the tokenizer with some additional keyword arguments. 43 00:02:05,880 --> 00:02:09,300 Here we indicate a maximum length of 128 44 00:02:09,300 --> 00:02:11,460 and pad inputs shorter than this length, 45 00:02:11,460 --> 00:02:13,060 truncate inputs that are longer. 46 00:02:14,040 --> 00:02:16,170 We put all of this in a tokenize_function 47 00:02:16,170 --> 00:02:18,510 that we can directly apply to all the splits 48 00:02:18,510 --> 00:02:20,260 in our dataset with the map method. 49 00:02:21,210 --> 00:02:24,120 As long as the function returns a dictionary-like object, 50 00:02:24,120 --> 00:02:26,580 the map method will add new columns as needed 51 00:02:26,580 --> 00:02:28,113 or update existing ones. 52 00:02:30,060 --> 00:02:32,520 To speed up pre-processing and take advantage 53 00:02:32,520 --> 00:02:35,130 of the fact our tokenizer is backed by Rust 54 00:02:35,130 --> 00:02:38,160 thanks to the Hugging Face Tokenizers library, 55 00:02:38,160 --> 00:02:40,590 we can process several elements at the same time 56 00:02:40,590 --> 00:02:43,923 in our tokenize function, using the batched=True argument. 57 00:02:45,300 --> 00:02:46,980 Since the tokenizer can handle a list 58 00:02:46,980 --> 00:02:50,280 of first or second sentences, the tokenize_function 59 00:02:50,280 --> 00:02:52,740 does not need to change for this. 60 00:02:52,740 --> 00:02:55,410 You can also use multiprocessing with the map method, 61 00:02:55,410 --> 00:02:57,460 check out its documentation linked below. 62 00:02:58,740 --> 00:03:02,130 Once this is done, we are almost ready for training, 63 00:03:02,130 --> 00:03:04,020 we just remove the columns we don't need anymore 64 00:03:04,020 --> 00:03:06,120 with the remove_columns method, 65 00:03:06,120 --> 00:03:08,580 rename label to labels, since the models 66 00:03:08,580 --> 00:03:11,430 from the transformers library expect that, 67 00:03:11,430 --> 00:03:14,040 and set the output format to our desired backend, 68 00:03:14,040 --> 00:03:15,893 torch, tensorflow or numpy. 69 00:03:16,800 --> 00:03:19,050 If needed, we can also generate a short sample 70 00:03:19,050 --> 00:03:21,377 of a dataset using the select method. 71 00:03:22,817 --> 00:03:25,734 (screen whooshing)