1 00:00:00,213 --> 00:00:02,963 (滑动嗖嗖声) (slide whooshes) 2 00:00:05,340 --> 00:00:08,373 - 本节将带来 Hugging Face Datasets 库的快速概览。 - The Hugging Face Datasets library, a quick overview. 3 00:00:09,990 --> 00:00:11,670 Hugging Face Datasets 库 The Hugging Face Datasets library 4 00:00:11,670 --> 00:00:14,310 是一个提供 API 来快速下载的库 is a library that provides an API to quickly download 5 00:00:14,310 --> 00:00:17,610 许多公共数据集并对其进行预处理。 many public datasets and preprocess them. 6 00:00:17,610 --> 00:00:20,614 在本视频中,我们将探索如何做到这一点。 In this video we will explore how to do that. 7 00:00:20,614 --> 00:00:21,780 下载部分很简单, The downloading part is easy, 8 00:00:21,780 --> 00:00:23,760 使用 load_dataset 函数。 with the load_dataset function. 9 00:00:23,760 --> 00:00:26,460 你可以直接下载并缓存数据集 You can directly download and cache a dataset 10 00:00:26,460 --> 00:00:28,473 来自其在数据集中心的标识符。 from its identifier on the Dataset hub. 11 00:00:29,640 --> 00:00:33,570 在这里,我们从 GLUE 基准中获取 MRPC 数据集, Here, we fetch the MRPC dataset from the GLUE benchmark, 12 00:00:33,570 --> 00:00:36,390 这是一个包含成对句子的数据集 which is a dataset containing pairs of sentences 13 00:00:36,390 --> 00:00:38,740 任务是确定释义。 where the task is to determine the paraphrases. 14 00:00:39,810 --> 00:00:42,420 load_dataset 函数返回的对象 The object returned by the load_dataset function 15 00:00:42,420 --> 00:00:45,600 是一个 DatasetDict,它是一种字典 is a DatasetDict, which is a sort of dictionary 16 00:00:45,600 --> 00:00:47,463 包含我们数据集的每个分割。 containing each split of our dataset. 17 00:00:48,946 --> 00:00:52,170 我们可以通过使用其名称进行索引来访问每个拆分。 We can access each split by indexing with its name. 18 00:00:52,170 --> 00:00:55,047 这个拆分然后是 Dataset 类的一个实例, This split is then an instance of the Dataset class, 19 00:00:55,047 --> 00:00:58,590 有列,这里是 sentence1,sentence2, with columns, here sentence1, sentence2, 20 00:00:58,590 --> 00:01:01,233 标签和 idx,以及行。 label and idx, and rows. 21 00:01:02,400 --> 00:01:04,563 我们可以通过索引访问给定的元素。 We can access a given element by its index. 22 00:01:05,460 --> 00:01:08,220 Hugging Face Datasets 库的神奇之处 The amazing thing about the Hugging Face Datasets library 23 00:01:08,220 --> 00:01:11,880 是所有内容都使用 Apache Arrow 保存到磁盘, is that everything is saved to disk using Apache Arrow, 24 00:01:11,880 --> 00:01:14,550 这意味着即使你的数据集很大, which means that even if your dataset is huge, 25 00:01:14,550 --> 00:01:16,350 你不会离开 RAM。 you won't get out of RAM. 26 00:01:16,350 --> 00:01:19,113 只有你请求的元素才会加载到内存中。 Only the elements you request are loaded in memory. 27 00:01:20,340 --> 00:01:23,940 访问数据集的一部分就像访问一个元素一样简单。 Accessing a slice of your dataset is as easy as one element. 28 00:01:23,940 --> 00:01:26,220 结果是一个包含值列表的字典 The result is then a dictionary with list of values 29 00:01:26,220 --> 00:01:27,480 对于每个键。 for each keys. 30 00:01:27,480 --> 00:01:29,070 这里是标签列表, Here the list of labels, 31 00:01:29,070 --> 00:01:30,147 第一句话列表 the list of first sentences 32 00:01:30,147 --> 00:01:31,923 和第二句话的列表。 and the list of second sentences. 33 00:01:33,690 --> 00:01:35,580 数据集的特征属性 The features attribute of a Dataset 34 00:01:35,580 --> 00:01:37,470 为我们提供有关其专栏的更多信息。 gives us more information about its columns. 35 00:01:37,470 --> 00:01:40,020 特别是,我们可以在这里看到 In particular, we can see here 36 00:01:40,020 --> 00:01:41,400 它给了我们信件 it gives us the correspondence 37 00:01:41,400 --> 00:01:44,810 在标签的整数和名称之间。 between the integers and names for the labels. 38 00:01:44,810 --> 00:01:48,543 零代表不等价,一代表等价。 Zero stands for not equivalent and one for equivalent. 39 00:01:49,830 --> 00:01:52,020 要预处理数据集的所有元素, To preprocess all the elements of our dataset, 40 00:01:52,020 --> 00:01:53,850 我们需要将它们标记化。 we need to tokenize them. 41 00:01:53,850 --> 00:01:56,160 看看视频 “预处理句子对” Have a look at the video "Preprocess sentence pairs" 42 00:01:56,160 --> 00:01:57,570 复习一下, for a refresher, 43 00:01:57,570 --> 00:01:59,430 但你只需要发送这两个句子 but you just have to send the two sentences 44 00:01:59,430 --> 00:02:02,733 带有一些额外的关键字参数的分词器。 to the tokenizer with some additional keyword arguments. 45 00:02:03,780 --> 00:02:06,600 这里我们表示最大长度为 128 Here we indicate a maximum length of 128 46 00:02:06,600 --> 00:02:08,820 和垫输入短于这个长度, and pad inputs shorter than this length, 47 00:02:08,820 --> 00:02:10,420 截断更长的输入。 truncate inputs that are longer. 48 00:02:11,460 --> 00:02:13,470 我们把所有这些都放在一个 tokenize_function 中 We put all of this in a tokenize_function 49 00:02:13,470 --> 00:02:16,710 我们可以直接应用于数据集中的所有拆分 that we can directly apply to all the splits in our dataset 50 00:02:16,710 --> 00:02:17,710 用地图的方法。 with the map method. 51 00:02:18,840 --> 00:02:22,110 只要函数返回一个类似字典的对象, As long as the function returns a dictionary-like object, 52 00:02:22,110 --> 00:02:24,300 map 方法将根据需要添加新列 the map method will add new columns as needed 53 00:02:24,300 --> 00:02:26,043 或更新现有的。 or update existing ones. 54 00:02:27,315 --> 00:02:28,830 加快预处理 To speed up preprocessing 55 00:02:28,830 --> 00:02:30,870 并利用我们的分词器这一事实 and take advantage of the fact our tokenizer 56 00:02:30,870 --> 00:02:32,040 由 Rust 支持, is backed by Rust, 57 00:02:32,040 --> 00:02:34,770 感谢 Hugging Face Tokenizers 库, thanks to the Hugging Face Tokenizers library, 58 00:02:34,770 --> 00:02:37,110 我们可以同时处理多个元素 we can process several elements at the same time 59 00:02:37,110 --> 00:02:40,710 到我们的 tokenize 函数,使用 batched=True 参数。 to our tokenize function, using the batched=True argument. 60 00:02:40,710 --> 00:02:42,120 由于分词器可以处理 Since the tokenizer can handle 61 00:02:42,120 --> 00:02:44,610 第一句话列表,第二句列表, list of first sentences, list of second sentences, 62 00:02:44,610 --> 00:02:47,493 tokenize_function 不需要为此更改。 the tokenize_function does not need to change for this. 63 00:02:48,360 --> 00:02:51,180 你还可以将多处理与 map 方法一起使用。 You can also use multiprocessing with the map method. 64 00:02:51,180 --> 00:02:53,583 在链接的视频中查看其文档。 Check out its documentation in the linked video. 65 00:02:54,840 --> 00:02:57,990 完成后,我们几乎可以进行培训了。 Once this is done, we are almost ready for training. 66 00:02:57,990 --> 00:02:59,970 我们只是删除不再需要的列 We just remove the columns we don't need anymore 67 00:02:59,970 --> 00:03:02,190 使用 remove_columns 方法, with the remove_columns method, 68 00:03:02,190 --> 00:03:03,750 将标签重命名为标签, rename label to labels, 69 00:03:03,750 --> 00:03:05,790 因为来自 Hugging Face Transformers 的模型 since the models from the Hugging Face Transformers 70 00:03:05,790 --> 00:03:07,710 图书馆期望, library expect that, 71 00:03:07,710 --> 00:03:10,470 并将输出格式设置为我们想要的后端, and set the output format to our desired backend, 72 00:03:10,470 --> 00:03:12,053 火炬、TensorFlow 或 NumPy。 Torch, TensorFlow or NumPy. 73 00:03:13,440 --> 00:03:16,800 如果需要,我们还可以生成一个简短的数据集样本 If needed, we can also generate a short sample of a dataset 74 00:03:16,800 --> 00:03:18,000 使用选择方法。 using the select method. 75 00:03:20,211 --> 00:03:22,961 (滑动嗖嗖声) (slide whooshes)