subtitles/zh-CN/20_hugging-face-datasets-overview-(tensorflow).srt (284 lines of code) (raw):

1 00:00:00,170 --> 00:00:03,087 (屏幕呼啸) (screen whooshing) 2 00:00:05,371 --> 00:00:09,690 - Hugging Face Datasets 库: 快速概览。 - The Hugging Face Datasets library: A Quick overview. 3 00:00:09,690 --> 00:00:10,917 Hugging Face Datasets 库 The Hugging Face Datasets library 4 00:00:10,917 --> 00:00:12,870 是一个库, 提供 API is a library that provides an API 5 00:00:12,870 --> 00:00:15,150 来快速下载许多公共数据集 to quickly download many public datasets 6 00:00:15,150 --> 00:00:16,200 并对它们进行预处理。 and pre-process them. 7 00:00:17,070 --> 00:00:19,473 在本视频中,我们将探索如何做到这一点。 In this video we will explore to do that. 8 00:00:20,520 --> 00:00:23,730 下载部分使用 load_dataset 函数很容易, The downloading part is easy with the load_dataset function, 9 00:00:23,730 --> 00:00:26,010 你可以直接下载并缓存数据集 you can directly download and cache a dataset 10 00:00:26,010 --> 00:00:28,023 来自其在数据集 Hub 的 ID 。 from its identifier on the Dataset hub. 11 00:00:29,160 --> 00:00:33,690 这里我们从 GLUE benchmark 中获取 MRPC 数据集, Here we fetch the MRPC dataset from the GLUE benchmark, 12 00:00:33,690 --> 00:00:36,030 是一个包含句子对的数据集 is a dataset containing pairs of sentences 13 00:00:36,030 --> 00:00:38,380 任务是确定释义。 where the task is to determine the paraphrases. 14 00:00:39,720 --> 00:00:42,120 load_dataset 函数返回的对象 The object returned by the load_dataset function 15 00:00:42,120 --> 00:00:45,090 是一个 DatasetDict,它是一种字典 is a DatasetDict, which is a sort of dictionary 16 00:00:45,090 --> 00:00:46,940 包含我们数据集的每个拆分。 containing each split of our dataset. 17 00:00:48,600 --> 00:00:51,780 我们可以通过使用其名称进行索引来访问每个拆分。 We can access each split by indexing with its name. 18 00:00:51,780 --> 00:00:54,540 这个拆分然后是 Dataset 类的一个实例, This split is then an instance of the Dataset class, 19 00:00:54,540 --> 00:00:57,423 有列: sentence1,sentence2, with columns, here sentence1, sentence2, 20 00:00:58,350 --> 00:01:00,813 label 和 idx,以及行。 label and idx, and rows. 21 00:01:02,160 --> 00:01:05,220 我们可以访问给定的元素, 通过索引。 We can access a given element by its index. 22 00:01:05,220 --> 00:01:08,220 Hugging Face Datasets 库的神奇之处 The amazing thing about the Hugging Face Datasets library 23 00:01:08,220 --> 00:01:11,700 是所有内容都使用 Apache Arrow 保存到磁盘, is that everything is saved to disk using Apache Arrow, 24 00:01:11,700 --> 00:01:14,460 这意味着即使你的数据集很大 which means that even if your dataset is huge 25 00:01:14,460 --> 00:01:16,219 你不会内存溢出, you won't get out of RAM, 26 00:01:16,219 --> 00:01:18,769 只有你请求的元素才会加载到内存中。 only the elements you request are loaded in memory. 27 00:01:19,920 --> 00:01:24,510 访问数据集的一个切片就像访问一个元素一样简单。 Accessing a slice of your dataset is as easy as one element. 28 00:01:24,510 --> 00:01:27,150 结果是一个包含值列表的字典 The result is then a dictionary with list of values 29 00:01:27,150 --> 00:01:30,630 对于每个键,这里是标签列表, for each keys, here the list of labels, 30 00:01:30,630 --> 00:01:32,190 第一句话列表, the list of first sentences, 31 00:01:32,190 --> 00:01:33,840 和第二句话的列表。 and the list of second sentences. 32 00:01:35,100 --> 00:01:37,080 数据集的特征属性 The features attribute of a Dataset 33 00:01:37,080 --> 00:01:39,840 为我们提供有关其列的更多信息。 gives us more information about its columns. 34 00:01:39,840 --> 00:01:42,150 特别是,我们可以在这里看到它给了我们 In particular, we can see here it gives us 35 00:01:42,150 --> 00:01:43,980 整数之间的对应关系 a correspondence between the integers 36 00:01:43,980 --> 00:01:46,110 和标签的名称。 and names for the labels. 37 00:01:46,110 --> 00:01:49,623 0 代表不相等,1 代表相等。 0 stands for not equivalent and 1 for equivalent. 38 00:01:51,630 --> 00:01:54,090 要预处理数据集的所有元素, To pre-process all the elements of our dataset, 39 00:01:54,090 --> 00:01:55,980 我们需要将它们 token 化。 we need to tokenize them. 40 00:01:55,980 --> 00:01:58,470 看看视频 “Pre-process sentence pairs” Have a look at the video "Pre-process sentence pairs" 41 00:01:58,470 --> 00:02:01,800 复习一下,但你只需要发送这两个句子 for a refresher, but you just have to send the two sentences 42 00:02:01,800 --> 00:02:04,833 给 tokenizer, 带有一些额外的关键字参数。 to the tokenizer with some additional keyword arguments. 43 00:02:05,880 --> 00:02:09,300 这里我们表示最大长度为 128 Here we indicate a maximum length of 128 44 00:02:09,300 --> 00:02:11,460 和垫全输入短于这个长度的, and pad inputs shorter than this length, 45 00:02:11,460 --> 00:02:13,060 截断更长的输入。 truncate inputs that are longer. 46 00:02:14,040 --> 00:02:16,170 我们把所有这些都放在一个 tokenize_function 中 We put all of this in a tokenize_function 47 00:02:16,170 --> 00:02:18,510 我们可以直接应用于所有拆分 that we can directly apply to all the splits 48 00:02:18,510 --> 00:02:20,260 在我们的数据集中使用 map 方法。 in our dataset with the map method. 49 00:02:21,210 --> 00:02:24,120 只要函数返回一个类似字典的对象, As long as the function returns a dictionary-like object, 50 00:02:24,120 --> 00:02:26,580 map 方法将根据需要添加新列 the map method will add new columns as needed 51 00:02:26,580 --> 00:02:28,113 或更新现有的。 or update existing ones. 52 00:02:30,060 --> 00:02:32,520 为加快预处理并利用 To speed up pre-processing and take advantage 53 00:02:32,520 --> 00:02:35,130 我们的分词器由 Rust 支持的事实 of the fact our tokenizer is backed by Rust 54 00:02:35,130 --> 00:02:38,160 感谢 Hugging Face Tokenizers 库, thanks to the Hugging Face Tokenizers library, 55 00:02:38,160 --> 00:02:40,590 我们可以同时处理多个元素 we can process several elements at the same time 56 00:02:40,590 --> 00:02:43,923 在我们的 tokenize 函数中,使用 batched=True 参数。 in our tokenize function, using the batched=True argument. 57 00:02:45,300 --> 00:02:46,980 由于 tokenizer 可以处理列表 Since the tokenizer can handle a list 58 00:02:46,980 --> 00:02:50,280 第一句话或第二句话, tokenize_function of first or second sentences, the tokenize_function 59 00:02:50,280 --> 00:02:52,740 不需要为此改变。 does not need to change for this. 60 00:02:52,740 --> 00:02:55,410 你还可以将多进程与 map 方法一起使用, You can also use multiprocessing with the map method, 61 00:02:55,410 --> 00:02:57,460 查看下面链接的文档。 check out its documentation linked below. 62 00:02:58,740 --> 00:03:02,130 完成后,我们几乎准备好训练了, Once this is done, we are almost ready for training, 63 00:03:02,130 --> 00:03:04,020 我们只是删除我们不再需要的列 we just remove the columns we don't need anymore 64 00:03:04,020 --> 00:03:06,120 使用 remove_columns 方法, with the remove_columns method, 65 00:03:06,120 --> 00:03:08,580 将 label 重命名为 labels ,因为模型是 rename label to labels, since the models 66 00:03:08,580 --> 00:03:11,430 transformers 库中的,期望如此 from the transformers library expect that, 67 00:03:11,430 --> 00:03:14,040 并将输出格式设置为我们想要的后端, and set the output format to our desired backend, 68 00:03:14,040 --> 00:03:15,893 torch、tensorflow 或 numpy。 torch, tensorflow or numpy. 69 00:03:16,800 --> 00:03:19,050 如果需要,我们还可以生成一个简短的示例 If needed, we can also generate a short sample 70 00:03:19,050 --> 00:03:21,377 使用 select 方法的数据集。 of a dataset using the select method. 71 00:03:22,817 --> 00:03:25,734 (屏幕呼啸) (screen whooshing)