subtitles/zh-CN/35_loading-a-custom-dataset.srt (308 lines of code) (raw):

1 00:00:00,195 --> 00:00:01,426 (屏幕呼啸) (screen whooshing) 2 00:00:01,426 --> 00:00:02,614 (贴纸弹出) (sticker popping) 3 00:00:02,614 --> 00:00:06,150 (屏幕呼啸) (screen whooshing) 4 00:00:06,150 --> 00:00:08,430 - 加载自定义数据集。 - Loading a custom dataset. 5 00:00:08,430 --> 00:00:09,750 尽管 Hugging Face Hub 上承载了 Although the HuggingFace Hub hosts 6 00:00:09,750 --> 00:00:11,730 超过一千个公共数据集, over a thousand public datasets, 7 00:00:11,730 --> 00:00:12,930 你可能仍然需要经常处理存储在你的笔记本电脑 you'll often need to work with data 8 00:00:12,930 --> 00:00:15,900 或存储在远程服务器上的数据。 that is stored on your laptop or some remote server. 9 00:00:15,900 --> 00:00:18,060 在本视频中,我们将探讨如何利用 Datasets 库 In this video, we'll explore how the Datasets library 10 00:00:18,060 --> 00:00:20,310 加载 Hugging Face Hub 以外 can be used to load datasets that aren't available 11 00:00:20,310 --> 00:00:21,510 的数据集。 on the Hugging Face Hub. 12 00:00:22,980 --> 00:00:25,290 正如你在此表中所见,Datasets 库 As you can see in this table, the Datasets library 13 00:00:25,290 --> 00:00:26,700 提供了几个内置脚本 provides several in-built scripts 14 00:00:26,700 --> 00:00:29,370 以多种格式加载数据集。 to load datasets in several formats. 15 00:00:29,370 --> 00:00:31,200 要以其中一种格式加载数据集, To load a dataset in one of these formats, 16 00:00:31,200 --> 00:00:32,730 你只需要向 load_dataset 函数 you just need to provide the name of the format 17 00:00:32,730 --> 00:00:34,350 提供格式的名称, to the load_dataset function, 18 00:00:34,350 --> 00:00:35,790 并且连同 data_files 参数一起传入 along with a data_files argument 19 00:00:35,790 --> 00:00:37,610 该参数指向一个或多个文件路径或 URL。 that points to one or more filepaths or URLs. 20 00:00:40,350 --> 00:00:43,590 要查看实际效果,让我们从加载 CSV 文件开始。 To see this in action, let's start by loading a CSV file. 21 00:00:43,590 --> 00:00:45,960 在这个例子中,我们首先下载一个数据集 In this example, we first download a dataset 22 00:00:45,960 --> 00:00:48,963 该数据集是来自 UCI 机器学习库的葡萄酒质量数据。 about wine quality from the UCI machine learning repository. 23 00:00:50,220 --> 00:00:52,590 由于这是一个 CSV 文件,因此我们指定 Since this is a CSV file, we then specify 24 00:00:52,590 --> 00:00:53,943 CSV 加载脚本。 the CSV loading script. 25 00:00:55,320 --> 00:00:57,570 现在,这个脚本需要知道我们的数据在哪里, Now, this script needs to know where our data is located, 26 00:00:57,570 --> 00:00:58,650 所以我们提供文件名 so we provide the filename 27 00:00:58,650 --> 00:01:00,483 作为 data_files 参数的一部分。 as part of the data_files argument. 28 00:01:01,860 --> 00:01:03,360 并且加载脚本还允许你 And the loading script also allows you 29 00:01:03,360 --> 00:01:05,040 传递几个关键字参数, to pass several keyword arguments, 30 00:01:05,040 --> 00:01:06,750 所以在这里我们也指定了 so here we've also specified 31 00:01:06,750 --> 00:01:09,030 分号作为分隔符。 that the separator is a semi-colon. 32 00:01:09,030 --> 00:01:10,380 这样,我们就可以看到数据集 And with that, we can see the dataset 33 00:01:10,380 --> 00:01:13,020 作为 DatasetDict 对象自动加载, is loaded automatically as a DatasetDict object, 34 00:01:13,020 --> 00:01:15,920 CSV 文件中的每一列都代表一个特征。 with each column in the CSV file represented as a feature. 35 00:01:17,610 --> 00:01:20,280 如果你的数据集位于 GitHub 等远程服务器上 If your dataset is located on some remote server like GitHub 36 00:01:20,280 --> 00:01:22,050 或其他一些数据仓库, or some other repository, 37 00:01:22,050 --> 00:01:23,700 这个过程实际上非常相似。 the process is actually very similar. 38 00:01:23,700 --> 00:01:25,980 唯一的区别是现在 data_files 参数 The only difference is that now the data_files argument 39 00:01:25,980 --> 00:01:28,623 指向 URL 而不是本地文件路径。 points to a URL instead of a local filepath. 40 00:01:30,330 --> 00:01:33,270 现在让我们看一下加载原始文本文件。 Let's now take a look at loading raw text files. 41 00:01:33,270 --> 00:01:35,100 这种格式在 NLP 中很常见, This format is quite common in NLP, 42 00:01:35,100 --> 00:01:36,750 你常常会发现书籍和戏剧 and you'll typically find books and plays 43 00:01:36,750 --> 00:01:39,393 只是一个包含原始文本的独立文件。 are just a single file with raw text inside. 44 00:01:40,410 --> 00:01:43,020 在这个例子中,我们有一个莎士比亚戏剧的文本文件 In this example, we have a text file of Shakespeare plays 45 00:01:43,020 --> 00:01:45,330 存储在 GitHub 仓库中。 that's stored on a GitHub repository. 46 00:01:45,330 --> 00:01:47,040 正如我们对 CSV 文件所做的那样, And as we did for CSV files, 47 00:01:47,040 --> 00:01:49,020 我们只需选择文本加载脚本 we simply choose the text loading script 48 00:01:49,020 --> 00:01:51,423 并将 data_files 参数指向 URL。 and point the data_files argument to the URL. 49 00:01:52,260 --> 00:01:55,110 如你所见,这些文件是逐行处理的, As you can see, these files are processed line-by-line, 50 00:01:55,110 --> 00:01:57,690 所以原始文本中的空行 so empty lines in the raw text are also represented 51 00:01:57,690 --> 00:01:58,953 也按照数据集中的一行表示。 as a row in the dataset. 52 00:02:00,810 --> 00:02:04,230 对于 JSON 文件,有两种主要格式需要了解。 For JSON files, there are two main formats to know about. 53 00:02:04,230 --> 00:02:06,060 第一个叫做 JSON 行, The first one is called JSON Lines, 54 00:02:06,060 --> 00:02:09,510 文件中的每一行都是一个单独的 JSON 对象。 where every row in the file is a separate JSON object. 55 00:02:09,510 --> 00:02:11,100 对于这些文件,你可以通过选择 JSON 加载脚本 For these files, you can load the dataset 56 00:02:11,100 --> 00:02:13,020 来加载数据集 by selecting the JSON loading script 57 00:02:13,020 --> 00:02:16,143 并将 data_files 参数指向文件或 URL。 and pointing the data_files argument to the file or URL. 58 00:02:17,160 --> 00:02:19,410 在这个例子中,我们加载了一个多行 JSON 的文件 In this example, we've loaded a JSON lines files 59 00:02:19,410 --> 00:02:21,710 其内容基于 Stack Exchange 问题和答案。 based on Stack Exchange questions and answers. 60 00:02:23,490 --> 00:02:26,610 另一种格式是嵌套的 JSON 文件。 The other format is nested JSON files. 61 00:02:26,610 --> 00:02:29,100 这些文件基本上看起来像一本巨大的字典, These files basically look like one huge dictionary, 62 00:02:29,100 --> 00:02:31,200 所以 load_dataset 函数允许你指定 so the load_dataset function allow you to specify 63 00:02:31,200 --> 00:02:32,733 要加载哪个特定关键词。 which specific key to load. 64 00:02:33,630 --> 00:02:35,910 例如,用于问答的 SQuAD 数据集有它的格式, For example, the SQuAD dataset for question and answering 65 00:02:35,910 --> 00:02:38,340 我们可以通过指定我们感兴趣的数据字段 has its format, and we can load it by specifying 66 00:02:38,340 --> 00:02:40,340 我们对 data 字段感兴趣。 that we're interested in the data field. 67 00:02:41,400 --> 00:02:42,780 最后要和大家分享的内容是 There is just one last thing to mention 68 00:02:42,780 --> 00:02:44,910 关于所有这些加载脚本。 about all of these loading scripts. 69 00:02:44,910 --> 00:02:46,410 你可以有不止一次数据切分, You can have more than one split, 70 00:02:46,410 --> 00:02:49,080 你可以通过将数据文件视为字典来加载它们, you can load them by treating data files as a dictionary, 71 00:02:49,080 --> 00:02:52,140 并将每个拆分的名称映射到其对应的文件。 and map each split name to its corresponding file. 72 00:02:52,140 --> 00:02:53,970 其他一切都保持完全不变 Everything else stays completely unchanged 73 00:02:53,970 --> 00:02:55,350 你可以看到一个例子, and you can see an example of loading 74 00:02:55,350 --> 00:02:58,283 加载此 SQuAD 的训练和验证分解步骤都在这里。 both the training and validation splits for this SQuAD here. 75 00:02:59,550 --> 00:03:02,310 这样,你现在可以加载来自笔记本电脑的数据集,来自 Hugging Face Hub 的数据集, And with that, you can now load datasets from your laptop, 76 00:03:02,310 --> 00:03:04,653 或来自任何其他地方的数据集。 the Hugging Face Hub, or anywhere else want. 77 00:03:06,277 --> 00:03:09,194 (屏幕呼啸) (screen whooshing)