subtitles/zh-CN/35_loading-a-custom-dataset.srt (308 lines of code) (raw):
1
00:00:00,195 --> 00:00:01,426
(屏幕呼啸)
(screen whooshing)
2
00:00:01,426 --> 00:00:02,614
(贴纸弹出)
(sticker popping)
3
00:00:02,614 --> 00:00:06,150
(屏幕呼啸)
(screen whooshing)
4
00:00:06,150 --> 00:00:08,430
- 加载自定义数据集。
- Loading a custom dataset.
5
00:00:08,430 --> 00:00:09,750
尽管 Hugging Face Hub 上承载了
Although the HuggingFace Hub hosts
6
00:00:09,750 --> 00:00:11,730
超过一千个公共数据集,
over a thousand public datasets,
7
00:00:11,730 --> 00:00:12,930
你可能仍然需要经常处理存储在你的笔记本电脑
you'll often need to work with data
8
00:00:12,930 --> 00:00:15,900
或存储在远程服务器上的数据。
that is stored on your laptop or some remote server.
9
00:00:15,900 --> 00:00:18,060
在本视频中,我们将探讨如何利用 Datasets 库
In this video, we'll explore how the Datasets library
10
00:00:18,060 --> 00:00:20,310
加载 Hugging Face Hub 以外
can be used to load datasets that aren't available
11
00:00:20,310 --> 00:00:21,510
的数据集。
on the Hugging Face Hub.
12
00:00:22,980 --> 00:00:25,290
正如你在此表中所见,Datasets 库
As you can see in this table, the Datasets library
13
00:00:25,290 --> 00:00:26,700
提供了几个内置脚本
provides several in-built scripts
14
00:00:26,700 --> 00:00:29,370
以多种格式加载数据集。
to load datasets in several formats.
15
00:00:29,370 --> 00:00:31,200
要以其中一种格式加载数据集,
To load a dataset in one of these formats,
16
00:00:31,200 --> 00:00:32,730
你只需要向 load_dataset 函数
you just need to provide the name of the format
17
00:00:32,730 --> 00:00:34,350
提供格式的名称,
to the load_dataset function,
18
00:00:34,350 --> 00:00:35,790
并且连同 data_files 参数一起传入
along with a data_files argument
19
00:00:35,790 --> 00:00:37,610
该参数指向一个或多个文件路径或 URL。
that points to one or more filepaths or URLs.
20
00:00:40,350 --> 00:00:43,590
要查看实际效果,让我们从加载 CSV 文件开始。
To see this in action, let's start by loading a CSV file.
21
00:00:43,590 --> 00:00:45,960
在这个例子中,我们首先下载一个数据集
In this example, we first download a dataset
22
00:00:45,960 --> 00:00:48,963
该数据集是来自 UCI 机器学习库的葡萄酒质量数据。
about wine quality from the UCI machine learning repository.
23
00:00:50,220 --> 00:00:52,590
由于这是一个 CSV 文件,因此我们指定
Since this is a CSV file, we then specify
24
00:00:52,590 --> 00:00:53,943
CSV 加载脚本。
the CSV loading script.
25
00:00:55,320 --> 00:00:57,570
现在,这个脚本需要知道我们的数据在哪里,
Now, this script needs to know where our data is located,
26
00:00:57,570 --> 00:00:58,650
所以我们提供文件名
so we provide the filename
27
00:00:58,650 --> 00:01:00,483
作为 data_files 参数的一部分。
as part of the data_files argument.
28
00:01:01,860 --> 00:01:03,360
并且加载脚本还允许你
And the loading script also allows you
29
00:01:03,360 --> 00:01:05,040
传递几个关键字参数,
to pass several keyword arguments,
30
00:01:05,040 --> 00:01:06,750
所以在这里我们也指定了
so here we've also specified
31
00:01:06,750 --> 00:01:09,030
分号作为分隔符。
that the separator is a semi-colon.
32
00:01:09,030 --> 00:01:10,380
这样,我们就可以看到数据集
And with that, we can see the dataset
33
00:01:10,380 --> 00:01:13,020
作为 DatasetDict 对象自动加载,
is loaded automatically as a DatasetDict object,
34
00:01:13,020 --> 00:01:15,920
CSV 文件中的每一列都代表一个特征。
with each column in the CSV file represented as a feature.
35
00:01:17,610 --> 00:01:20,280
如果你的数据集位于 GitHub 等远程服务器上
If your dataset is located on some remote server like GitHub
36
00:01:20,280 --> 00:01:22,050
或其他一些数据仓库,
or some other repository,
37
00:01:22,050 --> 00:01:23,700
这个过程实际上非常相似。
the process is actually very similar.
38
00:01:23,700 --> 00:01:25,980
唯一的区别是现在 data_files 参数
The only difference is that now the data_files argument
39
00:01:25,980 --> 00:01:28,623
指向 URL 而不是本地文件路径。
points to a URL instead of a local filepath.
40
00:01:30,330 --> 00:01:33,270
现在让我们看一下加载原始文本文件。
Let's now take a look at loading raw text files.
41
00:01:33,270 --> 00:01:35,100
这种格式在 NLP 中很常见,
This format is quite common in NLP,
42
00:01:35,100 --> 00:01:36,750
你常常会发现书籍和戏剧
and you'll typically find books and plays
43
00:01:36,750 --> 00:01:39,393
只是一个包含原始文本的独立文件。
are just a single file with raw text inside.
44
00:01:40,410 --> 00:01:43,020
在这个例子中,我们有一个莎士比亚戏剧的文本文件
In this example, we have a text file of Shakespeare plays
45
00:01:43,020 --> 00:01:45,330
存储在 GitHub 仓库中。
that's stored on a GitHub repository.
46
00:01:45,330 --> 00:01:47,040
正如我们对 CSV 文件所做的那样,
And as we did for CSV files,
47
00:01:47,040 --> 00:01:49,020
我们只需选择文本加载脚本
we simply choose the text loading script
48
00:01:49,020 --> 00:01:51,423
并将 data_files 参数指向 URL。
and point the data_files argument to the URL.
49
00:01:52,260 --> 00:01:55,110
如你所见,这些文件是逐行处理的,
As you can see, these files are processed line-by-line,
50
00:01:55,110 --> 00:01:57,690
所以原始文本中的空行
so empty lines in the raw text are also represented
51
00:01:57,690 --> 00:01:58,953
也按照数据集中的一行表示。
as a row in the dataset.
52
00:02:00,810 --> 00:02:04,230
对于 JSON 文件,有两种主要格式需要了解。
For JSON files, there are two main formats to know about.
53
00:02:04,230 --> 00:02:06,060
第一个叫做 JSON 行,
The first one is called JSON Lines,
54
00:02:06,060 --> 00:02:09,510
文件中的每一行都是一个单独的 JSON 对象。
where every row in the file is a separate JSON object.
55
00:02:09,510 --> 00:02:11,100
对于这些文件,你可以通过选择 JSON 加载脚本
For these files, you can load the dataset
56
00:02:11,100 --> 00:02:13,020
来加载数据集
by selecting the JSON loading script
57
00:02:13,020 --> 00:02:16,143
并将 data_files 参数指向文件或 URL。
and pointing the data_files argument to the file or URL.
58
00:02:17,160 --> 00:02:19,410
在这个例子中,我们加载了一个多行 JSON 的文件
In this example, we've loaded a JSON lines files
59
00:02:19,410 --> 00:02:21,710
其内容基于 Stack Exchange 问题和答案。
based on Stack Exchange questions and answers.
60
00:02:23,490 --> 00:02:26,610
另一种格式是嵌套的 JSON 文件。
The other format is nested JSON files.
61
00:02:26,610 --> 00:02:29,100
这些文件基本上看起来像一本巨大的字典,
These files basically look like one huge dictionary,
62
00:02:29,100 --> 00:02:31,200
所以 load_dataset 函数允许你指定
so the load_dataset function allow you to specify
63
00:02:31,200 --> 00:02:32,733
要加载哪个特定关键词。
which specific key to load.
64
00:02:33,630 --> 00:02:35,910
例如,用于问答的 SQuAD 数据集有它的格式,
For example, the SQuAD dataset for question and answering
65
00:02:35,910 --> 00:02:38,340
我们可以通过指定我们感兴趣的数据字段
has its format, and we can load it by specifying
66
00:02:38,340 --> 00:02:40,340
我们对 data 字段感兴趣。
that we're interested in the data field.
67
00:02:41,400 --> 00:02:42,780
最后要和大家分享的内容是
There is just one last thing to mention
68
00:02:42,780 --> 00:02:44,910
关于所有这些加载脚本。
about all of these loading scripts.
69
00:02:44,910 --> 00:02:46,410
你可以有不止一次数据切分,
You can have more than one split,
70
00:02:46,410 --> 00:02:49,080
你可以通过将数据文件视为字典来加载它们,
you can load them by treating data files as a dictionary,
71
00:02:49,080 --> 00:02:52,140
并将每个拆分的名称映射到其对应的文件。
and map each split name to its corresponding file.
72
00:02:52,140 --> 00:02:53,970
其他一切都保持完全不变
Everything else stays completely unchanged
73
00:02:53,970 --> 00:02:55,350
你可以看到一个例子,
and you can see an example of loading
74
00:02:55,350 --> 00:02:58,283
加载此 SQuAD 的训练和验证分解步骤都在这里。
both the training and validation splits for this SQuAD here.
75
00:02:59,550 --> 00:03:02,310
这样,你现在可以加载来自笔记本电脑的数据集,来自 Hugging Face Hub 的数据集,
And with that, you can now load datasets from your laptop,
76
00:03:02,310 --> 00:03:04,653
或来自任何其他地方的数据集。
the Hugging Face Hub, or anywhere else want.
77
00:03:06,277 --> 00:03:09,194
(屏幕呼啸)
(screen whooshing)