subtitles/zh-CN/40_uploading-a-dataset-to-the-hub.srt (208 lines of code) (raw):
1
00:00:00,000 --> 00:00:02,917
(过渡音乐)
(transition music)
2
00:00:05,490 --> 00:00:07,950
- 将数据集上传到 hub。
- Uploading a dataset to the hub.
3
00:00:07,950 --> 00:00:09,060
在本视频中,我们将了解
In this video, we'll take a look
4
00:00:09,060 --> 00:00:10,860
如何上传你自己的数据集
at how you can upload your very own dataset
5
00:00:10,860 --> 00:00:12,060
到 Hugging Face Hub。
to the Hugging Face Hub.
6
00:00:13,680 --> 00:00:14,670
你需要做的第一件事
The first thing you need to do
7
00:00:14,670 --> 00:00:17,400
是在 hub 上创建一个新的数据集仓库。
is create a new dataset repository on the hub.
8
00:00:17,400 --> 00:00:19,260
所以,只需点击你的个人资料图标
So, just click on your profile icon
9
00:00:19,260 --> 00:00:21,750
并选择 New Dataset 按钮。
and select the New Dataset button.
10
00:00:21,750 --> 00:00:24,750
接下来,我们需要指定数据集的所有者。
Next, we need to assign an owner of the dataset.
11
00:00:24,750 --> 00:00:26,970
默认情况下,所有者是你的 hub 帐户,
By default, this will be your hub account,
12
00:00:26,970 --> 00:00:28,170
但你也可以
but you can also create datasets
13
00:00:28,170 --> 00:00:30,585
以你所属的组织的名义创建数据集。
under any organization that you belong to.
14
00:00:30,585 --> 00:00:33,780
然后,我们只需要给数据集起个好名字
Then, we just need to give the dataset a good name
15
00:00:33,780 --> 00:00:36,513
并指定它是公共数据集还是私有数据集。
and specify whether it is a public or private dataset.
16
00:00:37,410 --> 00:00:39,810
任何人都可以访问公共数据集
Public datasets can be accessed by anyone
17
00:00:39,810 --> 00:00:41,670
而私人数据集只能
while private datasets can only be accessed
18
00:00:41,670 --> 00:00:43,653
允许你或你的组织成员访问。
by you or members of your organization.
19
00:00:44,580 --> 00:00:47,280
这样,我们就可以继续创建数据集了。
And with that, we can go ahead and create the dataset.
20
00:00:48,690 --> 00:00:51,060
现在你在 hub 上有一个空的数据集仓库,
Now that you have an empty dataset repository on the hub,
21
00:00:51,060 --> 00:00:53,880
接下来要做的是向其中添加一些实际数据。
the next thing to do is add some actual data to it.
22
00:00:53,880 --> 00:00:55,050
你可以用 git 做到这一点,
You can do this with git,
23
00:00:55,050 --> 00:00:57,960
但最简单的方法是选择 Upload file 按钮。
but the easiest way is by selecting the Upload file button.
24
00:00:57,960 --> 00:00:59,160
然后,你可以继续下一步
And then, you can just go ahead
25
00:00:59,160 --> 00:01:02,243
直接从你的机器上传文件。
and upload the files directly from your machine.
26
00:01:02,243 --> 00:01:03,846
上传文件后,
After you've uploaded your files,
27
00:01:03,846 --> 00:01:05,670
你会在 Files and versions 选项卡下
you'll see them appear in the repository
28
00:01:05,670 --> 00:01:07,320
看到它们出现在仓库中。
under the Files and versions tab.
29
00:01:08,550 --> 00:01:11,370
最后一步是创建数据集卡。
The last step is to create a dataset card.
30
00:01:11,370 --> 00:01:13,590
对其他人来说,归档良好的数据集貌似更有用,
Well-documented datasets are more likely to be useful
31
00:01:13,590 --> 00:01:15,600
因为他们提供了影响决策的上下文信息
to others as they provide the context to decide
32
00:01:15,600 --> 00:01:17,370
包括数据集是否与需求相关
whether the dataset is relevant
33
00:01:17,370 --> 00:01:18,450
或者有没有误差
or whether there are any biases
34
00:01:18,450 --> 00:01:20,673
或使用该数据可能遇到的风险。
or risks associated with using the dataset.
35
00:01:21,540 --> 00:01:22,710
在 Hugging Face Hub 上,
On the Hugging Face Hub,
36
00:01:22,710 --> 00:01:25,650
此信息存储在每个仓库的 README 文件中。
this information is stored in each repositories README file.
37
00:01:25,650 --> 00:01:27,988
你需要执行两个主要步骤。
There are two main steps that you should take.
38
00:01:27,988 --> 00:01:30,651
首先,你需要创建一些元数据
First, you need to create some metadata
39
00:01:30,651 --> 00:01:32,010
这将允许其他人在 hub 上
that will allow your dataset
40
00:01:32,010 --> 00:01:34,590
轻松找到你的数据集。
to be easily found by others on the hub.
41
00:01:34,590 --> 00:01:35,670
你可以使用数据集标记应用程序
You can create this metadata
42
00:01:35,670 --> 00:01:37,860
创建此元数据,
using the datasets tagging application,
43
00:01:37,860 --> 00:01:40,620
我们将在视频说明信息中包含它的链接。
which we'll link to in the video description.
44
00:01:40,620 --> 00:01:42,240
创建元数据后,
Once you've created the metadata,
45
00:01:42,240 --> 00:01:44,190
你可以填写数据集卡的其余部分,
you can fill out the rest of the dataset card,
46
00:01:44,190 --> 00:01:45,240
我们提供了一个模板
and we provide a template
47
00:01:45,240 --> 00:01:47,090
相关链接也会包含在下面的视频信息内容中。
that we'll also link to in the video.
48
00:01:48,480 --> 00:01:50,280
一旦你的数据集在 hub 上,
And once your dataset is on the hub,
49
00:01:50,280 --> 00:01:53,400
你可以使用可信赖的 load_dataset 函数加载它。
you can load it using the trusty load_dataset function.
50
00:01:53,400 --> 00:01:55,015
只需提供你的仓库的名称
Just provide the name of your repository
51
00:01:55,015 --> 00:01:57,843
和一个 data_files 参数,你就可以开始使用了。
and a data_files argument, and you're good to go.
52
00:01:59,619 --> 00:02:02,536
(过渡音乐)
(transition music)