1
00:00:00,000 --> 00:00:02,917
(transition music)

2
00:00:05,490 --> 00:00:07,950
- Uploading a dataset to the hub.

3
00:00:07,950 --> 00:00:09,060
In this video, we'll take a look

4
00:00:09,060 --> 00:00:10,860
at how you can upload
your very own dataset

5
00:00:10,860 --> 00:00:12,060
to the Hugging Face Hub.

6
00:00:13,680 --> 00:00:14,670
The first thing you need to do

7
00:00:14,670 --> 00:00:17,400
is create a new dataset
repository on the hub.

8
00:00:17,400 --> 00:00:19,260
So, just click on your profile icon

9
00:00:19,260 --> 00:00:21,750
and select the New Dataset button.

10
00:00:21,750 --> 00:00:24,750
Next, we need to assign
an owner of the dataset.

11
00:00:24,750 --> 00:00:26,970
By default, this will be your hub account,

12
00:00:26,970 --> 00:00:28,170
but you can also create datasets

13
00:00:28,170 --> 00:00:30,585
under any organization that you belong to.

14
00:00:30,585 --> 00:00:33,780
Then, we just need to give
the dataset a good name

15
00:00:33,780 --> 00:00:36,513
and specify whether it is a
public or private dataset.

16
00:00:37,410 --> 00:00:39,810
Public datasets can be accessed by anyone

17
00:00:39,810 --> 00:00:41,670
while private datasets
can only be accessed

18
00:00:41,670 --> 00:00:43,653
by you or members of your organization.

19
00:00:44,580 --> 00:00:47,280
And with that, we can go
ahead and create the dataset.

20
00:00:48,690 --> 00:00:51,060
Now that you have an empty
dataset repository on the hub,

21
00:00:51,060 --> 00:00:53,880
the next thing to do is
add some actual data to it.

22
00:00:53,880 --> 00:00:55,050
You can do this with git,

23
00:00:55,050 --> 00:00:57,960
but the easiest way is by
selecting the Upload file button.

24
00:00:57,960 --> 00:00:59,160
And then, you can just go ahead

25
00:00:59,160 --> 00:01:02,243
and upload the files
directly from your machine.

26
00:01:02,243 --> 00:01:03,846
After you've uploaded your files,

27
00:01:03,846 --> 00:01:05,670
you'll see them appear in the repository

28
00:01:05,670 --> 00:01:07,320
under the Files and versions tab.

29
00:01:08,550 --> 00:01:11,370
The last step is to create a dataset card.

30
00:01:11,370 --> 00:01:13,590
Well-documented datasets
are more likely to be useful

31
00:01:13,590 --> 00:01:15,600
to others as they provide
the context to decide

32
00:01:15,600 --> 00:01:17,370
whether the dataset is relevant

33
00:01:17,370 --> 00:01:18,450
or whether there are any biases

34
00:01:18,450 --> 00:01:20,673
or risks associated
with using the dataset.

35
00:01:21,540 --> 00:01:22,710
On the Hugging Face Hub,

36
00:01:22,710 --> 00:01:25,650
this information is stored in
each repositories README file.

37
00:01:25,650 --> 00:01:27,988
There are two main steps
that you should take.

38
00:01:27,988 --> 00:01:30,651
First, you need to create some metadata

39
00:01:30,651 --> 00:01:32,010
that will allow your dataset

40
00:01:32,010 --> 00:01:34,590
to be easily found by others on the hub.

41
00:01:34,590 --> 00:01:35,670
You can create this metadata

42
00:01:35,670 --> 00:01:37,860
using the datasets tagging application,

43
00:01:37,860 --> 00:01:40,620
which we'll link to in
the video description.

44
00:01:40,620 --> 00:01:42,240
Once you've created the metadata,

45
00:01:42,240 --> 00:01:44,190
you can fill out the
rest of the dataset card,

46
00:01:44,190 --> 00:01:45,240
and we provide a template

47
00:01:45,240 --> 00:01:47,090
that we'll also link to in the video.

48
00:01:48,480 --> 00:01:50,280
And once your dataset is on the hub,

49
00:01:50,280 --> 00:01:53,400
you can load it using the
trusty load_dataset function.

50
00:01:53,400 --> 00:01:55,015
Just provide the name of your repository

51
00:01:55,015 --> 00:01:57,843
and a data_files argument,
and you're good to go.

52
00:01:59,619 --> 00:02:02,536
(transition music)