subtitles/en/40_uploading-a-dataset-to-the-hub.srt

1 00:00:00,000 --> 00:00:02,917 (transition music) 2 00:00:05,490 --> 00:00:07,950 - Uploading a dataset to the hub. 3 00:00:07,950 --> 00:00:09,060 In this video, we'll take a look 4 00:00:09,060 --> 00:00:10,860 at how you can upload your very own dataset 5 00:00:10,860 --> 00:00:12,060 to the Hugging Face Hub. 6 00:00:13,680 --> 00:00:14,670 The first thing you need to do 7 00:00:14,670 --> 00:00:17,400 is create a new dataset repository on the hub. 8 00:00:17,400 --> 00:00:19,260 So, just click on your profile icon 9 00:00:19,260 --> 00:00:21,750 and select the New Dataset button. 10 00:00:21,750 --> 00:00:24,750 Next, we need to assign an owner of the dataset. 11 00:00:24,750 --> 00:00:26,970 By default, this will be your hub account, 12 00:00:26,970 --> 00:00:28,170 but you can also create datasets 13 00:00:28,170 --> 00:00:30,585 under any organization that you belong to. 14 00:00:30,585 --> 00:00:33,780 Then, we just need to give the dataset a good name 15 00:00:33,780 --> 00:00:36,513 and specify whether it is a public or private dataset. 16 00:00:37,410 --> 00:00:39,810 Public datasets can be accessed by anyone 17 00:00:39,810 --> 00:00:41,670 while private datasets can only be accessed 18 00:00:41,670 --> 00:00:43,653 by you or members of your organization. 19 00:00:44,580 --> 00:00:47,280 And with that, we can go ahead and create the dataset. 20 00:00:48,690 --> 00:00:51,060 Now that you have an empty dataset repository on the hub, 21 00:00:51,060 --> 00:00:53,880 the next thing to do is add some actual data to it. 22 00:00:53,880 --> 00:00:55,050 You can do this with git, 23 00:00:55,050 --> 00:00:57,960 but the easiest way is by selecting the Upload file button. 24 00:00:57,960 --> 00:00:59,160 And then, you can just go ahead 25 00:00:59,160 --> 00:01:02,243 and upload the files directly from your machine. 26 00:01:02,243 --> 00:01:03,846 After you've uploaded your files, 27 00:01:03,846 --> 00:01:05,670 you'll see them appear in the repository 28 00:01:05,670 --> 00:01:07,320 under the Files and versions tab. 29 00:01:08,550 --> 00:01:11,370 The last step is to create a dataset card. 30 00:01:11,370 --> 00:01:13,590 Well-documented datasets are more likely to be useful 31 00:01:13,590 --> 00:01:15,600 to others as they provide the context to decide 32 00:01:15,600 --> 00:01:17,370 whether the dataset is relevant 33 00:01:17,370 --> 00:01:18,450 or whether there are any biases 34 00:01:18,450 --> 00:01:20,673 or risks associated with using the dataset. 35 00:01:21,540 --> 00:01:22,710 On the Hugging Face Hub, 36 00:01:22,710 --> 00:01:25,650 this information is stored in each repositories README file. 37 00:01:25,650 --> 00:01:27,988 There are two main steps that you should take. 38 00:01:27,988 --> 00:01:30,651 First, you need to create some metadata 39 00:01:30,651 --> 00:01:32,010 that will allow your dataset 40 00:01:32,010 --> 00:01:34,590 to be easily found by others on the hub. 41 00:01:34,590 --> 00:01:35,670 You can create this metadata 42 00:01:35,670 --> 00:01:37,860 using the datasets tagging application, 43 00:01:37,860 --> 00:01:40,620 which we'll link to in the video description. 44 00:01:40,620 --> 00:01:42,240 Once you've created the metadata, 45 00:01:42,240 --> 00:01:44,190 you can fill out the rest of the dataset card, 46 00:01:44,190 --> 00:01:45,240 and we provide a template 47 00:01:45,240 --> 00:01:47,090 that we'll also link to in the video. 48 00:01:48,480 --> 00:01:50,280 And once your dataset is on the hub, 49 00:01:50,280 --> 00:01:53,400 you can load it using the trusty load_dataset function. 50 00:01:53,400 --> 00:01:55,015 Just provide the name of your repository 51 00:01:55,015 --> 00:01:57,843 and a data_files argument, and you're good to go. 52 00:01:59,619 --> 00:02:02,536 (transition music)

subtitles/en/40_uploading-a-dataset-to-the-hub.srt (176 lines of code) (raw):