subtitles/zh-CN/19_hugging-face-datasets-overview-(pytorch).srt (300 lines of code) (raw):
1
00:00:00,213 --> 00:00:02,963
(滑动嗖嗖声)
(slide whooshes)
2
00:00:05,340 --> 00:00:08,373
- 本节将带来 Hugging Face Datasets 库的快速概览。
- The Hugging Face Datasets library, a quick overview.
3
00:00:09,990 --> 00:00:11,670
Hugging Face Datasets 库
The Hugging Face Datasets library
4
00:00:11,670 --> 00:00:14,310
是一个提供 API 来快速下载的库
is a library that provides an API to quickly download
5
00:00:14,310 --> 00:00:17,610
许多公共数据集并对其进行预处理。
many public datasets and preprocess them.
6
00:00:17,610 --> 00:00:20,614
在本视频中,我们将探索如何做到这一点。
In this video we will explore how to do that.
7
00:00:20,614 --> 00:00:21,780
下载部分很简单,
The downloading part is easy,
8
00:00:21,780 --> 00:00:23,760
使用 load_dataset 函数。
with the load_dataset function.
9
00:00:23,760 --> 00:00:26,460
你可以直接下载并缓存数据集
You can directly download and cache a dataset
10
00:00:26,460 --> 00:00:28,473
来自其在数据集中心的标识符。
from its identifier on the Dataset hub.
11
00:00:29,640 --> 00:00:33,570
在这里,我们从 GLUE 基准中获取 MRPC 数据集,
Here, we fetch the MRPC dataset from the GLUE benchmark,
12
00:00:33,570 --> 00:00:36,390
这是一个包含成对句子的数据集
which is a dataset containing pairs of sentences
13
00:00:36,390 --> 00:00:38,740
任务是确定释义。
where the task is to determine the paraphrases.
14
00:00:39,810 --> 00:00:42,420
load_dataset 函数返回的对象
The object returned by the load_dataset function
15
00:00:42,420 --> 00:00:45,600
是一个 DatasetDict,它是一种字典
is a DatasetDict, which is a sort of dictionary
16
00:00:45,600 --> 00:00:47,463
包含我们数据集的每个分割。
containing each split of our dataset.
17
00:00:48,946 --> 00:00:52,170
我们可以通过使用其名称进行索引来访问每个拆分。
We can access each split by indexing with its name.
18
00:00:52,170 --> 00:00:55,047
这个拆分然后是 Dataset 类的一个实例,
This split is then an instance of the Dataset class,
19
00:00:55,047 --> 00:00:58,590
有列,这里是 sentence1,sentence2,
with columns, here sentence1, sentence2,
20
00:00:58,590 --> 00:01:01,233
标签和 idx,以及行。
label and idx, and rows.
21
00:01:02,400 --> 00:01:04,563
我们可以通过索引访问给定的元素。
We can access a given element by its index.
22
00:01:05,460 --> 00:01:08,220
Hugging Face Datasets 库的神奇之处
The amazing thing about the Hugging Face Datasets library
23
00:01:08,220 --> 00:01:11,880
是所有内容都使用 Apache Arrow 保存到磁盘,
is that everything is saved to disk using Apache Arrow,
24
00:01:11,880 --> 00:01:14,550
这意味着即使你的数据集很大,
which means that even if your dataset is huge,
25
00:01:14,550 --> 00:01:16,350
你不会离开 RAM。
you won't get out of RAM.
26
00:01:16,350 --> 00:01:19,113
只有你请求的元素才会加载到内存中。
Only the elements you request are loaded in memory.
27
00:01:20,340 --> 00:01:23,940
访问数据集的一部分就像访问一个元素一样简单。
Accessing a slice of your dataset is as easy as one element.
28
00:01:23,940 --> 00:01:26,220
结果是一个包含值列表的字典
The result is then a dictionary with list of values
29
00:01:26,220 --> 00:01:27,480
对于每个键。
for each keys.
30
00:01:27,480 --> 00:01:29,070
这里是标签列表,
Here the list of labels,
31
00:01:29,070 --> 00:01:30,147
第一句话列表
the list of first sentences
32
00:01:30,147 --> 00:01:31,923
和第二句话的列表。
and the list of second sentences.
33
00:01:33,690 --> 00:01:35,580
数据集的特征属性
The features attribute of a Dataset
34
00:01:35,580 --> 00:01:37,470
为我们提供有关其专栏的更多信息。
gives us more information about its columns.
35
00:01:37,470 --> 00:01:40,020
特别是,我们可以在这里看到
In particular, we can see here
36
00:01:40,020 --> 00:01:41,400
它给了我们信件
it gives us the correspondence
37
00:01:41,400 --> 00:01:44,810
在标签的整数和名称之间。
between the integers and names for the labels.
38
00:01:44,810 --> 00:01:48,543
零代表不等价,一代表等价。
Zero stands for not equivalent and one for equivalent.
39
00:01:49,830 --> 00:01:52,020
要预处理数据集的所有元素,
To preprocess all the elements of our dataset,
40
00:01:52,020 --> 00:01:53,850
我们需要将它们标记化。
we need to tokenize them.
41
00:01:53,850 --> 00:01:56,160
看看视频 “预处理句子对”
Have a look at the video "Preprocess sentence pairs"
42
00:01:56,160 --> 00:01:57,570
复习一下,
for a refresher,
43
00:01:57,570 --> 00:01:59,430
但你只需要发送这两个句子
but you just have to send the two sentences
44
00:01:59,430 --> 00:02:02,733
带有一些额外的关键字参数的分词器。
to the tokenizer with some additional keyword arguments.
45
00:02:03,780 --> 00:02:06,600
这里我们表示最大长度为 128
Here we indicate a maximum length of 128
46
00:02:06,600 --> 00:02:08,820
和垫输入短于这个长度,
and pad inputs shorter than this length,
47
00:02:08,820 --> 00:02:10,420
截断更长的输入。
truncate inputs that are longer.
48
00:02:11,460 --> 00:02:13,470
我们把所有这些都放在一个 tokenize_function 中
We put all of this in a tokenize_function
49
00:02:13,470 --> 00:02:16,710
我们可以直接应用于数据集中的所有拆分
that we can directly apply to all the splits in our dataset
50
00:02:16,710 --> 00:02:17,710
用地图的方法。
with the map method.
51
00:02:18,840 --> 00:02:22,110
只要函数返回一个类似字典的对象,
As long as the function returns a dictionary-like object,
52
00:02:22,110 --> 00:02:24,300
map 方法将根据需要添加新列
the map method will add new columns as needed
53
00:02:24,300 --> 00:02:26,043
或更新现有的。
or update existing ones.
54
00:02:27,315 --> 00:02:28,830
加快预处理
To speed up preprocessing
55
00:02:28,830 --> 00:02:30,870
并利用我们的分词器这一事实
and take advantage of the fact our tokenizer
56
00:02:30,870 --> 00:02:32,040
由 Rust 支持,
is backed by Rust,
57
00:02:32,040 --> 00:02:34,770
感谢 Hugging Face Tokenizers 库,
thanks to the Hugging Face Tokenizers library,
58
00:02:34,770 --> 00:02:37,110
我们可以同时处理多个元素
we can process several elements at the same time
59
00:02:37,110 --> 00:02:40,710
到我们的 tokenize 函数,使用 batched=True 参数。
to our tokenize function, using the batched=True argument.
60
00:02:40,710 --> 00:02:42,120
由于分词器可以处理
Since the tokenizer can handle
61
00:02:42,120 --> 00:02:44,610
第一句话列表,第二句列表,
list of first sentences, list of second sentences,
62
00:02:44,610 --> 00:02:47,493
tokenize_function 不需要为此更改。
the tokenize_function does not need to change for this.
63
00:02:48,360 --> 00:02:51,180
你还可以将多处理与 map 方法一起使用。
You can also use multiprocessing with the map method.
64
00:02:51,180 --> 00:02:53,583
在链接的视频中查看其文档。
Check out its documentation in the linked video.
65
00:02:54,840 --> 00:02:57,990
完成后,我们几乎可以进行培训了。
Once this is done, we are almost ready for training.
66
00:02:57,990 --> 00:02:59,970
我们只是删除不再需要的列
We just remove the columns we don't need anymore
67
00:02:59,970 --> 00:03:02,190
使用 remove_columns 方法,
with the remove_columns method,
68
00:03:02,190 --> 00:03:03,750
将标签重命名为标签,
rename label to labels,
69
00:03:03,750 --> 00:03:05,790
因为来自 Hugging Face Transformers 的模型
since the models from the Hugging Face Transformers
70
00:03:05,790 --> 00:03:07,710
图书馆期望,
library expect that,
71
00:03:07,710 --> 00:03:10,470
并将输出格式设置为我们想要的后端,
and set the output format to our desired backend,
72
00:03:10,470 --> 00:03:12,053
火炬、TensorFlow 或 NumPy。
Torch, TensorFlow or NumPy.
73
00:03:13,440 --> 00:03:16,800
如果需要,我们还可以生成一个简短的数据集样本
If needed, we can also generate a short sample of a dataset
74
00:03:16,800 --> 00:03:18,000
使用选择方法。
using the select method.
75
00:03:20,211 --> 00:03:22,961
(滑动嗖嗖声)
(slide whooshes)