subtitles/zh-CN/20_hugging-face-datasets-overview-(tensorflow).srt (284 lines of code) (raw):
1
00:00:00,170 --> 00:00:03,087
(屏幕呼啸)
(screen whooshing)
2
00:00:05,371 --> 00:00:09,690
- Hugging Face Datasets 库: 快速概览。
- The Hugging Face Datasets library: A Quick overview.
3
00:00:09,690 --> 00:00:10,917
Hugging Face Datasets 库
The Hugging Face Datasets library
4
00:00:10,917 --> 00:00:12,870
是一个库, 提供 API
is a library that provides an API
5
00:00:12,870 --> 00:00:15,150
来快速下载许多公共数据集
to quickly download many public datasets
6
00:00:15,150 --> 00:00:16,200
并对它们进行预处理。
and pre-process them.
7
00:00:17,070 --> 00:00:19,473
在本视频中,我们将探索如何做到这一点。
In this video we will explore to do that.
8
00:00:20,520 --> 00:00:23,730
下载部分使用 load_dataset 函数很容易,
The downloading part is easy with the load_dataset function,
9
00:00:23,730 --> 00:00:26,010
你可以直接下载并缓存数据集
you can directly download and cache a dataset
10
00:00:26,010 --> 00:00:28,023
来自其在数据集 Hub 的 ID 。
from its identifier on the Dataset hub.
11
00:00:29,160 --> 00:00:33,690
这里我们从 GLUE benchmark 中获取 MRPC 数据集,
Here we fetch the MRPC dataset from the GLUE benchmark,
12
00:00:33,690 --> 00:00:36,030
是一个包含句子对的数据集
is a dataset containing pairs of sentences
13
00:00:36,030 --> 00:00:38,380
任务是确定释义。
where the task is to determine the paraphrases.
14
00:00:39,720 --> 00:00:42,120
load_dataset 函数返回的对象
The object returned by the load_dataset function
15
00:00:42,120 --> 00:00:45,090
是一个 DatasetDict,它是一种字典
is a DatasetDict, which is a sort of dictionary
16
00:00:45,090 --> 00:00:46,940
包含我们数据集的每个拆分。
containing each split of our dataset.
17
00:00:48,600 --> 00:00:51,780
我们可以通过使用其名称进行索引来访问每个拆分。
We can access each split by indexing with its name.
18
00:00:51,780 --> 00:00:54,540
这个拆分然后是 Dataset 类的一个实例,
This split is then an instance of the Dataset class,
19
00:00:54,540 --> 00:00:57,423
有列: sentence1,sentence2,
with columns, here sentence1, sentence2,
20
00:00:58,350 --> 00:01:00,813
label 和 idx,以及行。
label and idx, and rows.
21
00:01:02,160 --> 00:01:05,220
我们可以访问给定的元素, 通过索引。
We can access a given element by its index.
22
00:01:05,220 --> 00:01:08,220
Hugging Face Datasets 库的神奇之处
The amazing thing about the Hugging Face Datasets library
23
00:01:08,220 --> 00:01:11,700
是所有内容都使用 Apache Arrow 保存到磁盘,
is that everything is saved to disk using Apache Arrow,
24
00:01:11,700 --> 00:01:14,460
这意味着即使你的数据集很大
which means that even if your dataset is huge
25
00:01:14,460 --> 00:01:16,219
你不会内存溢出,
you won't get out of RAM,
26
00:01:16,219 --> 00:01:18,769
只有你请求的元素才会加载到内存中。
only the elements you request are loaded in memory.
27
00:01:19,920 --> 00:01:24,510
访问数据集的一个切片就像访问一个元素一样简单。
Accessing a slice of your dataset is as easy as one element.
28
00:01:24,510 --> 00:01:27,150
结果是一个包含值列表的字典
The result is then a dictionary with list of values
29
00:01:27,150 --> 00:01:30,630
对于每个键,这里是标签列表,
for each keys, here the list of labels,
30
00:01:30,630 --> 00:01:32,190
第一句话列表,
the list of first sentences,
31
00:01:32,190 --> 00:01:33,840
和第二句话的列表。
and the list of second sentences.
32
00:01:35,100 --> 00:01:37,080
数据集的特征属性
The features attribute of a Dataset
33
00:01:37,080 --> 00:01:39,840
为我们提供有关其列的更多信息。
gives us more information about its columns.
34
00:01:39,840 --> 00:01:42,150
特别是,我们可以在这里看到它给了我们
In particular, we can see here it gives us
35
00:01:42,150 --> 00:01:43,980
整数之间的对应关系
a correspondence between the integers
36
00:01:43,980 --> 00:01:46,110
和标签的名称。
and names for the labels.
37
00:01:46,110 --> 00:01:49,623
0 代表不相等,1 代表相等。
0 stands for not equivalent and 1 for equivalent.
38
00:01:51,630 --> 00:01:54,090
要预处理数据集的所有元素,
To pre-process all the elements of our dataset,
39
00:01:54,090 --> 00:01:55,980
我们需要将它们 token 化。
we need to tokenize them.
40
00:01:55,980 --> 00:01:58,470
看看视频 “Pre-process sentence pairs”
Have a look at the video "Pre-process sentence pairs"
41
00:01:58,470 --> 00:02:01,800
复习一下,但你只需要发送这两个句子
for a refresher, but you just have to send the two sentences
42
00:02:01,800 --> 00:02:04,833
给 tokenizer, 带有一些额外的关键字参数。
to the tokenizer with some additional keyword arguments.
43
00:02:05,880 --> 00:02:09,300
这里我们表示最大长度为 128
Here we indicate a maximum length of 128
44
00:02:09,300 --> 00:02:11,460
和垫全输入短于这个长度的,
and pad inputs shorter than this length,
45
00:02:11,460 --> 00:02:13,060
截断更长的输入。
truncate inputs that are longer.
46
00:02:14,040 --> 00:02:16,170
我们把所有这些都放在一个 tokenize_function 中
We put all of this in a tokenize_function
47
00:02:16,170 --> 00:02:18,510
我们可以直接应用于所有拆分
that we can directly apply to all the splits
48
00:02:18,510 --> 00:02:20,260
在我们的数据集中使用 map 方法。
in our dataset with the map method.
49
00:02:21,210 --> 00:02:24,120
只要函数返回一个类似字典的对象,
As long as the function returns a dictionary-like object,
50
00:02:24,120 --> 00:02:26,580
map 方法将根据需要添加新列
the map method will add new columns as needed
51
00:02:26,580 --> 00:02:28,113
或更新现有的。
or update existing ones.
52
00:02:30,060 --> 00:02:32,520
为加快预处理并利用
To speed up pre-processing and take advantage
53
00:02:32,520 --> 00:02:35,130
我们的分词器由 Rust 支持的事实
of the fact our tokenizer is backed by Rust
54
00:02:35,130 --> 00:02:38,160
感谢 Hugging Face Tokenizers 库,
thanks to the Hugging Face Tokenizers library,
55
00:02:38,160 --> 00:02:40,590
我们可以同时处理多个元素
we can process several elements at the same time
56
00:02:40,590 --> 00:02:43,923
在我们的 tokenize 函数中,使用 batched=True 参数。
in our tokenize function, using the batched=True argument.
57
00:02:45,300 --> 00:02:46,980
由于 tokenizer 可以处理列表
Since the tokenizer can handle a list
58
00:02:46,980 --> 00:02:50,280
第一句话或第二句话, tokenize_function
of first or second sentences, the tokenize_function
59
00:02:50,280 --> 00:02:52,740
不需要为此改变。
does not need to change for this.
60
00:02:52,740 --> 00:02:55,410
你还可以将多进程与 map 方法一起使用,
You can also use multiprocessing with the map method,
61
00:02:55,410 --> 00:02:57,460
查看下面链接的文档。
check out its documentation linked below.
62
00:02:58,740 --> 00:03:02,130
完成后,我们几乎准备好训练了,
Once this is done, we are almost ready for training,
63
00:03:02,130 --> 00:03:04,020
我们只是删除我们不再需要的列
we just remove the columns we don't need anymore
64
00:03:04,020 --> 00:03:06,120
使用 remove_columns 方法,
with the remove_columns method,
65
00:03:06,120 --> 00:03:08,580
将 label 重命名为 labels ,因为模型是
rename label to labels, since the models
66
00:03:08,580 --> 00:03:11,430
transformers 库中的,期望如此
from the transformers library expect that,
67
00:03:11,430 --> 00:03:14,040
并将输出格式设置为我们想要的后端,
and set the output format to our desired backend,
68
00:03:14,040 --> 00:03:15,893
torch、tensorflow 或 numpy。
torch, tensorflow or numpy.
69
00:03:16,800 --> 00:03:19,050
如果需要,我们还可以生成一个简短的示例
If needed, we can also generate a short sample
70
00:03:19,050 --> 00:03:21,377
使用 select 方法的数据集。
of a dataset using the select method.
71
00:03:22,817 --> 00:03:25,734
(屏幕呼啸)
(screen whooshing)