1
00:00:00,213 --> 00:00:02,963
（滑动嗖嗖声）
(slide whooshes)

2
00:00:05,340 --> 00:00:08,373
- 本节将带来 Hugging Face Datasets 库的快速概览。
- The Hugging Face Datasets library, a quick overview.

3
00:00:09,990 --> 00:00:11,670
Hugging Face Datasets 库
The Hugging Face Datasets library

4
00:00:11,670 --> 00:00:14,310
是一个提供 API 来快速下载的库
is a library that provides an API to quickly download

5
00:00:14,310 --> 00:00:17,610
许多公共数据集并对其进行预处理。
many public datasets and preprocess them.

6
00:00:17,610 --> 00:00:20,614
在本视频中，我们将探索如何做到这一点。
In this video we will explore how to do that.

7
00:00:20,614 --> 00:00:21,780
下载部分很简单，
The downloading part is easy,

8
00:00:21,780 --> 00:00:23,760
使用 load_dataset 函数。
with the load_dataset function.

9
00:00:23,760 --> 00:00:26,460
你可以直接下载并缓存数据集
You can directly download and cache a dataset

10
00:00:26,460 --> 00:00:28,473
来自其在数据集中心的标识符。
from its identifier on the Dataset hub.

11
00:00:29,640 --> 00:00:33,570
在这里，我们从 GLUE 基准中获取 MRPC 数据集，
Here, we fetch the MRPC dataset from the GLUE benchmark,

12
00:00:33,570 --> 00:00:36,390
这是一个包含成对句子的数据集
which is a dataset containing pairs of sentences

13
00:00:36,390 --> 00:00:38,740
任务是确定释义。
where the task is to determine the paraphrases.

14
00:00:39,810 --> 00:00:42,420
load_dataset 函数返回的对象
The object returned by the load_dataset function

15
00:00:42,420 --> 00:00:45,600
是一个 DatasetDict，它是一种字典
is a DatasetDict, which is a sort of dictionary

16
00:00:45,600 --> 00:00:47,463
包含我们数据集的每个分割。
containing each split of our dataset.

17
00:00:48,946 --> 00:00:52,170
我们可以通过使用其名称进行索引来访问每个拆分。
We can access each split by indexing with its name.

18
00:00:52,170 --> 00:00:55,047
这个拆分然后是 Dataset 类的一个实例，
This split is then an instance of the Dataset class,

19
00:00:55,047 --> 00:00:58,590
有列，这里是 sentence1，sentence2，
with columns, here sentence1, sentence2,

20
00:00:58,590 --> 00:01:01,233
标签和 idx，以及行。
label and idx, and rows.

21
00:01:02,400 --> 00:01:04,563
我们可以通过索引访问给定的元素。
We can access a given element by its index.

22
00:01:05,460 --> 00:01:08,220
Hugging Face Datasets 库的神奇之处
The amazing thing about the Hugging Face Datasets library

23
00:01:08,220 --> 00:01:11,880
是所有内容都使用 Apache Arrow 保存到磁盘，
is that everything is saved to disk using Apache Arrow,

24
00:01:11,880 --> 00:01:14,550
这意味着即使你的数据集很大，
which means that even if your dataset is huge,

25
00:01:14,550 --> 00:01:16,350
你不会离开 RAM。
you won't get out of RAM.

26
00:01:16,350 --> 00:01:19,113
只有你请求的元素才会加载到内存中。
Only the elements you request are loaded in memory.

27
00:01:20,340 --> 00:01:23,940
访问数据集的一部分就像访问一个元素一样简单。
Accessing a slice of your dataset is as easy as one element.

28
00:01:23,940 --> 00:01:26,220
结果是一个包含值列表的字典
The result is then a dictionary with list of values

29
00:01:26,220 --> 00:01:27,480
对于每个键。
for each keys.

30
00:01:27,480 --> 00:01:29,070
这里是标签列表，
Here the list of labels,

31
00:01:29,070 --> 00:01:30,147
第一句话列表
the list of first sentences

32
00:01:30,147 --> 00:01:31,923
和第二句话的列表。
and the list of second sentences.

33
00:01:33,690 --> 00:01:35,580
数据集的特征属性
The features attribute of a Dataset

34
00:01:35,580 --> 00:01:37,470
为我们提供有关其专栏的更多信息。
gives us more information about its columns.

35
00:01:37,470 --> 00:01:40,020
特别是，我们可以在这里看到
In particular, we can see here

36
00:01:40,020 --> 00:01:41,400
它给了我们信件
it gives us the correspondence

37
00:01:41,400 --> 00:01:44,810
在标签的整数和名称之间。
between the integers and names for the labels.

38
00:01:44,810 --> 00:01:48,543
零代表不等价，一代表等价。
Zero stands for not equivalent and one for equivalent.

39
00:01:49,830 --> 00:01:52,020
要预处理数据集的所有元素，
To preprocess all the elements of our dataset,

40
00:01:52,020 --> 00:01:53,850
我们需要将它们标记化。
we need to tokenize them.

41
00:01:53,850 --> 00:01:56,160
看看视频 “预处理句子对”
Have a look at the video "Preprocess sentence pairs"

42
00:01:56,160 --> 00:01:57,570
复习一下，
for a refresher,

43
00:01:57,570 --> 00:01:59,430
但你只需要发送这两个句子
but you just have to send the two sentences

44
00:01:59,430 --> 00:02:02,733
带有一些额外的关键字参数的分词器。
to the tokenizer with some additional keyword arguments.

45
00:02:03,780 --> 00:02:06,600
这里我们表示最大长度为 128
Here we indicate a maximum length of 128

46
00:02:06,600 --> 00:02:08,820
和垫输入短于这个长度，
and pad inputs shorter than this length,

47
00:02:08,820 --> 00:02:10,420
截断更长的输入。
truncate inputs that are longer.

48
00:02:11,460 --> 00:02:13,470
我们把所有这些都放在一个 tokenize_function 中
We put all of this in a tokenize_function

49
00:02:13,470 --> 00:02:16,710
我们可以直接应用于数据集中的所有拆分
that we can directly apply to all the splits in our dataset

50
00:02:16,710 --> 00:02:17,710
用地图的方法。
with the map method.

51
00:02:18,840 --> 00:02:22,110
只要函数返回一个类似字典的对象，
As long as the function returns a dictionary-like object,

52
00:02:22,110 --> 00:02:24,300
map 方法将根据需要添加新列
the map method will add new columns as needed

53
00:02:24,300 --> 00:02:26,043
或更新现有的。
or update existing ones.

54
00:02:27,315 --> 00:02:28,830
加快预处理
To speed up preprocessing

55
00:02:28,830 --> 00:02:30,870
并利用我们的分词器这一事实
and take advantage of the fact our tokenizer

56
00:02:30,870 --> 00:02:32,040
由 Rust 支持，
is backed by Rust,

57
00:02:32,040 --> 00:02:34,770
感谢 Hugging Face Tokenizers 库，
thanks to the Hugging Face Tokenizers library,

58
00:02:34,770 --> 00:02:37,110
我们可以同时处理多个元素
we can process several elements at the same time

59
00:02:37,110 --> 00:02:40,710
到我们的 tokenize 函数，使用 batched=True 参数。
to our tokenize function, using the batched=True argument.

60
00:02:40,710 --> 00:02:42,120
由于分词器可以处理
Since the tokenizer can handle

61
00:02:42,120 --> 00:02:44,610
第一句话列表，第二句列表，
list of first sentences, list of second sentences,

62
00:02:44,610 --> 00:02:47,493
tokenize_function 不需要为此更改。
the tokenize_function does not need to change for this.

63
00:02:48,360 --> 00:02:51,180
你还可以将多处理与 map 方法一起使用。
You can also use multiprocessing with the map method.

64
00:02:51,180 --> 00:02:53,583
在链接的视频中查看其文档。
Check out its documentation in the linked video.

65
00:02:54,840 --> 00:02:57,990
完成后，我们几乎可以进行培训了。
Once this is done, we are almost ready for training.

66
00:02:57,990 --> 00:02:59,970
我们只是删除不再需要的列
We just remove the columns we don't need anymore

67
00:02:59,970 --> 00:03:02,190
使用 remove_columns 方法，
with the remove_columns method,

68
00:03:02,190 --> 00:03:03,750
将标签重命名为标签，
rename label to labels,

69
00:03:03,750 --> 00:03:05,790
因为来自 Hugging Face Transformers 的模型
since the models from the Hugging Face Transformers

70
00:03:05,790 --> 00:03:07,710
图书馆期望，
library expect that,

71
00:03:07,710 --> 00:03:10,470
并将输出格式设置为我们想要的后端，
and set the output format to our desired backend,

72
00:03:10,470 --> 00:03:12,053
火炬、TensorFlow 或 NumPy。
Torch, TensorFlow or NumPy.

73
00:03:13,440 --> 00:03:16,800
如果需要，我们还可以生成一个简短的数据集样本
If needed, we can also generate a short sample of a dataset

74
00:03:16,800 --> 00:03:18,000
使用选择方法。
using the select method.

75
00:03:20,211 --> 00:03:22,961
（滑动嗖嗖声）
(slide whooshes)