subtitles/zh-CN/36_slice-and-dice-a-dataset-🔪.srt (332 lines of code) (raw):
1
00:00:00,215 --> 00:00:02,882
(空气呼啸)
(air whooshing)
2
00:00:05,760 --> 00:00:07,623
- 如何对数据集进行切片和切块?
- How to slice and dice the dataset?
3
00:00:08,760 --> 00:00:10,410
大多数时候,你使用的数据
Most of the time, the data you work with
4
00:00:10,410 --> 00:00:13,230
不会为训练模型做好充分准备。
won't be perfectly prepared for training models.
5
00:00:13,230 --> 00:00:15,810
在本视频中,我们将探索各种功能
In this video, we'll explore various features
6
00:00:15,810 --> 00:00:18,660
数据集库提供的用于清理数据的数据集。
that the datasets library provides to clean up your data.
7
00:00:19,915 --> 00:00:22,500
数据集库提供了几种内置方法
The datasets library provides several built-in methods
8
00:00:22,500 --> 00:00:25,350
允许你以各种方式处理数据。
that allow you to wrangle your data in various ways.
9
00:00:25,350 --> 00:00:27,360
在本视频中,我们将了解如何随机播放
In this video, we'll see how you can shuffle
10
00:00:27,360 --> 00:00:30,750
并拆分你的数据,选择你感兴趣的行,
and split your data, select the rows you're interested in,
11
00:00:30,750 --> 00:00:32,070
调整列,
tweak the columns,
12
00:00:32,070 --> 00:00:34,620
并使用 map 方法应用处理函数。
and apply processing functions with the map method.
13
00:00:35,640 --> 00:00:37,620
让我们从洗牌开始。
Let's start with shuffling.
14
00:00:37,620 --> 00:00:38,520
这通常是个好主意
It is generally a good idea
15
00:00:38,520 --> 00:00:40,140
将改组应用于你的训练集
to apply shuffling to your training set
16
00:00:40,140 --> 00:00:41,250
这样你的模型就不会学习
so that your model doesn't learn
17
00:00:41,250 --> 00:00:43,590
任何人工排序数据。
any artificial ordering the data.
18
00:00:43,590 --> 00:00:45,360
如果你想洗牌整个数据集,
If you wanna shuffle the whole dataset,
19
00:00:45,360 --> 00:00:48,390
你可以应用适当命名的 shuffle 方法。
you can apply the appropriately named shuffle method.
20
00:00:48,390 --> 00:00:50,730
你可以在这里看到这个方法的一个例子,
You can see an example of this method in action here,
21
00:00:50,730 --> 00:00:52,200
我们在哪里下载了训练拆分
where we've downloaded the training split
22
00:00:52,200 --> 00:00:55,000
小队数据集并随机打乱所有行。
of the squad dataset and shuffled all the rows randomly.
23
00:00:56,880 --> 00:00:58,230
另一种打乱数据的方法
Another way to shuffle the data
24
00:00:58,230 --> 00:01:00,930
是创建随机训练和测试拆分。
is to create random train and test splits.
25
00:01:00,930 --> 00:01:02,280
如果你必须创建,这将很有用
This can be useful if you have to create
26
00:01:02,280 --> 00:01:04,620
你自己的测试从原始数据中分离出来。
your own test splits from raw data.
27
00:01:04,620 --> 00:01:07,620
为此,你只需应用 train_test_split 方法
To do this, you just apply the train_test_split method
28
00:01:07,620 --> 00:01:10,740
并指定测试拆分应该有多大。
and specify how large the test split should be.
29
00:01:10,740 --> 00:01:14,310
在这个例子中,我们指定测试集应该是 10%
In this example, we specify that the test set should be 10%
30
00:01:14,310 --> 00:01:15,963
总数据集大小。
of the total dataset size.
31
00:01:16,890 --> 00:01:19,140
可以看到 train_test_split 方法的输出
You can see that the output of the train_test_split method
32
00:01:19,140 --> 00:01:20,610
是一个 DatasetDict 对象
is a DatasetDict object
33
00:01:20,610 --> 00:01:22,743
其键对应于新的拆分。
whose keys correspond to the new splits.
34
00:01:25,170 --> 00:01:27,210
现在我们知道如何打乱数据集了,
Now that we know how to shuffle the dataset,
35
00:01:27,210 --> 00:01:30,060
让我们来看看返回我们感兴趣的行。
let's take a look at returning the rows we're interested in.
36
00:01:30,060 --> 00:01:33,180
最常见的方法是使用 select 方法。
The most common way to do this is with the select method.
37
00:01:33,180 --> 00:01:34,590
此方法需要一个列表
This method expects a list
38
00:01:34,590 --> 00:01:36,750
或数据集索引的生成器,
or a generator of the datasets indices,
39
00:01:36,750 --> 00:01:38,670
然后将返回一个新的数据集对象
and will then return a new dataset object
40
00:01:38,670 --> 00:01:40,143
只包含那些行。
containing just those rows.
41
00:01:41,490 --> 00:01:43,740
如果你想创建一个随机的行样本,
If you wanna create a random sample of rows,
42
00:01:43,740 --> 00:01:45,360
你可以通过链接洗牌来做到这一点
you can do this by chaining the shuffle
43
00:01:45,360 --> 00:01:47,310
并一起选择方法。
and select methods together.
44
00:01:47,310 --> 00:01:48,450
在这个例子中,
In this example,
45
00:01:48,450 --> 00:01:50,250
我们创建了一个包含五个元素的样本
we've created a sample of five elements
46
00:01:50,250 --> 00:01:51,423
来自小队数据集。
from the squad dataset.
47
00:01:53,550 --> 00:01:56,010
在数据集中挑选特定行的最后一种方法
The last way to pick out specific rows in a dataset
48
00:01:56,010 --> 00:01:58,290
是通过应用过滤器方法。
is by applying the filter method.
49
00:01:58,290 --> 00:02:00,120
此方法检查每一行是否
This method checks whether each row
50
00:02:00,120 --> 00:02:02,310
是否满足某种条件。
fulfills some condition or not.
51
00:02:02,310 --> 00:02:05,130
例如,这里我们创建了一个小的 lambda 函数
For example, here we've created a small lambda function
52
00:02:05,130 --> 00:02:08,460
检查标题是否以字母 L 开头。
that checks whether the title starts with the letter L.
53
00:02:08,460 --> 00:02:11,040
一旦我们用 filter 方法应用这个函数,
Once we apply this function with the filter method,
54
00:02:11,040 --> 00:02:14,283
我们得到仅包含这些行的数据子集。
we get a subset of the data just containing these rows.
55
00:02:16,200 --> 00:02:18,600
到目前为止,我们一直在谈论数据集的行,
So far, we've been talking about the rows of a dataset,
56
00:02:18,600 --> 00:02:20,490
但是列呢?
but what about the columns?
57
00:02:20,490 --> 00:02:22,320
数据集库有两个主要方法
The datasets library has two main methods
58
00:02:22,320 --> 00:02:24,060
用于转换列,
for transforming columns,
59
00:02:24,060 --> 00:02:26,760
用于更改列名称的 rename_column 方法
a rename_column method to change the name of the column
60
00:02:26,760 --> 00:02:29,460
以及删除它们的 remove_columns 方法。
and a remove_columns method to delete them.
61
00:02:29,460 --> 00:02:31,860
你可以在此处查看这两种方法的示例。
You can see examples of both these methods here.
62
00:02:34,140 --> 00:02:36,060
一些数据集有嵌套的列,
Some datasets have nested columns,
63
00:02:36,060 --> 00:02:39,360
你可以通过应用展平方法来扩展它们。
and you can expand these by applying the flatten method.
64
00:02:39,360 --> 00:02:41,430
例如,在小队数据集中,
For example, in the squad dataset,
65
00:02:41,430 --> 00:02:45,150
answers 列包含文本和 answer_start 字段。
the answers column contains a text and answer_start field.
66
00:02:45,150 --> 00:02:47,430
如果我们想将它们提升到各自的专栏中,
If we wanna promote them to their own separate columns,
67
00:02:47,430 --> 00:02:49,383
我们可以应用扁平化,如图所示。
we can apply flatten as shown here.
68
00:02:51,300 --> 00:02:53,760
当然,现在不讨论数据集库
Now of course, no discussion of the datasets library
69
00:02:53,760 --> 00:02:56,880
不提著名的 map 方法就完整了。
would be complete without mentioning the famous map method.
70
00:02:56,880 --> 00:02:59,160
此方法应用自定义处理功能
This method applies a custom processing function
71
00:02:59,160 --> 00:03:01,140
到数据集中的每一行。
to each row in the dataset.
72
00:03:01,140 --> 00:03:03,360
比如这里我们先定义
For example, here we first define
73
00:03:03,360 --> 00:03:04,890
小写标题函数,
a lowercase title function,
74
00:03:04,890 --> 00:03:07,503
这只是将标题列中的文本小写。
that simply lowercases the text in the title column.
75
00:03:08,640 --> 00:03:11,700
然后我们将该函数提供给 map 方法,
And then we feed that function to the map method,
76
00:03:11,700 --> 00:03:14,223
瞧,我们现在有了小写标题。
and voila, we now have lowercase titles.
77
00:03:16,020 --> 00:03:18,360
map 方法也可用于批量输入行
The map method can also be used to feed batches of rows
78
00:03:18,360 --> 00:03:20,100
到处理函数。
to the processing function.
79
00:03:20,100 --> 00:03:22,410
这对于标记化特别有用
This is especially useful for tokenization
80
00:03:22,410 --> 00:03:25,290
其中分词器由分词器库支持,
where the tokenizer is backed by the Tokenizers library,
81
00:03:25,290 --> 00:03:26,910
他们可以使用快速多线程
and they can use fast multithreading
82
00:03:26,910 --> 00:03:28,563
并行处理批次。
to process batches in parallel.
83
00:03:30,056 --> 00:03:32,723
(空气呼啸)
(air whooshing)