subtitles/zh-CN/36_slice-and-dice-a-dataset-🔪.srt (332 lines of code) (raw):

1 00:00:00,215 --> 00:00:02,882 (空气呼啸) (air whooshing) 2 00:00:05,760 --> 00:00:07,623 - 如何对数据集进行切片和切块? - How to slice and dice the dataset? 3 00:00:08,760 --> 00:00:10,410 大多数时候,你使用的数据 Most of the time, the data you work with 4 00:00:10,410 --> 00:00:13,230 不会为训练模型做好充分准备。 won't be perfectly prepared for training models. 5 00:00:13,230 --> 00:00:15,810 在本视频中,我们将探索各种功能 In this video, we'll explore various features 6 00:00:15,810 --> 00:00:18,660 数据集库提供的用于清理数据的数据集。 that the datasets library provides to clean up your data. 7 00:00:19,915 --> 00:00:22,500 数据集库提供了几种内置方法 The datasets library provides several built-in methods 8 00:00:22,500 --> 00:00:25,350 允许你以各种方式处理数据。 that allow you to wrangle your data in various ways. 9 00:00:25,350 --> 00:00:27,360 在本视频中,我们将了解如何随机播放 In this video, we'll see how you can shuffle 10 00:00:27,360 --> 00:00:30,750 并拆分你的数据,选择你感兴趣的行, and split your data, select the rows you're interested in, 11 00:00:30,750 --> 00:00:32,070 调整列, tweak the columns, 12 00:00:32,070 --> 00:00:34,620 并使用 map 方法应用处理函数。 and apply processing functions with the map method. 13 00:00:35,640 --> 00:00:37,620 让我们从洗牌开始。 Let's start with shuffling. 14 00:00:37,620 --> 00:00:38,520 这通常是个好主意 It is generally a good idea 15 00:00:38,520 --> 00:00:40,140 将改组应用于你的训练集 to apply shuffling to your training set 16 00:00:40,140 --> 00:00:41,250 这样你的模型就不会学习 so that your model doesn't learn 17 00:00:41,250 --> 00:00:43,590 任何人工排序数据。 any artificial ordering the data. 18 00:00:43,590 --> 00:00:45,360 如果你想洗牌整个数据集, If you wanna shuffle the whole dataset, 19 00:00:45,360 --> 00:00:48,390 你可以应用适当命名的 shuffle 方法。 you can apply the appropriately named shuffle method. 20 00:00:48,390 --> 00:00:50,730 你可以在这里看到这个方法的一个例子, You can see an example of this method in action here, 21 00:00:50,730 --> 00:00:52,200 我们在哪里下载了训练拆分 where we've downloaded the training split 22 00:00:52,200 --> 00:00:55,000 小队数据集并随机打乱所有行。 of the squad dataset and shuffled all the rows randomly. 23 00:00:56,880 --> 00:00:58,230 另一种打乱数据的方法 Another way to shuffle the data 24 00:00:58,230 --> 00:01:00,930 是创建随机训练和测试拆分。 is to create random train and test splits. 25 00:01:00,930 --> 00:01:02,280 如果你必须创建,这将很有用 This can be useful if you have to create 26 00:01:02,280 --> 00:01:04,620 你自己的测试从原始数据中分离出来。 your own test splits from raw data. 27 00:01:04,620 --> 00:01:07,620 为此,你只需应用 train_test_split 方法 To do this, you just apply the train_test_split method 28 00:01:07,620 --> 00:01:10,740 并指定测试拆分应该有多大。 and specify how large the test split should be. 29 00:01:10,740 --> 00:01:14,310 在这个例子中,我们指定测试集应该是 10% In this example, we specify that the test set should be 10% 30 00:01:14,310 --> 00:01:15,963 总数据集大小。 of the total dataset size. 31 00:01:16,890 --> 00:01:19,140 可以看到 train_test_split 方法的输出 You can see that the output of the train_test_split method 32 00:01:19,140 --> 00:01:20,610 是一个 DatasetDict 对象 is a DatasetDict object 33 00:01:20,610 --> 00:01:22,743 其键对应于新的拆分。 whose keys correspond to the new splits. 34 00:01:25,170 --> 00:01:27,210 现在我们知道如何打乱数据集了, Now that we know how to shuffle the dataset, 35 00:01:27,210 --> 00:01:30,060 让我们来看看返回我们感兴趣的行。 let's take a look at returning the rows we're interested in. 36 00:01:30,060 --> 00:01:33,180 最常见的方法是使用 select 方法。 The most common way to do this is with the select method. 37 00:01:33,180 --> 00:01:34,590 此方法需要一个列表 This method expects a list 38 00:01:34,590 --> 00:01:36,750 或数据集索引的生成器, or a generator of the datasets indices, 39 00:01:36,750 --> 00:01:38,670 然后将返回一个新的数据集对象 and will then return a new dataset object 40 00:01:38,670 --> 00:01:40,143 只包含那些行。 containing just those rows. 41 00:01:41,490 --> 00:01:43,740 如果你想创建一个随机的行样本, If you wanna create a random sample of rows, 42 00:01:43,740 --> 00:01:45,360 你可以通过链接洗牌来做到这一点 you can do this by chaining the shuffle 43 00:01:45,360 --> 00:01:47,310 并一起选择方法。 and select methods together. 44 00:01:47,310 --> 00:01:48,450 在这个例子中, In this example, 45 00:01:48,450 --> 00:01:50,250 我们创建了一个包含五个元素的样本 we've created a sample of five elements 46 00:01:50,250 --> 00:01:51,423 来自小队数据集。 from the squad dataset. 47 00:01:53,550 --> 00:01:56,010 在数据集中挑选特定行的最后一种方法 The last way to pick out specific rows in a dataset 48 00:01:56,010 --> 00:01:58,290 是通过应用过滤器方法。 is by applying the filter method. 49 00:01:58,290 --> 00:02:00,120 此方法检查每一行是否 This method checks whether each row 50 00:02:00,120 --> 00:02:02,310 是否满足某种条件。 fulfills some condition or not. 51 00:02:02,310 --> 00:02:05,130 例如,这里我们创建了一个小的 lambda 函数 For example, here we've created a small lambda function 52 00:02:05,130 --> 00:02:08,460 检查标题是否以字母 L 开头。 that checks whether the title starts with the letter L. 53 00:02:08,460 --> 00:02:11,040 一旦我们用 filter 方法应用这个函数, Once we apply this function with the filter method, 54 00:02:11,040 --> 00:02:14,283 我们得到仅包含这些行的数据子集。 we get a subset of the data just containing these rows. 55 00:02:16,200 --> 00:02:18,600 到目前为止,我们一直在谈论数据集的行, So far, we've been talking about the rows of a dataset, 56 00:02:18,600 --> 00:02:20,490 但是列呢? but what about the columns? 57 00:02:20,490 --> 00:02:22,320 数据集库有两个主要方法 The datasets library has two main methods 58 00:02:22,320 --> 00:02:24,060 用于转换列, for transforming columns, 59 00:02:24,060 --> 00:02:26,760 用于更改列名称的 rename_column 方法 a rename_column method to change the name of the column 60 00:02:26,760 --> 00:02:29,460 以及删除它们的 remove_columns 方法。 and a remove_columns method to delete them. 61 00:02:29,460 --> 00:02:31,860 你可以在此处查看这两种方法的示例。 You can see examples of both these methods here. 62 00:02:34,140 --> 00:02:36,060 一些数据集有嵌套的列, Some datasets have nested columns, 63 00:02:36,060 --> 00:02:39,360 你可以通过应用展平方法来扩展它们。 and you can expand these by applying the flatten method. 64 00:02:39,360 --> 00:02:41,430 例如,在小队数据集中, For example, in the squad dataset, 65 00:02:41,430 --> 00:02:45,150 answers 列包含文本和 answer_start 字段。 the answers column contains a text and answer_start field. 66 00:02:45,150 --> 00:02:47,430 如果我们想将它们提升到各自的专栏中, If we wanna promote them to their own separate columns, 67 00:02:47,430 --> 00:02:49,383 我们可以应用扁平化,如图所示。 we can apply flatten as shown here. 68 00:02:51,300 --> 00:02:53,760 当然,现在不讨论数据集库 Now of course, no discussion of the datasets library 69 00:02:53,760 --> 00:02:56,880 不提著名的 map 方法就完整了。 would be complete without mentioning the famous map method. 70 00:02:56,880 --> 00:02:59,160 此方法应用自定义处理功能 This method applies a custom processing function 71 00:02:59,160 --> 00:03:01,140 到数据集中的每一行。 to each row in the dataset. 72 00:03:01,140 --> 00:03:03,360 比如这里我们先定义 For example, here we first define 73 00:03:03,360 --> 00:03:04,890 小写标题函数, a lowercase title function, 74 00:03:04,890 --> 00:03:07,503 这只是将标题列中的文本小写。 that simply lowercases the text in the title column. 75 00:03:08,640 --> 00:03:11,700 然后我们将该函数提供给 map 方法, And then we feed that function to the map method, 76 00:03:11,700 --> 00:03:14,223 瞧,我们现在有了小写标题。 and voila, we now have lowercase titles. 77 00:03:16,020 --> 00:03:18,360 map 方法也可用于批量输入行 The map method can also be used to feed batches of rows 78 00:03:18,360 --> 00:03:20,100 到处理函数。 to the processing function. 79 00:03:20,100 --> 00:03:22,410 这对于标记化特别有用 This is especially useful for tokenization 80 00:03:22,410 --> 00:03:25,290 其中分词器由分词器库支持, where the tokenizer is backed by the Tokenizers library, 81 00:03:25,290 --> 00:03:26,910 他们可以使用快速多线程 and they can use fast multithreading 82 00:03:26,910 --> 00:03:28,563 并行处理批次。 to process batches in parallel. 83 00:03:30,056 --> 00:03:32,723 (空气呼啸) (air whooshing)