subtitles/zh-CN/63_data-processing-for-causal-language-modeling.srt (376 lines of code) (raw):

1 00:00:00,000 --> 00:00:02,917 (过渡音乐) (transition music) 2 00:00:05,364 --> 00:00:08,310 - 在这个视频中,我们来看看 - In this video, we take a look at the data processing 3 00:00:08,310 --> 00:00:10,803 训练因果语言模型所必需的数据处理。 necessary to train causal language models. 4 00:00:12,690 --> 00:00:14,400 因果语言建模是 Causal language modeling is the task 5 00:00:14,400 --> 00:00:17,820 基于先前的词元预测下一个词元的任务。 of predicting the next token based on the previous ones. 6 00:00:17,820 --> 00:00:19,680 因果语言建模的另一个术语 Another term for causal language modeling 7 00:00:19,680 --> 00:00:21,000 是自回归建模。 is autoregressive modeling. 8 00:00:21,000 --> 00:00:23,940 在这里的示例中, In the example that you can see here, 9 00:00:23,940 --> 00:00:25,560 例如,下一个词元可以是 the next token could, for example, 10 00:00:25,560 --> 00:00:28,263 是 NLP,也可能是机器学习。 be NLP or it could be machine learning. 11 00:00:29,460 --> 00:00:31,457 因果语言模型的一个流行示例 A popular example of causal language models 12 00:00:31,457 --> 00:00:33,693 是 GPT 系列模型。 is the GPT family of models. 13 00:00:35,561 --> 00:00:38,010 训练 GPT 等模型, To train models such as GPT, 14 00:00:38,010 --> 00:00:41,460 我们通常从大量文本文件组成的语料库开始。 we usually start with a large corpus of text files. 15 00:00:41,460 --> 00:00:43,890 这些文件可以是从互联网上抓取的网页 These files can be webpages scraped from the internet 16 00:00:43,890 --> 00:00:46,020 例如 Common Crawl 数据集 such as the Common Crawl dataset 17 00:00:46,020 --> 00:00:47,940 或者它们可以是来自 GitHub 的 Python 文件, or they can be Python files from GitHub, 18 00:00:47,940 --> 00:00:49,490 就像你在这里看到的一样。 like the ones you can see here. 19 00:00:50,400 --> 00:00:52,680 作为第一步,我们需要词元化这些文件 As a first step, we need to tokenize these files 20 00:00:52,680 --> 00:00:55,380 这样我们就可以将它们输入给模型。 such that we can feed them through the model. 21 00:00:55,380 --> 00:00:58,500 在这里,我们将词元化的文本显示为不同长度的条, Here, we show the tokenized texts as bars of various length, 22 00:00:58,500 --> 00:01:02,188 表明它们有些长一些有些短一些。 illustrating that they're shorter and longer ones. 23 00:01:02,188 --> 00:01:05,910 这在处理文本时很常见。 This is very common when working with text. 24 00:01:05,910 --> 00:01:09,270 但是,转换模型的上下文窗口有限 However, transform models have a limited context window 25 00:01:09,270 --> 00:01:10,770 并根据数据源的不同, and depending on the data source, 26 00:01:10,770 --> 00:01:13,140 词元化的文本可能 it is possible that the tokenized texts 27 00:01:13,140 --> 00:01:15,183 比这个窗口长得多。 are much longer than this window. 28 00:01:16,080 --> 00:01:18,870 在这种情况下,我们可以将序列 In this case, we could just truncate the sequences 29 00:01:18,870 --> 00:01:20,182 截断为上下文长度, to the context length, 30 00:01:20,182 --> 00:01:22,650 但这意味着在第一个上下文窗口之后 but this would mean that we lose everything 31 00:01:22,650 --> 00:01:24,513 我们将失去一切。 after the first context window. 32 00:01:25,500 --> 00:01:28,410 使用 return_overflowing_tokens 标志, Using the return overflowing token flag, 33 00:01:28,410 --> 00:01:30,960 我们可以使用分词器来创建块 we can use the tokenizer to create chunks 34 00:01:30,960 --> 00:01:33,510 其中每个块都是上下文长度的大小。 with each one being the size of the context length. 35 00:01:34,860 --> 00:01:36,180 有时,如果没有足够的词元来填充它 Sometimes, it can still happen 36 00:01:36,180 --> 00:01:37,590 仍然会出现 that the last chunk is too short 37 00:01:37,590 --> 00:01:39,900 最后一块太短的情况。 if there aren't enough tokens to fill it. 38 00:01:39,900 --> 00:01:41,793 在这种情况下,我们可以将其删除。 In this case, we can just remove it. 39 00:01:42,990 --> 00:01:45,960 使用 return_length 关键字, With the return_length keyword, 40 00:01:45,960 --> 00:01:49,173 我们还从分词器中获取每个块的长度。 we also get the length of each chunk from the tokenizer. 41 00:01:51,960 --> 00:01:53,640 此函数显示准备数据集 This function shows all the steps 42 00:01:53,640 --> 00:01:56,280 所必需的所有步骤。 necessary to prepare the dataset. 43 00:01:56,280 --> 00:01:57,960 首先,我们用我刚才提到的标志 First, we tokenize the dataset 44 00:01:57,960 --> 00:02:00,330 词元化数据集。 with the flags I just mentioned. 45 00:02:00,330 --> 00:02:02,190 然后,我们遍历每个块 Then, we go through each chunk 46 00:02:02,190 --> 00:02:04,680 如果它的长度与上下文长度匹配, and if it's length matches the context length, 47 00:02:04,680 --> 00:02:06,663 我们将它添加到我们返回的输入中。 we add it to the inputs we return. 48 00:02:07,590 --> 00:02:10,260 我们可以将此函数应用于整个数据集。 We can apply this function to the whole dataset. 49 00:02:10,260 --> 00:02:11,700 此外,我们确保 In addition, we make sure 50 00:02:11,700 --> 00:02:15,450 使用批处理并删除现有列。 that to use batches and remove the existing columns. 51 00:02:15,450 --> 00:02:17,670 我们之所以需要删除现有的列, We need to remove the existing columns, 52 00:02:17,670 --> 00:02:21,330 是因为我们可以为每个文本创建多个样本, because we can create multiple samples per text, 53 00:02:21,330 --> 00:02:22,890 和数据集中的形状 and the shapes in the dataset 54 00:02:22,890 --> 00:02:24,753 在那种情况下将不再匹配。 would not match anymore in that case. 55 00:02:26,832 --> 00:02:30,330 如果上下文长度与文件长度相似, If the context length is of similar lengths as the files, 56 00:02:30,330 --> 00:02:32,733 这种方法不再那么有效了。 this approach doesn't work so well anymore. 57 00:02:33,660 --> 00:02:36,420 在这个例子中,样本 1 和 2 In this example, both sample 1 and 2 58 00:02:36,420 --> 00:02:38,400 比上下文大小短 are shorter than the context size 59 00:02:38,400 --> 00:02:41,610 并且按照之前的方法处理的话,将会丢弃它。 and will be discarded with the previous approach. 60 00:02:41,610 --> 00:02:45,150 在这种情况下,最好先词元化每个样本 In this case, it is better to first tokenize each sample 61 00:02:45,150 --> 00:02:46,590 而不去截断 without truncation 62 00:02:46,590 --> 00:02:49,290 然后连接词元化后的样本 and then concatenate the tokenized samples 63 00:02:49,290 --> 00:02:52,353 并且之间以字符串结尾或 EOS 词元结尾。 with an end of string or EOS token in between. 64 00:02:53,546 --> 00:02:56,220 最后,我们可以按照上下文长度分块这个长序列, Finally, we can chunk this long sequence 65 00:02:56,220 --> 00:02:59,490 我们不会丢失太多序列 with the context length and we don't lose too many sequences 66 00:02:59,490 --> 00:03:01,263 因为它们太短了。 because they're too short anymore. 67 00:03:04,170 --> 00:03:05,760 到目前为止,我们只介绍了 So far, we have only talked 68 00:03:05,760 --> 00:03:08,370 因果语言建模的输入, about the inputs for causal language modeling, 69 00:03:08,370 --> 00:03:11,850 但还没有提到监督训练所需的标签。 but not the labels needed for supervised training. 70 00:03:11,850 --> 00:03:13,380 当我们进行因果语言建模时, When we do causal language modeling, 71 00:03:13,380 --> 00:03:16,710 我们不需要输入序列的任何额外标签 we don't require any extra labels for the input sequences 72 00:03:16,710 --> 00:03:20,610 因为输入序列本身就是标签。 as the input sequences themselves are the labels. 73 00:03:20,610 --> 00:03:24,240 在这个例子中,当我们将词元 Trans 提供给模型时, In this example, when we feed the token Trans to the model, 74 00:03:24,240 --> 00:03:27,510 我们要预测的下一个词元是 formers 。 the next token we wanted to predict is formers. 75 00:03:27,510 --> 00:03:30,780 在下一步中,我们将 Trans 和 formers 提供给模型 In the next step, we feed trans and formers to the model 76 00:03:30,780 --> 00:03:33,903 我们想要预测的标签是 are。 and the label we wanted to predict is are. 77 00:03:35,460 --> 00:03:38,130 这种模式仍在继续,如你所见, This pattern continues, and as you can see, 78 00:03:38,130 --> 00:03:41,220 输入序列是前移了一个位置的 the input sequence is the label sequence 79 00:03:41,220 --> 00:03:42,663 标签序列。 just shifted by one. 80 00:03:43,590 --> 00:03:47,310 由于模型仅在第一个词元之后进行预测, Since the model only makes prediction after the first token, 81 00:03:47,310 --> 00:03:49,350 输入序列的第一个元素, the first element of the input sequence, 82 00:03:49,350 --> 00:03:52,980 在本例中,就是 Trans,不会作为标签使用。 in this case, Trans, is not used as a label. 83 00:03:52,980 --> 00:03:55,530 同样,对于序列中的最后一个词元 Similarly, we don't have a label 84 00:03:55,530 --> 00:03:57,600 我们也没有标签 for the last token in the sequence 85 00:03:57,600 --> 00:04:00,843 因为序列结束后没有词元。 since there is no token after the sequence ends. 86 00:04:04,110 --> 00:04:06,300 让我们看看当需要在代码中为因果语言建模创建标签 Let's have a look at what we need to do 87 00:04:06,300 --> 00:04:10,200 我们需要做什么操作。 to create the labels for causal language modeling in code. 88 00:04:10,200 --> 00:04:12,360 如果我们想计算一批的损失, If we want to calculate a loss on a batch, 89 00:04:12,360 --> 00:04:15,120 我们可以将 input_ids 作为标签传递 we can just pass the input_ids as labels 90 00:04:15,120 --> 00:04:18,933 所有的转移都在模型内部处理。 and all the shifting is handled in the model internally. 91 00:04:20,032 --> 00:04:22,170 所以,你看,在处理因果语言建模的数据时, So, you see, there's no matching involved 92 00:04:22,170 --> 00:04:24,870 不涉及任何匹配, in processing data for causal language modeling, 93 00:04:24,870 --> 00:04:27,723 它只需要几个简单的步骤。 and it only requires a few simple steps. 94 00:04:28,854 --> 00:04:31,771 (过渡音乐) (transition music)