subtitles/zh-CN/63_data-processing-for-causal-language-modeling.srt (376 lines of code) (raw):
1
00:00:00,000 --> 00:00:02,917
(过渡音乐)
(transition music)
2
00:00:05,364 --> 00:00:08,310
- 在这个视频中,我们来看看
- In this video, we take a look at the data processing
3
00:00:08,310 --> 00:00:10,803
训练因果语言模型所必需的数据处理。
necessary to train causal language models.
4
00:00:12,690 --> 00:00:14,400
因果语言建模是
Causal language modeling is the task
5
00:00:14,400 --> 00:00:17,820
基于先前的词元预测下一个词元的任务。
of predicting the next token based on the previous ones.
6
00:00:17,820 --> 00:00:19,680
因果语言建模的另一个术语
Another term for causal language modeling
7
00:00:19,680 --> 00:00:21,000
是自回归建模。
is autoregressive modeling.
8
00:00:21,000 --> 00:00:23,940
在这里的示例中,
In the example that you can see here,
9
00:00:23,940 --> 00:00:25,560
例如,下一个词元可以是
the next token could, for example,
10
00:00:25,560 --> 00:00:28,263
是 NLP,也可能是机器学习。
be NLP or it could be machine learning.
11
00:00:29,460 --> 00:00:31,457
因果语言模型的一个流行示例
A popular example of causal language models
12
00:00:31,457 --> 00:00:33,693
是 GPT 系列模型。
is the GPT family of models.
13
00:00:35,561 --> 00:00:38,010
训练 GPT 等模型,
To train models such as GPT,
14
00:00:38,010 --> 00:00:41,460
我们通常从大量文本文件组成的语料库开始。
we usually start with a large corpus of text files.
15
00:00:41,460 --> 00:00:43,890
这些文件可以是从互联网上抓取的网页
These files can be webpages scraped from the internet
16
00:00:43,890 --> 00:00:46,020
例如 Common Crawl 数据集
such as the Common Crawl dataset
17
00:00:46,020 --> 00:00:47,940
或者它们可以是来自 GitHub 的 Python 文件,
or they can be Python files from GitHub,
18
00:00:47,940 --> 00:00:49,490
就像你在这里看到的一样。
like the ones you can see here.
19
00:00:50,400 --> 00:00:52,680
作为第一步,我们需要词元化这些文件
As a first step, we need to tokenize these files
20
00:00:52,680 --> 00:00:55,380
这样我们就可以将它们输入给模型。
such that we can feed them through the model.
21
00:00:55,380 --> 00:00:58,500
在这里,我们将词元化的文本显示为不同长度的条,
Here, we show the tokenized texts as bars of various length,
22
00:00:58,500 --> 00:01:02,188
表明它们有些长一些有些短一些。
illustrating that they're shorter and longer ones.
23
00:01:02,188 --> 00:01:05,910
这在处理文本时很常见。
This is very common when working with text.
24
00:01:05,910 --> 00:01:09,270
但是,转换模型的上下文窗口有限
However, transform models have a limited context window
25
00:01:09,270 --> 00:01:10,770
并根据数据源的不同,
and depending on the data source,
26
00:01:10,770 --> 00:01:13,140
词元化的文本可能
it is possible that the tokenized texts
27
00:01:13,140 --> 00:01:15,183
比这个窗口长得多。
are much longer than this window.
28
00:01:16,080 --> 00:01:18,870
在这种情况下,我们可以将序列
In this case, we could just truncate the sequences
29
00:01:18,870 --> 00:01:20,182
截断为上下文长度,
to the context length,
30
00:01:20,182 --> 00:01:22,650
但这意味着在第一个上下文窗口之后
but this would mean that we lose everything
31
00:01:22,650 --> 00:01:24,513
我们将失去一切。
after the first context window.
32
00:01:25,500 --> 00:01:28,410
使用 return_overflowing_tokens 标志,
Using the return overflowing token flag,
33
00:01:28,410 --> 00:01:30,960
我们可以使用分词器来创建块
we can use the tokenizer to create chunks
34
00:01:30,960 --> 00:01:33,510
其中每个块都是上下文长度的大小。
with each one being the size of the context length.
35
00:01:34,860 --> 00:01:36,180
有时,如果没有足够的词元来填充它
Sometimes, it can still happen
36
00:01:36,180 --> 00:01:37,590
仍然会出现
that the last chunk is too short
37
00:01:37,590 --> 00:01:39,900
最后一块太短的情况。
if there aren't enough tokens to fill it.
38
00:01:39,900 --> 00:01:41,793
在这种情况下,我们可以将其删除。
In this case, we can just remove it.
39
00:01:42,990 --> 00:01:45,960
使用 return_length 关键字,
With the return_length keyword,
40
00:01:45,960 --> 00:01:49,173
我们还从分词器中获取每个块的长度。
we also get the length of each chunk from the tokenizer.
41
00:01:51,960 --> 00:01:53,640
此函数显示准备数据集
This function shows all the steps
42
00:01:53,640 --> 00:01:56,280
所必需的所有步骤。
necessary to prepare the dataset.
43
00:01:56,280 --> 00:01:57,960
首先,我们用我刚才提到的标志
First, we tokenize the dataset
44
00:01:57,960 --> 00:02:00,330
词元化数据集。
with the flags I just mentioned.
45
00:02:00,330 --> 00:02:02,190
然后,我们遍历每个块
Then, we go through each chunk
46
00:02:02,190 --> 00:02:04,680
如果它的长度与上下文长度匹配,
and if it's length matches the context length,
47
00:02:04,680 --> 00:02:06,663
我们将它添加到我们返回的输入中。
we add it to the inputs we return.
48
00:02:07,590 --> 00:02:10,260
我们可以将此函数应用于整个数据集。
We can apply this function to the whole dataset.
49
00:02:10,260 --> 00:02:11,700
此外,我们确保
In addition, we make sure
50
00:02:11,700 --> 00:02:15,450
使用批处理并删除现有列。
that to use batches and remove the existing columns.
51
00:02:15,450 --> 00:02:17,670
我们之所以需要删除现有的列,
We need to remove the existing columns,
52
00:02:17,670 --> 00:02:21,330
是因为我们可以为每个文本创建多个样本,
because we can create multiple samples per text,
53
00:02:21,330 --> 00:02:22,890
和数据集中的形状
and the shapes in the dataset
54
00:02:22,890 --> 00:02:24,753
在那种情况下将不再匹配。
would not match anymore in that case.
55
00:02:26,832 --> 00:02:30,330
如果上下文长度与文件长度相似,
If the context length is of similar lengths as the files,
56
00:02:30,330 --> 00:02:32,733
这种方法不再那么有效了。
this approach doesn't work so well anymore.
57
00:02:33,660 --> 00:02:36,420
在这个例子中,样本 1 和 2
In this example, both sample 1 and 2
58
00:02:36,420 --> 00:02:38,400
比上下文大小短
are shorter than the context size
59
00:02:38,400 --> 00:02:41,610
并且按照之前的方法处理的话,将会丢弃它。
and will be discarded with the previous approach.
60
00:02:41,610 --> 00:02:45,150
在这种情况下,最好先词元化每个样本
In this case, it is better to first tokenize each sample
61
00:02:45,150 --> 00:02:46,590
而不去截断
without truncation
62
00:02:46,590 --> 00:02:49,290
然后连接词元化后的样本
and then concatenate the tokenized samples
63
00:02:49,290 --> 00:02:52,353
并且之间以字符串结尾或 EOS 词元结尾。
with an end of string or EOS token in between.
64
00:02:53,546 --> 00:02:56,220
最后,我们可以按照上下文长度分块这个长序列,
Finally, we can chunk this long sequence
65
00:02:56,220 --> 00:02:59,490
我们不会丢失太多序列
with the context length and we don't lose too many sequences
66
00:02:59,490 --> 00:03:01,263
因为它们太短了。
because they're too short anymore.
67
00:03:04,170 --> 00:03:05,760
到目前为止,我们只介绍了
So far, we have only talked
68
00:03:05,760 --> 00:03:08,370
因果语言建模的输入,
about the inputs for causal language modeling,
69
00:03:08,370 --> 00:03:11,850
但还没有提到监督训练所需的标签。
but not the labels needed for supervised training.
70
00:03:11,850 --> 00:03:13,380
当我们进行因果语言建模时,
When we do causal language modeling,
71
00:03:13,380 --> 00:03:16,710
我们不需要输入序列的任何额外标签
we don't require any extra labels for the input sequences
72
00:03:16,710 --> 00:03:20,610
因为输入序列本身就是标签。
as the input sequences themselves are the labels.
73
00:03:20,610 --> 00:03:24,240
在这个例子中,当我们将词元 Trans 提供给模型时,
In this example, when we feed the token Trans to the model,
74
00:03:24,240 --> 00:03:27,510
我们要预测的下一个词元是 formers 。
the next token we wanted to predict is formers.
75
00:03:27,510 --> 00:03:30,780
在下一步中,我们将 Trans 和 formers 提供给模型
In the next step, we feed trans and formers to the model
76
00:03:30,780 --> 00:03:33,903
我们想要预测的标签是 are。
and the label we wanted to predict is are.
77
00:03:35,460 --> 00:03:38,130
这种模式仍在继续,如你所见,
This pattern continues, and as you can see,
78
00:03:38,130 --> 00:03:41,220
输入序列是前移了一个位置的
the input sequence is the label sequence
79
00:03:41,220 --> 00:03:42,663
标签序列。
just shifted by one.
80
00:03:43,590 --> 00:03:47,310
由于模型仅在第一个词元之后进行预测,
Since the model only makes prediction after the first token,
81
00:03:47,310 --> 00:03:49,350
输入序列的第一个元素,
the first element of the input sequence,
82
00:03:49,350 --> 00:03:52,980
在本例中,就是 Trans,不会作为标签使用。
in this case, Trans, is not used as a label.
83
00:03:52,980 --> 00:03:55,530
同样,对于序列中的最后一个词元
Similarly, we don't have a label
84
00:03:55,530 --> 00:03:57,600
我们也没有标签
for the last token in the sequence
85
00:03:57,600 --> 00:04:00,843
因为序列结束后没有词元。
since there is no token after the sequence ends.
86
00:04:04,110 --> 00:04:06,300
让我们看看当需要在代码中为因果语言建模创建标签
Let's have a look at what we need to do
87
00:04:06,300 --> 00:04:10,200
我们需要做什么操作。
to create the labels for causal language modeling in code.
88
00:04:10,200 --> 00:04:12,360
如果我们想计算一批的损失,
If we want to calculate a loss on a batch,
89
00:04:12,360 --> 00:04:15,120
我们可以将 input_ids 作为标签传递
we can just pass the input_ids as labels
90
00:04:15,120 --> 00:04:18,933
所有的转移都在模型内部处理。
and all the shifting is handled in the model internally.
91
00:04:20,032 --> 00:04:22,170
所以,你看,在处理因果语言建模的数据时,
So, you see, there's no matching involved
92
00:04:22,170 --> 00:04:24,870
不涉及任何匹配,
in processing data for causal language modeling,
93
00:04:24,870 --> 00:04:27,723
它只需要几个简单的步骤。
and it only requires a few simple steps.
94
00:04:28,854 --> 00:04:31,771
(过渡音乐)
(transition music)