subtitles/en/63_data-processing-for-causal-language-modeling.srt (321 lines of code) (raw):
1
00:00:00,000 --> 00:00:02,917
(transition music)
2
00:00:05,364 --> 00:00:08,310
- In this video, we take a
look at the data processing
3
00:00:08,310 --> 00:00:10,803
necessary to train causal language models.
4
00:00:12,690 --> 00:00:14,400
Causal language modeling is the task
5
00:00:14,400 --> 00:00:17,820
of predicting the next token
based on the previous ones.
6
00:00:17,820 --> 00:00:19,680
Another term for causal language modeling
7
00:00:19,680 --> 00:00:21,000
is autoregressive modeling.
8
00:00:21,000 --> 00:00:23,940
In the example that you can see here,
9
00:00:23,940 --> 00:00:25,560
the next token could, for example,
10
00:00:25,560 --> 00:00:28,263
be NLP or it could be machine learning.
11
00:00:29,460 --> 00:00:31,457
A popular example of
causal language models
12
00:00:31,457 --> 00:00:33,693
is the GPT family of models.
13
00:00:35,561 --> 00:00:38,010
To train models such as GPT,
14
00:00:38,010 --> 00:00:41,460
we usually start with a
large corpus of text files.
15
00:00:41,460 --> 00:00:43,890
These files can be webpages
scraped from the internet
16
00:00:43,890 --> 00:00:46,020
such as the Common Crawl dataset
17
00:00:46,020 --> 00:00:47,940
or they can be Python files from GitHub,
18
00:00:47,940 --> 00:00:49,490
like the ones you can see here.
19
00:00:50,400 --> 00:00:52,680
As a first step, we need
to tokenize these files
20
00:00:52,680 --> 00:00:55,380
such that we can feed
them through the model.
21
00:00:55,380 --> 00:00:58,500
Here, we show the tokenized
texts as bars of various length,
22
00:00:58,500 --> 00:01:02,188
illustrating that they're
shorter and longer ones.
23
00:01:02,188 --> 00:01:05,910
This is very common
when working with text.
24
00:01:05,910 --> 00:01:09,270
However, transform models
have a limited context window
25
00:01:09,270 --> 00:01:10,770
and depending on the data source,
26
00:01:10,770 --> 00:01:13,140
it is possible that the tokenized texts
27
00:01:13,140 --> 00:01:15,183
are much longer than this window.
28
00:01:16,080 --> 00:01:18,870
In this case, we could
just truncate the sequences
29
00:01:18,870 --> 00:01:20,182
to the context length,
30
00:01:20,182 --> 00:01:22,650
but this would mean
that we lose everything
31
00:01:22,650 --> 00:01:24,513
after the first context window.
32
00:01:25,500 --> 00:01:28,410
Using the return overflowing token flag,
33
00:01:28,410 --> 00:01:30,960
we can use the tokenizer to create chunks
34
00:01:30,960 --> 00:01:33,510
with each one being the
size of the context length.
35
00:01:34,860 --> 00:01:36,180
Sometimes, it can still happen
36
00:01:36,180 --> 00:01:37,590
that the last chunk is too short
37
00:01:37,590 --> 00:01:39,900
if there aren't enough tokens to fill it.
38
00:01:39,900 --> 00:01:41,793
In this case, we can just remove it.
39
00:01:42,990 --> 00:01:45,960
With the return_length keyword,
40
00:01:45,960 --> 00:01:49,173
we also get the length of
each chunk from the tokenizer.
41
00:01:51,960 --> 00:01:53,640
This function shows all the steps
42
00:01:53,640 --> 00:01:56,280
necessary to prepare the dataset.
43
00:01:56,280 --> 00:01:57,960
First, we tokenize the dataset
44
00:01:57,960 --> 00:02:00,330
with the flags I just mentioned.
45
00:02:00,330 --> 00:02:02,190
Then, we go through each chunk
46
00:02:02,190 --> 00:02:04,680
and if it's length matches
the context length,
47
00:02:04,680 --> 00:02:06,663
we add it to the inputs we return.
48
00:02:07,590 --> 00:02:10,260
We can apply this function
to the whole dataset.
49
00:02:10,260 --> 00:02:11,700
In addition, we make sure
50
00:02:11,700 --> 00:02:15,450
that to use batches and
remove the existing columns.
51
00:02:15,450 --> 00:02:17,670
We need to remove the existing columns,
52
00:02:17,670 --> 00:02:21,330
because we can create
multiple samples per text,
53
00:02:21,330 --> 00:02:22,890
and the shapes in the dataset
54
00:02:22,890 --> 00:02:24,753
would not match anymore in that case.
55
00:02:26,832 --> 00:02:30,330
If the context length is of
similar lengths as the files,
56
00:02:30,330 --> 00:02:32,733
this approach doesn't
work so well anymore.
57
00:02:33,660 --> 00:02:36,420
In this example, both sample 1 and 2
58
00:02:36,420 --> 00:02:38,400
are shorter than the context size
59
00:02:38,400 --> 00:02:41,610
and will be discarded with
the previous approach.
60
00:02:41,610 --> 00:02:45,150
In this case, it is better
to first tokenize each sample
61
00:02:45,150 --> 00:02:46,590
without truncation
62
00:02:46,590 --> 00:02:49,290
and then concatenate the tokenized samples
63
00:02:49,290 --> 00:02:52,353
with an end of string
or EOS token in between.
64
00:02:53,546 --> 00:02:56,220
Finally, we can chunk this long sequence
65
00:02:56,220 --> 00:02:59,490
with the context length and we
don't lose too many sequences
66
00:02:59,490 --> 00:03:01,263
because they're too short anymore.
67
00:03:04,170 --> 00:03:05,760
So far, we have only talked
68
00:03:05,760 --> 00:03:08,370
about the inputs for
causal language modeling,
69
00:03:08,370 --> 00:03:11,850
but not the labels needed
for supervised training.
70
00:03:11,850 --> 00:03:13,380
When we do causal language modeling,
71
00:03:13,380 --> 00:03:16,710
we don't require any extra
labels for the input sequences
72
00:03:16,710 --> 00:03:20,610
as the input sequences
themselves are the labels.
73
00:03:20,610 --> 00:03:24,240
In this example, when we feed
the token trans to the model,
74
00:03:24,240 --> 00:03:27,510
the next token we wanted
to predict is formers.
75
00:03:27,510 --> 00:03:30,780
In the next step, we feed
trans and formers to the model
76
00:03:30,780 --> 00:03:33,903
and the label we wanted to predict is are.
77
00:03:35,460 --> 00:03:38,130
This pattern continues,
and as you can see,
78
00:03:38,130 --> 00:03:41,220
the input sequence is the label sequence
79
00:03:41,220 --> 00:03:42,663
just shifted by one.
80
00:03:43,590 --> 00:03:47,310
Since the model only makes
prediction after the first token,
81
00:03:47,310 --> 00:03:49,350
the first element of the input sequence,
82
00:03:49,350 --> 00:03:52,980
in this case, trans,
is not used as a label.
83
00:03:52,980 --> 00:03:55,530
Similarly, we don't have a label
84
00:03:55,530 --> 00:03:57,600
for the last token in the sequence
85
00:03:57,600 --> 00:04:00,843
since there is no token
after the sequence ends.
86
00:04:04,110 --> 00:04:06,300
Let's have a look at what we need to do
87
00:04:06,300 --> 00:04:10,200
to create the labels for causal
language modeling in code.
88
00:04:10,200 --> 00:04:12,360
If we want to calculate a loss on a batch,
89
00:04:12,360 --> 00:04:15,120
we can just pass the input_ids as labels
90
00:04:15,120 --> 00:04:18,933
and all the shifting is handled
in the model internally.
91
00:04:20,032 --> 00:04:22,170
So, you see, there's no matching involved
92
00:04:22,170 --> 00:04:24,870
in processing data for
causal language modeling,
93
00:04:24,870 --> 00:04:27,723
and it only requires a few simple steps.
94
00:04:28,854 --> 00:04:31,771
(transition music)