1
00:00:00,000 --> 00:00:02,917
(transition music)

2
00:00:05,364 --> 00:00:08,310
- In this video, we take a
look at the data processing

3
00:00:08,310 --> 00:00:10,803
necessary to train causal language models.

4
00:00:12,690 --> 00:00:14,400
Causal language modeling is the task

5
00:00:14,400 --> 00:00:17,820
of predicting the next token
based on the previous ones.

6
00:00:17,820 --> 00:00:19,680
Another term for causal language modeling

7
00:00:19,680 --> 00:00:21,000
is autoregressive modeling.

8
00:00:21,000 --> 00:00:23,940
In the example that you can see here,

9
00:00:23,940 --> 00:00:25,560
the next token could, for example,

10
00:00:25,560 --> 00:00:28,263
be NLP or it could be machine learning.

11
00:00:29,460 --> 00:00:31,457
A popular example of
causal language models

12
00:00:31,457 --> 00:00:33,693
is the GPT family of models.

13
00:00:35,561 --> 00:00:38,010
To train models such as GPT,

14
00:00:38,010 --> 00:00:41,460
we usually start with a
large corpus of text files.

15
00:00:41,460 --> 00:00:43,890
These files can be webpages
scraped from the internet

16
00:00:43,890 --> 00:00:46,020
such as the Common Crawl dataset

17
00:00:46,020 --> 00:00:47,940
or they can be Python files from GitHub,

18
00:00:47,940 --> 00:00:49,490
like the ones you can see here.

19
00:00:50,400 --> 00:00:52,680
As a first step, we need
to tokenize these files

20
00:00:52,680 --> 00:00:55,380
such that we can feed
them through the model.

21
00:00:55,380 --> 00:00:58,500
Here, we show the tokenized
texts as bars of various length,

22
00:00:58,500 --> 00:01:02,188
illustrating that they're
shorter and longer ones.

23
00:01:02,188 --> 00:01:05,910
This is very common
when working with text.

24
00:01:05,910 --> 00:01:09,270
However, transform models
have a limited context window

25
00:01:09,270 --> 00:01:10,770
and depending on the data source,

26
00:01:10,770 --> 00:01:13,140
it is possible that the tokenized texts

27
00:01:13,140 --> 00:01:15,183
are much longer than this window.

28
00:01:16,080 --> 00:01:18,870
In this case, we could
just truncate the sequences

29
00:01:18,870 --> 00:01:20,182
to the context length,

30
00:01:20,182 --> 00:01:22,650
but this would mean
that we lose everything

31
00:01:22,650 --> 00:01:24,513
after the first context window.

32
00:01:25,500 --> 00:01:28,410
Using the return overflowing token flag,

33
00:01:28,410 --> 00:01:30,960
we can use the tokenizer to create chunks

34
00:01:30,960 --> 00:01:33,510
with each one being the
size of the context length.

35
00:01:34,860 --> 00:01:36,180
Sometimes, it can still happen

36
00:01:36,180 --> 00:01:37,590
that the last chunk is too short

37
00:01:37,590 --> 00:01:39,900
if there aren't enough tokens to fill it.

38
00:01:39,900 --> 00:01:41,793
In this case, we can just remove it.

39
00:01:42,990 --> 00:01:45,960
With the return_length keyword,

40
00:01:45,960 --> 00:01:49,173
we also get the length of
each chunk from the tokenizer.

41
00:01:51,960 --> 00:01:53,640
This function shows all the steps

42
00:01:53,640 --> 00:01:56,280
necessary to prepare the dataset.

43
00:01:56,280 --> 00:01:57,960
First, we tokenize the dataset

44
00:01:57,960 --> 00:02:00,330
with the flags I just mentioned.

45
00:02:00,330 --> 00:02:02,190
Then, we go through each chunk

46
00:02:02,190 --> 00:02:04,680
and if it's length matches
the context length,

47
00:02:04,680 --> 00:02:06,663
we add it to the inputs we return.

48
00:02:07,590 --> 00:02:10,260
We can apply this function
to the whole dataset.

49
00:02:10,260 --> 00:02:11,700
In addition, we make sure

50
00:02:11,700 --> 00:02:15,450
that to use batches and
remove the existing columns.

51
00:02:15,450 --> 00:02:17,670
We need to remove the existing columns,

52
00:02:17,670 --> 00:02:21,330
because we can create
multiple samples per text,

53
00:02:21,330 --> 00:02:22,890
and the shapes in the dataset

54
00:02:22,890 --> 00:02:24,753
would not match anymore in that case.

55
00:02:26,832 --> 00:02:30,330
If the context length is of
similar lengths as the files,

56
00:02:30,330 --> 00:02:32,733
this approach doesn't
work so well anymore.

57
00:02:33,660 --> 00:02:36,420
In this example, both sample 1 and 2

58
00:02:36,420 --> 00:02:38,400
are shorter than the context size

59
00:02:38,400 --> 00:02:41,610
and will be discarded with
the previous approach.

60
00:02:41,610 --> 00:02:45,150
In this case, it is better
to first tokenize each sample

61
00:02:45,150 --> 00:02:46,590
without truncation

62
00:02:46,590 --> 00:02:49,290
and then concatenate the tokenized samples

63
00:02:49,290 --> 00:02:52,353
with an end of string
or EOS token in between.

64
00:02:53,546 --> 00:02:56,220
Finally, we can chunk this long sequence

65
00:02:56,220 --> 00:02:59,490
with the context length and we
don't lose too many sequences

66
00:02:59,490 --> 00:03:01,263
because they're too short anymore.

67
00:03:04,170 --> 00:03:05,760
So far, we have only talked

68
00:03:05,760 --> 00:03:08,370
about the inputs for
causal language modeling,

69
00:03:08,370 --> 00:03:11,850
but not the labels needed
for supervised training.

70
00:03:11,850 --> 00:03:13,380
When we do causal language modeling,

71
00:03:13,380 --> 00:03:16,710
we don't require any extra
labels for the input sequences

72
00:03:16,710 --> 00:03:20,610
as the input sequences
themselves are the labels.

73
00:03:20,610 --> 00:03:24,240
In this example, when we feed
the token trans to the model,

74
00:03:24,240 --> 00:03:27,510
the next token we wanted
to predict is formers.

75
00:03:27,510 --> 00:03:30,780
In the next step, we feed
trans and formers to the model

76
00:03:30,780 --> 00:03:33,903
and the label we wanted to predict is are.

77
00:03:35,460 --> 00:03:38,130
This pattern continues,
and as you can see,

78
00:03:38,130 --> 00:03:41,220
the input sequence is the label sequence

79
00:03:41,220 --> 00:03:42,663
just shifted by one.

80
00:03:43,590 --> 00:03:47,310
Since the model only makes
prediction after the first token,

81
00:03:47,310 --> 00:03:49,350
the first element of the input sequence,

82
00:03:49,350 --> 00:03:52,980
in this case, trans,
is not used as a label.

83
00:03:52,980 --> 00:03:55,530
Similarly, we don't have a label

84
00:03:55,530 --> 00:03:57,600
for the last token in the sequence

85
00:03:57,600 --> 00:04:00,843
since there is no token
after the sequence ends.

86
00:04:04,110 --> 00:04:06,300
Let's have a look at what we need to do

87
00:04:06,300 --> 00:04:10,200
to create the labels for causal
language modeling in code.

88
00:04:10,200 --> 00:04:12,360
If we want to calculate a loss on a batch,

89
00:04:12,360 --> 00:04:15,120
we can just pass the input_ids as labels

90
00:04:15,120 --> 00:04:18,933
and all the shifting is handled
in the model internally.

91
00:04:20,032 --> 00:04:22,170
So, you see, there's no matching involved

92
00:04:22,170 --> 00:04:24,870
in processing data for
causal language modeling,

93
00:04:24,870 --> 00:04:27,723
and it only requires a few simple steps.

94
00:04:28,854 --> 00:04:31,771
(transition music)