subtitles/en/56_data-processing-for-masked-language-modeling.srt (193 lines of code) (raw):
1
00:00:00,000 --> 00:00:02,333
(whooshing)
2
00:00:05,250 --> 00:00:07,230
- Let's see how we can preprocess our data
3
00:00:07,230 --> 00:00:08,703
for masked language modeling.
4
00:00:10,230 --> 00:00:12,570
As a reminder, masked language modeling
5
00:00:12,570 --> 00:00:15,333
is when a model needs to fill
the blanks in a sentence.
6
00:00:16,530 --> 00:00:19,650
To do this, you just
need texts, no labels,
7
00:00:19,650 --> 00:00:22,200
as this is a self-supervised problem.
8
00:00:22,200 --> 00:00:23,670
To apply this on your own data,
9
00:00:23,670 --> 00:00:25,740
just make sure you have
all your texts gathered
10
00:00:25,740 --> 00:00:27,603
in one column of your dataset.
11
00:00:28,440 --> 00:00:30,480
Before we start randomly masking things,
12
00:00:30,480 --> 00:00:33,090
we will need to somehow make
all those texts the same length
13
00:00:33,090 --> 00:00:34,263
to batch them together.
14
00:00:35,640 --> 00:00:38,490
The first way to make all
the texts the same length
15
00:00:38,490 --> 00:00:40,590
is the one we used in text classification.
16
00:00:41,430 --> 00:00:44,163
Let's pad the short texts
and truncate the long ones.
17
00:00:45,030 --> 00:00:45,900
As we have seen
18
00:00:45,900 --> 00:00:48,690
when we processed data
for text classification,
19
00:00:48,690 --> 00:00:49,923
this is all done by our tokenizer
20
00:00:49,923 --> 00:00:53,130
with the right options for
padding and truncation.
21
00:00:53,130 --> 00:00:56,100
This will however make
us lose a lot of texts
22
00:00:56,100 --> 00:00:58,620
if the examples in our
dataset are very long,
23
00:00:58,620 --> 00:01:00,960
compared to the context length we picked.
24
00:01:00,960 --> 00:01:03,393
Here, all the portion in gray is lost.
25
00:01:04,410 --> 00:01:06,660
This is why a second way
to generate samples of text
26
00:01:06,660 --> 00:01:08,820
with the same length is to chunk our text
27
00:01:08,820 --> 00:01:10,560
in pieces of context lengths,
28
00:01:10,560 --> 00:01:14,010
instead of discarding everything
after the first chunk.
29
00:01:14,010 --> 00:01:15,420
There will probably be a remainder
30
00:01:15,420 --> 00:01:17,700
of length smaller than the context size,
31
00:01:17,700 --> 00:01:20,493
which we can choose to
keep and pad or ignore.
32
00:01:21,570 --> 00:01:23,790
Here is how we can apply this in practice,
33
00:01:23,790 --> 00:01:26,460
by just adding the return
overflowing tokens option
34
00:01:26,460 --> 00:01:28,200
in our tokenizer call.
35
00:01:28,200 --> 00:01:30,243
Note how this gives us a bigger dataset!
36
00:01:31,560 --> 00:01:34,260
This second way of chunking
is ideal if all your texts
37
00:01:34,260 --> 00:01:36,270
are very long, but it won't work
38
00:01:36,270 --> 00:01:39,900
as nicely if you have a variety
of lengths in the texts.
39
00:01:39,900 --> 00:01:41,040
In this case,
40
00:01:41,040 --> 00:01:44,280
the best option is to concatenate
all your tokenized texts
41
00:01:44,280 --> 00:01:46,560
in one big stream, with a special tokens
42
00:01:46,560 --> 00:01:49,800
to indicate when you pass from
one document to the other,
43
00:01:49,800 --> 00:01:52,503
and only then split the
big stream into chunks.
44
00:01:53,760 --> 00:01:55,620
Here is how it can be done with code,
45
00:01:55,620 --> 00:01:58,230
with one loop to concatenate all the texts
46
00:01:58,230 --> 00:01:59,673
and another one to chunk it.
47
00:02:00,780 --> 00:02:02,850
Notice how it reduces
the number of samples
48
00:02:02,850 --> 00:02:04,230
in our dataset here,
49
00:02:04,230 --> 00:02:06,580
there must have been
quite a few short entries!
50
00:02:07,710 --> 00:02:11,130
Once this is done, the
masking is the easy part.
51
00:02:11,130 --> 00:02:13,400
There is a data collator
designed specifically for this
52
00:02:13,400 --> 00:02:15,540
in the Transformers library.
53
00:02:15,540 --> 00:02:17,700
You can use it directly in the Trainer,
54
00:02:17,700 --> 00:02:20,400
or when converting your
datasets to tensorflow datasets
55
00:02:20,400 --> 00:02:23,703
before doing Keras.fit, with
the to_tf_dataset method.
56
00:02:24,992 --> 00:02:27,325
(whooshing)