subtitles/en/59_data-processing-for-translation.srt (192 lines of code) (raw):
1
00:00:00,449 --> 00:00:01,559
(air whooshing)
2
00:00:01,559 --> 00:00:02,767
(logo popping)
3
00:00:02,767 --> 00:00:05,670
(metal sliding)
4
00:00:05,670 --> 00:00:08,470
- Let's see how to preprocess
a dataset for translation.
5
00:00:09,540 --> 00:00:12,420
This is a task of well
translating a sentence
6
00:00:12,420 --> 00:00:14,310
in another language.
7
00:00:14,310 --> 00:00:17,100
This video will focus on how
to preprocess your dataset
8
00:00:17,100 --> 00:00:19,950
once you've managed to put
it in the following format.
9
00:00:19,950 --> 00:00:23,730
One column for input texts
and one for the target texts.
10
00:00:23,730 --> 00:00:25,980
Here is how we can achieve
this with the Datasets library
11
00:00:25,980 --> 00:00:29,643
and the KDE4 dataset for
English to French translation.
12
00:00:30,870 --> 00:00:33,240
As long as you manage to have
your data look like this,
13
00:00:33,240 --> 00:00:35,440
you should be able to
follow the same steps.
14
00:00:36,630 --> 00:00:39,210
For once, our labels are not integers
15
00:00:39,210 --> 00:00:42,210
corresponding to some
classes, but plain texts.
16
00:00:42,210 --> 00:00:45,810
We will thus need to tokenize
them, like our inputs.
17
00:00:45,810 --> 00:00:47,370
There is a trap there though,
18
00:00:47,370 --> 00:00:49,890
as if you tokenize your
targets like your inputs,
19
00:00:49,890 --> 00:00:51,690
you will hit a problem.
20
00:00:51,690 --> 00:00:54,090
Even if you don't speak
French, you might notice
21
00:00:54,090 --> 00:00:57,270
some weird things in the
tokenization of the targets.
22
00:00:57,270 --> 00:01:00,510
Most of the words are
tokenized in several subtokens,
23
00:01:00,510 --> 00:01:03,180
while fish, one of the only English word,
24
00:01:03,180 --> 00:01:05,670
is tokenized as a single word.
25
00:01:05,670 --> 00:01:08,703
That's because our inputs have
been tokenized as English.
26
00:01:09,660 --> 00:01:11,430
Since our model knows two languages,
27
00:01:11,430 --> 00:01:13,800
you have to warn it when
tokenizing the targets
28
00:01:13,800 --> 00:01:16,230
so it switches in French mode.
29
00:01:16,230 --> 00:01:20,010
This is done with the
as_target_tokenizer context manager.
30
00:01:20,010 --> 00:01:23,343
You can see how it results in
a more compact tokenization.
31
00:01:24,810 --> 00:01:25,890
Processing the whole dataset
32
00:01:25,890 --> 00:01:28,440
is then super easy with the map function.
33
00:01:28,440 --> 00:01:30,207
You can pick different maximum lengths
34
00:01:30,207 --> 00:01:32,100
for the inputs and targets,
35
00:01:32,100 --> 00:01:34,530
and choose to pad at this
stage to that maximum length
36
00:01:34,530 --> 00:01:36,273
by setting padding=max_length.
37
00:01:37,230 --> 00:01:39,300
Here we'll show you to pad dynamically
38
00:01:39,300 --> 00:01:41,013
as it requires one more step.
39
00:01:42,450 --> 00:01:43,470
Your inputs and targets
40
00:01:43,470 --> 00:01:46,080
are all sentences of various lengths.
41
00:01:46,080 --> 00:01:48,510
We will pad the inputs
and targets separately,
42
00:01:48,510 --> 00:01:50,460
as the maximum lengths
of the inputs and targets
43
00:01:50,460 --> 00:01:51,483
might be different.
44
00:01:52,620 --> 00:01:54,540
Then we pad the inputs with the pad token
45
00:01:54,540 --> 00:01:57,060
and the targets with the -100 index
46
00:01:57,060 --> 00:01:58,890
to make sure they're
not taken into account
47
00:01:58,890 --> 00:02:00,123
in the loss computation.
48
00:02:01,320 --> 00:02:02,153
Once this is done,
49
00:02:02,153 --> 00:02:04,340
batching inputs and
targets become super easy.
50
00:02:05,670 --> 00:02:08,220
The Transformers library
provides us with data collator
51
00:02:08,220 --> 00:02:10,500
to do this all automatically.
52
00:02:10,500 --> 00:02:13,800
You can then pass it to the
Trainer with your datasets
53
00:02:13,800 --> 00:02:15,960
or use it in the to_tf_dataset method
54
00:02:15,960 --> 00:02:18,560
before using model.fit()
on your Keras model.
55
00:02:21,057 --> 00:02:23,724
(air whooshing)