subtitles/en/61_data-processing-for-summarization.srt (172 lines of code) (raw):
1
00:00:00,227 --> 00:00:01,359
(air whooshing)
2
00:00:01,359 --> 00:00:02,610
(smiley clicking)
3
00:00:02,610 --> 00:00:05,550
(air whooshing)
4
00:00:05,550 --> 00:00:08,450
- Let's see how to preprocess
a dataset for summarization.
5
00:00:09,750 --> 00:00:13,083
This is the task of, well,
summarizing a long document.
6
00:00:14,040 --> 00:00:16,830
This video will focus on how
to preprocess your dataset
7
00:00:16,830 --> 00:00:19,680
once you have managed to put
it in the following format:
8
00:00:19,680 --> 00:00:21,510
one column for the long documents,
9
00:00:21,510 --> 00:00:23,610
and one for the summaries.
10
00:00:23,610 --> 00:00:24,930
Here is how we can achieve this
11
00:00:24,930 --> 00:00:27,573
with the Datasets library
on the XSUM dataset.
12
00:00:28,650 --> 00:00:30,810
As long as you manage to have
your data look like this,
13
00:00:30,810 --> 00:00:33,690
you should be able to
follow the same steps.
14
00:00:33,690 --> 00:00:35,880
For once, our labels are not integers
15
00:00:35,880 --> 00:00:39,150
corresponding to some
classes, but plain text.
16
00:00:39,150 --> 00:00:42,480
We will thus need to tokenize
them, like our inputs.
17
00:00:42,480 --> 00:00:43,920
There is a small trap there though,
18
00:00:43,920 --> 00:00:45,360
as we need to tokenize our targets
19
00:00:45,360 --> 00:00:48,690
inside the as_target_tokenizer
context manager.
20
00:00:48,690 --> 00:00:51,030
This is because the special tokens we add
21
00:00:51,030 --> 00:00:54,000
might be slightly different
for the inputs and the target,
22
00:00:54,000 --> 00:00:57,300
so the tokenizer has to know
which one it is processing.
23
00:00:57,300 --> 00:00:59,550
Processing the whole
dataset is then super easy
24
00:00:59,550 --> 00:01:01,290
with the map function.
25
00:01:01,290 --> 00:01:03,450
Since the summaries are
usually much shorter
26
00:01:03,450 --> 00:01:05,400
than the documents, you
should definitely pick
27
00:01:05,400 --> 00:01:08,880
different maximum lengths
for the inputs and targets.
28
00:01:08,880 --> 00:01:11,730
You can choose to pad at this
stage to that maximum length
29
00:01:11,730 --> 00:01:14,070
by setting padding=max_length.
30
00:01:14,070 --> 00:01:16,170
Here we'll show you
how to pad dynamically,
31
00:01:16,170 --> 00:01:17,620
as it requires one more step.
32
00:01:18,840 --> 00:01:20,910
Your inputs and targets are all sentences
33
00:01:20,910 --> 00:01:22,620
of various lengths.
34
00:01:22,620 --> 00:01:24,960
We'll pad the inputs
and targets separately
35
00:01:24,960 --> 00:01:27,030
as the maximum lengths
of the inputs and targets
36
00:01:27,030 --> 00:01:28,280
are completely different.
37
00:01:29,130 --> 00:01:31,170
Then, we pad the inputs
to the maximum lengths
38
00:01:31,170 --> 00:01:33,813
among the inputs, and same for the target.
39
00:01:34,860 --> 00:01:36,630
We pad the input with the pad token,
40
00:01:36,630 --> 00:01:39,000
and the targets with the -100 index
41
00:01:39,000 --> 00:01:40,980
to make sure they are
not taken into account
42
00:01:40,980 --> 00:01:42,180
in the loss computation.
43
00:01:43,440 --> 00:01:45,180
The Transformers library provide us
44
00:01:45,180 --> 00:01:48,510
with a data collator to
do this all automatically.
45
00:01:48,510 --> 00:01:51,690
You can then pass it to the
Trainer with your datasets,
46
00:01:51,690 --> 00:01:55,710
or use it in the to_tf_dataset
method before using model.fit
47
00:01:55,710 --> 00:01:56,823
on your current model.
48
00:01:58,339 --> 00:02:01,520
(air whooshing)
49
00:02:01,520 --> 00:02:02,876
(air whooshing)