1
00:00:00,227 --> 00:00:01,359
(air whooshing)

2
00:00:01,359 --> 00:00:02,610
(smiley clicking)

3
00:00:02,610 --> 00:00:05,550
(air whooshing)

4
00:00:05,550 --> 00:00:08,450
- Let's see how to preprocess
a dataset for summarization.

5
00:00:09,750 --> 00:00:13,083
This is the task of, well,
summarizing a long document.

6
00:00:14,040 --> 00:00:16,830
This video will focus on how
to preprocess your dataset

7
00:00:16,830 --> 00:00:19,680
once you have managed to put
it in the following format:

8
00:00:19,680 --> 00:00:21,510
one column for the long documents,

9
00:00:21,510 --> 00:00:23,610
and one for the summaries.

10
00:00:23,610 --> 00:00:24,930
Here is how we can achieve this

11
00:00:24,930 --> 00:00:27,573
with the Datasets library
on the XSUM dataset.

12
00:00:28,650 --> 00:00:30,810
As long as you manage to have
your data look like this,

13
00:00:30,810 --> 00:00:33,690
you should be able to
follow the same steps.

14
00:00:33,690 --> 00:00:35,880
For once, our labels are not integers

15
00:00:35,880 --> 00:00:39,150
corresponding to some
classes, but plain text.

16
00:00:39,150 --> 00:00:42,480
We will thus need to tokenize
them, like our inputs.

17
00:00:42,480 --> 00:00:43,920
There is a small trap there though,

18
00:00:43,920 --> 00:00:45,360
as we need to tokenize our targets

19
00:00:45,360 --> 00:00:48,690
inside the as_target_tokenizer
context manager.

20
00:00:48,690 --> 00:00:51,030
This is because the special tokens we add

21
00:00:51,030 --> 00:00:54,000
might be slightly different
for the inputs and the target,

22
00:00:54,000 --> 00:00:57,300
so the tokenizer has to know
which one it is processing.

23
00:00:57,300 --> 00:00:59,550
Processing the whole
dataset is then super easy

24
00:00:59,550 --> 00:01:01,290
with the map function.

25
00:01:01,290 --> 00:01:03,450
Since the summaries are
usually much shorter

26
00:01:03,450 --> 00:01:05,400
than the documents, you
should definitely pick

27
00:01:05,400 --> 00:01:08,880
different maximum lengths
for the inputs and targets.

28
00:01:08,880 --> 00:01:11,730
You can choose to pad at this
stage to that maximum length

29
00:01:11,730 --> 00:01:14,070
by setting padding=max_length.

30
00:01:14,070 --> 00:01:16,170
Here we'll show you
how to pad dynamically,

31
00:01:16,170 --> 00:01:17,620
as it requires one more step.

32
00:01:18,840 --> 00:01:20,910
Your inputs and targets are all sentences

33
00:01:20,910 --> 00:01:22,620
of various lengths.

34
00:01:22,620 --> 00:01:24,960
We'll pad the inputs
and targets separately

35
00:01:24,960 --> 00:01:27,030
as the maximum lengths
of the inputs and targets

36
00:01:27,030 --> 00:01:28,280
are completely different.

37
00:01:29,130 --> 00:01:31,170
Then, we pad the inputs
to the maximum lengths

38
00:01:31,170 --> 00:01:33,813
among the inputs, and same for the target.

39
00:01:34,860 --> 00:01:36,630
We pad the input with the pad token,

40
00:01:36,630 --> 00:01:39,000
and the targets with the -100 index

41
00:01:39,000 --> 00:01:40,980
to make sure they are
not taken into account

42
00:01:40,980 --> 00:01:42,180
in the loss computation.

43
00:01:43,440 --> 00:01:45,180
The Transformers library provide us

44
00:01:45,180 --> 00:01:48,510
with a data collator to
do this all automatically.

45
00:01:48,510 --> 00:01:51,690
You can then pass it to the
Trainer with your datasets,

46
00:01:51,690 --> 00:01:55,710
or use it in the to_tf_dataset
method before using model.fit

47
00:01:55,710 --> 00:01:56,823
on your current model.

48
00:01:58,339 --> 00:02:01,520
(air whooshing)

49
00:02:01,520 --> 00:02:02,876
(air whooshing)