1
00:00:00,670 --> 00:00:01,503
(whooshing sound)

2
00:00:01,503 --> 00:00:02,469
(sticker popping)

3
00:00:02,469 --> 00:00:05,302
(whooshing sound)

4
00:00:06,240 --> 00:00:08,220
In a lot of our examples,

5
00:00:08,220 --> 00:00:12,150
you're going to see DataCollators
popping up over and over.

6
00:00:12,150 --> 00:00:16,020
They're used in both PyTorch
and TensorFlow workflows,

7
00:00:16,020 --> 00:00:17,460
and maybe even in JAX,

8
00:00:17,460 --> 00:00:20,130
but no-one really knows
what's happening in JAX.

9
00:00:20,130 --> 00:00:21,840
We do have a research
team working on it though,

10
00:00:21,840 --> 00:00:23,970
so maybe they'll tell us soon.

11
00:00:23,970 --> 00:00:25,620
But coming back on topic.

12
00:00:25,620 --> 00:00:27,600
What are data collators?

13
00:00:27,600 --> 00:00:30,480
Data collators collate data.

14
00:00:30,480 --> 00:00:31,800
That's not that helpful.

15
00:00:31,800 --> 00:00:35,023
But to be more specific, they
put together a list of samples

16
00:00:35,023 --> 00:00:37,830
into a single training minibatch.

17
00:00:37,830 --> 00:00:38,910
For some tasks,

18
00:00:38,910 --> 00:00:41,670
the data collator can
be very straightforward.

19
00:00:41,670 --> 00:00:44,820
For example, when you're
doing sequence classification,

20
00:00:44,820 --> 00:00:47,010
all you really need
from your data collator

21
00:00:47,010 --> 00:00:49,860
is that it pads your
samples to the same length

22
00:00:49,860 --> 00:00:52,413
and concatenates them
into a single Tensor.

23
00:00:53,340 --> 00:00:57,750
But for other workflows, data
collators can be quite complex

24
00:00:57,750 --> 00:00:59,910
as they handle some of the preprocessing

25
00:00:59,910 --> 00:01:02,340
needed for that particular task.

26
00:01:02,340 --> 00:01:04,800
So, if you want to use a data collator,

27
00:01:04,800 --> 00:01:07,860
for PyTorch users, you
usually pass the data collator

28
00:01:07,860 --> 00:01:09,780
to your Trainer object.

29
00:01:09,780 --> 00:01:11,310
In TensorFlow, it's a bit different.

30
00:01:11,310 --> 00:01:12,960
The easiest way to use a data collator

31
00:01:12,960 --> 00:01:16,860
is to pass it to the to_tf_dataset
method of your dataset.

32
00:01:16,860 --> 00:01:20,198
And this will give you a
tensorflow_tf_data.dataset

33
00:01:20,198 --> 00:01:22,743
that you can then pass to model.fit.

34
00:01:23,580 --> 00:01:25,890
You'll see these approaches
used in the examples

35
00:01:25,890 --> 00:01:28,068
and notebooks throughout this course.

36
00:01:28,068 --> 00:01:30,180
Also note that all of our collators

37
00:01:30,180 --> 00:01:32,610
take a return_tensors argument.

38
00:01:32,610 --> 00:01:35,737
You can set this to "pt"
to get PyTorch Tensors,

39
00:01:35,737 --> 00:01:37,920
"tf" to get TensorFlow Tensors,

40
00:01:37,920 --> 00:01:40,404
or "np" to get Numpy arrays.

41
00:01:40,404 --> 00:01:42,450
For backward compatibility reasons,

42
00:01:42,450 --> 00:01:44,460
the default value is "pt",

43
00:01:44,460 --> 00:01:47,160
so PyTorch users don't even
have to set this argument

44
00:01:47,160 --> 00:01:48,270
most of the time.

45
00:01:48,270 --> 00:01:50,820
And so as a result, they're
often totally unaware

46
00:01:50,820 --> 00:01:52,713
that this argument even exists.

47
00:01:53,730 --> 00:01:55,050
We can learn something from this

48
00:01:55,050 --> 00:01:57,120
which is that the
beneficiaries of privilege

49
00:01:57,120 --> 00:01:59,793
are often the most blind to its existence.

50
00:02:00,690 --> 00:02:01,920
But okay, coming back.

51
00:02:01,920 --> 00:02:06,540
Let's see how some specific
data collators work in action.

52
00:02:06,540 --> 00:02:08,070
Although again, remember if none

53
00:02:08,070 --> 00:02:09,900
of the built-in data
collators do what you need,

54
00:02:09,900 --> 00:02:13,650
you can always write your own
and they're often quite short.

55
00:02:13,650 --> 00:02:16,950
So first, we'll see the
"basic" data collators.

56
00:02:16,950 --> 00:02:20,433
These are DefaultDataCollator
and DataCollatorWithPadding.

57
00:02:21,420 --> 00:02:22,830
These are the ones you should use

58
00:02:22,830 --> 00:02:24,720
if your labels are straightforward

59
00:02:24,720 --> 00:02:27,300
and your data doesn't need
any special processing

60
00:02:27,300 --> 00:02:29,673
before being ready for training.

61
00:02:29,673 --> 00:02:31,272
Notice that because different models

62
00:02:31,272 --> 00:02:33,690
have different padding tokens,

63
00:02:33,690 --> 00:02:37,170
DataCollatorWithPadding will
need your model's Tokenizer

64
00:02:37,170 --> 00:02:40,150
so it knows how to pad sequences properly.

65
00:02:40,150 --> 00:02:44,790
The default data collator
doesn't need a Tokenizer to work,

66
00:02:44,790 --> 00:02:46,710
but it will as a result throw an error

67
00:02:46,710 --> 00:02:48,900
unless all of your sequences
are the same length.

68
00:02:48,900 --> 00:02:50,500
So, you should be aware of that.

69
00:02:51,480 --> 00:02:52,860
Moving on though.

70
00:02:52,860 --> 00:02:54,300
A lot of the other data collators

71
00:02:54,300 --> 00:02:56,130
aside from the basic two are,

72
00:02:56,130 --> 00:02:59,490
they're usually designed to
handle one specific task.

73
00:02:59,490 --> 00:03:01,050
And so, I'm going to show a couple here.

74
00:03:01,050 --> 00:03:04,320
These are
DataCollatorForTokenClassification

75
00:03:04,320 --> 00:03:06,447
and DataCollatorForSeqToSeq.

76
00:03:06,447 --> 00:03:09,540
And the reason these tasks
need special collators

77
00:03:09,540 --> 00:03:12,600
is because their labels
are variable in length.

78
00:03:12,600 --> 00:03:15,960
In token classification there's
one label for each token,

79
00:03:15,960 --> 00:03:17,400
and so the length of the labels

80
00:03:17,400 --> 00:03:18,993
is the length of the sequence.

81
00:03:20,280 --> 00:03:23,520
While in SeqToSeq the labels
are a sequence of tokens

82
00:03:23,520 --> 00:03:24,780
that can be variable length,

83
00:03:24,780 --> 00:03:25,800
that can be very different

84
00:03:25,800 --> 00:03:28,200
from the length of the input sequence.

85
00:03:28,200 --> 00:03:32,880
So in both of these cases, we
handle collating that batch

86
00:03:32,880 --> 00:03:35,280
by padding the labels as well,

87
00:03:35,280 --> 00:03:37,410
as you can see here in this example.

88
00:03:37,410 --> 00:03:40,770
So, inputs and the labels
will need to be padded

89
00:03:40,770 --> 00:03:43,860
if we want to join
samples of variable length

90
00:03:43,860 --> 00:03:45,120
into the same minibatch.

91
00:03:45,120 --> 00:03:47,520
That's exactly what the data collators

92
00:03:47,520 --> 00:03:50,460
and that's exactly what these
data collators will do for us

93
00:03:50,460 --> 00:03:52,383
you know, for this particular task.

94
00:03:53,820 --> 00:03:56,070
So, there's one final data collator

95
00:03:56,070 --> 00:03:58,560
I want to show you as
well just in this lecture.

96
00:03:58,560 --> 00:04:00,473
And that's the
DataCollatorForLanguageModeling.

97
00:04:01,410 --> 00:04:03,390
So, it's very important, and it's firstly,

98
00:04:03,390 --> 00:04:05,820
because language models
are just so foundational

99
00:04:05,820 --> 00:04:09,720
to do for everything we
do with NLP these days.

100
00:04:09,720 --> 00:04:12,060
But secondly, because it has two modes

101
00:04:12,060 --> 00:04:14,760
that do two very different things.

102
00:04:14,760 --> 00:04:19,230
So you choose which mode you
want with the mlm argument.

103
00:04:19,230 --> 00:04:22,470
Set it to True for
masked language modeling,

104
00:04:22,470 --> 00:04:26,190
and set it to False for
causal language modeling.

105
00:04:26,190 --> 00:04:28,620
So, collating data for
causal language modeling

106
00:04:28,620 --> 00:04:30,750
is actually quite straightforward.

107
00:04:30,750 --> 00:04:32,640
The model is just making predictions

108
00:04:32,640 --> 00:04:35,460
for what token comes
next, and so your labels

109
00:04:35,460 --> 00:04:37,800
are more or less just
a copy of your inputs,

110
00:04:37,800 --> 00:04:39,090
and the collator will handle that

111
00:04:39,090 --> 00:04:42,240
and ensure that the inputs and
labels are padded correctly.

112
00:04:42,240 --> 00:04:44,910
When you set mlm to True though,

113
00:04:44,910 --> 00:04:46,786
you get quite different behavior,

114
00:04:46,786 --> 00:04:49,200
that's different from
any other data collator,

115
00:04:49,200 --> 00:04:51,660
and that's because setting mlm to True

116
00:04:51,660 --> 00:04:53,550
means masked language modeling

117
00:04:53,550 --> 00:04:55,680
and that means the labels need to be,

118
00:04:55,680 --> 00:04:58,080
you know, the inputs need to be masked.

119
00:04:58,080 --> 00:05:00,093
So, what does that look like?

120
00:05:01,050 --> 00:05:03,900
So, recall that in
masked language modeling,

121
00:05:03,900 --> 00:05:06,570
the model is not predicting the next word,

122
00:05:06,570 --> 00:05:09,240
instead we randomly mask out some tokens

123
00:05:09,240 --> 00:05:11,130
and the model predicts
all of them at once.

124
00:05:11,130 --> 00:05:12,780
So, it tries to kinda fill in the blanks

125
00:05:12,780 --> 00:05:14,790
for those masked tokens.

126
00:05:14,790 --> 00:05:18,210
But the process of random
masking is surprisingly complex.

127
00:05:18,210 --> 00:05:21,330
If we follow the protocol
from the original BERT paper,

128
00:05:21,330 --> 00:05:23,970
we need to replace some
tokens with a masked token,

129
00:05:23,970 --> 00:05:26,190
some other tokens with a random token,

130
00:05:26,190 --> 00:05:29,820
and then keep a third
set of tokens unchanged.

131
00:05:29,820 --> 00:05:30,840
Yeah, this is not the lecture

132
00:05:30,840 --> 00:05:33,903
to go into the specifics
of that or why we do it.

133
00:05:33,903 --> 00:05:36,660
You can always check out
the original BERT paper

134
00:05:36,660 --> 00:05:37,493
if you're curious.

135
00:05:37,493 --> 00:05:39,620
It's well written. It's
easy to understand.

136
00:05:40,650 --> 00:05:44,190
The main thing to know here
is that it can be a real pain

137
00:05:44,190 --> 00:05:46,770
and quite complex to
implement that yourself.

138
00:05:46,770 --> 00:05:49,740
But DataCollatorForLanguageModeling
will do it for you

139
00:05:49,740 --> 00:05:51,750
when you set mlm to True.

140
00:05:51,750 --> 00:05:54,690
And that's an example
of the more intricate

141
00:05:54,690 --> 00:05:57,870
preprocessing that some
of our data collators do.

142
00:05:57,870 --> 00:05:59,430
And that's it!

143
00:05:59,430 --> 00:06:01,920
So, this covers the most
commonly used data collators

144
00:06:01,920 --> 00:06:03,480
and the tasks they're used for.

145
00:06:03,480 --> 00:06:06,990
And hopefully, now you'll know
when to use data collators

146
00:06:06,990 --> 00:06:10,833
and which one to choose
for your specific task.

147
00:06:11,765 --> 00:06:14,598
(whooshing sound)