subtitles/en/74_debugging-the-training-pipeline-(tensorflow).srt (619 lines of code) (raw):
1
00:00:00,212 --> 00:00:02,879
(air whooshing)
2
00:00:04,680 --> 00:00:08,130
- Some bugs in your code
are very straightforward.
3
00:00:08,130 --> 00:00:11,580
You try running it, you get
a syntax error somewhere,
4
00:00:11,580 --> 00:00:14,490
Python tells you exactly
where, and you fix it.
5
00:00:14,490 --> 00:00:17,760
This is great, it's simple
and it's satisfying.
6
00:00:17,760 --> 00:00:20,310
Sometimes, though, things crash
7
00:00:20,310 --> 00:00:23,670
and the error is impossible to understand.
8
00:00:23,670 --> 00:00:26,700
This happens a lot in machine
learning for a few reasons,
9
00:00:26,700 --> 00:00:29,310
you're working with big data structures,
10
00:00:29,310 --> 00:00:31,440
you're using these big, complex libraries
11
00:00:31,440 --> 00:00:33,420
with a lot of moving parts,
12
00:00:33,420 --> 00:00:35,310
and also you're doing
a lot of GPU computing,
13
00:00:35,310 --> 00:00:38,490
and that in general is much
more difficult to debug.
14
00:00:38,490 --> 00:00:40,260
In Keras there's the additional problem
15
00:00:40,260 --> 00:00:43,140
that your models are often
compiled before execution,
16
00:00:43,140 --> 00:00:44,400
which is great for performance
17
00:00:44,400 --> 00:00:47,430
but it makes debugging them
very difficult as well.
18
00:00:47,430 --> 00:00:50,370
So, this is going to be
a video about what to do
19
00:00:50,370 --> 00:00:52,410
when you run into one
of those nightmare bugs
20
00:00:52,410 --> 00:00:55,210
and you just have no idea
where to begin with fixing it.
21
00:00:56,370 --> 00:00:58,920
So, to give you some intuitions for
22
00:00:58,920 --> 00:01:01,530
the most common things that go wrong
23
00:01:01,530 --> 00:01:03,573
and cause these weird issues,
24
00:01:04,800 --> 00:01:07,530
and show you where to look
for the sources of bugs
25
00:01:07,530 --> 00:01:10,560
that you encounter, let's
use this example script.
26
00:01:10,560 --> 00:01:12,900
So, I'll show it to you here in two parts.
27
00:01:12,900 --> 00:01:16,410
First, we do all our
imports, we load a dataset,
28
00:01:16,410 --> 00:01:20,280
we create our tokenizer and
we tokenize the dataset.
29
00:01:20,280 --> 00:01:23,640
Next, we convert our datasets
to TensorFlow datasets,
30
00:01:23,640 --> 00:01:26,100
so that's tf.data.Dataset,
31
00:01:26,100 --> 00:01:28,500
and that's so that we can run fit on them,
32
00:01:28,500 --> 00:01:31,170
and then we load our model
from a pretrained checkpoint,
33
00:01:31,170 --> 00:01:33,870
we compile it and we fit
it with those datasets.
34
00:01:33,870 --> 00:01:35,970
So, this seems straightforward enough,
35
00:01:35,970 --> 00:01:38,220
it's similar to what we've
done in the course before.
36
00:01:38,220 --> 00:01:40,650
But beware, this is spooky code
37
00:01:40,650 --> 00:01:43,590
and hides many dark
and mysterious secrets.
38
00:01:43,590 --> 00:01:46,050
So, what happens when we run it?
39
00:01:46,050 --> 00:01:48,840
Well, it's not great.
40
00:01:48,840 --> 00:01:52,320
So, we get this error message,
but what does it mean?
41
00:01:52,320 --> 00:01:55,470
We tried to train on our
data, but we got no gradient?
42
00:01:55,470 --> 00:01:59,130
It's pretty perplexing, I
mean, how do we even begin
43
00:01:59,130 --> 00:02:01,500
to debug not getting a gradient?
44
00:02:01,500 --> 00:02:03,930
So, when the error you get
doesn't immediately suggest
45
00:02:03,930 --> 00:02:06,630
where the problem is, the best solution
46
00:02:06,630 --> 00:02:09,180
is often to walk through
things in sequence,
47
00:02:09,180 --> 00:02:12,900
making sure at each stage
that the outputs look right,
48
00:02:12,900 --> 00:02:15,300
that everything looks okay at that point.
49
00:02:15,300 --> 00:02:17,730
And, of course, that
means the place to start
50
00:02:17,730 --> 00:02:19,473
is always to check your data.
51
00:02:20,670 --> 00:02:22,050
So, the best way to make sure
52
00:02:22,050 --> 00:02:24,480
that the data you're
giving the model is good,
53
00:02:24,480 --> 00:02:27,690
is to grab a batch from
the tf.data.Dataset
54
00:02:27,690 --> 00:02:29,520
that your model is training on,
55
00:02:29,520 --> 00:02:31,560
and that's because it's right at the end
56
00:02:31,560 --> 00:02:33,990
of the data pipeline.
57
00:02:33,990 --> 00:02:36,990
And so that means that if
those outputs are good,
58
00:02:36,990 --> 00:02:39,990
you're guaranteed that your
data pipeline is working well.
59
00:02:39,990 --> 00:02:42,600
So, we can do that by
looping over the dataset
60
00:02:42,600 --> 00:02:44,790
for one iteration and then breaking,
61
00:02:44,790 --> 00:02:46,980
and that gives us a single batch.
62
00:02:46,980 --> 00:02:49,443
So, what do we get when
we inspect that batch?
63
00:02:50,460 --> 00:02:52,500
We'll see that we're
not getting any gradient
64
00:02:52,500 --> 00:02:55,530
because we're not passing labels to Keras.
65
00:02:55,530 --> 00:02:57,510
So, our labels are in the batch,
66
00:02:57,510 --> 00:02:59,670
but they're a key in the input dictionary
67
00:02:59,670 --> 00:03:02,340
and they're not a separate
label as Keras expects,
68
00:03:02,340 --> 00:03:04,830
so this is one of the most
common issues you'll encounter
69
00:03:04,830 --> 00:03:07,590
when training Transformers
models with TensorFlow.
70
00:03:07,590 --> 00:03:10,980
Our models can all
compute loss internally,
71
00:03:10,980 --> 00:03:13,140
but to use that loss for training
72
00:03:13,140 --> 00:03:15,960
the labels need to be passed
in the input dictionary,
73
00:03:15,960 --> 00:03:17,940
where the model can see them.
74
00:03:17,940 --> 00:03:20,280
This internal loss is the loss that we use
75
00:03:20,280 --> 00:03:23,760
when we don't specify a
loss when we call compile,
76
00:03:23,760 --> 00:03:25,660
when we don't specify a loss argument.
77
00:03:26,520 --> 00:03:27,870
So, Keras, on the other hand,
78
00:03:27,870 --> 00:03:30,570
usually expects labels
to be passed separately
79
00:03:30,570 --> 00:03:32,130
from the input dictionary,
80
00:03:32,130 --> 00:03:34,110
and not to be visible to the model,
81
00:03:34,110 --> 00:03:36,600
and loss computations will usually fail
82
00:03:36,600 --> 00:03:38,220
if you don't do that
83
00:03:38,220 --> 00:03:40,380
So we need to choose one or the other,
84
00:03:40,380 --> 00:03:42,780
either we use the model's internal loss
85
00:03:42,780 --> 00:03:44,940
and keep the labels where they are,
86
00:03:44,940 --> 00:03:46,980
or we keep using Keras losses
87
00:03:46,980 --> 00:03:50,520
but we move the labels to
the place Keras expects them.
88
00:03:50,520 --> 00:03:53,310
So, or simplicity here,
let's fix this issue
89
00:03:53,310 --> 00:03:55,860
by using the model's internal losses,
90
00:03:55,860 --> 00:03:57,900
and we do that by
removing the loss argument
91
00:03:57,900 --> 00:03:59,343
from the call to compile.
92
00:04:00,540 --> 00:04:03,000
So, what happens if we try training now?
93
00:04:03,000 --> 00:04:08,000
So we recompile with that, we
call model.fit, what happens?
94
00:04:08,220 --> 00:04:13,050
Well, it runs this time but
now we get a loss of NaN.
95
00:04:13,050 --> 00:04:16,440
So, that's not good,
NaN means not a number
96
00:04:16,440 --> 00:04:19,140
and it's not a good
loss to have in general.
97
00:04:19,140 --> 00:04:21,000
In fact, if we inspect our model now,
98
00:04:21,000 --> 00:04:23,970
we'll see that not only
are all the outputs NaN,
99
00:04:23,970 --> 00:04:27,600
all the weights are NaN as
well, as well as the loss.
100
00:04:27,600 --> 00:04:30,810
So once a single NaN creeps
into your computations,
101
00:04:30,810 --> 00:04:34,530
it tends to spread, because
it propagates from the loss
102
00:04:34,530 --> 00:04:36,420
and once it's at the loss
it's at the gradient,
103
00:04:36,420 --> 00:04:37,530
it gets to the gradient,
104
00:04:37,530 --> 00:04:38,910
and then once it's in the gradient
105
00:04:38,910 --> 00:04:41,280
it enters the weight updates,
106
00:04:41,280 --> 00:04:43,980
and then all your weight
updates end up as NaN as well.
107
00:04:43,980 --> 00:04:46,950
So NaN just completely
destroyed our model here,
108
00:04:46,950 --> 00:04:49,560
but where did it creep in first?
109
00:04:49,560 --> 00:04:52,140
So to find out, we need to back to a point
110
00:04:52,140 --> 00:04:53,490
before the model was destroyed,
111
00:04:53,490 --> 00:04:55,440
we need to re-initialize the model
112
00:04:55,440 --> 00:04:58,590
and look at the outputs
for just the first batch.
113
00:04:58,590 --> 00:04:59,850
And when we do that,
114
00:04:59,850 --> 00:05:02,790
we see that NaN first appears in the loss,
115
00:05:02,790 --> 00:05:04,980
but only in some samples.
116
00:05:04,980 --> 00:05:06,540
So you can see this in more detail
117
00:05:06,540 --> 00:05:09,090
in the accompanying section
of the course notes,
118
00:05:09,090 --> 00:05:11,220
I am moving fairly quickly here,
119
00:05:11,220 --> 00:05:13,500
but we find that if we look at the labels,
120
00:05:13,500 --> 00:05:17,790
the samples with a loss of
NaN all have a label of two.
121
00:05:17,790 --> 00:05:19,950
So this gives us a very strong clue,
122
00:05:19,950 --> 00:05:24,060
if we check the model with
model.config.num_labels,
123
00:05:24,060 --> 00:05:26,760
we see that the model thinks
there's only two labels,
124
00:05:26,760 --> 00:05:28,950
but if we see a value of two,
125
00:05:28,950 --> 00:05:31,200
that means there's at least three labels
126
00:05:31,200 --> 00:05:33,630
because 0 is a label as well.
127
00:05:33,630 --> 00:05:35,070
So we got a loss of NaN
128
00:05:35,070 --> 00:05:37,887
because we got an "impossible"
label in our label set,
129
00:05:37,887 --> 00:05:41,010
and to fix that we need to
go back and set the model
130
00:05:41,010 --> 00:05:43,650
to expect the right number of labels,
131
00:05:43,650 --> 00:05:45,870
so we can set num_labels=3
132
00:05:45,870 --> 00:05:48,540
when we initialize the
model but from_pretrained,
133
00:05:48,540 --> 00:05:51,450
and now hopefully we can avoid this issue.
134
00:05:51,450 --> 00:05:54,660
So, now we think our data is
good and our model is good
135
00:05:54,660 --> 00:05:56,220
and so training should work
136
00:05:56,220 --> 00:06:00,510
but if we try running
model.fit, we, well...
137
00:06:00,510 --> 00:06:02,040
I mean, we do get a loss,
138
00:06:02,040 --> 00:06:03,930
it is a number and it is going down
139
00:06:03,930 --> 00:06:06,090
but it's not going down very quickly
140
00:06:06,090 --> 00:06:07,770
and if we keep running this out,
141
00:06:07,770 --> 00:06:10,980
we'll find that it stalls
at a fairly high loss value.
142
00:06:10,980 --> 00:06:12,450
So, what's going on?
143
00:06:12,450 --> 00:06:14,130
Well, when things are mostly working,
144
00:06:14,130 --> 00:06:16,620
but training is just slow or a bit odd,
145
00:06:16,620 --> 00:06:19,470
that can often be a good time
to look at your optimizer
146
00:06:19,470 --> 00:06:22,020
and your training hyperparameters.
147
00:06:22,020 --> 00:06:23,460
And this is where I want to mention
148
00:06:23,460 --> 00:06:25,320
one of the most common sources of issues
149
00:06:25,320 --> 00:06:27,000
when you're working with Keras,
150
00:06:27,000 --> 00:06:30,870
you can name things like
optimizers with strings,
151
00:06:30,870 --> 00:06:33,180
so Keras supports that
and it's very convenient,
152
00:06:33,180 --> 00:06:35,460
but if you do that all of the options
153
00:06:35,460 --> 00:06:38,400
get silently set to their default values.
154
00:06:38,400 --> 00:06:41,190
So we specified our optimizer as Adam,
155
00:06:41,190 --> 00:06:43,110
but in the process we invisibly got
156
00:06:43,110 --> 00:06:46,260
the default learning rate, which is 1e-3,
157
00:06:46,260 --> 00:06:48,630
or 10 to the power of -3.
158
00:06:48,630 --> 00:06:50,550
So this learning rate is way too high
159
00:06:50,550 --> 00:06:52,530
for training transformer models,
160
00:06:52,530 --> 00:06:55,620
we should go back and specify
the learning rate directly,
161
00:06:55,620 --> 00:06:57,060
not using a string.
162
00:06:57,060 --> 00:07:01,290
So, good values here are
between 1e-5 and 1e-4
163
00:07:01,290 --> 00:07:04,233
so let's split the
difference and pick 5e-5.
164
00:07:05,310 --> 00:07:06,990
So if you recompile with that,
165
00:07:06,990 --> 00:07:09,840
you'll find that training
actually works, at last.
166
00:07:09,840 --> 00:07:11,700
The loss goes down efficiently
167
00:07:11,700 --> 00:07:14,070
and it converges to a lower value.
168
00:07:14,070 --> 00:07:16,410
So, again, I did go
through this quite quickly
169
00:07:16,410 --> 00:07:18,720
and I strongly recommend
checking out the course notes
170
00:07:18,720 --> 00:07:20,040
to see this in more detail,
171
00:07:20,040 --> 00:07:21,600
and to experiment with the code yourself
172
00:07:21,600 --> 00:07:23,490
and see what the errors look like
173
00:07:23,490 --> 00:07:25,380
and how you can approach them,
174
00:07:25,380 --> 00:07:27,930
but I hope I've given
you here a quick summary
175
00:07:27,930 --> 00:07:30,510
of the most common bugs
176
00:07:30,510 --> 00:07:32,880
and maybe the most common
debugging approaches
177
00:07:32,880 --> 00:07:33,960
to dealing with them.
178
00:07:33,960 --> 00:07:37,020
So, good luck, and remember
to take plenty of breaks
179
00:07:37,020 --> 00:07:38,970
if your code is giving you a hard time.
180
00:07:39,805 --> 00:07:42,472
(air whooshing)