subtitles/en/73_debugging-the-training-pipeline-(pytorch).srt (347 lines of code) (raw):
1
00:00:06,210 --> 00:00:08,760
- In this video, we will
see how to debug an error
2
00:00:08,760 --> 00:00:11,896
you encounter when running Trainer.train
3
00:00:11,896 --> 00:00:15,066
As an example, we will use
this script that finetunes
4
00:00:15,066 --> 00:00:17,760
a bert model on the GLUE MNLI dataset.
5
00:00:17,760 --> 00:00:19,470
Checkout the videos linked below
6
00:00:19,470 --> 00:00:21,840
to see how we came to such a script.
7
00:00:21,840 --> 00:00:24,540
Here we want to learn how
to debug the problems in it.
8
00:00:25,470 --> 00:00:28,110
Running the script gives
us an error pretty quickly.
9
00:00:28,110 --> 00:00:29,040
It happens at the line
10
00:00:29,040 --> 00:00:30,990
where we feed the inputs to the model,
11
00:00:30,990 --> 00:00:32,850
according to the traceback.
12
00:00:32,850 --> 00:00:34,702
That tells us there is a problem there,
13
00:00:34,702 --> 00:00:37,881
but the problem could come
from many different causes.
14
00:00:37,881 --> 00:00:39,330
To debug an error in a training,
15
00:00:39,330 --> 00:00:41,760
you need to make sure each
step of the training pipeline
16
00:00:41,760 --> 00:00:43,440
works as intended.
17
00:00:43,440 --> 00:00:45,780
This means checking that
the inputs of your dataset
18
00:00:45,780 --> 00:00:47,040
are correct,
19
00:00:47,040 --> 00:00:48,720
you can batch them together,
20
00:00:48,720 --> 00:00:50,790
feed them through the model to get a loss,
21
00:00:50,790 --> 00:00:52,500
then compute the gradients of that loss
22
00:00:52,500 --> 00:00:54,303
before performing an optimizer step.
23
00:00:55,470 --> 00:00:57,810
So let's start by looking
at the training dataset
24
00:00:57,810 --> 00:00:59,043
this Trainer is using.
25
00:00:59,910 --> 00:01:02,190
There is definitely a problem here.
26
00:01:02,190 --> 00:01:04,293
We see texts and not number.
27
00:01:05,130 --> 00:01:06,660
The error message was telling us the model
28
00:01:06,660 --> 00:01:08,220
did not get input IDs
29
00:01:08,220 --> 00:01:11,100
and we do not have those
in the dataset indeed.
30
00:01:11,100 --> 00:01:12,660
Looking back at our code,
31
00:01:12,660 --> 00:01:14,400
we can see we made a mistake
32
00:01:14,400 --> 00:01:17,400
and passed the wrong
datasets to the Trainer.
33
00:01:17,400 --> 00:01:19,173
So let's fix that and run again.
34
00:01:20,490 --> 00:01:21,840
Now we have a new error.
35
00:01:21,840 --> 00:01:23,130
Inspecting the traceback
36
00:01:23,130 --> 00:01:25,860
tells us it happens when
we try to create a batch,
37
00:01:25,860 --> 00:01:28,743
specifically to group
the features in a tensor.
38
00:01:29,700 --> 00:01:32,610
We can confirm this by asking
the Trainer to get us a batch
39
00:01:32,610 --> 00:01:34,230
of the training data loader,
40
00:01:34,230 --> 00:01:35,913
which reproduces the same error.
41
00:01:36,780 --> 00:01:39,064
Either by inspecting
the inputs or debugging,
42
00:01:39,064 --> 00:01:42,870
we can then see they are
not all of the same size.
43
00:01:42,870 --> 00:01:45,120
This is because we have
not passed a data collator
44
00:01:45,120 --> 00:01:46,890
to do the padding to the Trainer
45
00:01:46,890 --> 00:01:49,443
and didn't pad when
preprocessing the data either.
46
00:01:50,430 --> 00:01:52,710
Padding inside the Trainer
is normally the default,
47
00:01:52,710 --> 00:01:55,380
but only if you provide your
tokenizer to the Trainer,
48
00:01:55,380 --> 00:01:57,270
and we forgot to do that.
49
00:01:57,270 --> 00:01:59,120
So let's fix the issue and run again.
50
00:02:00,510 --> 00:02:02,883
This time we get a nasty CUDA error.
51
00:02:03,765 --> 00:02:06,285
They are very difficult
to debug because for one,
52
00:02:06,285 --> 00:02:10,530
they put your kernel in a
state that is not recoverable
53
00:02:10,530 --> 00:02:13,260
so you have to restart your
notebook from the beginning
54
00:02:13,260 --> 00:02:16,950
and two, the traceback is
completely useless for those.
55
00:02:16,950 --> 00:02:19,230
Here the traceback tells
us the error happens
56
00:02:19,230 --> 00:02:22,500
when we do the gradient
computation with loss.backward,
57
00:02:22,500 --> 00:02:25,113
but as we will see later
on that is not the case.
58
00:02:26,520 --> 00:02:28,920
This is because everything
that happens on the GPU
59
00:02:28,920 --> 00:02:30,720
is done asynchronously.
60
00:02:30,720 --> 00:02:32,880
When you execute the model call,
61
00:02:32,880 --> 00:02:34,457
what the program does
is just stacking that
62
00:02:34,457 --> 00:02:36,600
in the queue of GPU,
63
00:02:36,600 --> 00:02:39,856
then if the GPU didn't
have any current job to do,
64
00:02:39,856 --> 00:02:41,850
the work will start on
the GPU at the same time
65
00:02:41,850 --> 00:02:45,000
as the CPU moves to the next instruction.
66
00:02:45,000 --> 00:02:47,040
Continuing with the
extraction of the loss,
67
00:02:47,040 --> 00:02:49,170
this is stacked into the GPU queue
68
00:02:49,170 --> 00:02:51,953
while the CPU moves to the
instruction loss.backward.
69
00:02:51,953 --> 00:02:54,180
But the GPU still hasn't finished
70
00:02:54,180 --> 00:02:55,710
the forward pass of the model
71
00:02:55,710 --> 00:02:57,603
since all that took no time at all.
72
00:02:58,440 --> 00:03:00,210
The CPU stops moving forward,
73
00:03:00,210 --> 00:03:03,240
because loss.backward as an
instruction telling it to wait
74
00:03:03,240 --> 00:03:04,830
for the GPUs to be finished,
75
00:03:04,830 --> 00:03:06,780
to make sure the gradients are correct.
76
00:03:07,650 --> 00:03:09,570
When the GPU encounters an error,
77
00:03:09,570 --> 00:03:13,140
it gives it back to the
CPU with a cryptic message
78
00:03:13,140 --> 00:03:15,423
who raises the error at the wrong place.
79
00:03:16,350 --> 00:03:18,720
So to debug this, we will
need to execute the next steps
80
00:03:18,720 --> 00:03:21,211
of the training pipeline on the CPU.
81
00:03:21,211 --> 00:03:22,380
It is very easy to do,
82
00:03:22,380 --> 00:03:25,350
and we get a traceback
we can trust this time.
83
00:03:25,350 --> 00:03:26,520
As we said before,
84
00:03:26,520 --> 00:03:28,620
the error actually happens
during the forward pass
85
00:03:28,620 --> 00:03:29,453
of the model,
86
00:03:29,453 --> 00:03:30,993
and not loss.backward.
87
00:03:31,920 --> 00:03:33,680
It's an index error.
88
00:03:33,680 --> 00:03:34,950
With a bit of debugging,
89
00:03:34,950 --> 00:03:37,410
we see we have labels ranging from 0 to 2,
90
00:03:37,410 --> 00:03:39,000
so three different values,
91
00:03:39,000 --> 00:03:42,191
but our outputs have a
shape of batch size per 2.
92
00:03:42,191 --> 00:03:45,600
It looks like our model has
the wrong number of labels.
93
00:03:45,600 --> 00:03:47,190
We can indeed confirm that,
94
00:03:47,190 --> 00:03:49,860
and now that we know it's
easy to fix it in the code
95
00:03:49,860 --> 00:03:53,969
by adding num_labels=3
when we create the model.
96
00:03:53,969 --> 00:03:56,883
Now the training script
will run to completion.
97
00:03:58,440 --> 00:03:59,430
We did not need it yet,
98
00:03:59,430 --> 00:04:00,960
but here is how we would
debug the next step
99
00:04:00,960 --> 00:04:02,944
of the pipeline, gradient computation,
100
00:04:02,944 --> 00:04:05,850
as well as the optimizer step.
101
00:04:05,850 --> 00:04:08,823
With all of this, good luck
debugging your own trainings!