subtitles/en/29_write-your-training-loop-in-pytorch.srt (412 lines of code) (raw):
1
00:00:00,298 --> 00:00:01,511
(air whooshing)
2
00:00:01,511 --> 00:00:02,769
(smiley face popping)
3
00:00:02,769 --> 00:00:05,460
(air whooshing)
4
00:00:05,460 --> 00:00:08,486
- Write your own training
loop with PyTorch.
5
00:00:08,486 --> 00:00:09,960
In this video, we'll look at
6
00:00:09,960 --> 00:00:12,750
how we can do the same fine-tuning
as in the Trainer video,
7
00:00:12,750 --> 00:00:14,760
but without relying on that class.
8
00:00:14,760 --> 00:00:17,790
This way, you'll be able to
easily customize each step
9
00:00:17,790 --> 00:00:20,310
to the training loop to your needs.
10
00:00:20,310 --> 00:00:21,660
This is also very useful
11
00:00:21,660 --> 00:00:22,740
to manually debug something
12
00:00:22,740 --> 00:00:24,590
that went wrong with the Trainer API.
13
00:00:26,220 --> 00:00:28,020
Before we dive into the code,
14
00:00:28,020 --> 00:00:30,481
here is a sketch of a training loop.
15
00:00:30,481 --> 00:00:33,381
We take a batch of training
data and feed it to the model.
16
00:00:34,223 --> 00:00:36,960
With the labels, we can
then compute a loss.
17
00:00:36,960 --> 00:00:39,316
That number is not useful in its own,
18
00:00:39,316 --> 00:00:40,260
that is used to compute
19
00:00:40,260 --> 00:00:42,150
the ingredients of our model weights,
20
00:00:42,150 --> 00:00:43,440
that is the derivative of the loss
21
00:00:44,610 --> 00:00:47,160
with respect to each model weight.
22
00:00:47,160 --> 00:00:49,800
Those gradients are then
used by the optimizer
23
00:00:49,800 --> 00:00:51,210
to update the model weights,
24
00:00:51,210 --> 00:00:53,550
and make them a little bit better.
25
00:00:53,550 --> 00:00:54,510
We then repeat the process
26
00:00:54,510 --> 00:00:56,880
with a new batch of training data.
27
00:00:56,880 --> 00:00:58,620
If any of this isn't clear,
28
00:00:58,620 --> 00:01:00,270
don't hesitate to take a refresher
29
00:01:00,270 --> 00:01:02,170
on your favorite deep learning course.
30
00:01:03,210 --> 00:01:06,000
We'll use the GLUE MRPC
data set here again,
31
00:01:06,000 --> 00:01:07,680
and we've seen how to prepropose the data
32
00:01:07,680 --> 00:01:11,130
using the Datasets library
with dynamic padding.
33
00:01:11,130 --> 00:01:12,630
Check out the videos link below
34
00:01:12,630 --> 00:01:14,280
if you haven't seen them already.
35
00:01:15,480 --> 00:01:18,930
With this done, we only have
to define PyTorch DataLoaders
36
00:01:18,930 --> 00:01:20,610
which will be responsible to convert
37
00:01:20,610 --> 00:01:23,253
the elements of our dataset into patches.
38
00:01:24,450 --> 00:01:27,960
We use our DataColletorForPadding
as a collate function,
39
00:01:27,960 --> 00:01:29,460
and shuffle the training set
40
00:01:29,460 --> 00:01:31,080
to make sure we don't go over the samples
41
00:01:31,080 --> 00:01:33,870
in the same order at a epoch*.
42
00:01:33,870 --> 00:01:36,390
To check that everything
works as intended,
43
00:01:36,390 --> 00:01:38,883
we try to grab a batch
of data, and inspect it.
44
00:01:40,080 --> 00:01:43,050
Like our data set elements,
it's a dictionary,
45
00:01:43,050 --> 00:01:46,260
but these times the values are
not a single list of integers
46
00:01:46,260 --> 00:01:49,053
but a tensor of shape batch
size by sequence length.
47
00:01:50,460 --> 00:01:53,580
The next step is to send the
training data in our model.
48
00:01:53,580 --> 00:01:56,730
For that, we'll need to
actually create a model.
49
00:01:56,730 --> 00:01:58,740
As seen in the Model API video,
50
00:01:58,740 --> 00:02:00,540
we use the from_pretrained method,
51
00:02:00,540 --> 00:02:03,270
and adjust the number of
labels to the number of classes
52
00:02:03,270 --> 00:02:06,810
we have on this data set, here two.
53
00:02:06,810 --> 00:02:08,940
Again to be sure everything is going well,
54
00:02:08,940 --> 00:02:11,100
we pass the batch we grabbed to our model,
55
00:02:11,100 --> 00:02:13,320
and check there is no error.
56
00:02:13,320 --> 00:02:14,940
If the labels are provided,
57
00:02:14,940 --> 00:02:16,590
the models of the Transformers library
58
00:02:16,590 --> 00:02:18,273
always returns a loss directly.
59
00:02:19,525 --> 00:02:21,090
We will be able to do loss.backward()
60
00:02:21,090 --> 00:02:22,860
to compute all the gradients,
61
00:02:22,860 --> 00:02:26,460
and will then need an optimizer
to do the training step.
62
00:02:26,460 --> 00:02:28,860
We use the AdamW optimizer here,
63
00:02:28,860 --> 00:02:31,440
which is a variant of Adam
with proper weight decay,
64
00:02:31,440 --> 00:02:33,840
but you can pick any
PyTorch optimizer you like.
65
00:02:34,830 --> 00:02:36,150
Using the previous loss,
66
00:02:36,150 --> 00:02:39,060
and computing the gradients
with loss.backward(),
67
00:02:39,060 --> 00:02:41,130
we check that we can do the optimizer step
68
00:02:41,130 --> 00:02:42,030
without any error.
69
00:02:43,380 --> 00:02:45,870
Don't forget to zero
your gradient afterwards,
70
00:02:45,870 --> 00:02:46,890
or at the next step,
71
00:02:46,890 --> 00:02:49,343
they will get added to the
gradients you computed.
72
00:02:50,490 --> 00:02:52,080
We could already write our training loop,
73
00:02:52,080 --> 00:02:53,220
but we will add two more things
74
00:02:53,220 --> 00:02:55,620
to make it as good as it can be.
75
00:02:55,620 --> 00:02:57,690
The first one is a
learning rate scheduler,
76
00:02:57,690 --> 00:03:00,140
to progressively decay
our learning rate to zero.
77
00:03:01,195 --> 00:03:04,590
The get_scheduler function
from the Transformers library
78
00:03:04,590 --> 00:03:06,150
is just a convenience function
79
00:03:06,150 --> 00:03:07,800
to easily build such a scheduler.
80
00:03:08,850 --> 00:03:09,683
You can again use
81
00:03:09,683 --> 00:03:11,860
any PyTorch learning
rate scheduler instead.
82
00:03:13,110 --> 00:03:14,850
Finally, if we want our training
83
00:03:14,850 --> 00:03:17,610
to take a couple of minutes
instead of a few hours,
84
00:03:17,610 --> 00:03:19,530
we will need to use a GPU.
85
00:03:19,530 --> 00:03:21,270
The first step is to get one,
86
00:03:21,270 --> 00:03:23,283
for instance by using a collab notebook.
87
00:03:24,180 --> 00:03:26,040
Then you need to actually send your model,
88
00:03:26,040 --> 00:03:28,923
and training data on it
by using a torch device.
89
00:03:29,790 --> 00:03:30,840
Double-check the following lines
90
00:03:30,840 --> 00:03:32,340
print a CUDA device for you.
91
00:03:32,340 --> 00:03:35,640
or be prepared for your training
to less, more than an hour.
92
00:03:35,640 --> 00:03:37,390
We can now put everything together.
93
00:03:38,550 --> 00:03:40,860
First, we put our model in training mode
94
00:03:40,860 --> 00:03:42,240
which will activate the training behavior
95
00:03:42,240 --> 00:03:44,790
for some layers, like Dropout.
96
00:03:44,790 --> 00:03:46,860
Then go through the number
of epochs we picked,
97
00:03:46,860 --> 00:03:50,070
and all the data in our
training dataloader.
98
00:03:50,070 --> 00:03:52,410
Then we go through all the
steps we have seen already;
99
00:03:52,410 --> 00:03:54,240
send the data to the GPU,
100
00:03:54,240 --> 00:03:55,560
compute the model outputs,
101
00:03:55,560 --> 00:03:57,720
and in particular the loss.
102
00:03:57,720 --> 00:03:59,850
Use the loss to compute gradients,
103
00:03:59,850 --> 00:04:02,880
then make a training
step with the optimizer.
104
00:04:02,880 --> 00:04:04,500
Update the learning rate in our scheduler
105
00:04:04,500 --> 00:04:05,970
for the next iteration,
106
00:04:05,970 --> 00:04:07,763
and zero the gradients of the optimizer.
107
00:04:09,240 --> 00:04:10,500
Once this is finished,
108
00:04:10,500 --> 00:04:12,150
we can evaluate our model very easily
109
00:04:12,150 --> 00:04:14,283
with a metric from the Datasets library.
110
00:04:15,180 --> 00:04:17,880
First, we put our model
in evaluation mode,
111
00:04:17,880 --> 00:04:20,550
to deactivate layers like Dropout,
112
00:04:20,550 --> 00:04:23,850
then go through all the data
in the evaluation data loader.
113
00:04:23,850 --> 00:04:25,530
As we have seen in the Trainer video,
114
00:04:25,530 --> 00:04:26,850
the model outputs logits,
115
00:04:26,850 --> 00:04:28,530
and we need to apply the argmax function
116
00:04:28,530 --> 00:04:30,213
to convert them into predictions.
117
00:04:31,350 --> 00:04:33,420
The metric object then
has an add_batch method,
118
00:04:33,420 --> 00:04:36,810
we can use to send it those
intermediate predictions.
119
00:04:36,810 --> 00:04:38,700
Once the evaluation loop is finished,
120
00:04:38,700 --> 00:04:40,320
we just have to call the compute method
121
00:04:40,320 --> 00:04:42,180
to get our final results.
122
00:04:42,180 --> 00:04:44,490
Congratulations, you have
now fine-tuned a model
123
00:04:44,490 --> 00:04:45,633
all by yourself.
124
00:04:47,253 --> 00:04:49,920
(air whooshing)