1 00:00:00,288 --> 00:00:02,639 (screen swishing) 2 00:00:02,639 --> 00:00:05,190 (text swishing) 3 00:00:05,190 --> 00:00:06,780 In our other videos, 4 00:00:06,780 --> 00:00:08,280 we talked about the basics 5 00:00:08,280 --> 00:00:11,610 of fine-tuning a language model with Tensorflow, 6 00:00:11,610 --> 00:00:15,030 and as always, when I refer to videos I'll link them below. 7 00:00:15,030 --> 00:00:17,610 Still, can we do better? 8 00:00:17,610 --> 00:00:20,700 So here's the code from our model fine-tuning video, 9 00:00:20,700 --> 00:00:21,600 and while it works, 10 00:00:21,600 --> 00:00:24,390 we could definitely tweak a couple of things. 11 00:00:24,390 --> 00:00:27,540 By far the most important thing is the learning rate. 12 00:00:27,540 --> 00:00:29,940 In this video we'll talk about how to change it, 13 00:00:29,940 --> 00:00:31,080 which will make your training 14 00:00:31,080 --> 00:00:33,303 much more consistently successful. 15 00:00:34,440 --> 00:00:37,320 In fact, really there are two things 16 00:00:37,320 --> 00:00:40,530 we want to change about the default learning rate for Adam. 17 00:00:40,530 --> 00:00:42,720 So the first we want to change 18 00:00:42,720 --> 00:00:45,630 is that it's way too high for our models, 19 00:00:45,630 --> 00:00:48,030 by default, Adam uses a learning rate 20 00:00:48,030 --> 00:00:51,540 of 10 to the minus 3, 1 E minus 3, 21 00:00:51,540 --> 00:00:54,660 and that's very high for training transformer models. 22 00:00:54,660 --> 00:00:58,200 We're going to start at 5 by 10 to the minus 5, 23 00:00:58,200 --> 00:01:02,700 5 E minus 5, which is 20 times lower than the default. 24 00:01:02,700 --> 00:01:06,330 And secondly, we don't just want a constant learning rate, 25 00:01:06,330 --> 00:01:07,950 we can get even better performance 26 00:01:07,950 --> 00:01:11,160 if we decay the learning rate down to a tiny value, 27 00:01:11,160 --> 00:01:13,920 or even to zero , over the course of training. 28 00:01:13,920 --> 00:01:15,510 So that's what this thing here, 29 00:01:15,510 --> 00:01:18,540 this Polynomial Decay schedule thing is doing. 30 00:01:18,540 --> 00:01:21,570 So I'll show you what that decay looks like in a second, 31 00:01:21,570 --> 00:01:23,160 but first we need to tell the scheduler 32 00:01:23,160 --> 00:01:25,290 how long training is going to be, 33 00:01:25,290 --> 00:01:27,450 so that it decays at the right speed, 34 00:01:27,450 --> 00:01:29,450 and that's what this code here is doing. 35 00:01:30,300 --> 00:01:32,280 We're computing how many minibatches 36 00:01:32,280 --> 00:01:35,520 the model is going to see over its entire training run, 37 00:01:35,520 --> 00:01:37,950 which is the size of the training set, 38 00:01:37,950 --> 00:01:39,570 and then we multiply that 39 00:01:39,570 --> 00:01:41,220 by the number of epochs 40 00:01:41,220 --> 00:01:42,930 to get the total number of batches 41 00:01:42,930 --> 00:01:45,060 across the whole training run. 42 00:01:45,060 --> 00:01:47,880 Once we know how many training steps we're taking, 43 00:01:47,880 --> 00:01:50,580 we just pass all that information to the scheduler 44 00:01:50,580 --> 00:01:51,783 and we're ready to go. 45 00:01:53,110 --> 00:01:57,510 What does the polynomial decay schedule look like? 46 00:01:57,510 --> 00:01:59,610 Well, it looks like this, 47 00:01:59,610 --> 00:02:02,160 it starts at 5 E minus 5, 48 00:02:02,160 --> 00:02:05,490 which means 5 times 10 to the minus 5, 49 00:02:05,490 --> 00:02:08,190 and then decays down at a constant rate 50 00:02:08,190 --> 00:02:11,310 until it hits zero right at the very end of training. 51 00:02:11,310 --> 00:02:13,200 So hold on, I can already hear you 52 00:02:13,200 --> 00:02:14,640 yelling at your monitor, though, 53 00:02:14,640 --> 00:02:16,020 and yes, I know, 54 00:02:16,020 --> 00:02:18,690 this is actually constant or a linear decay, 55 00:02:18,690 --> 00:02:20,310 and I know the name is polynomial, 56 00:02:20,310 --> 00:02:21,870 and you're feeling cheated that, you know, 57 00:02:21,870 --> 00:02:24,390 you were promised a polynomial and haven't gotten it, 58 00:02:24,390 --> 00:02:26,550 so calm down though, it's okay, 59 00:02:26,550 --> 00:02:28,830 because, of course, linear functions are just 60 00:02:28,830 --> 00:02:30,480 first-order special cases 61 00:02:30,480 --> 00:02:32,850 of the general polynomial functions, 62 00:02:32,850 --> 00:02:36,180 and if you tweak the options to this class, 63 00:02:36,180 --> 00:02:38,130 you can get a truly polynomial, 64 00:02:38,130 --> 00:02:40,170 a higher-order decay schedule, 65 00:02:40,170 --> 00:02:43,140 but this linear schedule will work fine for us for now, 66 00:02:43,140 --> 00:02:45,210 we don't actually need all those 67 00:02:45,210 --> 00:02:47,610 fancy tweaks and fancy gadgets. 68 00:02:47,610 --> 00:02:49,770 So coming back, 69 00:02:49,770 --> 00:02:51,990 how do we actually use this learning rate schedule 70 00:02:51,990 --> 00:02:53,460 once we've created it? 71 00:02:53,460 --> 00:02:55,650 So it's simple, we just pass it to Adam. 72 00:02:55,650 --> 00:02:58,560 So the first time we compiled the model, 73 00:02:58,560 --> 00:03:00,840 we just passed the string Adam, 74 00:03:00,840 --> 00:03:02,250 to get our optimizer. 75 00:03:02,250 --> 00:03:05,340 So Keras recognizes the names of common optimizers 76 00:03:05,340 --> 00:03:07,920 and loss functions if you pass them as strings, 77 00:03:07,920 --> 00:03:09,480 so it saves time to do that 78 00:03:09,480 --> 00:03:11,460 if you only want the default settings. 79 00:03:11,460 --> 00:03:13,320 But now we're professional machine learners, 80 00:03:13,320 --> 00:03:15,720 and, you know, that salary review is upcoming, 81 00:03:15,720 --> 00:03:17,790 so we've got our very own learning rate schedule, 82 00:03:17,790 --> 00:03:19,770 and we're gonna do things properly. 83 00:03:19,770 --> 00:03:22,830 So the first we do is we import the optimizer, 84 00:03:22,830 --> 00:03:24,960 and then we initialize it with a scheduler, 85 00:03:24,960 --> 00:03:27,540 which is getting passed to to the learning rate argument 86 00:03:27,540 --> 00:03:29,100 of that optimizer. 87 00:03:29,100 --> 00:03:32,190 And now we compile the model with this new optimizer, 88 00:03:32,190 --> 00:03:34,140 and again, whatever loss function you want, 89 00:03:34,140 --> 00:03:37,050 so this is going to be sparse categorical crossentropy 90 00:03:37,050 --> 00:03:39,840 if you're following along from the fine-tuning video. 91 00:03:39,840 --> 00:03:41,370 And then, we're we're ready to go, 92 00:03:41,370 --> 00:03:43,710 now we have a high-performance model, 93 00:03:43,710 --> 00:03:44,970 and ready for training. 94 00:03:44,970 --> 00:03:46,830 All that remains is to fit the model 95 00:03:46,830 --> 00:03:48,363 just like we did before. 96 00:03:49,350 --> 00:03:51,600 Remember, because we compiled the model 97 00:03:51,600 --> 00:03:54,300 with the new optimizer and the new learning rate schedule, 98 00:03:54,300 --> 00:03:56,190 we actually don't need to change anything at all 99 00:03:56,190 --> 00:03:57,360 when we call fit, 100 00:03:57,360 --> 00:03:58,290 we just call it again, 101 00:03:58,290 --> 00:04:00,540 with exactly the same command as before, 102 00:04:00,540 --> 00:04:02,400 but now we get a beautiful training, 103 00:04:02,400 --> 00:04:04,740 with a nice, smooth learning rate decay, 104 00:04:04,740 --> 00:04:06,330 starting from a good value, 105 00:04:06,330 --> 00:04:07,713 and decaying down to zero. 106 00:04:08,867 --> 00:04:12,059 (screen swishing) 107 00:04:12,059 --> 00:04:13,395 (screen swishing)