subtitles/en/03_what-is-transfer-learning.srt (306 lines of code) (raw):

1 00:00:00,189 --> 00:00:02,856 (air whooshing) 2 00:00:05,550 --> 00:00:07,293 - What is transfer learning? 3 00:00:09,480 --> 00:00:10,920 The idea of transfer learning 4 00:00:10,920 --> 00:00:12,570 is to leverage the knowledge acquired 5 00:00:12,570 --> 00:00:15,543 by a model trained with lots of data on another task. 6 00:00:16,410 --> 00:00:20,130 The model A will be trained specifically for task A. 7 00:00:20,130 --> 00:00:22,200 Now let's say you want to train a model B 8 00:00:22,200 --> 00:00:23,970 for a different task. 9 00:00:23,970 --> 00:00:27,330 One option would be to train the model from scratch. 10 00:00:27,330 --> 00:00:30,633 This could take lots of computation, time and data. 11 00:00:31,470 --> 00:00:34,260 Instead, we could initialize model B 12 00:00:34,260 --> 00:00:36,570 with the same weights as model A, 13 00:00:36,570 --> 00:00:39,213 transferring the knowledge of model A on task B. 14 00:00:41,040 --> 00:00:42,690 When training from scratch, 15 00:00:42,690 --> 00:00:45,870 all the model's weight are initialized randomly. 16 00:00:45,870 --> 00:00:48,870 In this example, we are training a BERT model 17 00:00:48,870 --> 00:00:50,220 on the task of recognizing 18 00:00:50,220 --> 00:00:52,203 if two sentences are similar or not. 19 00:00:54,116 --> 00:00:56,730 On the left, it's trained from scratch, 20 00:00:56,730 --> 00:01:00,000 and on the right it's fine-tuning a pretrained model. 21 00:01:00,000 --> 00:01:02,220 As we can see, using transfer learning 22 00:01:02,220 --> 00:01:05,160 and the pretrained model yields better results. 23 00:01:05,160 --> 00:01:07,140 And it doesn't matter if we train longer. 24 00:01:07,140 --> 00:01:10,620 The training from scratch is capped around 70% accuracy 25 00:01:10,620 --> 00:01:13,293 while the pretrained model beats the 86% easily. 26 00:01:14,460 --> 00:01:16,140 This is because pretrained models 27 00:01:16,140 --> 00:01:18,420 are usually trained on large amounts of data 28 00:01:18,420 --> 00:01:21,000 that provide the model with a statistical understanding 29 00:01:21,000 --> 00:01:23,413 of the language used during pretraining. 30 00:01:24,450 --> 00:01:25,950 In computer vision, 31 00:01:25,950 --> 00:01:28,080 transfer learning has been applied successfully 32 00:01:28,080 --> 00:01:30,060 for almost ten years. 33 00:01:30,060 --> 00:01:32,850 Models are frequently pretrained on ImageNet, 34 00:01:32,850 --> 00:01:36,153 a dataset containing 1.2 millions of photo images. 35 00:01:37,170 --> 00:01:41,130 Each image is classified by one of 1000 labels. 36 00:01:41,130 --> 00:01:44,010 Training like this, on labeled data 37 00:01:44,010 --> 00:01:45,663 is called supervised learning. 38 00:01:47,340 --> 00:01:49,140 In Natural Language Processing, 39 00:01:49,140 --> 00:01:51,870 transfer learning is a bit more recent. 40 00:01:51,870 --> 00:01:54,480 A key difference with ImageNet is that the pretraining 41 00:01:54,480 --> 00:01:56,460 is usually self-supervised, 42 00:01:56,460 --> 00:01:58,770 which means it doesn't require humans annotations 43 00:01:58,770 --> 00:01:59,673 for the labels. 44 00:02:00,780 --> 00:02:02,700 A very common pretraining objective 45 00:02:02,700 --> 00:02:05,310 is to guess the next word in a sentence. 46 00:02:05,310 --> 00:02:07,710 Which only requires lots and lots of text. 47 00:02:07,710 --> 00:02:10,710 GPT-2 for instance, was pretrained this way 48 00:02:10,710 --> 00:02:12,900 using the content of 45 millions links 49 00:02:12,900 --> 00:02:14,673 posted by users on Reddit. 50 00:02:16,560 --> 00:02:19,590 Another example of self-supervised pretraining objective 51 00:02:19,590 --> 00:02:22,470 is to predict the value of randomly masked words. 52 00:02:22,470 --> 00:02:24,540 Which is similar to fill-in-the-blank tests 53 00:02:24,540 --> 00:02:26,760 you may have done in school. 54 00:02:26,760 --> 00:02:29,880 BERT was pretrained this way using the English Wikipedia 55 00:02:29,880 --> 00:02:31,893 and 11,000 unpublished books. 56 00:02:33,120 --> 00:02:36,450 In practice, transfer learning is applied on a given model 57 00:02:36,450 --> 00:02:39,090 by throwing away its head, that is, 58 00:02:39,090 --> 00:02:42,150 its last layers focused on the pretraining objective, 59 00:02:42,150 --> 00:02:45,360 and replacing it with a new, randomly initialized head 60 00:02:45,360 --> 00:02:46,860 suitable for the task at hand. 61 00:02:47,970 --> 00:02:51,570 For instance, when we fine-tuned a BERT model earlier, 62 00:02:51,570 --> 00:02:54,060 we removed the head that classified mask words 63 00:02:54,060 --> 00:02:56,790 and replaced it with a classifier with 2 outputs. 64 00:02:56,790 --> 00:02:58,563 Since our task had two labels. 65 00:02:59,700 --> 00:03:02,490 To be as efficient as possible, the pretrained model used 66 00:03:02,490 --> 00:03:03,770 should be as similar as possible 67 00:03:03,770 --> 00:03:06,270 to the task it's fine-tuned on. 68 00:03:06,270 --> 00:03:08,190 For instance, if the problem 69 00:03:08,190 --> 00:03:10,860 is to classify German sentences, 70 00:03:10,860 --> 00:03:13,053 it's best to use a German pretrained model. 71 00:03:14,370 --> 00:03:16,649 But with the good comes the bad. 72 00:03:16,649 --> 00:03:19,380 The pretrained model does not only transfer its knowledge, 73 00:03:19,380 --> 00:03:21,693 but also any bias it may contain. 74 00:03:22,530 --> 00:03:24,300 ImageNet mostly contains images 75 00:03:24,300 --> 00:03:26,850 coming from the United States and Western Europe. 76 00:03:26,850 --> 00:03:28,020 So models fine-tuned with it 77 00:03:28,020 --> 00:03:31,710 usually will perform better on images from these countries. 78 00:03:31,710 --> 00:03:33,690 OpenAI also studied the bias 79 00:03:33,690 --> 00:03:36,120 in the predictions of its GPT-3 model 80 00:03:36,120 --> 00:03:36,953 which was pretrained 81 00:03:36,953 --> 00:03:38,750 using the guess the next word objective. 82 00:03:39,720 --> 00:03:41,040 Changing the gender of the prompt 83 00:03:41,040 --> 00:03:44,250 from he was very to she was very 84 00:03:44,250 --> 00:03:47,550 changed the predictions from mostly neutral adjectives 85 00:03:47,550 --> 00:03:49,233 to almost only physical ones. 86 00:03:50,400 --> 00:03:52,367 In their model card of the GPT-2 model, 87 00:03:52,367 --> 00:03:54,990 OpenAI also acknowledges its bias 88 00:03:54,990 --> 00:03:56,730 and discourages its use 89 00:03:56,730 --> 00:03:58,803 in systems that interact with humans. 90 00:04:01,040 --> 00:04:03,707 (air whooshing)