subtitles/zh-CN/03_what-is-transfer-learning.srt (360 lines of code) (raw):

1 00:00:00,189 --> 00:00:02,856 (空气呼啸) (air whooshing) 2 00:00:05,550 --> 00:00:07,293 - 什么是转移学习? - What is transfer learning? 3 00:00:09,480 --> 00:00:10,920 转移学习的思想 The idea of transfer learning 4 00:00:10,920 --> 00:00:12,570 是利用在另一项任务上使用大量数据训练的模型结果 is to leverage the knowledge acquired 5 00:00:12,570 --> 00:00:15,543 来获取知识。 by a model trained with lots of data on another task. 6 00:00:16,410 --> 00:00:20,130 模型 A 将专门针对任务 A 进行训练。 The model A will be trained specifically for task A. 7 00:00:20,130 --> 00:00:22,200 现在假设您想为了另一个任务 Now let's say you want to train a model B 8 00:00:22,200 --> 00:00:23,970 训练模型 B。 for a different task. 9 00:00:23,970 --> 00:00:27,330 一种选择是从头开始训练模型。 One option would be to train the model from scratch. 10 00:00:27,330 --> 00:00:30,633 但这可能需要大量的计算、时间和数据。 This could take lots of computation, time and data. 11 00:00:31,470 --> 00:00:34,260 我们可以有另一种选择,初始化模型 B Instead, we could initialize model B 12 00:00:34,260 --> 00:00:36,570 它与模型 A 具有相同的权重, with the same weights as model A, 13 00:00:36,570 --> 00:00:39,213 将模型 A 的知识转移到任务 B 上。 transferring the knowledge of model A on task B. 14 00:00:41,040 --> 00:00:42,690 从头开始训练时, When training from scratch, 15 00:00:42,690 --> 00:00:45,870 所有模型的权重都是随机初始化的。 all the model's weight are initialized randomly. 16 00:00:45,870 --> 00:00:48,870 在这个例子中,我们正在基于识别任务上 In this example, we are training a BERT model 17 00:00:48,870 --> 00:00:50,220 训练一个 BERT 模型 on the task of recognizing 18 00:00:50,220 --> 00:00:52,203 来判断两个句子是否相似。 if two sentences are similar or not. 19 00:00:54,116 --> 00:00:56,730 左边的例子是从头开始训练的, On the left, it's trained from scratch, 20 00:00:56,730 --> 00:01:00,000 右边则代表正在微调预训练模型。 and on the right it's fine-tuning a pretrained model. 21 00:01:00,000 --> 00:01:02,220 正如我们所见,使用转移学习 As we can see, using transfer learning 22 00:01:02,220 --> 00:01:05,160 和预训练模型产生了更好的结果。 and the pretrained model yields better results. 23 00:01:05,160 --> 00:01:07,140 如果我们训练更长时间也没关系。 And it doesn't matter if we train longer. 24 00:01:07,140 --> 00:01:10,620 从头开始的训练准确率上限在 70% 左右 The training from scratch is capped around 70% accuracy 25 00:01:10,620 --> 00:01:13,293 而预训练模型轻松达到了 86%。 while the pretrained model beats the 86% easily. 26 00:01:14,460 --> 00:01:16,140 这是因为预训练模型 This is because pretrained models 27 00:01:16,140 --> 00:01:18,420 通常基于大量数据进行训练 are usually trained on large amounts of data 28 00:01:18,420 --> 00:01:21,000 这些数据为模型在预训练期间 that provide the model with a statistical understanding 29 00:01:21,000 --> 00:01:23,413 提供了对语言使用的统计理解。 of the language used during pretraining. 30 00:01:24,450 --> 00:01:25,950 在计算机视觉中, In computer vision, 31 00:01:25,950 --> 00:01:28,080 转移学习已成功应用 transfer learning has been applied successfully 32 00:01:28,080 --> 00:01:30,060 将近十年。 for almost ten years. 33 00:01:30,060 --> 00:01:32,850 模型经常在 ImageNet 上进行预训练, Models are frequently pretrained on ImageNet, 34 00:01:32,850 --> 00:01:36,153 它是一种包含 120 万张照片图像的数据集。 a dataset containing 1.2 millions of photo images. 35 00:01:37,170 --> 00:01:41,130 每个图像都按 1000 个标签中的一个进行分类。 Each image is classified by one of 1000 labels. 36 00:01:41,130 --> 00:01:44,010 像这样在标记数据上训练 Training like this, on labeled data 37 00:01:44,010 --> 00:01:45,663 称为监督学习。 is called supervised learning. 38 00:01:47,340 --> 00:01:49,140 在自然语言处理中, In Natural Language Processing, 39 00:01:49,140 --> 00:01:51,870 转移学习是最近才出现的。 transfer learning is a bit more recent. 40 00:01:51,870 --> 00:01:54,480 它与 ImageNet 的一个关键区别是预训练 A key difference with ImageNet is that the pretraining 41 00:01:54,480 --> 00:01:56,460 通常是自我监督的, is usually self-supervised, 42 00:01:56,460 --> 00:01:58,770 这意味着它不需要人工对标签 which means it doesn't require humans annotations 43 00:01:58,770 --> 00:01:59,673 进行注释。 for the labels. 44 00:02:00,780 --> 00:02:02,700 一个非常常见的预训练目标 A very common pretraining objective 45 00:02:02,700 --> 00:02:05,310 是猜测句子中的下一个单词。 is to guess the next word in a sentence. 46 00:02:05,310 --> 00:02:07,710 它只需要大量的输入文本。 Which only requires lots and lots of text. 47 00:02:07,710 --> 00:02:10,710 例如 GPT-2,就是这样预训练的 GPT-2 for instance, was pretrained this way 48 00:02:10,710 --> 00:02:12,900 它使用 4500 万个用户在 Reddit 上发布的 using the content of 45 millions links 49 00:02:12,900 --> 00:02:14,673 链接的内容。 posted by users on Reddit. 50 00:02:16,560 --> 00:02:19,590 自监督预训练目标的另一个例子 Another example of self-supervised pretraining objective 51 00:02:19,590 --> 00:02:22,470 是预测随机屏蔽词的值。 is to predict the value of randomly masked words. 52 00:02:22,470 --> 00:02:24,540 这类似于填空测试 Which is similar to fill-in-the-blank tests 53 00:02:24,540 --> 00:02:26,760 您可能在学校做过。 you may have done in school. 54 00:02:26,760 --> 00:02:29,880 BERT 是使用英文维基百科和 11,000 本未出版的书籍 BERT was pretrained this way using the English Wikipedia 55 00:02:29,880 --> 00:02:31,893 进行预训练的。 and 11,000 unpublished books. 56 00:02:33,120 --> 00:02:36,450 在实践中,转移学习是通过抛弃原模型的头部 In practice, transfer learning is applied on a given model 57 00:02:36,450 --> 00:02:39,090 即其针对预训练目标的最后几层 by throwing away its head, that is, 58 00:02:39,090 --> 00:02:42,150 并用一个新的、随机初始化的头部 its last layers focused on the pretraining objective, 59 00:02:42,150 --> 00:02:45,360 来替换它来应用的 and replacing it with a new, randomly initialized head 60 00:02:45,360 --> 00:02:46,860 这个新的头部适用于当前的任务。 suitable for the task at hand. 61 00:02:47,970 --> 00:02:51,570 例如,当我们之前微调 BERT 模型时, For instance, when we fine-tuned a BERT model earlier, 62 00:02:51,570 --> 00:02:54,060 我们删除了分类掩码词的头部 we removed the head that classified mask words 63 00:02:54,060 --> 00:02:56,790 并将其替换为具有 2 个输出的分类器。 and replaced it with a classifier with 2 outputs. 64 00:02:56,790 --> 00:02:58,563 因为我们的任务有两个标签。 Since our task had two labels. 65 00:02:59,700 --> 00:03:02,490 为了尽可能高效 To be as efficient as possible, the pretrained model used 66 00:03:02,490 --> 00:03:03,770 所使用的预训练模型 should be as similar as possible 67 00:03:03,770 --> 00:03:06,270 应尽可能与其微调的任务相似。 to the task it's fine-tuned on. 68 00:03:06,270 --> 00:03:08,190 例如,如果当前需要 For instance, if the problem 69 00:03:08,190 --> 00:03:10,860 对德语句子进行分类, is to classify German sentences, 70 00:03:10,860 --> 00:03:13,053 最好使用德语预训练模型。 it's best to use a German pretrained model. 71 00:03:14,370 --> 00:03:16,649 但有好事也有坏事。 But with the good comes the bad. 72 00:03:16,649 --> 00:03:19,380 预训练模型不仅转移了它的知识, The pretrained model does not only transfer its knowledge, 73 00:03:19,380 --> 00:03:21,693 同时也转移了它可能包含的任何偏见。 but also any bias it may contain. 74 00:03:22,530 --> 00:03:24,300 ImageNet 主要包含来自美国和西欧 ImageNet mostly contains images 75 00:03:24,300 --> 00:03:26,850 的图像。 coming from the United States and Western Europe. 76 00:03:26,850 --> 00:03:28,020 所以基于它进行微调的模型 So models fine-tuned with it 77 00:03:28,020 --> 00:03:31,710 通常会在来自这些国家或地区的图像上表现更好。 usually will perform better on images from these countries. 78 00:03:31,710 --> 00:03:33,690 OpenAI 还研究了 OpenAI also studied the bias 79 00:03:33,690 --> 00:03:36,120 其使用猜测下一个单词目标 in the predictions of its GPT-3 model 80 00:03:36,120 --> 00:03:36,953 预训练的 GPT-3 模型中 which was pretrained 81 00:03:36,953 --> 00:03:38,750 预测的偏差。 using the guess the next word objective. 82 00:03:39,720 --> 00:03:41,040 将提示的性别 Changing the gender of the prompt 83 00:03:41,040 --> 00:03:44,250 从“他”更改到“她” from he was very to she was very 84 00:03:44,250 --> 00:03:47,550 会使预测从主要是中性形容词 changed the predictions from mostly neutral adjectives 85 00:03:47,550 --> 00:03:49,233 变为几乎只有物理上的形容词。 to almost only physical ones. 86 00:03:50,400 --> 00:03:52,367 在他们的 GPT-2 的模型卡中, In their model card of the GPT-2 model, 87 00:03:52,367 --> 00:03:54,990 OpenAI 也承认了它的偏见 OpenAI also acknowledges its bias 88 00:03:54,990 --> 00:03:56,730 并且不鼓励在与人类交互的系统中 and discourages its use 89 00:03:56,730 --> 00:03:58,803 使用它。 in systems that interact with humans. 90 00:04:01,040 --> 00:04:03,707 (空气呼啸) (air whooshing)