subtitles/zh-CN/03_what-is-transfer-learning.srt (360 lines of code) (raw):
1
00:00:00,189 --> 00:00:02,856
(空气呼啸)
(air whooshing)
2
00:00:05,550 --> 00:00:07,293
- 什么是转移学习?
- What is transfer learning?
3
00:00:09,480 --> 00:00:10,920
转移学习的思想
The idea of transfer learning
4
00:00:10,920 --> 00:00:12,570
是利用在另一项任务上使用大量数据训练的模型结果
is to leverage the knowledge acquired
5
00:00:12,570 --> 00:00:15,543
来获取知识。
by a model trained with lots of data on another task.
6
00:00:16,410 --> 00:00:20,130
模型 A 将专门针对任务 A 进行训练。
The model A will be trained specifically for task A.
7
00:00:20,130 --> 00:00:22,200
现在假设您想为了另一个任务
Now let's say you want to train a model B
8
00:00:22,200 --> 00:00:23,970
训练模型 B。
for a different task.
9
00:00:23,970 --> 00:00:27,330
一种选择是从头开始训练模型。
One option would be to train the model from scratch.
10
00:00:27,330 --> 00:00:30,633
但这可能需要大量的计算、时间和数据。
This could take lots of computation, time and data.
11
00:00:31,470 --> 00:00:34,260
我们可以有另一种选择,初始化模型 B
Instead, we could initialize model B
12
00:00:34,260 --> 00:00:36,570
它与模型 A 具有相同的权重,
with the same weights as model A,
13
00:00:36,570 --> 00:00:39,213
将模型 A 的知识转移到任务 B 上。
transferring the knowledge of model A on task B.
14
00:00:41,040 --> 00:00:42,690
从头开始训练时,
When training from scratch,
15
00:00:42,690 --> 00:00:45,870
所有模型的权重都是随机初始化的。
all the model's weight are initialized randomly.
16
00:00:45,870 --> 00:00:48,870
在这个例子中,我们正在基于识别任务上
In this example, we are training a BERT model
17
00:00:48,870 --> 00:00:50,220
训练一个 BERT 模型
on the task of recognizing
18
00:00:50,220 --> 00:00:52,203
来判断两个句子是否相似。
if two sentences are similar or not.
19
00:00:54,116 --> 00:00:56,730
左边的例子是从头开始训练的,
On the left, it's trained from scratch,
20
00:00:56,730 --> 00:01:00,000
右边则代表正在微调预训练模型。
and on the right it's fine-tuning a pretrained model.
21
00:01:00,000 --> 00:01:02,220
正如我们所见,使用转移学习
As we can see, using transfer learning
22
00:01:02,220 --> 00:01:05,160
和预训练模型产生了更好的结果。
and the pretrained model yields better results.
23
00:01:05,160 --> 00:01:07,140
如果我们训练更长时间也没关系。
And it doesn't matter if we train longer.
24
00:01:07,140 --> 00:01:10,620
从头开始的训练准确率上限在 70% 左右
The training from scratch is capped around 70% accuracy
25
00:01:10,620 --> 00:01:13,293
而预训练模型轻松达到了 86%。
while the pretrained model beats the 86% easily.
26
00:01:14,460 --> 00:01:16,140
这是因为预训练模型
This is because pretrained models
27
00:01:16,140 --> 00:01:18,420
通常基于大量数据进行训练
are usually trained on large amounts of data
28
00:01:18,420 --> 00:01:21,000
这些数据为模型在预训练期间
that provide the model with a statistical understanding
29
00:01:21,000 --> 00:01:23,413
提供了对语言使用的统计理解。
of the language used during pretraining.
30
00:01:24,450 --> 00:01:25,950
在计算机视觉中,
In computer vision,
31
00:01:25,950 --> 00:01:28,080
转移学习已成功应用
transfer learning has been applied successfully
32
00:01:28,080 --> 00:01:30,060
将近十年。
for almost ten years.
33
00:01:30,060 --> 00:01:32,850
模型经常在 ImageNet 上进行预训练,
Models are frequently pretrained on ImageNet,
34
00:01:32,850 --> 00:01:36,153
它是一种包含 120 万张照片图像的数据集。
a dataset containing 1.2 millions of photo images.
35
00:01:37,170 --> 00:01:41,130
每个图像都按 1000 个标签中的一个进行分类。
Each image is classified by one of 1000 labels.
36
00:01:41,130 --> 00:01:44,010
像这样在标记数据上训练
Training like this, on labeled data
37
00:01:44,010 --> 00:01:45,663
称为监督学习。
is called supervised learning.
38
00:01:47,340 --> 00:01:49,140
在自然语言处理中,
In Natural Language Processing,
39
00:01:49,140 --> 00:01:51,870
转移学习是最近才出现的。
transfer learning is a bit more recent.
40
00:01:51,870 --> 00:01:54,480
它与 ImageNet 的一个关键区别是预训练
A key difference with ImageNet is that the pretraining
41
00:01:54,480 --> 00:01:56,460
通常是自我监督的,
is usually self-supervised,
42
00:01:56,460 --> 00:01:58,770
这意味着它不需要人工对标签
which means it doesn't require humans annotations
43
00:01:58,770 --> 00:01:59,673
进行注释。
for the labels.
44
00:02:00,780 --> 00:02:02,700
一个非常常见的预训练目标
A very common pretraining objective
45
00:02:02,700 --> 00:02:05,310
是猜测句子中的下一个单词。
is to guess the next word in a sentence.
46
00:02:05,310 --> 00:02:07,710
它只需要大量的输入文本。
Which only requires lots and lots of text.
47
00:02:07,710 --> 00:02:10,710
例如 GPT-2,就是这样预训练的
GPT-2 for instance, was pretrained this way
48
00:02:10,710 --> 00:02:12,900
它使用 4500 万个用户在 Reddit 上发布的
using the content of 45 millions links
49
00:02:12,900 --> 00:02:14,673
链接的内容。
posted by users on Reddit.
50
00:02:16,560 --> 00:02:19,590
自监督预训练目标的另一个例子
Another example of self-supervised pretraining objective
51
00:02:19,590 --> 00:02:22,470
是预测随机屏蔽词的值。
is to predict the value of randomly masked words.
52
00:02:22,470 --> 00:02:24,540
这类似于填空测试
Which is similar to fill-in-the-blank tests
53
00:02:24,540 --> 00:02:26,760
您可能在学校做过。
you may have done in school.
54
00:02:26,760 --> 00:02:29,880
BERT 是使用英文维基百科和 11,000 本未出版的书籍
BERT was pretrained this way using the English Wikipedia
55
00:02:29,880 --> 00:02:31,893
进行预训练的。
and 11,000 unpublished books.
56
00:02:33,120 --> 00:02:36,450
在实践中,转移学习是通过抛弃原模型的头部
In practice, transfer learning is applied on a given model
57
00:02:36,450 --> 00:02:39,090
即其针对预训练目标的最后几层
by throwing away its head, that is,
58
00:02:39,090 --> 00:02:42,150
并用一个新的、随机初始化的头部
its last layers focused on the pretraining objective,
59
00:02:42,150 --> 00:02:45,360
来替换它来应用的
and replacing it with a new, randomly initialized head
60
00:02:45,360 --> 00:02:46,860
这个新的头部适用于当前的任务。
suitable for the task at hand.
61
00:02:47,970 --> 00:02:51,570
例如,当我们之前微调 BERT 模型时,
For instance, when we fine-tuned a BERT model earlier,
62
00:02:51,570 --> 00:02:54,060
我们删除了分类掩码词的头部
we removed the head that classified mask words
63
00:02:54,060 --> 00:02:56,790
并将其替换为具有 2 个输出的分类器。
and replaced it with a classifier with 2 outputs.
64
00:02:56,790 --> 00:02:58,563
因为我们的任务有两个标签。
Since our task had two labels.
65
00:02:59,700 --> 00:03:02,490
为了尽可能高效
To be as efficient as possible, the pretrained model used
66
00:03:02,490 --> 00:03:03,770
所使用的预训练模型
should be as similar as possible
67
00:03:03,770 --> 00:03:06,270
应尽可能与其微调的任务相似。
to the task it's fine-tuned on.
68
00:03:06,270 --> 00:03:08,190
例如,如果当前需要
For instance, if the problem
69
00:03:08,190 --> 00:03:10,860
对德语句子进行分类,
is to classify German sentences,
70
00:03:10,860 --> 00:03:13,053
最好使用德语预训练模型。
it's best to use a German pretrained model.
71
00:03:14,370 --> 00:03:16,649
但有好事也有坏事。
But with the good comes the bad.
72
00:03:16,649 --> 00:03:19,380
预训练模型不仅转移了它的知识,
The pretrained model does not only transfer its knowledge,
73
00:03:19,380 --> 00:03:21,693
同时也转移了它可能包含的任何偏见。
but also any bias it may contain.
74
00:03:22,530 --> 00:03:24,300
ImageNet 主要包含来自美国和西欧
ImageNet mostly contains images
75
00:03:24,300 --> 00:03:26,850
的图像。
coming from the United States and Western Europe.
76
00:03:26,850 --> 00:03:28,020
所以基于它进行微调的模型
So models fine-tuned with it
77
00:03:28,020 --> 00:03:31,710
通常会在来自这些国家或地区的图像上表现更好。
usually will perform better on images from these countries.
78
00:03:31,710 --> 00:03:33,690
OpenAI 还研究了
OpenAI also studied the bias
79
00:03:33,690 --> 00:03:36,120
其使用猜测下一个单词目标
in the predictions of its GPT-3 model
80
00:03:36,120 --> 00:03:36,953
预训练的 GPT-3 模型中
which was pretrained
81
00:03:36,953 --> 00:03:38,750
预测的偏差。
using the guess the next word objective.
82
00:03:39,720 --> 00:03:41,040
将提示的性别
Changing the gender of the prompt
83
00:03:41,040 --> 00:03:44,250
从“他”更改到“她”
from he was very to she was very
84
00:03:44,250 --> 00:03:47,550
会使预测从主要是中性形容词
changed the predictions from mostly neutral adjectives
85
00:03:47,550 --> 00:03:49,233
变为几乎只有物理上的形容词。
to almost only physical ones.
86
00:03:50,400 --> 00:03:52,367
在他们的 GPT-2 的模型卡中,
In their model card of the GPT-2 model,
87
00:03:52,367 --> 00:03:54,990
OpenAI 也承认了它的偏见
OpenAI also acknowledges its bias
88
00:03:54,990 --> 00:03:56,730
并且不鼓励在与人类交互的系统中
and discourages its use
89
00:03:56,730 --> 00:03:58,803
使用它。
in systems that interact with humans.
90
00:04:01,040 --> 00:04:03,707
(空气呼啸)
(air whooshing)