subtitles/zh-CN/34_the-push-to-hub-api-(tensorflow).srt (808 lines of code) (raw):
1
00:00:00,587 --> 00:00:02,670
(嗖嗖)
(swoosh)
2
00:00:05,100 --> 00:00:07,080
- [旁白] 嗨,本视频将带领大家了解
- [Narrator] Hi, this is going to be a video
3
00:00:07,080 --> 00:00:09,420
push_to_hub API 适用于
about the push_to_hub API
4
00:00:09,420 --> 00:00:10,670
Tensorflow 和 Keras 的版本。
for Tensorflow and Keras.
5
00:00:11,820 --> 00:00:14,850
那么首先,打开我们的 jupyter。
So, to get started, we'll open up our notebook.
6
00:00:14,850 --> 00:00:16,920
你需要做的第一件事就是
And the first thing you'll need to do is log in to
7
00:00:16,920 --> 00:00:18,170
登录你的 HuggingFace 帐户,
your HuggingFace account,
8
00:00:19,043 --> 00:00:20,663
例如调用 notebook_login 函数。
for example with the notebook login function.
9
00:00:21,570 --> 00:00:24,630
所以要使用它,你只需调用函数,
So to use that, you simply call the function,
10
00:00:24,630 --> 00:00:26,010
随后会弹出窗口。
the popup will emerge.
11
00:00:26,010 --> 00:00:28,800
你需要输入你的用户名和密码,
You will enter your username and password,
12
00:00:28,800 --> 00:00:31,425
这里我使用密码管理器拷贝密码,
which I'm going to pull out of my password manager here,
13
00:00:31,425 --> 00:00:33,108
然后就登录了。
and you log in.
14
00:00:33,108 --> 00:00:35,670
接下来的两个单元格只是
The next two cells are just
15
00:00:35,670 --> 00:00:37,080
为训练做好准备。
getting everything ready for training.
16
00:00:37,080 --> 00:00:38,940
所以我们只是要加载一个数据集,
So we're just going to load a dataset,
17
00:00:38,940 --> 00:00:41,100
我们要对数据集进行分词,
we're going to tokenize that dataset,
18
00:00:41,100 --> 00:00:42,990
然后我们将加载我们的模型并
and then we're going to load our model and compile
19
00:00:42,990 --> 00:00:45,660
使用标准 Adam 优化器进行编译。
it with the standard Adam optimizer.
20
00:00:45,660 --> 00:00:47,560
现在我们会运行这些代码。
So I'm just going to run all of those.
21
00:00:49,830 --> 00:00:52,080
我们等几秒钟,
We'll wait a few seconds,
22
00:00:52,080 --> 00:00:54,280
一切都应该为训练做好准备。
and everything should be ready for training.
23
00:00:57,983 --> 00:00:58,816
好的。
Okay.
24
00:00:58,816 --> 00:01:01,440
所以现在我们准备好训练了。
So now we're ready to train.
25
00:01:01,440 --> 00:01:03,030
我将向你展示两种方法
I'm going to show you the two ways
26
00:01:03,030 --> 00:01:05,130
你可以将你的模型推送到 Hub。
you can push your model to the Hub.
27
00:01:05,130 --> 00:01:08,190
所以第一个是 PushToHubCallback。
So the first is with the PushToHubCallback.
28
00:01:08,190 --> 00:01:10,107
所以 Keras 中的回调
So a callback in Keras
29
00:01:10,107 --> 00:01:13,710
也会在训练过程中被频繁调用。
is a function that's called regularly during training.
30
00:01:13,710 --> 00:01:17,400
你可以将其设置为在一定数量的 step 后或者在每个 epoch 被调用,
You can set it to be called after a certain number of steps,
31
00:01:17,400 --> 00:01:21,427
甚至只是在训练结束时调用一次。
or every epoch, or even just once at the end of training.
32
00:01:21,427 --> 00:01:25,080
所以 Keras 中有很多回调,例如,
So a lot of callbacks in Keras, for example,
33
00:01:25,080 --> 00:01:28,050
在平稳期控制学习率衰减,
control learning rate decaying on plateau,
34
00:01:28,050 --> 00:01:30,047
诸如此类的事情。
and things like that.
35
00:01:30,047 --> 00:01:32,520
所以这个回调,默认情况下,
So this callback, by default,
36
00:01:32,520 --> 00:01:35,760
在每个 epoch 都会将你的模型保存到 Hub 一次。
will save your model to the Hub once every epoch.
37
00:01:35,760 --> 00:01:37,080
这真的非常有用,
And that's really helpful,
38
00:01:37,080 --> 00:01:38,790
特别是如果你的训练时间很长,
especially if your training is very long,
39
00:01:38,790 --> 00:01:40,800
因为那意味着你可以从那个存储中恢复,
because that means you can resume from that save,
40
00:01:40,800 --> 00:01:43,290
所以你可以实现将你的模型进行自动云存储。
so you get this automatic cloud-saving of your model.
41
00:01:43,290 --> 00:01:45,027
你甚至可以利用
And you can even run inference
42
00:01:45,027 --> 00:01:47,730
该回调所上传的模型存储点(checkpoint)
with the checkpoints of your model
43
00:01:47,730 --> 00:01:50,208
进行推断过程。
that have been uploaded by this callback.
44
00:01:50,208 --> 00:01:52,260
这意味着你可以,
And that means you can,
45
00:01:52,260 --> 00:01:54,150
你懂的,运行一些测试输入
y'know, run some test inputs
46
00:01:54,150 --> 00:01:56,100
并实际查看你的模型在训练的各个阶段
and actually see how your model works
47
00:01:56,100 --> 00:01:57,990
是如何工作的,
at various stages during training,
48
00:01:57,990 --> 00:01:59,540
这是一个非常好的功能。
which is a really nice feature.
49
00:02:00,390 --> 00:02:03,960
所以我们要添加 PushToHubCallback,
So we're going to add the PushToHubCallback,
50
00:02:03,960 --> 00:02:05,670
它只需要几个参数。
and it takes just a few arguments.
51
00:02:05,670 --> 00:02:08,250
第一个参数是临时目录
So the first argument is the temporary directory
52
00:02:08,250 --> 00:02:10,260
文件在上传到 Hub 之前
that files are going to be saved to
53
00:02:10,260 --> 00:02:12,150
将被保存到该目录。
before they're uploaded to the Hub.
54
00:02:12,150 --> 00:02:14,127
第二个参数是分词器,
The second argument is the tokenizer,
55
00:02:14,127 --> 00:02:15,808
第三个参数在这里
and the third argument here
56
00:02:15,808 --> 00:02:19,080
是关键字参数 hub_model_id。
is the keyword argument hub_model_id.
57
00:02:19,080 --> 00:02:21,330
也就是在 HuggingFace Hub 上
So that's the name it's going to be saved under
58
00:02:21,330 --> 00:02:23,006
它将被保存的名称。
on the HuggingFace Hub.
59
00:02:23,006 --> 00:02:26,267
你还可以上传到组织帐户
You can also upload to an organization account
60
00:02:26,267 --> 00:02:29,370
只需在带有斜杠的仓库名称之前
just by adding the organization name
61
00:02:29,370 --> 00:02:32,460
添加组织名称,就像这样。
before the repository name with a slash, like this.
62
00:02:32,460 --> 00:02:34,020
所以你可能没有权限
So you probably don't have permissions
63
00:02:34,020 --> 00:02:36,000
上传到 HuggingFace 组织,
to upload to the HuggingFace organization,
64
00:02:36,000 --> 00:02:37,170
如果你有权限可以上传,
if you do please file a bug
65
00:02:37,170 --> 00:02:38,973
请提交 bug 并尽快通知我们。
and let us know extremely urgently.
66
00:02:40,830 --> 00:02:42,960
但如果你确实有权限访问你自己的组织,
But if you do have access to your own organization,
67
00:02:42,960 --> 00:02:44,730
那么你可以使用相同的方法
then you can use that same approach
68
00:02:44,730 --> 00:02:46,650
将模型上传到他们的帐户
to upload models to their account
69
00:02:46,650 --> 00:02:50,760
而不是你自己的个人模型集。
instead of to your own personal set of models.
70
00:02:50,760 --> 00:02:53,520
所以,一旦你实现了回调,
So, once you've made your callback,
71
00:02:53,520 --> 00:02:56,310
你只需在调用 model.fit 时
you simply add it to the callbacks list
72
00:02:56,310 --> 00:02:58,080
将它添加到回调列表。
when you're calling model.fit.
73
00:02:58,080 --> 00:03:01,110
所有内容均为从那里上传,
And everything is uploaded for you from there,
74
00:03:01,110 --> 00:03:02,610
没有什么可担心的。
there's nothing else to worry about.
75
00:03:02,610 --> 00:03:04,530
上传模型的第二种方式,
The second way to upload a model, though,
76
00:03:04,530 --> 00:03:07,020
就是调用 model.push_to_hub。
is to call model.push_to_hub.
77
00:03:07,020 --> 00:03:09,086
所以这更像是一种一次性的方法。
So this is more of a once-off method.
78
00:03:09,086 --> 00:03:11,550
在训练期间不会定期调用它。
It's not called regularly during training.
79
00:03:11,550 --> 00:03:13,680
你可以随时手动调用它
You can just call this manually whenever you want to
80
00:03:13,680 --> 00:03:15,240
将模型上传到 hub。
upload a model to the hub.
81
00:03:15,240 --> 00:03:18,949
所以我们建议在训练结束后运行这个,
So we recommend running this after the end of training,
82
00:03:18,949 --> 00:03:21,870
只是为了确保你有一条 commit 消息
just to make sure that you have a commit message
83
00:03:21,870 --> 00:03:24,060
保证这是训练结束时
to guarantee that this was the final version
84
00:03:24,060 --> 00:03:26,143
模型的最终版本。
of the model at the end of training.
85
00:03:26,143 --> 00:03:27,930
它只是为了确保,你懂的,
And it just makes sure that, you know,
86
00:03:27,930 --> 00:03:30,480
你当前正在使用的是最终训练结束的模型
you're working with the definitive end-of-training model
87
00:03:30,480 --> 00:03:32,190
而不是意外从某个地方拿到的
and not accidentally using a checkpoint
88
00:03:32,190 --> 00:03:34,224
模型的某个存储点的版本。
from somewhere along the way.
89
00:03:34,224 --> 00:03:37,173
所以我要运行这两个单元格。
So I'm going to run both of these cells.
90
00:03:39,299 --> 00:03:41,716
然后我要在这里剪视频,
And then I'm going to cut the video here,
91
00:03:41,716 --> 00:03:43,080
只是因为训练需要几分钟。
just because training is going to take a couple of minutes.
92
00:03:43,080 --> 00:03:44,580
所以我会直接跳到最后,
So I'll skip forward to the end of that,
93
00:03:44,580 --> 00:03:46,320
当模型全部上传后,
when the models have all been uploaded,
94
00:03:46,320 --> 00:03:48,390
我会告诉你
and I'm gonna show you how you can
95
00:03:48,390 --> 00:03:50,010
如何访问 Hub 中的模型,
access the models in the Hub,
96
00:03:50,010 --> 00:03:52,713
以及其他能够在那里利用模型做到的事情。
and the other things you can do with them from there.
97
00:03:55,440 --> 00:03:56,700
好的,我们回来了,
Okay, we're back,
98
00:03:56,700 --> 00:03:59,160
我们的模型已上传。
and our model was uploaded.
99
00:03:59,160 --> 00:04:00,750
两者都由 PushToHubCallback
Both by the PushToHubCallback
100
00:04:00,750 --> 00:04:04,251
以及我们在训练后调用 model.push_to_hub。
and also by our call to model.push_to_hub after training.
101
00:04:04,251 --> 00:04:05,910
所以一切看起来都很好。
So everything's looking good.
102
00:04:05,910 --> 00:04:09,960
那么现在如果我们转到我在 HuggingFace 上的个人资料,
So now if we drop over to my profile on HuggingFace,
103
00:04:09,960 --> 00:04:12,630
你只需点击 profile 按钮就可以
and you can get there just by clicking the profile button
104
00:04:12,630 --> 00:04:13,680
在下拉列表中打开。
in the dropdown.
105
00:04:13,680 --> 00:04:16,860
我们可以看到 bert-fine-tuned-cola 模型在这里,
We can see that the bert-fine-tuned-cola model is here,
106
00:04:16,860 --> 00:04:18,369
并于 3 分钟前更新。
and was updated 3 minutes ago.
107
00:04:18,369 --> 00:04:20,520
所以它总是在你的列表的顶部,
So it'll always be at the top of your list,
108
00:04:20,520 --> 00:04:23,340
因为它们是按最近更新时间排序的。
because they're sorted by how recently they were updated.
109
00:04:23,340 --> 00:04:25,740
我们可以立即开始查询我们的模型。
And we can start querying our model immediately.
110
00:04:30,564 --> 00:04:32,939
所以我们训练的数据集
So the dataset we were training on
111
00:04:32,939 --> 00:04:34,320
是 Glue CoLA 数据集,
is the Glue CoLA dataset,
112
00:04:34,320 --> 00:04:36,210
CoLA 是 Corpus of Linguistic Acceptability(语言可接受性语料库)
and CoLA is an acronym standing for
113
00:04:36,210 --> 00:04:39,420
的首字母缩写。
the Corpus of Linguistic Acceptability.
114
00:04:39,420 --> 00:04:42,480
所以这意味着正在训练模型用来决定
So what that means is the model is being trained to decide
115
00:04:42,480 --> 00:04:46,350
一个句子在语法或语言上是正确的呢,
if a sentence is grammatically or linguistically okay,
116
00:04:46,350 --> 00:04:48,171
还是有问题的呢。
or if there's a problem with it.
117
00:04:48,171 --> 00:04:52,890
例如,我们可以说,“This is a legitimate sentence.”
For example, we could say, "This is a legitimate sentence."
118
00:04:52,890 --> 00:04:54,180
并且希望它判断
And hopefully it realizes that
119
00:04:54,180 --> 00:04:56,080
这实际上是一个合理的句子。
this is in fact a legitimate sentence.
120
00:04:57,630 --> 00:05:00,240
所以当你第一次调用模型时,
So it might take a couple of seconds for the model to load
121
00:05:00,240 --> 00:05:03,060
加载模型可能需要几秒钟。
when you call it for the first time.
122
00:05:03,060 --> 00:05:05,960
所以我可能会在这里从这个视频中剪掉几秒钟。
So I might cut a couple of seconds out of this video here.
123
00:05:07,860 --> 00:05:09,060
好的,我们回来了。
Okay, we're back.
124
00:05:09,060 --> 00:05:12,407
所以模型加载了,我们得到了一个输出结果,
So the model loaded and we got an output,
125
00:05:12,407 --> 00:05:14,340
但这里有一个明显的问题。
but there's an obvious problem here.
126
00:05:14,340 --> 00:05:16,888
这些标签并没有真正告诉我们
So these labels aren't really telling us
127
00:05:16,888 --> 00:05:19,740
模型实际为这个输入的句子
what categories the model has actually assigned
128
00:05:19,740 --> 00:05:21,655
分配了哪些类别。
to this input sentence.
129
00:05:21,655 --> 00:05:23,520
所以如果我们想解决这个问题,
So if we want to fix that,
130
00:05:23,520 --> 00:05:26,010
我们要确保模型配置
we want to make sure the model config
131
00:05:26,010 --> 00:05:28,980
针对每个标签类别都有正确的名称,
has the correct names for each of the label classes,
132
00:05:28,980 --> 00:05:30,707
然后我们要上传该配置。
and then we want to upload that config.
133
00:05:30,707 --> 00:05:32,220
所以我们可以在这里实现。
So we can do that down here.
134
00:05:32,220 --> 00:05:34,050
要获取标签名称,
To get the label names,
135
00:05:34,050 --> 00:05:36,547
我们可以从我们加载的数据集中得到它,
we can get that from the dataset we loaded,
136
00:05:36,547 --> 00:05:39,627
它其中包含特性属性。
from the features attribute it has.
137
00:05:39,627 --> 00:05:42,217
然后我们可以创建字典
And then we can create dictionaries
138
00:05:42,217 --> 00:05:44,865
“id2label” 和 “label2id”,
"id2label" and "label2id",
139
00:05:44,865 --> 00:05:47,452
并将它们设置到模型配置中。
and just assign them to the model config.
140
00:05:47,452 --> 00:05:50,790
然后我们可以推送我们更新的配置,
And then we can just push our updated config,
141
00:05:50,790 --> 00:05:54,690
这将覆盖 Hub 仓库中的现有配置。
and that'll override the existing config in the Hub repo.
142
00:05:54,690 --> 00:05:56,368
所以这已经完成了。
So that's just been done.
143
00:05:56,368 --> 00:05:58,320
那么现在,如果我们回到这里,
So now, if we go back here,
144
00:05:58,320 --> 00:06:00,000
我要用一个稍微不同的句子
I'm going to use a slightly different sentence
145
00:06:00,000 --> 00:06:03,540
因为句子的输出有时会被缓存。
because the outputs for sentences are sometimes cached.
146
00:06:03,540 --> 00:06:06,030
所以,如果我们想生成新的结果
And so, if we want to generate new results
147
00:06:06,030 --> 00:06:07,590
我将使用一些稍微不同的东西。
I'm going to use something slightly different.
148
00:06:07,590 --> 00:06:09,783
那么,让我们尝试换一个不正确的句子。
So let's try an incorrect sentence.
149
00:06:10,830 --> 00:06:12,640
所以这不是有效的英语语法
So this is not valid English grammar
150
00:06:13,538 --> 00:06:15,030
希望模型能发现这一点。
and hopefully the model will see that.
151
00:06:15,030 --> 00:06:16,958
它会在这里重新加载,
It's going to reload here,
152
00:06:16,958 --> 00:06:18,630
所以在这里稍微快几秒,
so I'm going to cut a couple of seconds here,
153
00:06:18,630 --> 00:06:20,933
然后我们会看到模型会返回什么结果。
and then we'll see what the model is going to say.
154
00:06:22,860 --> 00:06:23,820
好的。
Okay.
155
00:06:23,820 --> 00:06:26,580
所以这个模型,它的置信度不是很好,
So the model, it's confidence isn't very good,
156
00:06:26,580 --> 00:06:28,830
因为我们还没有真正优化
because of course we didn't really optimize
157
00:06:28,830 --> 00:06:30,630
我们的超参数。
our hyperparameters at all.
158
00:06:30,630 --> 00:06:32,190
但它返回的结果显示这句话
But it has decided that this sentence
159
00:06:32,190 --> 00:06:35,094
不可接受的程度大于可接受的程度。
is more likely to be unacceptable than acceptable.
160
00:06:35,094 --> 00:06:38,160
如果我们更努力地训练
Presumably if we tried a bit harder with training
161
00:06:38,160 --> 00:06:40,080
我们可以获得更低的验证集 loss,
we could get a much lower validation loss,
162
00:06:40,080 --> 00:06:43,830
因此模型的预测会更准确。
and therefore the model's predictions would be more precise.
163
00:06:43,830 --> 00:06:46,260
但是让我们再试一次我们原来的句子。
But let's try our original sentence again.
164
00:06:46,260 --> 00:06:49,140
当然,因为缓存的问题,
Of course, because of the caching issue,
165
00:06:49,140 --> 00:06:52,740
我们看到原来的答案没有改变。
we're seeing that the original answers are unchanged.
166
00:06:52,740 --> 00:06:55,196
所以让我们尝试一个不同的,有效的句子。
So let's try a different, valid sentence.
167
00:06:55,196 --> 00:06:58,767
所以让我们试试,“This is a valid English sentence”。
So let's try, "This is a valid English sentence".
168
00:07:00,150 --> 00:07:02,100
我们看到现在模型的结果是正确的
And we see that now the model correctly decides
169
00:07:02,100 --> 00:07:04,290
它的可接受度非常高,
that it has a very high probability of being acceptable,
170
00:07:04,290 --> 00:07:06,900
并且被拒绝的可能性非常低。
and a very low probability of being unacceptable.
171
00:07:06,900 --> 00:07:09,930
所以你可以使用这个 inference API
So you can use this inference API
172
00:07:09,930 --> 00:07:12,810
甚至可用于训练期间上传的存储点,
even with the checkpoints that are uploaded during training,
173
00:07:12,810 --> 00:07:14,546
能够看到在每个训练的 epoch 时
so it can be very interesting to see how
174
00:07:14,546 --> 00:07:17,690
不同的样本输入所输出的预测结果
the model's predictions for sample inputs change
175
00:07:17,690 --> 00:07:20,579
也是非常有意思的。
with each epoch of training.
176
00:07:20,579 --> 00:07:23,370
另外,你也可以访问
Also, the model we've uploaded
177
00:07:23,370 --> 00:07:25,740
我们上传的模型
is going to be accessible to you and,
178
00:07:25,740 --> 00:07:28,046
如果公开分享,其它任何人都可以访问。
if it's shared publicly, to anyone else.
179
00:07:28,046 --> 00:07:29,788
所以如果你想加载那个模型,
So if you want to load that model,
180
00:07:29,788 --> 00:07:32,500
你或其他任何人所需要做的
all you or anyone else needs to do
181
00:07:34,290 --> 00:07:37,440
就是将它加载到 pipeline,
is just to load it in either a pipeline,
182
00:07:37,440 --> 00:07:40,925
或者你可以使用其它方式加载它,例如,
or you can just load it with, for example,
183
00:07:40,925 --> 00:07:43,203
TFAutoModelForSequenceClassification。
TFAutoModelForSequenceClassification.
184
00:07:46,920 --> 00:07:49,989
然后对于名称,你只需传入
And then for the name you would just simply pass
185
00:07:49,989 --> 00:07:53,325
想要上传的 repo 的路径即可。
the path to the repo you want to upload.
186
00:07:53,325 --> 00:07:55,890
或者下载,不好意思。
Or to download, excuse me.
187
00:07:55,890 --> 00:07:58,710
所以如果我想再次使用这个模型,
So if I want to use this model again,
188
00:07:58,710 --> 00:08:00,667
如果我想从 hub 加载它,
if I want to load it from the hub,
189
00:08:00,667 --> 00:08:01,763
仅需运行这一行代码。
I just run this one line of code.
190
00:08:02,813 --> 00:08:03,773
该模型将被下载。
The model will be downloaded.
191
00:08:07,757 --> 00:08:10,080
而且,运气好的话,它会准备好
And, with any luck, it'll be ready to
192
00:08:10,080 --> 00:08:12,450
对不同的数据集进行微调,进行预测,
fine-tune on a different dataset, make predictions with,
193
00:08:12,450 --> 00:08:14,340
或者做任何你想做的事情。
or do anything else you wanna do.
194
00:08:14,340 --> 00:08:17,700
以上内容就是关于
So that was a quick overview of how,
195
00:08:17,700 --> 00:08:19,470
在你训练期间或者训练之后
after your training or during your training,
196
00:08:19,470 --> 00:08:21,420
如何将模型上传到 Hub,
you can upload models to the Hub,
197
00:08:21,420 --> 00:08:22,440
可以添加存储点,
you can checkpoint there,
198
00:08:22,440 --> 00:08:24,240
可以恢复训练,
you can resume training from there,
199
00:08:24,240 --> 00:08:26,790
以及从所上传模型
and you can get inference results
200
00:08:26,790 --> 00:08:28,384
获得推理结果。
from the models you've uploaded.
201
00:08:28,384 --> 00:08:31,084
谢谢大家,希望在以后的视频再会。
So thank you, and I hope to see you in a future video.
202
00:08:32,852 --> 00:08:34,935
(嗖嗖)
(swoosh)