subtitles/zh-CN/34_the-push-to-hub-api-(tensorflow).srt (808 lines of code) (raw):

1 00:00:00,587 --> 00:00:02,670 (嗖嗖) (swoosh) 2 00:00:05,100 --> 00:00:07,080 - [旁白] 嗨,本视频将带领大家了解 - [Narrator] Hi, this is going to be a video 3 00:00:07,080 --> 00:00:09,420 push_to_hub API 适用于 about the push_to_hub API 4 00:00:09,420 --> 00:00:10,670 Tensorflow 和 Keras 的版本。 for Tensorflow and Keras. 5 00:00:11,820 --> 00:00:14,850 那么首先,打开我们的 jupyter。 So, to get started, we'll open up our notebook. 6 00:00:14,850 --> 00:00:16,920 你需要做的第一件事就是 And the first thing you'll need to do is log in to 7 00:00:16,920 --> 00:00:18,170 登录你的 HuggingFace 帐户, your HuggingFace account, 8 00:00:19,043 --> 00:00:20,663 例如调用 notebook_login 函数。 for example with the notebook login function. 9 00:00:21,570 --> 00:00:24,630 所以要使用它,你只需调用函数, So to use that, you simply call the function, 10 00:00:24,630 --> 00:00:26,010 随后会弹出窗口。 the popup will emerge. 11 00:00:26,010 --> 00:00:28,800 你需要输入你的用户名和密码, You will enter your username and password, 12 00:00:28,800 --> 00:00:31,425 这里我使用密码管理器拷贝密码, which I'm going to pull out of my password manager here, 13 00:00:31,425 --> 00:00:33,108 然后就登录了。 and you log in. 14 00:00:33,108 --> 00:00:35,670 接下来的两个单元格只是 The next two cells are just 15 00:00:35,670 --> 00:00:37,080 为训练做好准备。 getting everything ready for training. 16 00:00:37,080 --> 00:00:38,940 所以我们只是要加载一个数据集, So we're just going to load a dataset, 17 00:00:38,940 --> 00:00:41,100 我们要对数据集进行分词, we're going to tokenize that dataset, 18 00:00:41,100 --> 00:00:42,990 然后我们将加载我们的模型并 and then we're going to load our model and compile 19 00:00:42,990 --> 00:00:45,660 使用标准 Adam 优化器进行编译。 it with the standard Adam optimizer. 20 00:00:45,660 --> 00:00:47,560 现在我们会运行这些代码。 So I'm just going to run all of those. 21 00:00:49,830 --> 00:00:52,080 我们等几秒钟, We'll wait a few seconds, 22 00:00:52,080 --> 00:00:54,280 一切都应该为训练做好准备。 and everything should be ready for training. 23 00:00:57,983 --> 00:00:58,816 好的。 Okay. 24 00:00:58,816 --> 00:01:01,440 所以现在我们准备好训练了。 So now we're ready to train. 25 00:01:01,440 --> 00:01:03,030 我将向你展示两种方法 I'm going to show you the two ways 26 00:01:03,030 --> 00:01:05,130 你可以将你的模型推送到 Hub。 you can push your model to the Hub. 27 00:01:05,130 --> 00:01:08,190 所以第一个是 PushToHubCallback。 So the first is with the PushToHubCallback. 28 00:01:08,190 --> 00:01:10,107 所以 Keras 中的回调 So a callback in Keras 29 00:01:10,107 --> 00:01:13,710 也会在训练过程中被频繁调用。 is a function that's called regularly during training. 30 00:01:13,710 --> 00:01:17,400 你可以将其设置为在一定数量的 step 后或者在每个 epoch 被调用, You can set it to be called after a certain number of steps, 31 00:01:17,400 --> 00:01:21,427 甚至只是在训练结束时调用一次。 or every epoch, or even just once at the end of training. 32 00:01:21,427 --> 00:01:25,080 所以 Keras 中有很多回调,例如, So a lot of callbacks in Keras, for example, 33 00:01:25,080 --> 00:01:28,050 在平稳期控制学习率衰减, control learning rate decaying on plateau, 34 00:01:28,050 --> 00:01:30,047 诸如此类的事情。 and things like that. 35 00:01:30,047 --> 00:01:32,520 所以这个回调,默认情况下, So this callback, by default, 36 00:01:32,520 --> 00:01:35,760 在每个 epoch 都会将你的模型保存到 Hub 一次。 will save your model to the Hub once every epoch. 37 00:01:35,760 --> 00:01:37,080 这真的非常有用, And that's really helpful, 38 00:01:37,080 --> 00:01:38,790 特别是如果你的训练时间很长, especially if your training is very long, 39 00:01:38,790 --> 00:01:40,800 因为那意味着你可以从那个存储中恢复, because that means you can resume from that save, 40 00:01:40,800 --> 00:01:43,290 所以你可以实现将你的模型进行自动云存储。 so you get this automatic cloud-saving of your model. 41 00:01:43,290 --> 00:01:45,027 你甚至可以利用 And you can even run inference 42 00:01:45,027 --> 00:01:47,730 该回调所上传的模型存储点(checkpoint) with the checkpoints of your model 43 00:01:47,730 --> 00:01:50,208 进行推断过程。 that have been uploaded by this callback. 44 00:01:50,208 --> 00:01:52,260 这意味着你可以, And that means you can, 45 00:01:52,260 --> 00:01:54,150 你懂的,运行一些测试输入 y'know, run some test inputs 46 00:01:54,150 --> 00:01:56,100 并实际查看你的模型在训练的各个阶段 and actually see how your model works 47 00:01:56,100 --> 00:01:57,990 是如何工作的, at various stages during training, 48 00:01:57,990 --> 00:01:59,540 这是一个非常好的功能。 which is a really nice feature. 49 00:02:00,390 --> 00:02:03,960 所以我们要添加 PushToHubCallback, So we're going to add the PushToHubCallback, 50 00:02:03,960 --> 00:02:05,670 它只需要几个参数。 and it takes just a few arguments. 51 00:02:05,670 --> 00:02:08,250 第一个参数是临时目录 So the first argument is the temporary directory 52 00:02:08,250 --> 00:02:10,260 文件在上传到 Hub 之前 that files are going to be saved to 53 00:02:10,260 --> 00:02:12,150 将被保存到该目录。 before they're uploaded to the Hub. 54 00:02:12,150 --> 00:02:14,127 第二个参数是分词器, The second argument is the tokenizer, 55 00:02:14,127 --> 00:02:15,808 第三个参数在这里 and the third argument here 56 00:02:15,808 --> 00:02:19,080 是关键字参数 hub_model_id。 is the keyword argument hub_model_id. 57 00:02:19,080 --> 00:02:21,330 也就是在 HuggingFace Hub 上 So that's the name it's going to be saved under 58 00:02:21,330 --> 00:02:23,006 它将被保存的名称。 on the HuggingFace Hub. 59 00:02:23,006 --> 00:02:26,267 你还可以上传到组织帐户 You can also upload to an organization account 60 00:02:26,267 --> 00:02:29,370 只需在带有斜杠的仓库名称之前 just by adding the organization name 61 00:02:29,370 --> 00:02:32,460 添加组织名称,就像这样。 before the repository name with a slash, like this. 62 00:02:32,460 --> 00:02:34,020 所以你可能没有权限 So you probably don't have permissions 63 00:02:34,020 --> 00:02:36,000 上传到 HuggingFace 组织, to upload to the HuggingFace organization, 64 00:02:36,000 --> 00:02:37,170 如果你有权限可以上传, if you do please file a bug 65 00:02:37,170 --> 00:02:38,973 请提交 bug 并尽快通知我们。 and let us know extremely urgently. 66 00:02:40,830 --> 00:02:42,960 但如果你确实有权限访问你自己的组织, But if you do have access to your own organization, 67 00:02:42,960 --> 00:02:44,730 那么你可以使用相同的方法 then you can use that same approach 68 00:02:44,730 --> 00:02:46,650 将模型上传到他们的帐户 to upload models to their account 69 00:02:46,650 --> 00:02:50,760 而不是你自己的个人模型集。 instead of to your own personal set of models. 70 00:02:50,760 --> 00:02:53,520 所以,一旦你实现了回调, So, once you've made your callback, 71 00:02:53,520 --> 00:02:56,310 你只需在调用 model.fit 时 you simply add it to the callbacks list 72 00:02:56,310 --> 00:02:58,080 将它添加到回调列表。 when you're calling model.fit. 73 00:02:58,080 --> 00:03:01,110 所有内容均为从那里上传, And everything is uploaded for you from there, 74 00:03:01,110 --> 00:03:02,610 没有什么可担心的。 there's nothing else to worry about. 75 00:03:02,610 --> 00:03:04,530 上传模型的第二种方式, The second way to upload a model, though, 76 00:03:04,530 --> 00:03:07,020 就是调用 model.push_to_hub。 is to call model.push_to_hub. 77 00:03:07,020 --> 00:03:09,086 所以这更像是一种一次性的方法。 So this is more of a once-off method. 78 00:03:09,086 --> 00:03:11,550 在训练期间不会定期调用它。 It's not called regularly during training. 79 00:03:11,550 --> 00:03:13,680 你可以随时手动调用它 You can just call this manually whenever you want to 80 00:03:13,680 --> 00:03:15,240 将模型上传到 hub。 upload a model to the hub. 81 00:03:15,240 --> 00:03:18,949 所以我们建议在训练结束后运行这个, So we recommend running this after the end of training, 82 00:03:18,949 --> 00:03:21,870 只是为了确保你有一条 commit 消息 just to make sure that you have a commit message 83 00:03:21,870 --> 00:03:24,060 保证这是训练结束时 to guarantee that this was the final version 84 00:03:24,060 --> 00:03:26,143 模型的最终版本。 of the model at the end of training. 85 00:03:26,143 --> 00:03:27,930 它只是为了确保,你懂的, And it just makes sure that, you know, 86 00:03:27,930 --> 00:03:30,480 你当前正在使用的是最终训练结束的模型 you're working with the definitive end-of-training model 87 00:03:30,480 --> 00:03:32,190 而不是意外从某个地方拿到的 and not accidentally using a checkpoint 88 00:03:32,190 --> 00:03:34,224 模型的某个存储点的版本。 from somewhere along the way. 89 00:03:34,224 --> 00:03:37,173 所以我要运行这两个单元格。 So I'm going to run both of these cells. 90 00:03:39,299 --> 00:03:41,716 然后我要在这里剪视频, And then I'm going to cut the video here, 91 00:03:41,716 --> 00:03:43,080 只是因为训练需要几分钟。 just because training is going to take a couple of minutes. 92 00:03:43,080 --> 00:03:44,580 所以我会直接跳到最后, So I'll skip forward to the end of that, 93 00:03:44,580 --> 00:03:46,320 当模型全部上传后, when the models have all been uploaded, 94 00:03:46,320 --> 00:03:48,390 我会告诉你 and I'm gonna show you how you can 95 00:03:48,390 --> 00:03:50,010 如何访问 Hub 中的模型, access the models in the Hub, 96 00:03:50,010 --> 00:03:52,713 以及其他能够在那里利用模型做到的事情。 and the other things you can do with them from there. 97 00:03:55,440 --> 00:03:56,700 好的,我们回来了, Okay, we're back, 98 00:03:56,700 --> 00:03:59,160 我们的模型已上传。 and our model was uploaded. 99 00:03:59,160 --> 00:04:00,750 两者都由 PushToHubCallback Both by the PushToHubCallback 100 00:04:00,750 --> 00:04:04,251 以及我们在训练后调用 model.push_to_hub。 and also by our call to model.push_to_hub after training. 101 00:04:04,251 --> 00:04:05,910 所以一切看起来都很好。 So everything's looking good. 102 00:04:05,910 --> 00:04:09,960 那么现在如果我们转到我在 HuggingFace 上的个人资料, So now if we drop over to my profile on HuggingFace, 103 00:04:09,960 --> 00:04:12,630 你只需点击 profile 按钮就可以 and you can get there just by clicking the profile button 104 00:04:12,630 --> 00:04:13,680 在下拉列表中打开。 in the dropdown. 105 00:04:13,680 --> 00:04:16,860 我们可以看到 bert-fine-tuned-cola 模型在这里, We can see that the bert-fine-tuned-cola model is here, 106 00:04:16,860 --> 00:04:18,369 并于 3 分钟前更新。 and was updated 3 minutes ago. 107 00:04:18,369 --> 00:04:20,520 所以它总是在你的列表的顶部, So it'll always be at the top of your list, 108 00:04:20,520 --> 00:04:23,340 因为它们是按最近更新时间排序的。 because they're sorted by how recently they were updated. 109 00:04:23,340 --> 00:04:25,740 我们可以立即开始查询我们的模型。 And we can start querying our model immediately. 110 00:04:30,564 --> 00:04:32,939 所以我们训练的数据集 So the dataset we were training on 111 00:04:32,939 --> 00:04:34,320 是 Glue CoLA 数据集, is the Glue CoLA dataset, 112 00:04:34,320 --> 00:04:36,210 CoLA 是 Corpus of Linguistic Acceptability(语言可接受性语料库) and CoLA is an acronym standing for 113 00:04:36,210 --> 00:04:39,420 的首字母缩写。 the Corpus of Linguistic Acceptability. 114 00:04:39,420 --> 00:04:42,480 所以这意味着正在训练模型用来决定 So what that means is the model is being trained to decide 115 00:04:42,480 --> 00:04:46,350 一个句子在语法或语言上是正确的呢, if a sentence is grammatically or linguistically okay, 116 00:04:46,350 --> 00:04:48,171 还是有问题的呢。 or if there's a problem with it. 117 00:04:48,171 --> 00:04:52,890 例如,我们可以说,“This is a legitimate sentence.” For example, we could say, "This is a legitimate sentence." 118 00:04:52,890 --> 00:04:54,180 并且希望它判断 And hopefully it realizes that 119 00:04:54,180 --> 00:04:56,080 这实际上是一个合理的句子。 this is in fact a legitimate sentence. 120 00:04:57,630 --> 00:05:00,240 所以当你第一次调用模型时, So it might take a couple of seconds for the model to load 121 00:05:00,240 --> 00:05:03,060 加载模型可能需要几秒钟。 when you call it for the first time. 122 00:05:03,060 --> 00:05:05,960 所以我可能会在这里从这个视频中剪掉几秒钟。 So I might cut a couple of seconds out of this video here. 123 00:05:07,860 --> 00:05:09,060 好的,我们回来了。 Okay, we're back. 124 00:05:09,060 --> 00:05:12,407 所以模型加载了,我们得到了一个输出结果, So the model loaded and we got an output, 125 00:05:12,407 --> 00:05:14,340 但这里有一个明显的问题。 but there's an obvious problem here. 126 00:05:14,340 --> 00:05:16,888 这些标签并没有真正告诉我们 So these labels aren't really telling us 127 00:05:16,888 --> 00:05:19,740 模型实际为这个输入的句子 what categories the model has actually assigned 128 00:05:19,740 --> 00:05:21,655 分配了哪些类别。 to this input sentence. 129 00:05:21,655 --> 00:05:23,520 所以如果我们想解决这个问题, So if we want to fix that, 130 00:05:23,520 --> 00:05:26,010 我们要确保模型配置 we want to make sure the model config 131 00:05:26,010 --> 00:05:28,980 针对每个标签类别都有正确的名称, has the correct names for each of the label classes, 132 00:05:28,980 --> 00:05:30,707 然后我们要上传该配置。 and then we want to upload that config. 133 00:05:30,707 --> 00:05:32,220 所以我们可以在这里实现。 So we can do that down here. 134 00:05:32,220 --> 00:05:34,050 要获取标签名称, To get the label names, 135 00:05:34,050 --> 00:05:36,547 我们可以从我们加载的数据集中得到它, we can get that from the dataset we loaded, 136 00:05:36,547 --> 00:05:39,627 它其中包含特性属性。 from the features attribute it has. 137 00:05:39,627 --> 00:05:42,217 然后我们可以创建字典 And then we can create dictionaries 138 00:05:42,217 --> 00:05:44,865 “id2label” 和 “label2id”, "id2label" and "label2id", 139 00:05:44,865 --> 00:05:47,452 并将它们设置到模型配置中。 and just assign them to the model config. 140 00:05:47,452 --> 00:05:50,790 然后我们可以推送我们更新的配置, And then we can just push our updated config, 141 00:05:50,790 --> 00:05:54,690 这将覆盖 Hub 仓库中的现有配置。 and that'll override the existing config in the Hub repo. 142 00:05:54,690 --> 00:05:56,368 所以这已经完成了。 So that's just been done. 143 00:05:56,368 --> 00:05:58,320 那么现在,如果我们回到这里, So now, if we go back here, 144 00:05:58,320 --> 00:06:00,000 我要用一个稍微不同的句子 I'm going to use a slightly different sentence 145 00:06:00,000 --> 00:06:03,540 因为句子的输出有时会被缓存。 because the outputs for sentences are sometimes cached. 146 00:06:03,540 --> 00:06:06,030 所以,如果我们想生成新的结果 And so, if we want to generate new results 147 00:06:06,030 --> 00:06:07,590 我将使用一些稍微不同的东西。 I'm going to use something slightly different. 148 00:06:07,590 --> 00:06:09,783 那么,让我们尝试换一个不正确的句子。 So let's try an incorrect sentence. 149 00:06:10,830 --> 00:06:12,640 所以这不是有效的英语语法 So this is not valid English grammar 150 00:06:13,538 --> 00:06:15,030 希望模型能发现这一点。 and hopefully the model will see that. 151 00:06:15,030 --> 00:06:16,958 它会在这里重新加载, It's going to reload here, 152 00:06:16,958 --> 00:06:18,630 所以在这里稍微快几秒, so I'm going to cut a couple of seconds here, 153 00:06:18,630 --> 00:06:20,933 然后我们会看到模型会返回什么结果。 and then we'll see what the model is going to say. 154 00:06:22,860 --> 00:06:23,820 好的。 Okay. 155 00:06:23,820 --> 00:06:26,580 所以这个模型,它的置信度不是很好, So the model, it's confidence isn't very good, 156 00:06:26,580 --> 00:06:28,830 因为我们还没有真正优化 because of course we didn't really optimize 157 00:06:28,830 --> 00:06:30,630 我们的超参数。 our hyperparameters at all. 158 00:06:30,630 --> 00:06:32,190 但它返回的结果显示这句话 But it has decided that this sentence 159 00:06:32,190 --> 00:06:35,094 不可接受的程度大于可接受的程度。 is more likely to be unacceptable than acceptable. 160 00:06:35,094 --> 00:06:38,160 如果我们更努力地训练 Presumably if we tried a bit harder with training 161 00:06:38,160 --> 00:06:40,080 我们可以获得更低的验证集 loss, we could get a much lower validation loss, 162 00:06:40,080 --> 00:06:43,830 因此模型的预测会更准确。 and therefore the model's predictions would be more precise. 163 00:06:43,830 --> 00:06:46,260 但是让我们再试一次我们原来的句子。 But let's try our original sentence again. 164 00:06:46,260 --> 00:06:49,140 当然,因为缓存的问题, Of course, because of the caching issue, 165 00:06:49,140 --> 00:06:52,740 我们看到原来的答案没有改变。 we're seeing that the original answers are unchanged. 166 00:06:52,740 --> 00:06:55,196 所以让我们尝试一个不同的,有效的句子。 So let's try a different, valid sentence. 167 00:06:55,196 --> 00:06:58,767 所以让我们试试,“This is a valid English sentence”。 So let's try, "This is a valid English sentence". 168 00:07:00,150 --> 00:07:02,100 我们看到现在模型的结果是正确的 And we see that now the model correctly decides 169 00:07:02,100 --> 00:07:04,290 它的可接受度非常高, that it has a very high probability of being acceptable, 170 00:07:04,290 --> 00:07:06,900 并且被拒绝的可能性非常低。 and a very low probability of being unacceptable. 171 00:07:06,900 --> 00:07:09,930 所以你可以使用这个 inference API So you can use this inference API 172 00:07:09,930 --> 00:07:12,810 甚至可用于训练期间上传的存储点, even with the checkpoints that are uploaded during training, 173 00:07:12,810 --> 00:07:14,546 能够看到在每个训练的 epoch 时 so it can be very interesting to see how 174 00:07:14,546 --> 00:07:17,690 不同的样本输入所输出的预测结果 the model's predictions for sample inputs change 175 00:07:17,690 --> 00:07:20,579 也是非常有意思的。 with each epoch of training. 176 00:07:20,579 --> 00:07:23,370 另外,你也可以访问 Also, the model we've uploaded 177 00:07:23,370 --> 00:07:25,740 我们上传的模型 is going to be accessible to you and, 178 00:07:25,740 --> 00:07:28,046 如果公开分享,其它任何人都可以访问。 if it's shared publicly, to anyone else. 179 00:07:28,046 --> 00:07:29,788 所以如果你想加载那个模型, So if you want to load that model, 180 00:07:29,788 --> 00:07:32,500 你或其他任何人所需要做的 all you or anyone else needs to do 181 00:07:34,290 --> 00:07:37,440 就是将它加载到 pipeline, is just to load it in either a pipeline, 182 00:07:37,440 --> 00:07:40,925 或者你可以使用其它方式加载它,例如, or you can just load it with, for example, 183 00:07:40,925 --> 00:07:43,203 TFAutoModelForSequenceClassification。 TFAutoModelForSequenceClassification. 184 00:07:46,920 --> 00:07:49,989 然后对于名称,你只需传入 And then for the name you would just simply pass 185 00:07:49,989 --> 00:07:53,325 想要上传的 repo 的路径即可。 the path to the repo you want to upload. 186 00:07:53,325 --> 00:07:55,890 或者下载,不好意思。 Or to download, excuse me. 187 00:07:55,890 --> 00:07:58,710 所以如果我想再次使用这个模型, So if I want to use this model again, 188 00:07:58,710 --> 00:08:00,667 如果我想从 hub 加载它, if I want to load it from the hub, 189 00:08:00,667 --> 00:08:01,763 仅需运行这一行代码。 I just run this one line of code. 190 00:08:02,813 --> 00:08:03,773 该模型将被下载。 The model will be downloaded. 191 00:08:07,757 --> 00:08:10,080 而且,运气好的话,它会准备好 And, with any luck, it'll be ready to 192 00:08:10,080 --> 00:08:12,450 对不同的数据集进行微调,进行预测, fine-tune on a different dataset, make predictions with, 193 00:08:12,450 --> 00:08:14,340 或者做任何你想做的事情。 or do anything else you wanna do. 194 00:08:14,340 --> 00:08:17,700 以上内容就是关于 So that was a quick overview of how, 195 00:08:17,700 --> 00:08:19,470 在你训练期间或者训练之后 after your training or during your training, 196 00:08:19,470 --> 00:08:21,420 如何将模型上传到 Hub, you can upload models to the Hub, 197 00:08:21,420 --> 00:08:22,440 可以添加存储点, you can checkpoint there, 198 00:08:22,440 --> 00:08:24,240 可以恢复训练, you can resume training from there, 199 00:08:24,240 --> 00:08:26,790 以及从所上传模型 and you can get inference results 200 00:08:26,790 --> 00:08:28,384 获得推理结果。 from the models you've uploaded. 201 00:08:28,384 --> 00:08:31,084 谢谢大家,希望在以后的视频再会。 So thank you, and I hope to see you in a future video. 202 00:08:32,852 --> 00:08:34,935 (嗖嗖) (swoosh)