pai-python-sdk/training/tensorboard/tensorboard.ipynb (261 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 使用TensorBoard可视化训练过程\n",
"\n",
"TensorBoard是一个用于追踪、可视化、分析模型训练过程和训练结果的工具,它提供了多种可视化功能,可以与PyTorch、TensorFlow、Keras、Huggingface transformers、ModelScope等机器学习框架一起使用,帮助用户了解模型的训练过程和性能。\n",
"\n",
"PAI提供了TensorBoard服务,支持用户在PAI创建TensorBoard应用,用于查看训练作业输出的TensorBoard日志。\n",
"\n",
"本文档将以不同的机器学习框架为示例,展示如何在PAI使用TensorBoard追踪和可视化模型训练过程。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 费用说明\n",
"\n",
"本示例将会使用以下云产品,并产生相应的费用账单:\n",
"\n",
"- PAI-DLC:运行训练任务,详细计费说明请参考[PAI-DLC计费说明](https://help.aliyun.com/zh/pai/product-overview/billing-of-dlc)\n",
"- OSS:存储训练任务输出的模型、训练代码、TensorBoard日志等,详细计费说明请参考[OSS计费概述](https://help.aliyun.com/zh/oss/product-overview/billing-overview)\n",
"\n",
"\n",
"> 通过参与云产品免费试用,使用**指定资源机型**提交训练作业或是部署推理服务,可以免费试用PAI产品,具体请参考[PAI免费试用](https://help.aliyun.com/zh/pai/product-overview/free-quota-for-new-users)。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## 安装和配置SDK\n",
"\n",
"我们需要首先安装PAI Python SDK以运行本示例。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"!python -m pip install --upgrade pai"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"SDK需要配置访问阿里云服务需要的AccessKey,以及当前使用的工作空间和OSS Bucket。在PAI SDK安装之后,通过在 **命令行终端** 中执行以下命令,按照引导配置密钥、工作空间等信息。\n",
"\n",
"\n",
"```shell\n",
"\n",
"# 以下命令,请在 命令行终端 中执行.\n",
"\n",
"python -m pai.toolkit.config\n",
"\n",
"```\n",
"\n",
"我们可以通过以下代码验证配置是否已生效。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pai\n",
"from pai.session import get_default_session\n",
"\n",
"print(pai.__version__)\n",
"\n",
"sess = get_default_session()\n",
"\n",
"# 获取配置的工作空间信息\n",
"assert sess.workspace_name is not None\n",
"print(sess.workspace_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 提交训练任务"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们首先需要准备训练脚本,使用将PyTorch的TensorBoard utility记录TensorBoard日志。\n",
"\n",
"\n",
"> PyTorch提供的TensorBoard utilities的使用可以见文档: [torch.utils.tensorboard 文档](https://pytorch.org/docs/stable/tensorboard.html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!mkdir -p src"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"镜像里需要先安装TensorBoard,可以在训练目录中准备 ``requirements.txt`` 指定需要按照的第三方库。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile src/requirements.txt\n",
"\n",
"tensorboard"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile src/run.py\n",
"\n",
"import os\n",
"\n",
"import torch\n",
"from torch.utils.tensorboard import SummaryWriter\n",
"\n",
"\n",
"# 通过环境变量获取TensorBoard输出路径,默认为 /ml/output/tensorboard/\n",
"tb_log_dir = os.environ.get(\"PAI_OUTPUT_TENSORBOARD\")\n",
"print(f\"TensorBoard log dir: {tb_log_dir}\")\n",
"writer = SummaryWriter(log_dir=tb_log_dir)\n",
"\n",
"def train_model(iter):\n",
"\n",
"\n",
" x = torch.arange(-5, 5, 0.1).view(-1, 1)\n",
" y = -5 * x + 0.1 * torch.randn(x.size())\n",
"\n",
" model = torch.nn.Linear(1, 1)\n",
" criterion = torch.nn.MSELoss()\n",
" optimizer = torch.optim.SGD(model.parameters(), lr = 0.1)\n",
"\n",
" for epoch in range(iter):\n",
" y1 = model(x)\n",
" loss = criterion(y1, y)\n",
" writer.add_scalar(\"Loss/train\", loss, epoch)\n",
" optimizer.zero_grad()\n",
" loss.backward()\n",
" optimizer.step()\n",
"\n",
"if __name__ == \"__main__\":\n",
" train_model(100)\n",
" writer.flush()\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from pai.estimator import Estimator\n",
"from pai.image import retrieve\n",
"\n",
"\n",
"est = Estimator(\n",
" command=\"python run.py\",\n",
" source_dir=\"./src\",\n",
" image_uri=retrieve(\"PyTorch\", \"latest\").image_uri,\n",
" instance_type=\"ecs.c6.large\",\n",
")\n",
"\n",
"est.fit(wait=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 使用TensorBoard应用监控训练"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在PAI启动一个TensorBoard应用,查看使用Estimator的训练作业写出的TensorBoard日志。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tb = est.tensorboard()\n",
"\n",
"print(tb.app_uri)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用完成之后,删除TensorBoard应用"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tb.delete()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}