pai-python-sdk/training/tensorboard/tensorboard.ipynb (261 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 使用TensorBoard可视化训练过程\n", "\n", "TensorBoard是一个用于追踪、可视化、分析模型训练过程和训练结果的工具,它提供了多种可视化功能,可以与PyTorch、TensorFlow、Keras、Huggingface transformers、ModelScope等机器学习框架一起使用,帮助用户了解模型的训练过程和性能。\n", "\n", "PAI提供了TensorBoard服务,支持用户在PAI创建TensorBoard应用,用于查看训练作业输出的TensorBoard日志。\n", "\n", "本文档将以不同的机器学习框架为示例,展示如何在PAI使用TensorBoard追踪和可视化模型训练过程。\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 费用说明\n", "\n", "本示例将会使用以下云产品,并产生相应的费用账单:\n", "\n", "- PAI-DLC:运行训练任务,详细计费说明请参考[PAI-DLC计费说明](https://help.aliyun.com/zh/pai/product-overview/billing-of-dlc)\n", "- OSS:存储训练任务输出的模型、训练代码、TensorBoard日志等,详细计费说明请参考[OSS计费概述](https://help.aliyun.com/zh/oss/product-overview/billing-overview)\n", "\n", "\n", "> 通过参与云产品免费试用,使用**指定资源机型**提交训练作业或是部署推理服务,可以免费试用PAI产品,具体请参考[PAI免费试用](https://help.aliyun.com/zh/pai/product-overview/free-quota-for-new-users)。\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 安装和配置SDK\n", "\n", "我们需要首先安装PAI Python SDK以运行本示例。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "!python -m pip install --upgrade pai" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "SDK需要配置访问阿里云服务需要的AccessKey,以及当前使用的工作空间和OSS Bucket。在PAI SDK安装之后,通过在 **命令行终端** 中执行以下命令,按照引导配置密钥、工作空间等信息。\n", "\n", "\n", "```shell\n", "\n", "# 以下命令,请在 命令行终端 中执行.\n", "\n", "python -m pai.toolkit.config\n", "\n", "```\n", "\n", "我们可以通过以下代码验证配置是否已生效。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pai\n", "from pai.session import get_default_session\n", "\n", "print(pai.__version__)\n", "\n", "sess = get_default_session()\n", "\n", "# 获取配置的工作空间信息\n", "assert sess.workspace_name is not None\n", "print(sess.workspace_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 提交训练任务" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们首先需要准备训练脚本,使用将PyTorch的TensorBoard utility记录TensorBoard日志。\n", "\n", "\n", "> PyTorch提供的TensorBoard utilities的使用可以见文档: [torch.utils.tensorboard 文档](https://pytorch.org/docs/stable/tensorboard.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir -p src" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "镜像里需要先安装TensorBoard,可以在训练目录中准备 ``requirements.txt`` 指定需要按照的第三方库。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile src/requirements.txt\n", "\n", "tensorboard" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile src/run.py\n", "\n", "import os\n", "\n", "import torch\n", "from torch.utils.tensorboard import SummaryWriter\n", "\n", "\n", "# 通过环境变量获取TensorBoard输出路径,默认为 /ml/output/tensorboard/\n", "tb_log_dir = os.environ.get(\"PAI_OUTPUT_TENSORBOARD\")\n", "print(f\"TensorBoard log dir: {tb_log_dir}\")\n", "writer = SummaryWriter(log_dir=tb_log_dir)\n", "\n", "def train_model(iter):\n", "\n", "\n", " x = torch.arange(-5, 5, 0.1).view(-1, 1)\n", " y = -5 * x + 0.1 * torch.randn(x.size())\n", "\n", " model = torch.nn.Linear(1, 1)\n", " criterion = torch.nn.MSELoss()\n", " optimizer = torch.optim.SGD(model.parameters(), lr = 0.1)\n", "\n", " for epoch in range(iter):\n", " y1 = model(x)\n", " loss = criterion(y1, y)\n", " writer.add_scalar(\"Loss/train\", loss, epoch)\n", " optimizer.zero_grad()\n", " loss.backward()\n", " optimizer.step()\n", "\n", "if __name__ == \"__main__\":\n", " train_model(100)\n", " writer.flush()\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pai.estimator import Estimator\n", "from pai.image import retrieve\n", "\n", "\n", "est = Estimator(\n", " command=\"python run.py\",\n", " source_dir=\"./src\",\n", " image_uri=retrieve(\"PyTorch\", \"latest\").image_uri,\n", " instance_type=\"ecs.c6.large\",\n", ")\n", "\n", "est.fit(wait=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 使用TensorBoard应用监控训练" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在PAI启动一个TensorBoard应用,查看使用Estimator的训练作业写出的TensorBoard日志。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tb = est.tensorboard()\n", "\n", "print(tb.app_uri)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用完成之后,删除TensorBoard应用" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tb.delete()" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }