pai-python-sdk/training/train_with_experiment/train_with_experiment.ipynb (393 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "id": "c8340fe748443685", "metadata": {}, "source": [ " # 通过实验管理追踪和对比PAI-QuickStart模型训练任务\n", "\n", "模型训练通常是一个需要多次尝试和实验的过程,开发者需要通过配置使用不同的数据集、训练超参,或是不同的预训练模型进行迭代,监控训练是否收敛,比对多次训练任务的指标,从而选择出效果更好的模型,这通常依赖于使用实验管理工具来实现。\n", "\n", "[TensorBoard](https://www.tensorflow.org/tensorboard?hl=zh-cn)是一个常用的追踪和可视化工具,可以用于记录并展示模型训练过程中的损失和精度等标量数据。PAI提供TensorBoard服务,支持开发者在云上运行一个TensorBoard实例,监控训练。\n", "\n", "通过实验管理,开发者可以方便的在同一个TensorBoard实例上查看对比同一个实验下不同训练任务的指标,不再需要手动管理TensorBoard日志,\n", "\n", "本示例将以PAI-QuickStart提供的预训练模型的微调任务为例,演示如何通过PAI Python SDK使用PAI提供的实验能力,来组织和对比模型微调任务指标。\n", "\n", "\n", "## 费用说明\n", "\n", "本示例将会使用以下云产品,并产生相应的费用账单:\n", "\n", "- PAI-DLC:运行训练任务,详细计费说明请参考[PAI-DLC计费说明](https://help.aliyun.com/zh/pai/product-overview/billing-of-dlc)\n", "- OSS:存储训练任务输出的模型、训练代码、TensorBoard日志等,详细计费说明请参考[OSS计费概述](https://help.aliyun.com/zh/oss/product-overview/billing-overview)\n", "\n", "\n", "> 通过参与云产品免费试用,使用**指定资源机型**提交训练作业或是部署推理服务,可以免费试用PAI产品,具体请参考[PAI免费试用](https://help.aliyun.com/zh/pai/product-overview/free-quota-for-new-users)。\n", "\n" ] }, { "cell_type": "markdown", "id": "8e80489c7d269c66", "metadata": { "ExecuteTime": { "end_time": "2024-05-10T10:53:39.498381Z", "start_time": "2024-05-10T10:53:39.467495Z" } }, "source": [ "## 安装和配置SDK\n", "\n", "我们需要安装PAI Python SDK以运行本示例。" ] }, { "cell_type": "code", "execution_count": null, "id": "8064d95fb7663d96", "metadata": {}, "outputs": [], "source": [ "!python -m pip install --upgrade pai" ] }, { "cell_type": "markdown", "id": "9e92e2e988d951a", "metadata": {}, "source": [ "SDK需要配置访问阿里云服务需要的AccessKey,以及当前使用的工作空间和OSS Bucket。在PAI SDK安装之后,通过在 **命令行终端** 中执行以下命令,按照引导配置密钥、工作空间等信息。\n", "\n", "```shell\n", "\n", "# 以下命令,请在 命令行终端 中执行.\n", "\n", "python -m pai.toolkit.config\n", "\n", "```" ] }, { "cell_type": "markdown", "id": "7bc2facf0611217c", "metadata": {}, "source": [ "我们可以通过以下代码验证配置是否已生效。" ] }, { "cell_type": "code", "execution_count": null, "id": "70dacf7e9f406070", "metadata": {}, "outputs": [], "source": [ "import pai\n", "from pai.session import get_default_session\n", "\n", "print(pai.__version__)\n", "\n", "sess = get_default_session()\n", "\n", "# 获取配置的工作空间信息\n", "assert sess.workspace_name is not None\n", "print(sess.workspace_name)" ] }, { "cell_type": "markdown", "id": "cb8ce4373d3e9903", "metadata": {}, "source": [ "## 创建实验\n", "\n", "首先,我们需要创建一个实验。指定实验名称和输出路径。" ] }, { "cell_type": "code", "execution_count": null, "id": "3b5310a914fd9fa6", "metadata": {}, "outputs": [], "source": [ "from pai.experiment import Experiment\n", "\n", "# 指定实验名称,同一个工作空间中,实验名称必须是唯一的\n", "experiment_name = \"test_experiment3\"\n", "\n", "# 使用工作空间默认Bucket与实验名称组合作为实验的输出路径,如果需要指定其他路径,请修改。\n", "# 目前仅支持OSS,请确保拥有对应Bucket的读写权限。\n", "default_bucket_name = sess.oss_bucket_name\n", "endpoint = sess.oss_endpoint\n", "artifact_uri = f\"oss://{default_bucket_name}.{endpoint}/{experiment_name}/\"\n", "\n", "# 创建实验\n", "experiment = Experiment.create(name=experiment_name, artifact_uri=artifact_uri)\n", "\n", "# 查看实验ID\n", "print(experiment.experiment_id)" ] }, { "cell_type": "markdown", "id": "e0b7ef42c7a641bc", "metadata": {}, "source": [ "查看实验默认的TensorBoard日志存储路径。" ] }, { "cell_type": "code", "execution_count": null, "id": "e9ab341ec875b0c3", "metadata": {}, "outputs": [], "source": [ "print(experiment.tensorboard_data())" ] }, { "cell_type": "markdown", "id": "a4ce9a1a54d40663", "metadata": {}, "source": [ "## 提交训练任务到实验" ] }, { "cell_type": "markdown", "id": "781934b47fce338d", "metadata": {}, "source": [ "PAI-QuickStart提供了大量预训练模型,包括生成式AI、计算机视觉等多个方向及领域。我们可以通过PAI-Python-SDK获取模型列表并对其进行训练,详细的操作方式请参考:https://gallery.pai-ml.com/#/preview/paiPythonSDK/training/pretrained_model。\n", "\n", "在本示例中,我们将以`Bert`模型为例,展示如何使用实验聚合多个模型微调训练任务并对比其训练指标。" ] }, { "cell_type": "code", "execution_count": null, "id": "fdaaa75ab4a3f9d8", "metadata": {}, "outputs": [], "source": [ "from pai.model import RegisteredModel\n", "import json\n", "from pai.estimator import AlgorithmEstimator\n", "from pai.experiment import ExperimentConfig\n", "\n", "# 获取PAI模型仓库中名称为bert-base-uncased模型\n", "m = RegisteredModel(\n", " model_name=\"bert-base-uncased\",\n", " model_provider=\"pai\",\n", ")\n", "\n", "# 通过注册模型的配置,获取相应的预训练算法\n", "est: AlgorithmEstimator = m.get_estimator(\n", " # 指定训练机器的规格\n", " instance_type=\"ecs.gn7i-c8g1.2xlarge\",\n", " experiment_config=ExperimentConfig(\n", " experiment_id=experiment.experiment_id,\n", " ),\n", ")\n", "\n", "# 查看算法的超参定义\n", "print(json.dumps(est.hyperparameter_definitions, indent=4))\n", "\n", "# 查看算法默认的超参信息\n", "print(\"before\")\n", "print(est.hyperparameters)" ] }, { "cell_type": "markdown", "id": "852a312331eeabb6", "metadata": {}, "source": [ "修改算法的超参配置,提交训练任务。" ] }, { "cell_type": "code", "execution_count": null, "id": "78be8d8cd6a4c3df", "metadata": {}, "outputs": [], "source": [ "est.set_hyperparameters(max_epochs=3, learning_rate=1.1e-5, save_step=10)\n", "\n", "print(\"after\")\n", "print(est.hyperparameters)\n", "\n", "# 获取默认训练输入\n", "default_inputs = m.get_estimator_inputs()\n", "\n", "# 创建训练任务\n", "est.fit(inputs=default_inputs, wait=False)" ] }, { "cell_type": "markdown", "id": "401a7fecbf2b2b23", "metadata": {}, "source": [ "再一次修改超参数配置,提交新的训练任务" ] }, { "cell_type": "code", "execution_count": null, "id": "a5046db12dfbbce", "metadata": {}, "outputs": [], "source": [ "# 调整超参配置\n", "est.set_hyperparameters(max_epochs=4, learning_rate=1.2e-5)\n", "\n", "print(\"after\")\n", "print(est.hyperparameters)\n", "\n", "# 创建新的训练任务\n", "est.fit(inputs=default_inputs, wait=False)" ] }, { "cell_type": "markdown", "id": "c3fbfeaf973f8cdc", "metadata": {}, "source": [ "## 通过实验的TensorBoard对比训练指标" ] }, { "cell_type": "markdown", "id": "67bdba17310bb254", "metadata": {}, "source": [ "我们可以使用PAI TensorBoard服务,实时的查看实验中所有任务的TensorBoard日志。" ] }, { "cell_type": "code", "execution_count": null, "id": "99c4fe6769b7dd95", "metadata": {}, "outputs": [], "source": [ "# 启动实验的TensorBoard应用\n", "tensorboard = experiment.tensorboard()\n", "\n", "# 查看TensorBoard的应用URL\n", "print(tensorboard.app_uri)" ] }, { "cell_type": "markdown", "id": "daf7e6b75ae04d8f", "metadata": {}, "source": [ "使用完成之后,删除TensorBoard应用。" ] }, { "cell_type": "code", "execution_count": null, "id": "4891040e57d185e2", "metadata": {}, "outputs": [], "source": [ "tensorboard.delete()" ] }, { "cell_type": "markdown", "id": "f603c1ec6aaed86b", "metadata": {}, "source": [ "我们也可以使用本地拉起的TensorBoard服务来查看TensorBoard日志。注意,TensorBoard日志是随着任务的运行不断写出的,日志文件会不断更新,需要下载最新的日志文件才能查看到最新的数据。\n", "\n", "首先需要在本地安装安装TensorBoard" ] }, { "cell_type": "code", "execution_count": null, "id": "d4a355db36ffb93a", "metadata": {}, "outputs": [], "source": [ "!pip install tensorboard" ] }, { "cell_type": "markdown", "id": "fbc051bdc1eced75", "metadata": {}, "source": [ "下载TensorBoard日志文件" ] }, { "cell_type": "code", "execution_count": null, "id": "30b702d094c1b520", "metadata": {}, "outputs": [], "source": [ "from pai.common import oss_utils\n", "from pai.common.oss_utils import OssUriObj\n", "\n", "oss_uri = OssUriObj(experiment.tensorboard_data())\n", "store_dir = \"./tensorboard_logs\"\n", "\n", "oss_utils.download(\n", " oss_path=oss_uri.object_key,\n", " local_path=store_dir,\n", " bucket=sess.get_oss_bucket(oss_uri.bucket_name),\n", ")" ] }, { "cell_type": "markdown", "id": "3aadda9f3942313c", "metadata": {}, "source": [ "通过shell命令在本地拉起TensorBoard服务。" ] }, { "cell_type": "code", "execution_count": null, "id": "b23e680e4e26d9cc", "metadata": {}, "outputs": [], "source": [ "!tensorboard --logdir \"$target_path\"" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.19" } }, "nbformat": 4, "nbformat_minor": 5 }