pai-python-sdk/training/train_with_experiment/train_with_experiment.ipynb (393 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"id": "c8340fe748443685",
"metadata": {},
"source": [
" # 通过实验管理追踪和对比PAI-QuickStart模型训练任务\n",
"\n",
"模型训练通常是一个需要多次尝试和实验的过程,开发者需要通过配置使用不同的数据集、训练超参,或是不同的预训练模型进行迭代,监控训练是否收敛,比对多次训练任务的指标,从而选择出效果更好的模型,这通常依赖于使用实验管理工具来实现。\n",
"\n",
"[TensorBoard](https://www.tensorflow.org/tensorboard?hl=zh-cn)是一个常用的追踪和可视化工具,可以用于记录并展示模型训练过程中的损失和精度等标量数据。PAI提供TensorBoard服务,支持开发者在云上运行一个TensorBoard实例,监控训练。\n",
"\n",
"通过实验管理,开发者可以方便的在同一个TensorBoard实例上查看对比同一个实验下不同训练任务的指标,不再需要手动管理TensorBoard日志,\n",
"\n",
"本示例将以PAI-QuickStart提供的预训练模型的微调任务为例,演示如何通过PAI Python SDK使用PAI提供的实验能力,来组织和对比模型微调任务指标。\n",
"\n",
"\n",
"## 费用说明\n",
"\n",
"本示例将会使用以下云产品,并产生相应的费用账单:\n",
"\n",
"- PAI-DLC:运行训练任务,详细计费说明请参考[PAI-DLC计费说明](https://help.aliyun.com/zh/pai/product-overview/billing-of-dlc)\n",
"- OSS:存储训练任务输出的模型、训练代码、TensorBoard日志等,详细计费说明请参考[OSS计费概述](https://help.aliyun.com/zh/oss/product-overview/billing-overview)\n",
"\n",
"\n",
"> 通过参与云产品免费试用,使用**指定资源机型**提交训练作业或是部署推理服务,可以免费试用PAI产品,具体请参考[PAI免费试用](https://help.aliyun.com/zh/pai/product-overview/free-quota-for-new-users)。\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "8e80489c7d269c66",
"metadata": {
"ExecuteTime": {
"end_time": "2024-05-10T10:53:39.498381Z",
"start_time": "2024-05-10T10:53:39.467495Z"
}
},
"source": [
"## 安装和配置SDK\n",
"\n",
"我们需要安装PAI Python SDK以运行本示例。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8064d95fb7663d96",
"metadata": {},
"outputs": [],
"source": [
"!python -m pip install --upgrade pai"
]
},
{
"cell_type": "markdown",
"id": "9e92e2e988d951a",
"metadata": {},
"source": [
"SDK需要配置访问阿里云服务需要的AccessKey,以及当前使用的工作空间和OSS Bucket。在PAI SDK安装之后,通过在 **命令行终端** 中执行以下命令,按照引导配置密钥、工作空间等信息。\n",
"\n",
"```shell\n",
"\n",
"# 以下命令,请在 命令行终端 中执行.\n",
"\n",
"python -m pai.toolkit.config\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "7bc2facf0611217c",
"metadata": {},
"source": [
"我们可以通过以下代码验证配置是否已生效。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70dacf7e9f406070",
"metadata": {},
"outputs": [],
"source": [
"import pai\n",
"from pai.session import get_default_session\n",
"\n",
"print(pai.__version__)\n",
"\n",
"sess = get_default_session()\n",
"\n",
"# 获取配置的工作空间信息\n",
"assert sess.workspace_name is not None\n",
"print(sess.workspace_name)"
]
},
{
"cell_type": "markdown",
"id": "cb8ce4373d3e9903",
"metadata": {},
"source": [
"## 创建实验\n",
"\n",
"首先,我们需要创建一个实验。指定实验名称和输出路径。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b5310a914fd9fa6",
"metadata": {},
"outputs": [],
"source": [
"from pai.experiment import Experiment\n",
"\n",
"# 指定实验名称,同一个工作空间中,实验名称必须是唯一的\n",
"experiment_name = \"test_experiment3\"\n",
"\n",
"# 使用工作空间默认Bucket与实验名称组合作为实验的输出路径,如果需要指定其他路径,请修改。\n",
"# 目前仅支持OSS,请确保拥有对应Bucket的读写权限。\n",
"default_bucket_name = sess.oss_bucket_name\n",
"endpoint = sess.oss_endpoint\n",
"artifact_uri = f\"oss://{default_bucket_name}.{endpoint}/{experiment_name}/\"\n",
"\n",
"# 创建实验\n",
"experiment = Experiment.create(name=experiment_name, artifact_uri=artifact_uri)\n",
"\n",
"# 查看实验ID\n",
"print(experiment.experiment_id)"
]
},
{
"cell_type": "markdown",
"id": "e0b7ef42c7a641bc",
"metadata": {},
"source": [
"查看实验默认的TensorBoard日志存储路径。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e9ab341ec875b0c3",
"metadata": {},
"outputs": [],
"source": [
"print(experiment.tensorboard_data())"
]
},
{
"cell_type": "markdown",
"id": "a4ce9a1a54d40663",
"metadata": {},
"source": [
"## 提交训练任务到实验"
]
},
{
"cell_type": "markdown",
"id": "781934b47fce338d",
"metadata": {},
"source": [
"PAI-QuickStart提供了大量预训练模型,包括生成式AI、计算机视觉等多个方向及领域。我们可以通过PAI-Python-SDK获取模型列表并对其进行训练,详细的操作方式请参考:https://gallery.pai-ml.com/#/preview/paiPythonSDK/training/pretrained_model。\n",
"\n",
"在本示例中,我们将以`Bert`模型为例,展示如何使用实验聚合多个模型微调训练任务并对比其训练指标。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fdaaa75ab4a3f9d8",
"metadata": {},
"outputs": [],
"source": [
"from pai.model import RegisteredModel\n",
"import json\n",
"from pai.estimator import AlgorithmEstimator\n",
"from pai.experiment import ExperimentConfig\n",
"\n",
"# 获取PAI模型仓库中名称为bert-base-uncased模型\n",
"m = RegisteredModel(\n",
" model_name=\"bert-base-uncased\",\n",
" model_provider=\"pai\",\n",
")\n",
"\n",
"# 通过注册模型的配置,获取相应的预训练算法\n",
"est: AlgorithmEstimator = m.get_estimator(\n",
" # 指定训练机器的规格\n",
" instance_type=\"ecs.gn7i-c8g1.2xlarge\",\n",
" experiment_config=ExperimentConfig(\n",
" experiment_id=experiment.experiment_id,\n",
" ),\n",
")\n",
"\n",
"# 查看算法的超参定义\n",
"print(json.dumps(est.hyperparameter_definitions, indent=4))\n",
"\n",
"# 查看算法默认的超参信息\n",
"print(\"before\")\n",
"print(est.hyperparameters)"
]
},
{
"cell_type": "markdown",
"id": "852a312331eeabb6",
"metadata": {},
"source": [
"修改算法的超参配置,提交训练任务。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "78be8d8cd6a4c3df",
"metadata": {},
"outputs": [],
"source": [
"est.set_hyperparameters(max_epochs=3, learning_rate=1.1e-5, save_step=10)\n",
"\n",
"print(\"after\")\n",
"print(est.hyperparameters)\n",
"\n",
"# 获取默认训练输入\n",
"default_inputs = m.get_estimator_inputs()\n",
"\n",
"# 创建训练任务\n",
"est.fit(inputs=default_inputs, wait=False)"
]
},
{
"cell_type": "markdown",
"id": "401a7fecbf2b2b23",
"metadata": {},
"source": [
"再一次修改超参数配置,提交新的训练任务"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a5046db12dfbbce",
"metadata": {},
"outputs": [],
"source": [
"# 调整超参配置\n",
"est.set_hyperparameters(max_epochs=4, learning_rate=1.2e-5)\n",
"\n",
"print(\"after\")\n",
"print(est.hyperparameters)\n",
"\n",
"# 创建新的训练任务\n",
"est.fit(inputs=default_inputs, wait=False)"
]
},
{
"cell_type": "markdown",
"id": "c3fbfeaf973f8cdc",
"metadata": {},
"source": [
"## 通过实验的TensorBoard对比训练指标"
]
},
{
"cell_type": "markdown",
"id": "67bdba17310bb254",
"metadata": {},
"source": [
"我们可以使用PAI TensorBoard服务,实时的查看实验中所有任务的TensorBoard日志。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "99c4fe6769b7dd95",
"metadata": {},
"outputs": [],
"source": [
"# 启动实验的TensorBoard应用\n",
"tensorboard = experiment.tensorboard()\n",
"\n",
"# 查看TensorBoard的应用URL\n",
"print(tensorboard.app_uri)"
]
},
{
"cell_type": "markdown",
"id": "daf7e6b75ae04d8f",
"metadata": {},
"source": [
"使用完成之后,删除TensorBoard应用。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4891040e57d185e2",
"metadata": {},
"outputs": [],
"source": [
"tensorboard.delete()"
]
},
{
"cell_type": "markdown",
"id": "f603c1ec6aaed86b",
"metadata": {},
"source": [
"我们也可以使用本地拉起的TensorBoard服务来查看TensorBoard日志。注意,TensorBoard日志是随着任务的运行不断写出的,日志文件会不断更新,需要下载最新的日志文件才能查看到最新的数据。\n",
"\n",
"首先需要在本地安装安装TensorBoard"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4a355db36ffb93a",
"metadata": {},
"outputs": [],
"source": [
"!pip install tensorboard"
]
},
{
"cell_type": "markdown",
"id": "fbc051bdc1eced75",
"metadata": {},
"source": [
"下载TensorBoard日志文件"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "30b702d094c1b520",
"metadata": {},
"outputs": [],
"source": [
"from pai.common import oss_utils\n",
"from pai.common.oss_utils import OssUriObj\n",
"\n",
"oss_uri = OssUriObj(experiment.tensorboard_data())\n",
"store_dir = \"./tensorboard_logs\"\n",
"\n",
"oss_utils.download(\n",
" oss_path=oss_uri.object_key,\n",
" local_path=store_dir,\n",
" bucket=sess.get_oss_bucket(oss_uri.bucket_name),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "3aadda9f3942313c",
"metadata": {},
"source": [
"通过shell命令在本地拉起TensorBoard服务。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b23e680e4e26d9cc",
"metadata": {},
"outputs": [],
"source": [
"!tensorboard --logdir \"$target_path\""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.19"
}
},
"nbformat": 4,
"nbformat_minor": 5
}