notebooks/finetune-t511-xl-xsum.ipynb (470 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Copyright 2022 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# http://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fine-tuning T5 1.1 XL for abstractive summarization\n", "\n", "\n", "This notebook demonstrates how to fine tune the T5 1.1 XL model for the abstractive summarization task using the [XSum dataset](https://www.tensorflow.org/datasets/catalog/xsum)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports and initialization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# reloads modules automatically before executing any code/script\n", "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import libraries\n", "\n", "Please refer to the [environment setup](../README.md) section in the README \n", "file to setup the development environment and install the required libraries \n", "before importing them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import time\n", "from datetime import datetime\n", "import pandas as pd\n", "\n", "import utils\n", "\n", "# import vertex ai sdk for python\n", "from google.cloud import aiplatform as vertex_ai" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure environment settings\n", "\n", "\n", "- **`PROJECT_ID`:** Configure the Google Cloud Project ID\n", "- **`REGION`:** Configure the [region](https://cloud.google.com/vertex-ai/docs/general/locations) \n", " to be used for Vertex AI operations throughout the rest of this notebook\n", "- **`BUCKET`:** Google Cloud Storage bucket name to be used by vertex AI for \n", " any operations such as to stage the code, save any artifacts generated etc.\n", "- **`TENSORBOARD_NAME`:** Configure the managed TensorBoard instance name \n", " created during the environment setup." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Project definitions\n", "PROJECT_ID = '<YOUR PROJECT ID>' # Change to your project id.\n", "REGION = '<YOUR REGION>' # Change to your region.\n", "\n", "# Bucket definitions\n", "BUCKET = '<YOUR BUCKET NAME>' # Change to your bucket.\n", "\n", "# Tensorboard definitions\n", "TENSORBOARD_NAME = '<YOUR TENSORBOARD NAME>' # Change to your Tensorboard instance name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get Vertex AI TensorBoard ID based on name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TENSORBOARD_ID = ! gcloud ai tensorboards list --filter=\"displayName={TENSORBOARD_NAME}\" --format=\"value(name)\" --region={REGION} 2>/dev/null \n", "TENSORBOARD_ID = TENSORBOARD_ID[0]\n", "\n", "print(f\"TENSORBOARD_ID = {TENSORBOARD_ID}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure custom container image\n", "\n", "In this example, you use the base T5X custom training container." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IMAGE_NAME = 't5x-base' \n", "IMAGE_URI = f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Validate image exists in the Container Registry" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! gcloud container images describe $IMAGE_URI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure experiment settings\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "EXPERIMENT_NAME = '<YOUR EXPERIMENT>' # Change to your experiment name\n", "\n", "EXPERIMENT_WORKSPACE = f'gs://{BUCKET}/experiments/{EXPERIMENT_NAME}'\n", "EXPERIMENT_RUNS = f'{EXPERIMENT_WORKSPACE}/runs'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initialize Vertex AI SDK for Python\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vertex_ai.init(\n", " project=PROJECT_ID,\n", " location=REGION,\n", " staging_bucket=EXPERIMENT_WORKSPACE,\n", " experiment=EXPERIMENT_NAME\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure dataset location\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TFDS_DATA_DIR = f'gs://{BUCKET}/datasets'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run fine-tuning job\n", "### Define the job's gin file\n", "\n", "This job is configured using the following Gin file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "JOB_GIN_FILE = '../configs/finetune_t511_xl_xsum.gin'\n", "\n", "!cat {JOB_GIN_FILE}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following the hyper parameter settings reported in a number of papers, including [PEGASUS: Pre-training with Extracted Gap-sentences for\n", "Abstractive Summarization](https://arxiv.org/pdf/1912.08777.pdf), the input length is set to 512 and the output length to 64. The batch size is set to 256.\n", "\n", "The job is configured to run evaluations on the `validation` split of XSum every 500 steps. Since running evaluations on the full `validation` split would be compute intensive and would significantly extend the elapsed time, they are limited to 500 examples.\n", "\n", "By default, `t5x/t5x/configs/runs/finetune.gin` does not put any constraints on the length of input and target features during evaluations. Their dimensions are computed by looking for the longest sequences in the data split used for evaluation. In the preprocessed `validation` split of the XSum dataset, the longest sequence is 7730 tokens so the feature dimension of input batches would be set to 7730. The 90th percentile is 1061. To mitigate potential out of memory errors the maximum length of inputs for evaluation is constrained to 1024.\n", "\n", "This configuration has been tested on a v2-128 TPU slice using an 8-way model parallelism, 16-way data parallelism. \n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "GIN_FILES = [JOB_GIN_FILE] \n", "GIN_OVERWRITES = [\n", " 'USE_CACHED_TASKS=False'\n", " ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure Vertex AI CustomJob" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "RUN_NAME = f'<YOUR RUN NAME>' # Change to your run name for the custom job\n", "RUN_ID = f'{EXPERIMENT_NAME}-{RUN_NAME}-{datetime.now().strftime(\"%Y%m%d%H%M\")}'\n", "RUN_DIR = f'{EXPERIMENT_RUNS}/{RUN_ID}'\n", "RUN_MODE = 'train'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Log local variables defined for any troubleshooting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for key in [\n", " \"PROJECT_ID\", \"REGION\", \"BUCKET\", \"TENSORBOARD_NAME\", \"TENSORBOARD_ID\", \n", " \"IMAGE_NAME\", \"IMAGE_URI\", \n", " \"EXPERIMENT_NAME\", \"EXPERIMENT_WORKSPACE\", \"EXPERIMENT_RUNS\", \n", " \"TFDS_DATA_DIR\", \"GIN_FILES\", \"GIN_OVERWRITES\", \n", " \"RUN_NAME\", \"RUN_ID\", \"RUN_DIR\", \"RUN_MODE\"\n", "]:\n", " print(f\"{key}={eval(key)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Configure a Cloud TPU slice for the job. Double check if your [region](https://cloud.google.com/vertex-ai/docs/general/locations#accelerators) supports the specified TPU topology." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "MACHINE_TYPE = 'cloud-tpu'\n", "ACCELERATOR_TYPE = 'TPU_V2'\n", "ACCELERATOR_COUNT = 128" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create the custom job spec" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "job = utils.create_t5x_custom_job(\n", " display_name=RUN_ID,\n", " machine_type=MACHINE_TYPE,\n", " accelerator_type=ACCELERATOR_TYPE,\n", " accelerator_count=ACCELERATOR_COUNT,\n", " image_uri=IMAGE_URI,\n", " run_mode=RUN_MODE,\n", " gin_files=GIN_FILES,\n", " model_dir=RUN_DIR,\n", " tfds_data_dir=TFDS_DATA_DIR,\n", " gin_overwrites=GIN_OVERWRITES\n", ")\n", "\n", "job.job_spec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit the custom job to Vertex AI and track the experiment\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "utils.submit_and_track_t5x_vertex_job(\n", " custom_job=job,\n", " job_display_name=RUN_ID,\n", " run_name=RUN_ID,\n", " experiment_name=EXPERIMENT_NAME,\n", " execution_name=RUN_ID,\n", " tfds_data_dir=TFDS_DATA_DIR,\n", " model_dir=RUN_DIR,\n", " vertex_ai=vertex_ai,\n", " run_mode=RUN_MODE\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Monitor the job with Vertex AI TensorBoard\n", "\n", "**Execute the following command from the terminal window to sync logs to Vertex \n", "AI TensorBoard**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cmd = f\"\"\"\n", "tb-gcp-uploader --tensorboard_resource_name {TENSORBOARD_ID} \\\n", "--logdir {EXPERIMENT_RUNS} \\\n", "--experiment_name {EXPERIMENT_NAME}\n", "\"\"\"\n", "\n", "print(cmd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To access the TensorBoard instance for the experiment, click the below URL" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TENSORBOARD_URL = f\"https://{REGION}.tensorboard.googleusercontent.com/experiment/{TENSORBOARD_ID.replace('/', '+')}+experiments+{EXPERIMENT_NAME}/\"\n", "print(f\"TensorBoard URL for the experiment is located at {TENSORBOARD_URL}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can access the Vertex AI TensorBoard experiment from the [console](https://console.cloud.google.com/vertex-ai/experiments/)." ] } ], "metadata": { "environment": { "kernel": "python3", "name": "common-cpu.m93", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/base-cpu:m93" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" }, "vscode": { "interpreter": { "hash": "802ea0518c7535cb908ec450b955e980eb8525a80af866d337675ca6fae56b98" } } }, "nbformat": 4, "nbformat_minor": 4 }