notebooks/finetune-t511-large-squad.ipynb (508 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Copyright 2022 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# http://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fine-tuning T5 1.1 large on SQuAD dataset\n", "\n", "\n", "---\n", "\n", "## Objective\n", "\n", "This notebook demonstrates how to fine tune the T5 1.1 large model on the question and answer task using the [SQuAD dataset](https://www.tensorflow.org/datasets/catalog/squad)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# reloads modules automatically before executing any code/script\n", "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import libraries\n", "\n", "Please refer to the [environment setup](../README.md) section in the README \n", "file to setup the development environment and install the required libraries \n", "before importing them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import time\n", "from datetime import datetime\n", "import pandas as pd\n", "\n", "import utils\n", "\n", "# import vertex ai sdk for python\n", "from google.cloud import aiplatform as vertex_ai" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure environment settings\n", "\n", "Based on the [environment setup](../README.md) done previously, configure the \n", "following environment settings:\n", "\n", "- **`PROJECT_ID`:** Configure the Google Cloud Project ID\n", "- **`REGION`:** Configure the [region](https://cloud.google.com/vertex-ai/docs/general/locations) \n", " to be used for Vertex AI operations throughout the rest of this notebook\n", "- **`BUCKET`:** Google Cloud Storage bucket name to be used by vertex AI for \n", " any operations such as to stage the code, save any artifacts generated etc.\n", "- **`TENSORBOARD_NAME`:** Configure the managed TensorBoard instance name \n", " created during the environment setup." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Project definitions\n", "PROJECT_ID = '<YOUR PROJECT ID>' # Change to your project id.\n", "REGION = '<YOUR REGION>' # Change to your region.\n", "\n", "# Bucket definitions\n", "BUCKET = '<YOUR BUCKET NAME>' # Change to your bucket.\n", "\n", "# Tensorboard definitions\n", "TENSORBOARD_NAME = '<YOUR TENSORBOARD NAME>' # Change to your Tensorboard instance name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get Vertex AI TensorBoard ID based on name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TENSORBOARD_ID = ! gcloud ai tensorboards list --filter=\"displayName={TENSORBOARD_NAME}\" --format=\"value(name)\" --region={REGION} 2>/dev/null \n", "TENSORBOARD_ID = TENSORBOARD_ID[0]\n", "\n", "print(f\"TENSORBOARD_ID = {TENSORBOARD_ID}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure custom container image\n", "\n", "In this example, you use the base T5X custom training container." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IMAGE_NAME = 't5x-base' \n", "IMAGE_URI = f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Validate image exists in the Container Registry" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! gcloud container images describe $IMAGE_URI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure experiment settings\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "EXPERIMENT_NAME = '<YOUR EXPERIMENT>' # Change to your experiment name\n", "\n", "EXPERIMENT_WORKSPACE = f'gs://{BUCKET}/experiments/{EXPERIMENT_NAME}'\n", "EXPERIMENT_RUNS = f'{EXPERIMENT_WORKSPACE}/runs'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initialize Vertex AI SDK for Python\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vertex_ai.init(\n", " project=PROJECT_ID,\n", " location=REGION,\n", " staging_bucket=EXPERIMENT_WORKSPACE,\n", " experiment=EXPERIMENT_NAME\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure dataset location\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TFDS_DATA_DIR = f'gs://{BUCKET}/datasets'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure T5X fine tuning job \n", "\n", "This job is configured using the following Gin file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "JOB_GIN_FILE = '../configs/finetune_t511_large_squad.gin'\n", "\n", "!cat {JOB_GIN_FILE}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This configuration has been tested on a v2-128 TPU slice using an 8-way model parallelism and 16-way data parallelism. The batch size is set to 128. If you want run it on a different slice topology make sure to adjust the global batch size and a number of model parallel partitions.\n", "\n", "The job uses the custom `squad` SeqIO Task.\n", "\n", "The default settings for finetuning do not set any constraints on the length of an input sequence when computing metrics defined in the SeqIO Task. This may lead to out of memory errors when using a dataset with long input sequences. To avoid the errors, the `task_feature_lengths` property for the inference evaluation dataset config is set to the same value as for training and validation datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "GIN_FILES = [JOB_GIN_FILE] \n", "GIN_OVERWRITES = [\n", " 'USE_CACHED_TASKS=False'\n", " ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure and run job on Vertex AI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure Vertex AI CustomJob" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "RUN_NAME = f'<YOUR RUN NAME>' # Change to your run name for the custom job\n", "RUN_ID = f'{EXPERIMENT_NAME}-{RUN_NAME}-{datetime.now().strftime(\"%Y%m%d%H%M\")}'\n", "RUN_DIR = f'{EXPERIMENT_RUNS}/{RUN_ID}'\n", "RUN_MODE = 'train'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Log local variables defined for any troubleshooting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for key in [\n", " \"PROJECT_ID\", \"REGION\", \"BUCKET\", \"TENSORBOARD_NAME\", \"TENSORBOARD_ID\", \n", " \"IMAGE_NAME\", \"IMAGE_URI\", \n", " \"EXPERIMENT_NAME\", \"EXPERIMENT_WORKSPACE\", \"EXPERIMENT_RUNS\", \n", " \"TFDS_DATA_DIR\", \"GIN_FILES\", \"GIN_OVERWRITES\", \n", " \"RUN_NAME\", \"RUN_ID\", \"RUN_DIR\", \"RUN_MODE\"\n", "]:\n", " print(f\"{key}={eval(key)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Configure a Cloud TPU slice for the job. Double check if your [region](https://cloud.google.com/vertex-ai/docs/general/locations#accelerators) supports the specified TPU topology." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "MACHINE_TYPE = 'cloud-tpu'\n", "ACCELERATOR_TYPE = 'TPU_V2'\n", "ACCELERATOR_COUNT = 128" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create the custom job spec" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "job = utils.create_t5x_custom_job(\n", " display_name=RUN_ID,\n", " machine_type=MACHINE_TYPE,\n", " accelerator_type=ACCELERATOR_TYPE,\n", " accelerator_count=ACCELERATOR_COUNT,\n", " image_uri=IMAGE_URI,\n", " run_mode=RUN_MODE,\n", " gin_files=GIN_FILES,\n", " model_dir=RUN_DIR,\n", " tfds_data_dir=TFDS_DATA_DIR,\n", " gin_overwrites=GIN_OVERWRITES\n", ")\n", "\n", "job.job_spec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit the custom job to Vertex AI and track the experiment\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "utils.submit_and_track_t5x_vertex_job(\n", " custom_job=job,\n", " job_display_name=RUN_ID,\n", " run_name=RUN_ID,\n", " experiment_name=EXPERIMENT_NAME,\n", " execution_name=RUN_ID,\n", " tfds_data_dir=TFDS_DATA_DIR,\n", " model_dir=RUN_DIR,\n", " run_model=RUN_MODE,\n", " vertex_ai=vertex_ai\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Monitor the job with Vertex AI TensorBoard\n", "\n", "Currently Vertex AI Training does not support native integration with Vertex AI \n", "Tensorboard for TPU based training jobs. As a mitigation you can start \n", "`tb-gcp-uploader` command line utility to manually [upload Vertex AI \n", "TensorBoard logs](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview#uploading_logs) \n", "to Vertex AI TensorBoard. This integration allows you to monitor the training \n", "in near real time as Vertex AI TensorBoard streams in Vertex AI TensorBoard \n", "logs as they are written to Cloud Storage bucket.\n", "\n", "**Execute the following command from the terminal window to sync logs to Vertex \n", "AI TensorBoard**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cmd = f\"\"\"\n", "tb-gcp-uploader --tensorboard_resource_name {TENSORBOARD_ID} \\\n", "--logdir {EXPERIMENT_RUNS} \\\n", "--experiment_name {EXPERIMENT_NAME}\n", "\"\"\"\n", "\n", "print(cmd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To access the TensorBoard instance for the experiment, click the below URL" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TENSORBOARD_URL = f\"https://{REGION}.tensorboard.googleusercontent.com/experiment/{TENSORBOARD_ID.replace('/', '+')}+experiments+{EXPERIMENT_NAME}/\"\n", "print(f\"TensorBoard URL for the experiment is located at {TENSORBOARD_URL}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can access the Vertex AI TensorBoard experiment from the [console](https://console.cloud.google.com/vertex-ai/experiments/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explore and log metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set path to read inference eval metrics\n", "GCS_VAL_DIR = os.path.join(RUN_DIR, 'inference_eval/')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results = utils.parse_and_log_eval_metrics(\n", " summary_dir=GCS_VAL_DIR,\n", " run_name=RUN_ID,\n", " vertex_ai=vertex_ai\n", ")\n", "results" ] } ], "metadata": { "environment": { "kernel": "python3", "name": "common-cpu.m93", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/base-cpu:m93" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "vscode": { "interpreter": { "hash": "cb70e41daff3c8e885354b3a6acd41d74f1030e2e06d05a657a05712cd24b2a0" } } }, "nbformat": 4, "nbformat_minor": 4 }