quests/endtoendml/solutions/labs/5_train.ipynb (574 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<h1>Training on Cloud AI Platform</h1>\n", "\n", "This notebook illustrates distributed training on Cloud AI Platform (formerly known as Cloud ML Engine)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Ensure the right version of Tensorflow is installed.\n", "!pip freeze | grep tensorflow==2.1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# change these to try this notebook out\n", "BUCKET = 'cloud-training-demos-ml'\n", "PROJECT = 'cloud-training-demos'\n", "REGION = 'us-central1'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ['BUCKET'] = BUCKET\n", "os.environ['PROJECT'] = PROJECT\n", "os.environ['REGION'] = REGION\n", "os.environ['TFVERSION'] = '2.1'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gcloud config set project $PROJECT\n", "gcloud config set compute/region $REGION" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "if ! gsutil ls | grep -q gs://${BUCKET}/babyweight/preproc; then\n", " gsutil mb -l ${REGION} gs://${BUCKET}\n", " # copy canonical set of preprocessed files if you didn't do previous notebook\n", " gsutil -m cp -R gs://cloud-training-demos/babyweight gs://${BUCKET}\n", "fi" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gsutil ls gs://${BUCKET}/babyweight/preproc/*-00000*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the TensorFlow code working on a subset of the data, we can package the TensorFlow code up as a Python module and train it on Cloud AI Platform.\n", "<p>\n", "<h2> Train on Cloud AI Platform</h2>\n", "<p>\n", "Training on Cloud AI Platform requires:\n", "<ol>\n", "<li> Making the code a Python package\n", "<li> Using gcloud to submit the training code to Cloud AI Platform\n", "</ol>\n", "\n", "Ensure that the AI Platform API is enabled by going to this [link](https://console.developers.google.com/apis/library/ml.googleapis.com)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lab Task 1\n", "\n", "The following code edits babyweight/trainer/task.py. You should use add hyperparameters needed by your model through the command-line using the `parser` module. Look at how `batch_size` is passed to the model in the code below. Do this for the following hyperparameters (defaults in parentheses): `train_examples` (5000), `eval_steps` (None), `pattern` (of)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile babyweight/trainer/task.py\n", "import argparse\n", "import json\n", "import os\n", "\n", "from . import model\n", "\n", "import tensorflow.compat.v1 as tf\n", "tf.disable_v2_behavior()\n", "\n", "if __name__ == '__main__':\n", " parser = argparse.ArgumentParser()\n", " parser.add_argument(\n", " '--bucket',\n", " help = 'GCS path to data. We assume that data is in gs://BUCKET/babyweight/preproc/',\n", " required = True\n", " )\n", " parser.add_argument(\n", " '--output_dir',\n", " help = 'GCS location to write checkpoints and export models',\n", " required = True\n", " )\n", " parser.add_argument(\n", " '--batch_size',\n", " help = 'Number of examples to compute gradient over.',\n", " type = int,\n", " default = 512\n", " )\n", " parser.add_argument(\n", " '--job-dir',\n", " help = 'this model ignores this field, but it is required by gcloud',\n", " default = 'junk'\n", " )\n", " parser.add_argument(\n", " '--nnsize',\n", " help = 'Hidden layer sizes to use for DNN feature columns -- provide space-separated layers',\n", " nargs = '+',\n", " type = int,\n", " default=[128, 32, 4]\n", " )\n", " parser.add_argument(\n", " '--nembeds',\n", " help = 'Embedding size of a cross of n key real-valued parameters',\n", " type = int,\n", " default = 3\n", " )\n", "\n", " ## TODOs after this line\n", " ################################################################################\n", " \n", " ## TODO 1: add the new arguments here \n", "\n", " ## parse all arguments\n", " args = parser.parse_args()\n", " arguments = args.__dict__\n", "\n", " # unused args provided by service\n", " arguments.pop('job_dir', None)\n", " arguments.pop('job-dir', None)\n", "\n", " ## assign the arguments to the model variables\n", " output_dir = arguments.pop('output_dir')\n", " model.BUCKET = arguments.pop('bucket')\n", " model.BATCH_SIZE = arguments.pop('batch_size')\n", " model.TRAIN_STEPS = (arguments.pop('train_examples') * 100) / model.BATCH_SIZE\n", " model.EVAL_STEPS = arguments.pop('eval_steps') \n", " print (\"Will train for {} steps using batch_size={}\".format(model.TRAIN_STEPS, model.BATCH_SIZE))\n", " model.PATTERN = arguments.pop('pattern')\n", " model.NEMBEDS= arguments.pop('nembeds')\n", " model.NNSIZE = arguments.pop('nnsize')\n", " print (\"Will use DNN size of {}\".format(model.NNSIZE))\n", "\n", " # Append trial_id to path if we are doing hptuning\n", " # This code can be removed if you are not using hyperparameter tuning\n", " output_dir = os.path.join(\n", " output_dir,\n", " json.loads(\n", " os.environ.get('TF_CONFIG', '{}')\n", " ).get('task', {}).get('trial', '')\n", " )\n", "\n", " # Run the training job\n", " model.train_and_evaluate(output_dir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lab Task 2\n", "\n", "Address all the TODOs in the following code in `babyweight/trainer/model.py` with the cell below. This code is similar to the model training code we wrote in Lab 3. \n", "\n", "After addressing all TODOs, run the cell to write the code to the model.py file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile babyweight/trainer/model.py\n", "import shutil\n", "import numpy as np\n", "\n", "import tensorflow.compat.v1 as tf\n", "tf.disable_v2_behavior()\n", "\n", "tf.logging.set_verbosity(tf.logging.INFO)\n", "\n", "BUCKET = None # set from task.py\n", "PATTERN = 'of' # gets all files\n", "\n", "# Determine CSV, label, and key columns\n", "CSV_COLUMNS = 'weight_pounds,is_male,mother_age,plurality,gestation_weeks,key'.split(',')\n", "LABEL_COLUMN = 'weight_pounds'\n", "KEY_COLUMN = 'key'\n", "\n", "# Set default values for each CSV column\n", "DEFAULTS = [[0.0], ['null'], [0.0], ['null'], [0.0], ['nokey']]\n", "\n", "# Define some hyperparameters\n", "TRAIN_STEPS = 10000\n", "EVAL_STEPS = None\n", "BATCH_SIZE = 512\n", "NEMBEDS = 3\n", "NNSIZE = [64, 16, 4]\n", "\n", "# Create an input function reading a file using the Dataset API\n", "# Then provide the results to the Estimator API\n", "def read_dataset(prefix, mode, batch_size):\n", " def _input_fn():\n", " def decode_csv(value_column):\n", " columns = tf.decode_csv(value_column, record_defaults=DEFAULTS)\n", " features = dict(zip(CSV_COLUMNS, columns))\n", " label = features.pop(LABEL_COLUMN)\n", " return features, label\n", " \n", " # Use prefix to create file path\n", " file_path = 'gs://{}/babyweight/preproc/{}*{}*'.format(BUCKET, prefix, PATTERN)\n", "\n", " # Create list of files that match pattern\n", " file_list = tf.gfile.Glob(file_path)\n", "\n", " # Create dataset from file list\n", " dataset = (tf.data.TextLineDataset(file_list) # Read text file\n", " .map(decode_csv)) # Transform each elem by applying decode_csv fn\n", " \n", " if mode == tf.estimator.ModeKeys.TRAIN:\n", " num_epochs = None # indefinitely\n", " dataset = dataset.shuffle(buffer_size = 10 * batch_size)\n", " else:\n", " num_epochs = 1 # end-of-input after this\n", " \n", " dataset = dataset.repeat(num_epochs).batch(batch_size)\n", " return dataset.make_one_shot_iterator().get_next()\n", " return _input_fn\n", "\n", "# Define feature columns\n", "def get_wide_deep():\n", " # Define column types\n", " is_male,mother_age,plurality,gestation_weeks = \\\n", " [\\\n", " tf.feature_column.categorical_column_with_vocabulary_list('is_male', \n", " ['True', 'False', 'Unknown']),\n", " tf.feature_column.numeric_column('mother_age'),\n", " tf.feature_column.categorical_column_with_vocabulary_list('plurality',\n", " ['Single(1)', 'Twins(2)', 'Triplets(3)',\n", " 'Quadruplets(4)', 'Quintuplets(5)','Multiple(2+)']),\n", " tf.feature_column.numeric_column('gestation_weeks')\n", " ]\n", "\n", " # Discretize\n", " age_buckets = tf.feature_column.bucketized_column(mother_age, \n", " boundaries=np.arange(15,45,1).tolist())\n", " gestation_buckets = tf.feature_column.bucketized_column(gestation_weeks, \n", " boundaries=np.arange(17,47,1).tolist())\n", " \n", " # Sparse columns are wide, have a linear relationship with the output\n", " wide = [is_male,\n", " plurality,\n", " age_buckets,\n", " gestation_buckets]\n", " \n", " # Feature cross all the wide columns and embed into a lower dimension\n", " crossed = tf.feature_column.crossed_column(wide, hash_bucket_size=20000)\n", " embed = tf.feature_column.embedding_column(crossed, NEMBEDS)\n", " \n", " # Continuous columns are deep, have a complex relationship with the output\n", " deep = [mother_age,\n", " gestation_weeks,\n", " embed]\n", " return wide, deep\n", "\n", "# Create serving input function to be able to serve predictions later using provided inputs\n", "def serving_input_fn():\n", " feature_placeholders = {\n", " 'is_male': tf.placeholder(tf.string, [None]),\n", " 'mother_age': tf.placeholder(tf.float32, [None]),\n", " 'plurality': tf.placeholder(tf.string, [None]),\n", " 'gestation_weeks': tf.placeholder(tf.float32, [None]),\n", " KEY_COLUMN: tf.placeholder_with_default(tf.constant(['nokey']), [None])\n", " }\n", " features = {\n", " key: tf.expand_dims(tensor, -1)\n", " for key, tensor in feature_placeholders.items()\n", " }\n", " return tf.estimator.export.ServingInputReceiver(features, feature_placeholders)\n", "\n", "# create metric for hyperparameter tuning\n", "def my_rmse(labels, predictions):\n", " pred_values = predictions['predictions']\n", " return {'rmse': tf.metrics.root_mean_squared_error(labels, pred_values)}\n", "\n", "def forward_features(estimator, key):\n", " def new_model_fn(features, labels, mode, config):\n", " spec = estimator.model_fn(features, labels, mode, config)\n", " predictions = spec.predictions\n", " predictions[key] = features[key]\n", " spec = spec._replace(predictions=predictions)\n", " return spec\n", " return tf.estimator.Estimator(model_fn=new_model_fn, model_dir=estimator.model_dir, config=estimator.config)\n", "\n", "## TODOs after this line\n", "################################################################################\n", "\n", "# Create estimator to train and evaluate\n", "def train_and_evaluate(output_dir):\n", " tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file\n", " wide, deep = get_wide_deep()\n", " EVAL_INTERVAL = 300 # seconds\n", "\n", " ## TODO 2a: set the save_checkpoints_secs to the EVAL_INTERVAL\n", " run_config = tf.estimator.RunConfig(save_checkpoints_secs = None,\n", " keep_checkpoint_max = 3)\n", " \n", " ## TODO 2b: change the dnn_hidden_units to NNSIZE\n", " estimator = tf.estimator.DNNLinearCombinedRegressor(\n", " model_dir = output_dir,\n", " linear_feature_columns = wide,\n", " dnn_feature_columns = deep,\n", " dnn_hidden_units = None,\n", " config = run_config)\n", " \n", " # illustrates how to add an extra metric\n", " estimator = tf.estimator.add_metrics(estimator, my_rmse)\n", " # for batch prediction, you need a key associated with each instance\n", " estimator = forward_features(estimator, KEY_COLUMN)\n", "\n", " ## TODO 2c: Set the third argument of read_dataset to BATCH_SIZE \n", " ## TODO 2d: and set max_steps to TRAIN_STEPS\n", " train_spec = tf.estimator.TrainSpec(\n", " input_fn = read_dataset('train', tf.estimator.ModeKeys.TRAIN, None),\n", " max_steps = None)\n", " \n", " exporter = tf.estimator.LatestExporter('exporter', serving_input_fn, exports_to_keep=None)\n", "\n", " ## TODO 2e: Lastly, set steps equal to EVAL_STEPS\n", " eval_spec = tf.estimator.EvalSpec(\n", " input_fn = read_dataset('eval', tf.estimator.ModeKeys.EVAL, 2**15), # no need to batch in eval\n", " steps = None,\n", " start_delay_secs = 60, # start evaluating after N seconds\n", " throttle_secs = EVAL_INTERVAL, # evaluate every N seconds\n", " exporters = exporter)\n", " tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lab Task 3\n", "\n", "After moving the code to a package, make sure it works standalone. (Note the --pattern and --train_examples lines so that I am not trying to boil the ocean on the small notebook VM. Change as appropriate for your model).\n", "<p>\n", "Even with smaller data, this might take <b>3-5 minutes</b> in which you won't see any output ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "echo \"bucket=${BUCKET}\"\n", "rm -rf babyweight_trained\n", "export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight\n", "python -m trainer.task \\\n", " --bucket=${BUCKET} \\\n", " --output_dir=babyweight_trained \\\n", " --job-dir=./tmp \\\n", " --pattern=\"00000-of-\" --train_examples=1 --eval_steps=1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lab Task 4\n", "\n", "The JSON below represents an input into your prediction model. Write the input.json file below with the next cell, then run the prediction locally to assess whether it produces predictions correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile inputs.json\n", "{\"key\": \"b1\", \"is_male\": \"True\", \"mother_age\": 26.0, \"plurality\": \"Single(1)\", \"gestation_weeks\": 39}\n", "{\"key\": \"g1\", \"is_male\": \"False\", \"mother_age\": 26.0, \"plurality\": \"Single(1)\", \"gestation_weeks\": 39}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "sudo find \"/usr/lib/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine\" -name '*.pyc' -delete" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "MODEL_LOCATION=$(ls -d $(pwd)/babyweight_trained/export/exporter/* | tail -1)\n", "echo $MODEL_LOCATION\n", "gcloud ai-platform local predict --model-dir=$MODEL_LOCATION --json-instances=inputs.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lab Task 5\n", "\n", "Once the code works in standalone mode, you can run it on Cloud AI Platform. \n", "Change the parameters to the model (-train_examples for example may not be part of your model) appropriately.\n", "\n", "Because this is on the entire dataset, it will take a while. The training run took about <b> 2 hours </b> for me. You can monitor the job from the GCP console in the Cloud AI Platform section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "OUTDIR=gs://${BUCKET}/babyweight/trained_model\n", "JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)\n", "echo $OUTDIR $REGION $JOBNAME\n", "gsutil -m rm -rf $OUTDIR\n", "gcloud ai-platform jobs submit training $JOBNAME \\\n", " --region=$REGION \\\n", " --module-name=trainer.task \\\n", " --package-path=$(pwd)/babyweight/trainer \\\n", " --job-dir=$OUTDIR \\\n", " --staging-bucket=gs://$BUCKET \\\n", " --scale-tier=STANDARD_1 \\\n", " --runtime-version=2.1 \\\n", " --python-version=3.7 \\\n", " -- \\\n", " --bucket=${BUCKET} \\\n", " --output_dir=${OUTDIR} \\\n", " --train_examples=20000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When I ran it, I used train_examples=20000. When training finished, I filtered in the Stackdriver log on the word \"dict\" and saw that the last line was:\n", "<pre>\n", "Saving dict for global step 5714290: average_loss = 1.06473, global_step = 5714290, loss = 34882.4, rmse = 1.03186\n", "</pre>\n", "The final RMSE was 1.03 pounds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<h2> Repeat training </h2>\n", "<p>\n", "This time with tuned parameters (note last line)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "OUTDIR=gs://${BUCKET}/babyweight/trained_model_tuned\n", "JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)\n", "echo $OUTDIR $REGION $JOBNAME\n", "gsutil -m rm -rf $OUTDIR\n", "gcloud ai-platform jobs submit training $JOBNAME \\\n", " --region=$REGION \\\n", " --module-name=trainer.task \\\n", " --package-path=$(pwd)/babyweight/trainer \\\n", " --job-dir=$OUTDIR \\\n", " --staging-bucket=gs://$BUCKET \\\n", " --scale-tier=STANDARD_1 \\\n", " --runtime-version=2.1 \\\n", " --python-version=3.7 \\\n", " -- \\\n", " --bucket=${BUCKET} \\\n", " --output_dir=${OUTDIR} \\\n", " --train_examples=2000 --batch_size=35 --nembeds=16 --nnsize=281" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.3" } }, "nbformat": 4, "nbformat_minor": 2 }