courses/machine_learning/asl/05_review/5

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Training on Cloud AI Platform\n", "\n", "**Learning Objectives**\n", "- Use CAIP to run a distributed training job\n", "\n", "## Introduction \n", "After having testing our training pipeline both locally and in the cloud on a susbset of the data, we can submit another (much larger) training job to the cloud. It is also a good idea to run a hyperparameter tuning job to make sure we have optimized the hyperparameters of our model. \n", "\n", "This notebook illustrates how to do distributed training and hyperparameter tuning on Cloud AI Platform. \n", "\n", "To start, we'll set up our environment variables as before." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "PROJECT = \"cloud-training-demos\" # Replace with your PROJECT\n", "BUCKET = \"cloud-training-bucket\" # Replace with your BUCKET\n", "\n", "REGION = \"us-central1\" # Choose an available region for Cloud CAIP\n", "TFVERSION = \"1.14\" # TF version for CMLE to use" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ[\"BUCKET\"] = BUCKET\n", "os.environ[\"PROJECT\"] = PROJECT\n", "os.environ[\"REGION\"] = REGION\n", "os.environ[\"TFVERSION\"] = TFVERSION" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gcloud config set project $PROJECT\n", "gcloud config set compute/region $REGION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we'll look for the preprocessed data for the babyweight model and copy it over if it's not there. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "if ! gsutil ls -r gs://$BUCKET | grep -q gs://$BUCKET/babyweight/preproc; then\n", " gsutil mb -l ${REGION} gs://${BUCKET}\n", " # copy canonical set of preprocessed files if you didn't do previous notebook\n", " gsutil -m cp -R gs://cloud-training-demos/babyweight gs://${BUCKET}\n", "fi" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gsutil ls gs://${BUCKET}/babyweight/preproc/*-00000*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous labs we developed our TensorFlow model and got it working on a subset of the data. Now we can package the TensorFlow code up as a Python module and train it on Cloud AI Platform.\n", "\n", "## Train on Cloud AI Platform\n", "\n", "Training on Cloud AI Platform requires two things:\n", "- Configuring our code as a Python package\n", "- Using gcloud to submit the training code to Cloud AI Platform\n", "\n", "### Move code into a Python package\n", "\n", "A Python package is simply a collection of one or more `.py` files along with an `__init__.py` file to identify the containing directory as a package. The `__init__.py` sometimes contains initialization code but for our purposes an empty file suffices.\n", "\n", "The bash command `touch` creates an empty file in the specified location, the directory `babyweight` should already exist." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "touch babyweight/trainer/__init__.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then use the `%%writefile` magic to write the contents of the cell below to a file called `task.py` in the `babyweight/trainer` folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile babyweight/trainer/task.py\n", "import argparse\n", "import json\n", "import os\n", "\n", "from . import model\n", "\n", "import tensorflow as tf\n", "\n", "\n", "if __name__ == \"__main__\":\n", " parser = argparse.ArgumentParser()\n", " parser.add_argument(\n", " \"--bucket\",\n", " help=\"GCS path to data. We assume that data is in \\\n", " gs://BUCKET/babyweight/preproc/\",\n", " required=True\n", " )\n", " parser.add_argument(\n", " \"--output_dir\",\n", " help=\"GCS location to write checkpoints and export models\",\n", " required=True\n", " )\n", " parser.add_argument(\n", " \"--batch_size\",\n", " help=\"Number of examples to compute gradient over.\",\n", " type=int,\n", " default=512\n", " )\n", " parser.add_argument(\n", " \"--job-dir\",\n", " help=\"this model ignores this field, but it is required by gcloud\",\n", " default=\"junk\"\n", " )\n", " parser.add_argument(\n", " \"--nnsize\",\n", " help=\"Hidden layer sizes to use for DNN feature columns -- provide \\\n", " space-separated layers\",\n", " nargs=\"+\",\n", " type=int,\n", " default=[128, 32, 4]\n", " )\n", " parser.add_argument(\n", " \"--nembeds\",\n", " help=\"Embedding size of a cross of n key real-valued parameters\",\n", " type=int,\n", " default=3\n", " )\n", " parser.add_argument(\n", " \"--train_examples\",\n", " help=\"Number of examples (in thousands) to run the training job over. \\\n", " If this is more than actual \\\n", " So specifying 1000 here when you have only 100k examples \\\n", " makes this 10 epochs.\",\n", " type=int,\n", " default=5000\n", " )\n", " parser.add_argument(\n", " \"--pattern\",\n", " help=\"Specify a pattern that has to be in input files. \\\n", " For example 00001-of \\\n", " will process only one shard\",\n", " default=\"of\"\n", " )\n", " parser.add_argument(\n", " \"--eval_steps\",\n", " help=\"Positive number of steps for which to evaluate model. \\\n", " Default to None, which means to evaluate until \\\n", " input_fn raises an end-of-input exception\",\n", " type=int,\n", " default=None\n", " )\n", "\n", " # Parse arguments\n", " args = parser.parse_args()\n", " arguments = args.__dict__\n", "\n", " # Pop unnecessary args needed for gcloud\n", " arguments.pop(\"job-dir\", None)\n", "\n", " # Assign the arguments to the model variables\n", " output_dir = arguments.pop(\"output_dir\")\n", " model.BUCKET = arguments.pop(\"bucket\")\n", " model.BATCH_SIZE = arguments.pop(\"batch_size\")\n", " model.TRAIN_STEPS = (\n", " arguments.pop(\"train_examples\") * 1000) / model.BATCH_SIZE\n", " model.EVAL_STEPS = arguments.pop(\"eval_steps\")\n", " print (\"Will train for {} steps using batch_size={}\".format(\n", " model.TRAIN_STEPS, model.BATCH_SIZE))\n", " model.PATTERN = arguments.pop(\"pattern\")\n", " model.NEMBEDS = arguments.pop(\"nembeds\")\n", " model.NNSIZE = arguments.pop(\"nnsize\")\n", " print (\"Will use DNN size of {}\".format(model.NNSIZE))\n", "\n", " # Append trial_id to path if we are doing hptuning\n", " # This code can be removed if you are not using hyperparameter tuning\n", " output_dir = os.path.join(\n", " output_dir,\n", " json.loads(\n", " os.environ.get(\"TF_CONFIG\", \"{}\")\n", " ).get(\"task\", {}).get(\"trial\", \"\")\n", " )\n", "\n", " # Run the training job\n", " model.train_and_evaluate(output_dir)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the same way we can write to the file `model.py` the model that we developed in the previous notebooks. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile babyweight/trainer/model.py\n", "import shutil\n", "\n", "import numpy as np\n", "import tensorflow as tf\n", "\n", "tf.logging.set_verbosity(tf.logging.INFO)\n", "\n", "BUCKET = None # set from task.py\n", "PATTERN = \"of\"\n", "\n", "CSV_COLUMNS = [\n", " \"weight_pounds\",\n", " \"is_male\",\n", " \"mother_age\",\n", " \"plurality\",\n", " \"gestation_weeks\",\n", "]\n", "LABEL_COLUMN = \"weight_pounds\"\n", "DEFAULTS = [[0.0], [\"null\"], [0.0], [\"null\"], [0.0]]\n", "\n", "TRAIN_STEPS = 10000\n", "EVAL_STEPS = None\n", "BATCH_SIZE = 512\n", "NEMBEDS = 3\n", "NNSIZE = [64, 16, 4]\n", "\n", "\n", "def read_dataset(filename_pattern, mode, batch_size):\n", "\n", " def _input_fn():\n", "\n", " def decode_csv(value_column):\n", " columns = tf.decode_csv(\n", " records=value_column,\n", " record_defaults=DEFAULTS\n", " )\n", " features = dict(zip(CSV_COLUMNS, columns))\n", " label = features.pop(LABEL_COLUMN)\n", " return features, label\n", "\n", " file_path = \"gs://{}/babyweight/preproc/{}*{}*\".format(\n", " BUCKET, filename_pattern, PATTERN)\n", " file_list = tf.gfile.Glob(filename=file_path)\n", "\n", " dataset = (\n", " tf.data.TextLineDataset(filenames=file_list).map(\n", " map_func=decode_csv)\n", " )\n", "\n", " if mode == tf.estimator.ModeKeys.TRAIN:\n", " num_epochs = None # indefinitely\n", " dataset = dataset.shuffle(buffer_size=10*batch_size)\n", " else:\n", " num_epochs = 1\n", " dataset = dataset.repeat(count=num_epochs).batch(batch_size=batch_size)\n", "\n", " return dataset\n", "\n", " return _input_fn\n", "\n", "\n", "def get_wide_deep():\n", "\n", " fc_is_male = tf.feature_column.categorical_column_with_vocabulary_list(\n", " key=\"is_male\",\n", " vocabulary_list=[\"True\", \"False\", \"Unknown\"]\n", " )\n", "\n", " fc_plurality = tf.feature_column.categorical_column_with_vocabulary_list(\n", " key=\"plurality\",\n", " vocabulary_list=[\n", " \"Single(1)\",\n", " \"Twins(2)\",\n", " \"Triplets(3)\",\n", " \"Quadruplets(4)\",\n", " \"Quintuplets(5)\",\n", " \"Multiple(2+)\"\n", " ]\n", " )\n", "\n", " fc_mother_age = tf.feature_column.numeric_column(\"mother_age\")\n", "\n", " fc_gestation_weeks = tf.feature_column.numeric_column(\"gestation_weeks\")\n", "\n", " fc_age_buckets = tf.feature_column.bucketized_column(\n", " source_column=fc_mother_age, \n", " boundaries=np.arange(start=15, stop=45, step=1).tolist()\n", " )\n", "\n", " fc_gestation_buckets = tf.feature_column.bucketized_column(\n", " source_column=fc_gestation_weeks,\n", " boundaries=np.arange(start=17, stop=47, step=1).tolist())\n", "\n", " wide = [\n", " fc_is_male,\n", " fc_plurality,\n", " fc_age_buckets,\n", " fc_gestation_buckets\n", " ]\n", "\n", " # Feature cross all the wide columns and embed into a lower dimension\n", " crossed = tf.feature_column.crossed_column(\n", " keys=wide, hash_bucket_size=20000\n", " )\n", " fc_embed = tf.feature_column.embedding_column(\n", " categorical_column=crossed,\n", " dimension=3\n", " )\n", "\n", " # Continuous columns are deep, have a complex relationship with the output\n", " deep = [\n", " fc_mother_age,\n", " fc_gestation_weeks,\n", " fc_embed\n", " ]\n", "\n", " return wide, deep\n", "\n", "\n", "def serving_input_fn():\n", " feature_placeholders = {\n", " \"is_male\": tf.placeholder(dtype=tf.string, shape=[None]),\n", " \"mother_age\": tf.placeholder(dtype=tf.float32, shape=[None]),\n", " \"plurality\": tf.placeholder(dtype=tf.string, shape=[None]),\n", " \"gestation_weeks\": tf.placeholder(dtype=tf.float32, shape=[None])\n", " }\n", "\n", " features = {\n", " key: tf.expand_dims(input=tensor, axis=-1)\n", " for key, tensor in feature_placeholders.items()\n", " }\n", "\n", " return tf.estimator.export.ServingInputReceiver(\n", " features=features, \n", " receiver_tensors=feature_placeholders\n", " )\n", "\n", "\n", "def my_rmse(labels, predictions):\n", " pred_values = predictions[\"predictions\"]\n", " return {\n", " \"rmse\": tf.metrics.root_mean_squared_error(\n", " labels=labels,\n", " predictions=pred_values\n", " )\n", " }\n", "\n", "\n", "def train_and_evaluate(output_dir):\n", " wide, deep = get_wide_deep()\n", " EVAL_INTERVAL = 300 # seconds\n", "\n", " run_config = tf.estimator.RunConfig(\n", " save_checkpoints_secs=EVAL_INTERVAL,\n", " keep_checkpoint_max=3)\n", "\n", " estimator = tf.estimator.DNNLinearCombinedRegressor(\n", " model_dir=output_dir,\n", " linear_feature_columns=wide,\n", " dnn_feature_columns=deep,\n", " dnn_hidden_units=NNSIZE,\n", " config=run_config)\n", "\n", " estimator = tf.contrib.estimator.add_metrics(estimator, my_rmse)\n", "\n", " train_spec = tf.estimator.TrainSpec(\n", " input_fn=read_dataset(\n", " \"train\", tf.estimator.ModeKeys.TRAIN, BATCH_SIZE),\n", " max_steps=TRAIN_STEPS)\n", "\n", " exporter = tf.estimator.LatestExporter(\n", " name=\"exporter\",\n", " serving_input_receiver_fn=serving_input_fn,\n", " exports_to_keep=None)\n", "\n", " eval_spec = tf.estimator.EvalSpec(\n", " input_fn=read_dataset(\n", " \"eval\", tf.estimator.ModeKeys.EVAL, 2**15),\n", " steps=EVAL_STEPS,\n", " start_delay_secs=60, # start evaluating after N seconds\n", " throttle_secs=EVAL_INTERVAL, # evaluate every N seconds\n", " exporters=exporter)\n", "\n", " tf.estimator.train_and_evaluate(\n", " estimator=estimator,\n", " train_spec=train_spec,\n", " eval_spec=eval_spec\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train locally\n", "\n", "After moving the code to a package, make sure it works as a standalone. Note, we incorporated the `--pattern` and `--train_examples` flags so that we don't try to train on the entire dataset while we are developing our pipeline. Once we are sure that everything is working on a subset, we can change the pattern so that we can train on all the data. Even for this subset, this takes about *3 minutes* in which you won't see any output ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "echo \"bucket=$BUCKET\"\n", "rm -rf babyweight_trained\n", "export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight\n", "python -m trainer.task \\\n", " --bucket=$BUCKET \\\n", " --output_dir=babyweight_trained \\\n", " --job-dir=./tmp \\\n", " --pattern=\"00000-of-\"\\\n", " --train_examples=1 \\\n", " --eval_steps=1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Making predictions\n", "\n", "The JSON below represents an input into your prediction model. Write the input.json file below with the next cell, then run the prediction locally to assess whether it produces predictions correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile inputs.json\n", "{\"is_male\": \"True\", \"mother_age\": 26.0, \"plurality\": \"Single(1)\", \"gestation_weeks\": 39}\n", "{\"is_male\": \"False\", \"mother_age\": 26.0, \"plurality\": \"Single(1)\", \"gestation_weeks\": 39}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "MODEL_LOCATION=$(ls -d $(pwd)/babyweight_trained/export/exporter/* | tail -1)\n", "echo $MODEL_LOCATION\n", "gcloud ml-engine local predict --model-dir=$MODEL_LOCATION --json-instances=inputs.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training on the Cloud with CAIP\n", "\n", "Once the code works in standalone mode, you can run it on Cloud AI Platform. Because this is on the entire dataset, it will take a while. The training run took about <b> an hour </b> for me. You can monitor the job from the GCP console in the Cloud AI Platform section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "OUTDIR=gs://${BUCKET}/babyweight/trained_model\n", "JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)\n", "echo $OUTDIR $REGION $JOBNAME\n", "gsutil -m rm -rf $OUTDIR\n", "gcloud ai-platform jobs submit training $JOBNAME \\\n", " --region=$REGION \\\n", " --module-name=trainer.task \\\n", " --package-path=$(pwd)/babyweight/trainer \\\n", " --job-dir=$OUTDIR \\\n", " --staging-bucket=gs://$BUCKET \\\n", " --scale-tier=STANDARD_1 \\\n", " --runtime-version=$TFVERSION \\\n", " -- \\\n", " --bucket=${BUCKET} \\\n", " --output_dir=${OUTDIR} \\\n", " --train_examples=200000CAIP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When I ran it, I used train_examples=2000000. When training finished, I filtered in the Stackdriver log on the word \"dict\" and saw that the last line was:\n", "<pre>\n", "Saving dict for global step 5714290: average_loss = 1.06473, global_step = 5714290, loss = 34882.4, rmse = 1.03186\n", "</pre>\n", "The final RMSE was 1.03 pounds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<h2> Hyperparameter tuning </h2>\n", "<p>\n", "All of these are command-line parameters to my program. To do hyperparameter tuning, create hyperparam.xml and pass it as --configFile.\n", "This step will take <b>1 hour</b> -- you can increase maxParallelTrials or reduce maxTrials to get it done faster. Since maxParallelTrials is the number of initial seeds to start searching from, you don't want it to be too large; otherwise, all you have is a random search.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile hyperparam.yaml\n", "trainingInput:\n", " scaleTier: STANDARD_1\n", " hyperparameters:\n", " hyperparameterMetricTag: rmse\n", " goal: MINIMIZE\n", " maxTrials: 20\n", " maxParallelTrials: 5\n", " enableTrialEarlyStopping: True\n", " params:\n", " - parameterName: batch_size\n", " type: INTEGER\n", " minValue: 8\n", " maxValue: 512\n", " scaleType: UNIT_LOG_SCALE\n", " - parameterName: nembeds\n", " type: INTEGER\n", " minValue: 3\n", " maxValue: 30\n", " scaleType: UNIT_LINEAR_SCALE\n", " - parameterName: nnsize\n", " type: INTEGER\n", " minValue: 64\n", " maxValue: 512\n", " scaleType: UNIT_LOG_SCALE" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "OUTDIR=gs://${BUCKET}/babyweight/hyperparam\n", "JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)\n", "echo $OUTDIR $REGION $JOBNAME\n", "gsutil -m rm -rf $OUTDIR\n", "gcloud ai-platform jobs submit training $JOBNAME \\\n", " --region=$REGION \\\n", " --module-name=trainer.task \\\n", " --package-path=$(pwd)/babyweight/trainer \\\n", " --job-dir=$OUTDIR \\\n", " --staging-bucket=gs://$BUCKET \\\n", " --scale-tier=STANDARD_1 \\\n", " --config=hyperparam.yaml \\\n", " --runtime-version=$TFVERSION \\\n", " -- \\\n", " --bucket=${BUCKET} \\\n", " --output_dir=${OUTDIR} \\\n", " --eval_steps=10 \\\n", " --train_examples=20000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Repeat training \n", "\n", "Now that we've determined the optimal hyparameters, we'll retrain with these tuned parameters. Note the last line. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "OUTDIR=gs://${BUCKET}/babyweight/trained_model_tuned\n", "JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)\n", "echo $OUTDIR $REGION $JOBNAME\n", "gsutil -m rm -rf $OUTDIR\n", "gcloud ai-platform jobs submit training $JOBNAME \\\n", " --region=$REGION \\\n", " --module-name=trainer.task \\\n", " --package-path=$(pwd)/babyweight/trainer \\\n", " --job-dir=$OUTDIR \\\n", " --staging-bucket=gs://$BUCKET \\\n", " --scale-tier=STANDARD_1 \\\n", " --runtime-version=$TFVERSION \\\n", " -- \\\n", " --bucket=${BUCKET} \\\n", " --output_dir=${OUTDIR} \\\n", " --train_examples=20000 --batch_size=35 --nembeds=16 --nnsize=281" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }

courses/machine_learning/asl/05_review/5_train.ipynb (679 lines of code) (raw):