courses/machine_learning/deepdive/09_sequence/poetry.ipynb (1,197 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Text generation using tensor2tensor on Cloud ML Engine\n", "\n", "This notebook illustrates using the <a href=\"https://github.com/tensorflow/tensor2tensor\">tensor2tensor</a> library to do from-scratch, distributed training of a poetry model. Then, the trained model is used to complete new poems.\n", "\n", "<br/>\n", "\n", "### Install tensor2tensor, and specify Google Cloud Platform project and bucket" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Install the necessary packages. tensor2tensor will give us the Transformer model. Project Gutenberg gives us access to historical poems.\n", "\n", "\n", "<b>p.s.</b> Note that this notebook uses Python2 because Project Gutenberg relies on BSD-DB which was deprecated in Python 3 and removed from the standard library.\n", "tensor2tensor itself can be used on Python 3. It's just Project Gutenberg that has this issue." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "pip freeze | grep tensor" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "pip install tensor2tensor==1.13.1 tensorflow==1.13.1 tensorflow-serving-api==1.13 gutenberg \n", "pip install tensorflow_hub \n", "\n", "# install from sou\n", "#git clone https://github.com/tensorflow/tensor2tensor.git\n", "#cd tensor2tensor\n", "#yes | pip install --user -e ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the following cell does not reflect the version of tensorflow and tensor2tensor that you just installed, click **\"Reset Session\"** on the notebook so that the Python environment picks up the new packages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "pip freeze | grep tensor" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "PROJECT = 'cloud-training-demos' # REPLACE WITH YOUR PROJECT ID\n", "BUCKET = 'cloud-training-demos-ml' # REPLACE WITH YOUR BUCKET NAME\n", "REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1\n", "\n", "# this is what this notebook is demonstrating\n", "PROBLEM= 'poetry_line_problem'\n", "\n", "# for bash\n", "os.environ['PROJECT'] = PROJECT\n", "os.environ['BUCKET'] = BUCKET\n", "os.environ['REGION'] = REGION\n", "os.environ['PROBLEM'] = PROBLEM\n", "\n", "#os.environ['PATH'] = os.environ['PATH'] + ':' + os.getcwd() + '/tensor2tensor/tensor2tensor/bin/'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gcloud config set project $PROJECT\n", "gcloud config set compute/region $REGION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download data\n", "\n", "We will get some <a href=\"https://www.gutenberg.org/wiki/Poetry_(Bookshelf)\">poetry anthologies</a> from Project Gutenberg." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "rm -rf data/poetry\n", "mkdir -p data/poetry" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gutenberg.acquire import load_etext\n", "from gutenberg.cleanup import strip_headers\n", "import re\n", "\n", "books = [\n", " # bookid, skip N lines\n", " (26715, 1000, 'Victorian songs'),\n", " (30235, 580, 'Baldwin collection'),\n", " (35402, 710, 'Swinburne collection'),\n", " (574, 15, 'Blake'),\n", " (1304, 172, 'Bulchevys collection'),\n", " (19221, 223, 'Palgrave-Pearse collection'),\n", " (15553, 522, 'Knowles collection') \n", "]\n", "\n", "with open('data/poetry/raw.txt', 'w') as ofp:\n", " lineno = 0\n", " for (id_nr, toskip, title) in books:\n", " startline = lineno\n", " text = strip_headers(load_etext(id_nr)).strip()\n", " lines = text.split('\\n')[toskip:]\n", " # any line that is all upper case is a title or author name\n", " # also don't want any lines with years (numbers)\n", " for line in lines:\n", " if (len(line) > 0 \n", " and line.upper() != line \n", " and not re.match('.*[0-9]+.*', line)\n", " and len(line) < 50\n", " ):\n", " cleaned = re.sub('[^a-z\\'\\-]+', ' ', line.strip().lower())\n", " ofp.write(cleaned)\n", " ofp.write('\\n')\n", " lineno = lineno + 1\n", " else:\n", " ofp.write('\\n')\n", " print('Wrote lines {} to {} from {}'.format(startline, lineno, title))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wc -l data/poetry/*.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create training dataset\n", "\n", "We are going to train a machine learning model to write poetry given a starting point. We'll give it one line, and it is going to tell us the next line. So, naturally, we will train it on real poetry. Our feature will be a line of a poem and the label will be next line of that poem.\n", "<p>\n", "Our training dataset will consist of two files. The first file will consist of the input lines of poetry and the other file will consist of the corresponding output lines, one output line per input line." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('data/poetry/raw.txt', 'r') as rawfp,\\\n", " open('data/poetry/input.txt', 'w') as infp,\\\n", " open('data/poetry/output.txt', 'w') as outfp:\n", " \n", " prev_line = ''\n", " for curr_line in rawfp:\n", " curr_line = curr_line.strip()\n", " # poems break at empty lines, so this ensures we train only\n", " # on lines of the same poem\n", " if len(prev_line) > 0 and len(curr_line) > 0: \n", " infp.write(prev_line + '\\n')\n", " outfp.write(curr_line + '\\n')\n", " prev_line = curr_line " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!head -5 data/poetry/*.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We do not need to generate the data beforehand -- instead, we can have Tensor2Tensor create the training dataset for us. So, in the code below, I will use only data/poetry/raw.txt -- obviously, this allows us to productionize our model better. Simply keep collecting raw data and generate the training/test data at the time of training." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up problem\n", "The Problem in tensor2tensor is where you specify parameters like the size of your vocabulary and where to get the training data from." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "rm -rf poetry\n", "mkdir -p poetry/trainer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile poetry/trainer/problem.py\n", "import os\n", "import tensorflow as tf\n", "from tensor2tensor.utils import registry\n", "from tensor2tensor.models import transformer\n", "from tensor2tensor.data_generators import problem\n", "from tensor2tensor.data_generators import text_encoder\n", "from tensor2tensor.data_generators import text_problems\n", "from tensor2tensor.data_generators import generator_utils\n", "\n", "tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file\n", "\n", "@registry.register_problem\n", "class PoetryLineProblem(text_problems.Text2TextProblem):\n", " \"\"\"Predict next line of poetry from the last line. From Gutenberg texts.\"\"\"\n", "\n", " @property\n", " def approx_vocab_size(self):\n", " return 2**13 # ~8k\n", "\n", " @property\n", " def is_generate_per_split(self):\n", " # generate_data will NOT shard the data into TRAIN and EVAL for us.\n", " return False\n", "\n", " @property\n", " def dataset_splits(self):\n", " \"\"\"Splits of data to produce and number of output shards for each.\"\"\"\n", " # 10% evaluation data\n", " return [{\n", " \"split\": problem.DatasetSplit.TRAIN,\n", " \"shards\": 90,\n", " }, {\n", " \"split\": problem.DatasetSplit.EVAL,\n", " \"shards\": 10,\n", " }]\n", "\n", " def generate_samples(self, data_dir, tmp_dir, dataset_split):\n", " with open('data/poetry/raw.txt', 'r') as rawfp:\n", " prev_line = ''\n", " for curr_line in rawfp:\n", " curr_line = curr_line.strip()\n", " # poems break at empty lines, so this ensures we train only\n", " # on lines of the same poem\n", " if len(prev_line) > 0 and len(curr_line) > 0: \n", " yield {\n", " \"inputs\": prev_line,\n", " \"targets\": curr_line\n", " }\n", " prev_line = curr_line \n", "\n", "\n", "# Smaller than the typical translate model, and with more regularization\n", "@registry.register_hparams\n", "def transformer_poetry():\n", " hparams = transformer.transformer_base()\n", " hparams.num_hidden_layers = 2\n", " hparams.hidden_size = 128\n", " hparams.filter_size = 512\n", " hparams.num_heads = 4\n", " hparams.attention_dropout = 0.6\n", " hparams.layer_prepostprocess_dropout = 0.6\n", " hparams.learning_rate = 0.05\n", " return hparams\n", "\n", "@registry.register_hparams\n", "def transformer_poetry_tpu():\n", " hparams = transformer_poetry()\n", " transformer.update_hparams_for_tpu(hparams)\n", " return hparams\n", "\n", "# hyperparameter tuning ranges\n", "@registry.register_ranged_hparams\n", "def transformer_poetry_range(rhp):\n", " rhp.set_float(\"learning_rate\", 0.05, 0.25, scale=rhp.LOG_SCALE)\n", " rhp.set_int(\"num_hidden_layers\", 2, 4)\n", " rhp.set_discrete(\"hidden_size\", [128, 256, 512])\n", " rhp.set_float(\"attention_dropout\", 0.4, 0.7)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile poetry/trainer/__init__.py\n", "from . import problem" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile poetry/setup.py\n", "from setuptools import find_packages\n", "from setuptools import setup\n", "\n", "REQUIRED_PACKAGES = [\n", " 'tensor2tensor'\n", "]\n", "\n", "setup(\n", " name='poetry',\n", " version='0.1',\n", " author = 'Google',\n", " author_email = 'training-feedback@cloud.google.com',\n", " install_requires=REQUIRED_PACKAGES,\n", " packages=find_packages(),\n", " include_package_data=True,\n", " description='Poetry Line Problem',\n", " requires=[]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!touch poetry/__init__.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!find poetry" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate training data \n", "\n", "Our problem (translation) requires the creation of text sequences from the training dataset. This is done using t2t-datagen and the Problem defined in the previous section.\n", "\n", "(Ignore any runtime warnings about np.float64. they are harmless)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "DATA_DIR=./t2t_data\n", "TMP_DIR=$DATA_DIR/tmp\n", "rm -rf $DATA_DIR $TMP_DIR\n", "mkdir -p $DATA_DIR $TMP_DIR\n", "# Generate data\n", "t2t-datagen \\\n", " --t2t_usr_dir=./poetry/trainer \\\n", " --problem=$PROBLEM \\\n", " --data_dir=$DATA_DIR \\\n", " --tmp_dir=$TMP_DIR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check to see the files that were output. If you see a broken pipe error, please ignore." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls t2t_data | head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Provide Cloud ML Engine access to data\n", "\n", "Copy the data to Google Cloud Storage, and then provide access to the data. `gsutil` throws an error when removing an empty bucket, so you may see an error the first time this code is run." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "DATA_DIR=./t2t_data\n", "gsutil -m rm -r gs://${BUCKET}/poetry/\n", "gsutil -m cp ${DATA_DIR}/${PROBLEM}* ${DATA_DIR}/vocab* gs://${BUCKET}/poetry/data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "PROJECT_ID=$PROJECT\n", "AUTH_TOKEN=$(gcloud auth print-access-token)\n", "SVC_ACCOUNT=$(curl -X GET -H \"Content-Type: application/json\" \\\n", " -H \"Authorization: Bearer $AUTH_TOKEN\" \\\n", " https://ml.googleapis.com/v1/projects/${PROJECT_ID}:getConfig \\\n", " | python -c \"import json; import sys; response = json.load(sys.stdin); \\\n", " print(response['serviceAccount'])\")\n", "\n", "echo \"Authorizing the Cloud ML Service account $SVC_ACCOUNT to access files in $BUCKET\"\n", "gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET\n", "gsutil -m acl ch -u $SVC_ACCOUNT:R -r gs://$BUCKET # error message (if bucket is empty) can be ignored\n", "gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train model locally on subset of data\n", "\n", "Let's run it locally on a subset of the data to make sure it works." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "BASE=gs://${BUCKET}/poetry/data\n", "OUTDIR=gs://${BUCKET}/poetry/subset\n", "gsutil -m rm -r $OUTDIR\n", "gsutil -m cp \\\n", " ${BASE}/${PROBLEM}-train-0008* \\\n", " ${BASE}/${PROBLEM}-dev-00000* \\\n", " ${BASE}/vocab* \\\n", " $OUTDIR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: the following will work only if you are running Jupyter on a reasonably powerful machine. Don't be alarmed if your process is killed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "DATA_DIR=gs://${BUCKET}/poetry/subset\n", "OUTDIR=./trained_model\n", "rm -rf $OUTDIR\n", "t2t-trainer \\\n", " --data_dir=gs://${BUCKET}/poetry/subset \\\n", " --t2t_usr_dir=./poetry/trainer \\\n", " --problem=$PROBLEM \\\n", " --model=transformer \\\n", " --hparams_set=transformer_poetry \\\n", " --output_dir=$OUTDIR --job-dir=$OUTDIR --train_steps=10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Option 1: Train model locally on full dataset (use if running on Notebook Instance with a GPU)\n", "\n", "You can train on the full dataset if you are on a Google Cloud Notebook Instance with a P100 or better GPU" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "LOCALGPU=\"--train_steps=7500 --worker_gpu=1 --hparams_set=transformer_poetry\"\n", "\n", "DATA_DIR=gs://${BUCKET}/poetry/data\n", "OUTDIR=gs://${BUCKET}/poetry/model\n", "rm -rf $OUTDIR\n", "t2t-trainer \\\n", " --data_dir=gs://${BUCKET}/poetry/subset \\\n", " --t2t_usr_dir=./poetry/trainer \\\n", " --problem=$PROBLEM \\\n", " --model=transformer \\\n", " --hparams_set=transformer_poetry \\\n", " --output_dir=$OUTDIR ${LOCALGPU}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Option 2: Train on Cloud ML Engine\n", "\n", "tensor2tensor has a convenient --cloud_mlengine option to kick off the training on the managed service.\n", "It uses the [Python API](https://cloud.google.com/ml-engine/docs/training-jobs) mentioned in the Cloud ML Engine docs, rather than requiring you to use gcloud to submit the job.\n", "<p>\n", "Note: your project needs P100 quota in the region.\n", "<p>\n", "The echo is because t2t-trainer asks you to confirm before submitting the job to the cloud. Ignore any error about \"broken pipe\".\n", "If you see a message similar to this:\n", "<pre>\n", " [... cloud_mlengine.py:392] Launched transformer_poetry_line_problem_t2t_20190323_000631. See console to track: https://console.cloud.google.com/mlengine/jobs/.\n", "</pre>\n", "then, this step has been successful." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "GPU=\"--train_steps=7500 --cloud_mlengine --worker_gpu=1 --hparams_set=transformer_poetry\"\n", "\n", "DATADIR=gs://${BUCKET}/poetry/data\n", "OUTDIR=gs://${BUCKET}/poetry/model\n", "JOBNAME=poetry_$(date -u +%y%m%d_%H%M%S)\n", "echo $OUTDIR $REGION $JOBNAME\n", "gsutil -m rm -rf $OUTDIR\n", "echo \"'Y'\" | t2t-trainer \\\n", " --data_dir=gs://${BUCKET}/poetry/subset \\\n", " --t2t_usr_dir=./poetry/trainer \\\n", " --problem=$PROBLEM \\\n", " --model=transformer \\\n", " --output_dir=$OUTDIR \\\n", " ${GPU}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "## CHANGE the job name (based on output above: You will see a line such as Launched transformer_poetry_line_problem_t2t_20190322_233159)\n", "gcloud ml-engine jobs describe transformer_poetry_line_problem_t2t_20190323_003001" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The job took about <b>25 minutes</b> for me and ended with these evaluation metrics:\n", "<pre>\n", "Saving dict for global step 8000: global_step = 8000, loss = 6.03338, metrics-poetry_line_problem/accuracy = 0.138544, metrics-poetry_line_problem/accuracy_per_sequence = 0.0, metrics-poetry_line_problem/accuracy_top5 = 0.232037, metrics-poetry_line_problem/approx_bleu_score = 0.00492648, metrics-poetry_line_problem/neg_log_perplexity = -6.68994, metrics-poetry_line_problem/rouge_2_fscore = 0.00256089, metrics-poetry_line_problem/rouge_L_fscore = 0.128194\n", "</pre>\n", "Notice that accuracy_per_sequence is 0 -- Considering that we are asking the NN to be rather creative, that doesn't surprise me. Why am I looking at accuracy_per_sequence and not the other metrics? This is because it is more appropriate for problem we are solving; metrics like Bleu score are better for translation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Option 3: Train on a directly-connected TPU\n", "\n", "If you are running on a VM connected directly to a Cloud TPU, you can run t2t-trainer directly. Unfortunately, you won't see any output from Jupyter while the program is running.\n", "\n", "Compare this command line to the one using GPU in the previous section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# use one of these\n", "TPU=\"--train_steps=7500 --use_tpu=True --cloud_tpu_name=laktpu --hparams_set=transformer_poetry_tpu\"\n", "\n", "DATADIR=gs://${BUCKET}/poetry/data\n", "OUTDIR=gs://${BUCKET}/poetry/model_tpu\n", "JOBNAME=poetry_$(date -u +%y%m%d_%H%M%S)\n", "echo $OUTDIR $REGION $JOBNAME\n", "gsutil -m rm -rf $OUTDIR\n", "echo \"'Y'\" | t2t-trainer \\\n", " --data_dir=gs://${BUCKET}/poetry/subset \\\n", " --t2t_usr_dir=./poetry/trainer \\\n", " --problem=$PROBLEM \\\n", " --model=transformer \\\n", " --output_dir=$OUTDIR \\\n", " ${TPU}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gsutil ls gs://${BUCKET}/poetry/model_tpu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The job took about <b>10 minutes</b> for me and ended with these evaluation metrics:\n", "<pre>\n", "Saving dict for global step 8000: global_step = 8000, loss = 6.03338, metrics-poetry_line_problem/accuracy = 0.138544, metrics-poetry_line_problem/accuracy_per_sequence = 0.0, metrics-poetry_line_problem/accuracy_top5 = 0.232037, metrics-poetry_line_problem/approx_bleu_score = 0.00492648, metrics-poetry_line_problem/neg_log_perplexity = -6.68994, metrics-poetry_line_problem/rouge_2_fscore = 0.00256089, metrics-poetry_line_problem/rouge_L_fscore = 0.128194\n", "</pre>\n", "Notice that accuracy_per_sequence is 0 -- Considering that we are asking the NN to be rather creative, that doesn't surprise me. Why am I looking at accuracy_per_sequence and not the other metrics? This is because it is more appropriate for problem we are solving; metrics like Bleu score are better for translation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Option 4: Training longer\n", "\n", "Let's train on 4 GPUs for 75,000 steps. Note the change in the last line of the job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "XXX This takes 3 hours on 4 GPUs. Remove this line if you are sure you want to do this.\n", "\n", "DATADIR=gs://${BUCKET}/poetry/data\n", "OUTDIR=gs://${BUCKET}/poetry/model_full2\n", "JOBNAME=poetry_$(date -u +%y%m%d_%H%M%S)\n", "echo $OUTDIR $REGION $JOBNAME\n", "gsutil -m rm -rf $OUTDIR\n", "echo \"'Y'\" | t2t-trainer \\\n", " --data_dir=gs://${BUCKET}/poetry/subset \\\n", " --t2t_usr_dir=./poetry/trainer \\\n", " --problem=$PROBLEM \\\n", " --model=transformer \\\n", " --hparams_set=transformer_poetry \\\n", " --output_dir=$OUTDIR \\\n", " --train_steps=75000 --cloud_mlengine --worker_gpu=4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This job took <b>12 hours</b> for me and ended with these metrics:\n", "<pre>\n", "global_step = 76000, loss = 4.99763, metrics-poetry_line_problem/accuracy = 0.219792, metrics-poetry_line_problem/accuracy_per_sequence = 0.0192308, metrics-poetry_line_problem/accuracy_top5 = 0.37618, metrics-poetry_line_problem/approx_bleu_score = 0.017955, metrics-poetry_line_problem/neg_log_perplexity = -5.38725, metrics-poetry_line_problem/rouge_2_fscore = 0.0325563, metrics-poetry_line_problem/rouge_L_fscore = 0.210618\n", "</pre>\n", "At least the accuracy per sequence is no longer zero. It is now 0.0192308 ... note that we are using a relatively small dataset (12K lines) and this is *tiny* in the world of natural language problems.\n", "<p>\n", "In order that you have your expectations set correctly: a high-performing translation model needs 400-million lines of input and takes 1 whole day on a TPU pod!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check trained model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "gsutil ls gs://${BUCKET}/poetry/model #_modeltpu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Batch-predict\n", "\n", "How will our poetry model do when faced with Rumi's spiritual couplets?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile data/poetry/rumi.txt\n", "Where did the handsome beloved go?\n", "I wonder, where did that tall, shapely cypress tree go?\n", "He spread his light among us like a candle.\n", "Where did he go? So strange, where did he go without me?\n", "All day long my heart trembles like a leaf.\n", "All alone at midnight, where did that beloved go?\n", "Go to the road, and ask any passing traveler — \n", "That soul-stirring companion, where did he go?\n", "Go to the garden, and ask the gardener — \n", "That tall, shapely rose stem, where did he go?\n", "Go to the rooftop, and ask the watchman — \n", "That unique sultan, where did he go?\n", "Like a madman, I search in the meadows!\n", "That deer in the meadows, where did he go?\n", "My tearful eyes overflow like a river — \n", "That pearl in the vast sea, where did he go?\n", "All night long, I implore both moon and Venus — \n", "That lovely face, like a moon, where did he go?\n", "If he is mine, why is he with others?\n", "Since he’s not here, to what “there” did he go?\n", "If his heart and soul are joined with God,\n", "And he left this realm of earth and water, where did he go?\n", "Tell me clearly, Shams of Tabriz,\n", "Of whom it is said, “The sun never dies” — where did he go?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's write out the odd-numbered lines. We'll compare how close our model can get to the beauty of Rumi's second lines given his first." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "awk 'NR % 2 == 1' data/poetry/rumi.txt | tr '[:upper:]' '[:lower:]' | sed \"s/[^a-z\\'-\\ ]//g\" > data/poetry/rumi_leads.txt\n", "head -3 data/poetry/rumi_leads.txt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# same as the above training job ...\n", "TOPDIR=gs://${BUCKET}\n", "OUTDIR=${TOPDIR}/poetry/model #_tpu # or ${TOPDIR}/poetry/model_full\n", "DATADIR=${TOPDIR}/poetry/data\n", "MODEL=transformer\n", "HPARAMS=transformer_poetry #_tpu\n", "\n", "# the file with the input lines\n", "DECODE_FILE=data/poetry/rumi_leads.txt\n", "\n", "BEAM_SIZE=4\n", "ALPHA=0.6\n", "\n", "t2t-decoder \\\n", " --data_dir=$DATADIR \\\n", " --problem=$PROBLEM \\\n", " --model=$MODEL \\\n", " --hparams_set=$HPARAMS \\\n", " --output_dir=$OUTDIR \\\n", " --t2t_usr_dir=./poetry/trainer \\\n", " --decode_hparams=\"beam_size=$BEAM_SIZE,alpha=$ALPHA\" \\\n", " --decode_from_file=$DECODE_FILE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<b> Note </b> if you get an error about \"AttributeError: 'HParams' object has no attribute 'problems'\" please <b>Reset Session</b>, run the cell that defines the PROBLEM and run the above cell again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash \n", "DECODE_FILE=data/poetry/rumi_leads.txt\n", "cat ${DECODE_FILE}.*.decodes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of these are still phrases and not complete sentences. This indicates that we might need to train longer or better somehow. We need to diagnose the model ...\n", "<p>\n", " \n", "### Diagnosing training run\n", "\n", "<p>\n", "Let's diagnose the training run to see what we'd improve the next time around.\n", "(Note that this package may not be present on Jupyter -- `pip install pydatalab` if necessary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Monitor training with TensorBoard\n", "\n", "To activate TensorBoard within the JupyterLab UI navigate to \"<b>File</b>\" - \"<b>New Launcher</b>\". Then double-click the 'Tensorboard' icon on the bottom row.\n", "\n", "TensorBoard 1 will appear in the new tab. Navigate through the three tabs to see the active TensorBoard. The 'Graphs' and 'Projector' tabs offer very interesting information including the ability to replay the tests.\n", "\n", "You may close the TensorBoard tab when you are finished exploring." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<table>\n", "<tr>\n", "<td><img src=\"diagrams/poetry_loss.png\"/></td>\n", "<td><img src=\"diagrams/poetry_acc.png\"/></td>\n", "Looking at the loss curve, it is clear that we are overfitting (note that the orange training curve is well below the blue eval curve). Both loss curves and the accuracy-per-sequence curve, which is our key evaluation measure, plateaus after 40k. (The red curve is a faster way of computing the evaluation metric, and can be ignored). So, how do we improve the model? Well, we need to reduce overfitting and make sure the eval metrics keep going down as long as the loss is also going down.\n", "<p>\n", "What we really need to do is to get more data, but if that's not an option, we could try to reduce the NN and increase the dropout regularization. We could also do hyperparameter tuning on the dropout and network sizes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hyperparameter tuning\n", "\n", "tensor2tensor also supports hyperparameter tuning on Cloud ML Engine. Note the addition of the autotune flags.\n", "<p>\n", "The `transformer_poetry_range` was registered in problem.py above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "XXX This takes about 15 hours and consumes about 420 ML units. Uncomment if you wish to proceed anyway\n", "\n", "DATADIR=gs://${BUCKET}/poetry/data\n", "OUTDIR=gs://${BUCKET}/poetry/model_hparam\n", "JOBNAME=poetry_$(date -u +%y%m%d_%H%M%S)\n", "echo $OUTDIR $REGION $JOBNAME\n", "gsutil -m rm -rf $OUTDIR\n", "echo \"'Y'\" | t2t-trainer \\\n", " --data_dir=gs://${BUCKET}/poetry/subset \\\n", " --t2t_usr_dir=./poetry/trainer \\\n", " --problem=$PROBLEM \\\n", " --model=transformer \\\n", " --hparams_set=transformer_poetry \\\n", " --output_dir=$OUTDIR \\\n", " --hparams_range=transformer_poetry_range \\\n", " --autotune_objective='metrics-poetry_line_problem/accuracy_per_sequence' \\\n", " --autotune_maximize \\\n", " --autotune_max_trials=4 \\\n", " --autotune_parallel_trials=4 \\\n", " --train_steps=7500 --cloud_mlengine --worker_gpu=4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When I ran the above job, it took about 15 hours and finished with these as the best parameters:\n", "<pre>\n", "{\n", " \"trialId\": \"37\",\n", " \"hyperparameters\": {\n", " \"hp_num_hidden_layers\": \"4\",\n", " \"hp_learning_rate\": \"0.026711152525921437\",\n", " \"hp_hidden_size\": \"512\",\n", " \"hp_attention_dropout\": \"0.60589466163419292\"\n", " },\n", " \"finalMetric\": {\n", " \"trainingStep\": \"8000\",\n", " \"objectiveValue\": 0.0276162791997\n", " }\n", "</pre>\n", "In other words, the accuracy per sequence achieved was 0.027 (as compared to 0.019 before hyperparameter tuning, so a <b>40% improvement!</b>) using 4 hidden layers, a learning rate of 0.0267, a hidden size of 512 and droput probability of 0.606. This is inspite of training for only 7500 steps instead of 75,000 steps ... we could train for 75k steps with these parameters, but I'll leave that as an exercise for you.\n", "<p>\n", "Instead, let's try predicting with this optimized model. Note the addition of the hp* flags in order to override the values hardcoded in the source code. (there is no need to specify learning rate and dropout because they are not used during inference). I am using 37 because I got the best result at trialId=37" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# same as the above training job ...\n", "BEST_TRIAL=28 # CHANGE as needed.\n", "TOPDIR=gs://${BUCKET}\n", "OUTDIR=${TOPDIR}/poetry/model_hparam/$BEST_TRIAL\n", "DATADIR=${TOPDIR}/poetry/data\n", "MODEL=transformer\n", "HPARAMS=transformer_poetry\n", "\n", "# the file with the input lines\n", "DECODE_FILE=data/poetry/rumi_leads.txt\n", "\n", "BEAM_SIZE=4\n", "ALPHA=0.6\n", "\n", "t2t-decoder \\\n", " --data_dir=$DATADIR \\\n", " --problem=$PROBLEM \\\n", " --model=$MODEL \\\n", " --hparams_set=$HPARAMS \\\n", " --output_dir=$OUTDIR \\\n", " --t2t_usr_dir=./poetry/trainer \\\n", " --decode_hparams=\"beam_size=$BEAM_SIZE,alpha=$ALPHA\" \\\n", " --decode_from_file=$DECODE_FILE \\\n", " --hparams=\"num_hidden_layers=4,hidden_size=512\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash \n", "DECODE_FILE=data/poetry/rumi_leads.txt\n", "cat ${DECODE_FILE}.*.decodes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take the first three line. I'm showing the first line of the couplet provided to the model, how the AI model that we trained complets it and how Rumi completes it:\n", "<p>\n", "INPUT: where did the handsome beloved go <br/>\n", "AI: where art thou worse to me than dead <br/>\n", "RUMI: I wonder, where did that tall, shapely cypress tree go?\n", "<p>\n", "INPUT: he spread his light among us like a candle <br/>\n", "AI: like the hurricane eclipse <br/>\n", "RUMI: Where did he go? So strange, where did he go without me? <br/>\n", "<p>\n", "INPUT: all day long my heart trembles like a leaf <br/>\n", "AI: and through their hollow aisles it plays <br/>\n", "RUMI: All alone at midnight, where did that beloved go? \n", "<p>\n", "Oh wow. The couplets as completed are quite decent considering that:\n", "* We trained the model on American poetry, so feeding it Rumi is a bit out of left field.\n", "* Rumi, of course, has a context and thread running through his lines while the AI (since it was fed only that one line) doesn't. \n", "\n", "<p>\n", "\"Spreading light like a hurricane eclipse\" is a metaphor I won't soon forget. And it was created by a machine learning model!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Serving poetry\n", "\n", "How would you serve these predictions? There are two ways:\n", "<ol>\n", "<li> Use [Cloud ML Engine](https://cloud.google.com/ml-engine/docs/deploying-models) -- this is serverless and you don't have to manage any infrastructure.\n", "<li> Use [Kubeflow](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md) on Google Kubernetes Engine -- this uses clusters but will also work on-prem on your own Kubernetes cluster.\n", "</ol>\n", "<p>\n", "In either case, you need to export the model first and have TensorFlow serving serve the model. The model, however, expects to see *encoded* (i.e. preprocessed) data. So, we'll do that in the Python Flask application (in AppEngine Flex) that serves the user interface." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "TOPDIR=gs://${BUCKET}\n", "OUTDIR=${TOPDIR}/poetry/model_full2\n", "DATADIR=${TOPDIR}/poetry/data\n", "MODEL=transformer\n", "HPARAMS=transformer_poetry\n", "BEAM_SIZE=4\n", "ALPHA=0.6\n", "\n", "t2t-exporter \\\n", " --model=$MODEL \\\n", " --hparams_set=$HPARAMS \\\n", " --problem=$PROBLEM \\\n", " --t2t_usr_dir=./poetry/trainer \\\n", " --decode_hparams=\"beam_size=$BEAM_SIZE,alpha=$ALPHA\" \\\n", " --data_dir=$DATADIR \\\n", " --output_dir=$OUTDIR" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/poetry/model_full2/export | tail -1)\n", "echo $MODEL_LOCATION\n", "saved_model_cli show --dir $MODEL_LOCATION --tag_set serve --signature_def serving_default" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Cloud ML Engine" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile mlengine.json\n", "description: Poetry service on ML Engine\n", "autoScaling:\n", " minNodes: 1 # We don't want this model to autoscale down to zero" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "MODEL_NAME=\"poetry\"\n", "MODEL_VERSION=\"v1\"\n", "MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/poetry/model_full2/export | tail -1)\n", "echo \"Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes\"\n", "gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}\n", "#gcloud ml-engine models delete ${MODEL_NAME}\n", "#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION\n", "gcloud ml-engine versions create ${MODEL_VERSION} \\\n", " --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version=1.13 --config=mlengine.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Kubeflow\n", "\n", "Follow these instructions:\n", "* On the GCP console, launch a Google Kubernetes Engine (GKE) cluster named 'poetry' with 2 nodes, each of which is a n1-standard-2 (2 vCPUs, 7.5 GB memory) VM\n", "* On the GCP console, click on the Connect button for your cluster, and choose the CloudShell option\n", "* In CloudShell, run: \n", " ```\n", " git clone https://github.com/GoogleCloudPlatform/training-data-analyst`\n", " cd training-data-analyst/courses/machine_learning/deepdive/09_sequence\n", " ```\n", "* Look at [`./setup_kubeflow.sh`](setup_kubeflow.sh) and modify as appropriate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AppEngine\n", "\n", "What's deployed in Cloud ML Engine or Kubeflow is only the TensorFlow model. We still need a preprocessing service. That is done using AppEngine. Edit application/app.yaml appropriately." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat application/app.yaml" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "cd application\n", "#gcloud app create # if this is your first app\n", "#gcloud app deploy --quiet --stop-previous-version app.yaml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now visit https://mlpoetry-dot-cloud-training-demos.appspot.com and try out the prediction app!\n", "\n", "<img src=\"diagrams/poetry_app.png\" width=\"50%\"/>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the \\\"License\\\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \\\"AS IS\\\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.15" } }, "nbformat": 4, "nbformat_minor": 2 }