courses/machine_learning/deepdive/04_features/a

{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "4f3CKqFUqL2-", "slideshow": { "slide_type": "slide" } }, "source": [ "# Trying out features" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "**Learning Objectives:**\n", " * Improve the accuracy of a model by adding new features with the appropriate representation" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The data is based on 1990 census data from California. This data is at the city block level, so these features reflect the total number of rooms in that block, or the total number of people who live on that block, respectively." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "6TjLjL9IU80G" }, "source": [ "## Set Up\n", "In this first cell, we'll load the necessary libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Ensure the right version of Tensorflow is installed.\n", "!pip freeze | grep tensorflow==2.5" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.1.0\n" ] } ], "source": [ "import math\n", "import shutil\n", "import numpy as np\n", "import pandas as pd\n", "import tensorflow as tf\n", "\n", "print(tf.__version__)\n", "tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)\n", "pd.options.display.max_rows = 10\n", "pd.options.display.float_format = '{:.1f}'.format" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "ipRyUHjhU80Q" }, "source": [ "Next, we'll load our data set." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "df = pd.read_csv(\"https://storage.googleapis.com/ml_universities/california_housing_train.csv\", sep=\",\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "deletable": true, "editable": true, "id": "HzzlSs3PtTmt", "slideshow": { "slide_type": "-" } }, "source": [ "## Examine and split the data\n", "\n", "It's a good idea to get to know your data a little bit before you work with it.\n", "\n", "We'll print out a quick summary of a few useful statistics on each column.\n", "\n", "This will include things like mean, standard deviation, max, min, and various quantiles." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>longitude</th>\n", " <th>latitude</th>\n", " <th>housing_median_age</th>\n", " <th>total_rooms</th>\n", " <th>total_bedrooms</th>\n", " <th>population</th>\n", " <th>households</th>\n", " <th>median_income</th>\n", " <th>median_house_value</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-114.3</td>\n", " <td>34.2</td>\n", " <td>15.0</td>\n", " <td>5612.0</td>\n", " <td>1283.0</td>\n", " <td>1015.0</td>\n", " <td>472.0</td>\n", " <td>1.5</td>\n", " <td>66900.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-114.5</td>\n", " <td>34.4</td>\n", " <td>19.0</td>\n", " <td>7650.0</td>\n", " <td>1901.0</td>\n", " <td>1129.0</td>\n", " <td>463.0</td>\n", " <td>1.8</td>\n", " <td>80100.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>-114.6</td>\n", " <td>33.7</td>\n", " <td>17.0</td>\n", " <td>720.0</td>\n", " <td>174.0</td>\n", " <td>333.0</td>\n", " <td>117.0</td>\n", " <td>1.7</td>\n", " <td>85700.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>-114.6</td>\n", " <td>33.6</td>\n", " <td>14.0</td>\n", " <td>1501.0</td>\n", " <td>337.0</td>\n", " <td>515.0</td>\n", " <td>226.0</td>\n", " <td>3.2</td>\n", " <td>73400.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>-114.6</td>\n", " <td>33.6</td>\n", " <td>20.0</td>\n", " <td>1454.0</td>\n", " <td>326.0</td>\n", " <td>624.0</td>\n", " <td>262.0</td>\n", " <td>1.9</td>\n", " <td>65500.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -114.3 34.2 15.0 5612.0 1283.0 \n", "1 -114.5 34.4 19.0 7650.0 1901.0 \n", "2 -114.6 33.7 17.0 720.0 174.0 \n", "3 -114.6 33.6 14.0 1501.0 337.0 \n", "4 -114.6 33.6 20.0 1454.0 326.0 \n", "\n", " population households median_income median_house_value \n", "0 1015.0 472.0 1.5 66900.0 \n", "1 1129.0 463.0 1.8 80100.0 \n", "2 333.0 117.0 1.7 85700.0 \n", "3 515.0 226.0 3.2 73400.0 \n", "4 624.0 262.0 1.9 65500.0 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "cellView": "both", "colab": { "autoexec": { "startup": false, "wait_interval": 0 }, "test": { "output": "ignore", "timeout": 600 } }, "colab_type": "code", "collapsed": false, "deletable": true, "editable": true, "id": "gzb10yoVrydW", "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>longitude</th>\n", " <th>latitude</th>\n", " <th>housing_median_age</th>\n", " <th>total_rooms</th>\n", " <th>total_bedrooms</th>\n", " <th>population</th>\n", " <th>households</th>\n", " <th>median_income</th>\n", " <th>median_house_value</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>count</th>\n", " <td>17000.0</td>\n", " <td>17000.0</td>\n", " <td>17000.0</td>\n", " <td>17000.0</td>\n", " <td>17000.0</td>\n", " <td>17000.0</td>\n", " <td>17000.0</td>\n", " <td>17000.0</td>\n", " <td>17000.0</td>\n", " </tr>\n", " <tr>\n", " <th>mean</th>\n", " <td>-119.6</td>\n", " <td>35.6</td>\n", " <td>28.6</td>\n", " <td>2643.7</td>\n", " <td>539.4</td>\n", " <td>1429.6</td>\n", " <td>501.2</td>\n", " <td>3.9</td>\n", " <td>207300.9</td>\n", " </tr>\n", " <tr>\n", " <th>std</th>\n", " <td>2.0</td>\n", " <td>2.1</td>\n", " <td>12.6</td>\n", " <td>2179.9</td>\n", " <td>421.5</td>\n", " <td>1147.9</td>\n", " <td>384.5</td>\n", " <td>1.9</td>\n", " <td>115983.8</td>\n", " </tr>\n", " <tr>\n", " <th>min</th>\n", " <td>-124.3</td>\n", " <td>32.5</td>\n", " <td>1.0</td>\n", " <td>2.0</td>\n", " <td>1.0</td>\n", " <td>3.0</td>\n", " <td>1.0</td>\n", " <td>0.5</td>\n", " <td>14999.0</td>\n", " </tr>\n", " <tr>\n", " <th>25%</th>\n", " <td>-121.8</td>\n", " <td>33.9</td>\n", " <td>18.0</td>\n", " <td>1462.0</td>\n", " <td>297.0</td>\n", " <td>790.0</td>\n", " <td>282.0</td>\n", " <td>2.6</td>\n", " <td>119400.0</td>\n", " </tr>\n", " <tr>\n", " <th>50%</th>\n", " <td>-118.5</td>\n", " <td>34.2</td>\n", " <td>29.0</td>\n", " <td>2127.0</td>\n", " <td>434.0</td>\n", " <td>1167.0</td>\n", " <td>409.0</td>\n", " <td>3.5</td>\n", " <td>180400.0</td>\n", " </tr>\n", " <tr>\n", " <th>75%</th>\n", " <td>-118.0</td>\n", " <td>37.7</td>\n", " <td>37.0</td>\n", " <td>3151.2</td>\n", " <td>648.2</td>\n", " <td>1721.0</td>\n", " <td>605.2</td>\n", " <td>4.8</td>\n", " <td>265000.0</td>\n", " </tr>\n", " <tr>\n", " <th>max</th>\n", " <td>-114.3</td>\n", " <td>42.0</td>\n", " <td>52.0</td>\n", " <td>37937.0</td>\n", " <td>6445.0</td>\n", " <td>35682.0</td>\n", " <td>6082.0</td>\n", " <td>15.0</td>\n", " <td>500001.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "count 17000.0 17000.0 17000.0 17000.0 17000.0 \n", "mean -119.6 35.6 28.6 2643.7 539.4 \n", "std 2.0 2.1 12.6 2179.9 421.5 \n", "min -124.3 32.5 1.0 2.0 1.0 \n", "25% -121.8 33.9 18.0 1462.0 297.0 \n", "50% -118.5 34.2 29.0 2127.0 434.0 \n", "75% -118.0 37.7 37.0 3151.2 648.2 \n", "max -114.3 42.0 52.0 37937.0 6445.0 \n", "\n", " population households median_income median_house_value \n", "count 17000.0 17000.0 17000.0 17000.0 \n", "mean 1429.6 501.2 3.9 207300.9 \n", "std 1147.9 384.5 1.9 115983.8 \n", "min 3.0 1.0 0.5 14999.0 \n", "25% 790.0 282.0 2.6 119400.0 \n", "50% 1167.0 409.0 3.5 180400.0 \n", "75% 1721.0 605.2 4.8 265000.0 \n", "max 35682.0 6082.0 15.0 500001.0 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Now, split the data into two parts -- training and evaluation." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "np.random.seed(seed=1) #makes result reproducible\n", "msk = np.random.rand(len(df)) < 0.8\n", "traindf = df[msk]\n", "evaldf = df[~msk]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Training and Evaluation\n", "\n", "In this exercise, we'll be trying to predict **median_house_value** It will be our label (sometimes also called a target).\n", "\n", "We'll modify the feature_cols and input function to represent the features you want to use.\n", "\n", "We divide **total_rooms** by **households** to get **avg_rooms_per_house** which we expect to positively correlate with **median_house_value**. \n", "\n", "We also divide **population** by **total_rooms** to get **avg_persons_per_room** which we expect to negatively correlate with **median_house_value**." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "def add_more_features(df):\n", " df['avg_rooms_per_house'] = df['total_rooms'] / df['households'] #expect positive correlation\n", " df['avg_persons_per_room'] = df['population'] / df['total_rooms'] #expect negative correlation\n", " return df" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Create pandas input function\n", "def make_input_fn(df, num_epochs):\n", " return tf.compat.v1.estimator.inputs.pandas_input_fn(\n", " x = add_more_features(df),\n", " y = df['median_house_value'] / 100000, # will talk about why later in the course\n", " batch_size = 128,\n", " num_epochs = num_epochs,\n", " shuffle = True,\n", " queue_capacity = 1000,\n", " num_threads = 1\n", " )" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Define your feature columns\n", "def create_feature_cols():\n", " return [\n", " tf.feature_column.numeric_column('housing_median_age'),\n", " tf.feature_column.bucketized_column(tf.feature_column.numeric_column('latitude'), boundaries = np.arange(32.0, 42, 1).tolist()),\n", " tf.feature_column.numeric_column('avg_rooms_per_house'),\n", " tf.feature_column.numeric_column('avg_persons_per_room'),\n", " tf.feature_column.numeric_column('median_income')\n", " ]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Create estimator train and evaluate function\n", "def train_and_evaluate(output_dir, num_train_steps):\n", " estimator = tf.compat.v1.estimator.LinearRegressor(model_dir = output_dir, feature_columns = create_feature_cols())\n", " train_spec = tf.estimator.TrainSpec(input_fn = make_input_fn(traindf, None), \n", " max_steps = num_train_steps)\n", " eval_spec = tf.estimator.EvalSpec(input_fn = make_input_fn(evaldf, 1), \n", " steps = None, \n", " start_delay_secs = 1, # start evaluating after N seconds, \n", " throttle_secs = 5) # evaluate every N seconds\n", " tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "OUTDIR = './trained_model'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Run the model\n", "shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time\n", "tf.compat.v1.summary.FileWriterCache.clear() \n", "train_and_evaluate(OUTDIR, 2000)" ] } ], "metadata": { "colab": { "default_view": {}, "name": "first_steps_with_tensor_flow.ipynb", "provenance": [], "version": "0.3.2", "views": {} }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.3" } }, "nbformat": 4, "nbformat_minor": 2 }

courses/machine_learning/deepdive/04_features/a_features.ipynb (651 lines of code) (raw):