courses/machine_learning/deepdive2/structured/solutions/4a_sample

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# LAB 4a: Creating a Sampled Dataset.\n", "\n", "## Learning Objectives\n", "\n", "1. Setup up the environment.\n", "1. Sample the natality dataset to create train/eval/test sets.\n", "1. Preprocess the data in Pandas dataframe.\n", "\n", "\n", "## Introduction \n", "In this notebook, we'll read data from BigQuery into our notebook to preprocess the data within a Pandas dataframe for a small, repeatable sample.\n", "\n", "We will set up the environment, sample the natality dataset to create train/eval/test splits, and preprocess the data in a Pandas dataframe." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "hJ7ByvoXzpVI" }, "source": [ "## Set up environment variables and load necessary libraries" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "mC9K9Dpx1ztf" }, "source": [ "Check that the Google BigQuery library is installed and if not, install it. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 609 }, "colab_type": "code", "id": "RZUQtASG10xO", "outputId": "5612d6b0-9730-476a-a28f-8fdc14f4ecde" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "google-cloud-bigquery==1.6.1\n" ] } ], "source": [ "%%bash\n", "pip freeze | grep google-cloud-bigquery==1.6.1 || \\\n", "pip install google-cloud-bigquery==1.6.1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import necessary libraries." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from google.cloud import bigquery\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set environment variables so that we can use them throughout the entire lab. We will be using our project name for our bucket, so you only need to change your project and region." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "export PROJECT=$(gcloud config list project --format \"value(core.project)\")\n", "echo \"Your current GCP Project Name is: \"$PROJECT" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "PROJECT = \"cloud-training-demos\" # Replace with your PROJECT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create ML datasets by sampling using BigQuery\n", "\n", "We'll begin by sampling the BigQuery data to create smaller datasets. Let's create a BigQuery client that we'll use throughout the lab." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "bq = bigquery.Client(project = PROJECT)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to figure out the right way to divide our hash values to get our desired splits. To do that we need to define some values to hash with in the modulo. Feel free to play around with these values to get the perfect combination." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "modulo_divisor = 100\n", "train_percent = 80.0\n", "eval_percent = 10.0\n", "\n", "train_buckets = int(modulo_divisor * train_percent / 100.0)\n", "eval_buckets = int(modulo_divisor * eval_percent / 100.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can make a series of queries to check if our bucketing values result in the correct sizes of each of our dataset splits and then adjust accordingly. Therefore, to make our code more compact and reusable, let's define a function to return the head of a dataframe produced from our queries up to a certain number of rows." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def display_dataframe_head_from_query(query, count=10):\n", " \"\"\"Displays count rows from dataframe head from query.\n", " \n", " Args:\n", " query: str, query to be run on BigQuery, results stored in dataframe.\n", " count: int, number of results from head of dataframe to display.\n", " Returns:\n", " Dataframe head with count number of results.\n", " \"\"\"\n", " df = bq.query(\n", " query + \" LIMIT {limit}\".format(\n", " limit=count)).to_dataframe()\n", "\n", " return df.head(count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For our first query, we're going to use the original query above to get our label, features, and columns to combine into our hash which we will use to perform our repeatable splitting. There are only a limited number of years, months, days, and states in the dataset. Let's see what the hash values are. We will need to include all of these extra columns to hash on to get a fairly uniform spread of the data. Feel free to try less or more in the hash and see how it changes your results." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>weight_pounds</th>\n", " <th>is_male</th>\n", " <th>mother_age</th>\n", " <th>plurality</th>\n", " <th>gestation_weeks</th>\n", " <th>year</th>\n", " <th>month</th>\n", " <th>date</th>\n", " <th>state</th>\n", " <th>mother_birth_state</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>7.568469</td>\n", " <td>True</td>\n", " <td>22</td>\n", " <td>1</td>\n", " <td>46</td>\n", " <td>2001</td>\n", " <td>7</td>\n", " <td>5</td>\n", " <td>CA</td>\n", " <td>CA</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>8.807467</td>\n", " <td>True</td>\n", " <td>39</td>\n", " <td>1</td>\n", " <td>42</td>\n", " <td>2001</td>\n", " <td>8</td>\n", " <td>3</td>\n", " <td>CA</td>\n", " <td>Foreign</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>8.313632</td>\n", " <td>True</td>\n", " <td>23</td>\n", " <td>1</td>\n", " <td>35</td>\n", " <td>2001</td>\n", " <td>10</td>\n", " <td>7</td>\n", " <td>IL</td>\n", " <td>IL</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>8.000575</td>\n", " <td>False</td>\n", " <td>27</td>\n", " <td>1</td>\n", " <td>40</td>\n", " <td>2001</td>\n", " <td>6</td>\n", " <td>7</td>\n", " <td>IL</td>\n", " <td>IL</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>6.563162</td>\n", " <td>False</td>\n", " <td>29</td>\n", " <td>1</td>\n", " <td>39</td>\n", " <td>2001</td>\n", " <td>11</td>\n", " <td>7</td>\n", " <td>KY</td>\n", " <td>IN</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>7.125340</td>\n", " <td>False</td>\n", " <td>34</td>\n", " <td>1</td>\n", " <td>40</td>\n", " <td>2001</td>\n", " <td>12</td>\n", " <td>7</td>\n", " <td>MD</td>\n", " <td>MD</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>7.438397</td>\n", " <td>False</td>\n", " <td>31</td>\n", " <td>1</td>\n", " <td>38</td>\n", " <td>2001</td>\n", " <td>4</td>\n", " <td>3</td>\n", " <td>MA</td>\n", " <td>Foreign</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>7.352416</td>\n", " <td>True</td>\n", " <td>30</td>\n", " <td>1</td>\n", " <td>37</td>\n", " <td>2001</td>\n", " <td>5</td>\n", " <td>7</td>\n", " <td>MI</td>\n", " <td>MI</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>8.062305</td>\n", " <td>True</td>\n", " <td>16</td>\n", " <td>1</td>\n", " <td>40</td>\n", " <td>2001</td>\n", " <td>10</td>\n", " <td>5</td>\n", " <td>MN</td>\n", " <td>MN</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>7.251004</td>\n", " <td>True</td>\n", " <td>17</td>\n", " <td>1</td>\n", " <td>39</td>\n", " <td>2001</td>\n", " <td>2</td>\n", " <td>5</td>\n", " <td>MS</td>\n", " <td>MS</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " weight_pounds is_male mother_age plurality gestation_weeks year \\\n", "0 7.568469 True 22 1 46 2001 \n", "1 8.807467 True 39 1 42 2001 \n", "2 8.313632 True 23 1 35 2001 \n", "3 8.000575 False 27 1 40 2001 \n", "4 6.563162 False 29 1 39 2001 \n", "5 7.125340 False 34 1 40 2001 \n", "6 7.438397 False 31 1 38 2001 \n", "7 7.352416 True 30 1 37 2001 \n", "8 8.062305 True 16 1 40 2001 \n", "9 7.251004 True 17 1 39 2001 \n", "\n", " month date state mother_birth_state \n", "0 7 5 CA CA \n", "1 8 3 CA Foreign \n", "2 10 7 IL IL \n", "3 6 7 IL IL \n", "4 11 7 KY IN \n", "5 12 7 MD MD \n", "6 4 3 MA Foreign \n", "7 5 7 MI MI \n", "8 10 5 MN MN \n", "9 2 5 MS MS " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get label, features, and columns to hash and split into buckets\n", "hash_cols_fixed_query = \"\"\"\n", "SELECT\n", " weight_pounds,\n", " is_male,\n", " mother_age,\n", " plurality,\n", " gestation_weeks,\n", " year,\n", " month,\n", " CASE\n", " WHEN day IS NULL THEN\n", " CASE\n", " WHEN wday IS NULL THEN 0\n", " ELSE wday\n", " END\n", " ELSE day\n", " END AS date,\n", " IFNULL(state, \"Unknown\") AS state,\n", " IFNULL(mother_birth_state, \"Unknown\") AS mother_birth_state\n", "FROM\n", " publicdata.samples.natality\n", "WHERE\n", " year > 2000\n", " AND weight_pounds > 0\n", " AND mother_age > 0\n", " AND plurality > 0\n", " AND gestation_weeks > 0\n", "\"\"\"\n", "\n", "display_dataframe_head_from_query(hash_cols_fixed_query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `COALESCE` would provide the same result as the nested `CASE WHEN`. This is preferable when all we want is the first non-null instance. To be precise the `CASE WHEN` would become `COALESCE(wday, day, 0) AS date`. You can read more about it [here](https://cloud.google.com/bigquery/docs/reference/standard-sql/conditional_expressions)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next query will combine our hash columns and will leave us just with our label, features, and our hash values." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>weight_pounds</th>\n", " <th>is_male</th>\n", " <th>mother_age</th>\n", " <th>plurality</th>\n", " <th>gestation_weeks</th>\n", " <th>hash_values</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>7.109908</td>\n", " <td>False</td>\n", " <td>25</td>\n", " <td>1</td>\n", " <td>38</td>\n", " <td>563561248331884029</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>7.588311</td>\n", " <td>False</td>\n", " <td>19</td>\n", " <td>1</td>\n", " <td>40</td>\n", " <td>3487851893553562338</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>4.812691</td>\n", " <td>True</td>\n", " <td>35</td>\n", " <td>1</td>\n", " <td>33</td>\n", " <td>2669304657201106008</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>7.251004</td>\n", " <td>True</td>\n", " <td>30</td>\n", " <td>2</td>\n", " <td>38</td>\n", " <td>7076342771382320241</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>6.206013</td>\n", " <td>False</td>\n", " <td>21</td>\n", " <td>1</td>\n", " <td>36</td>\n", " <td>8828960867056723893</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>6.062712</td>\n", " <td>False</td>\n", " <td>33</td>\n", " <td>1</td>\n", " <td>40</td>\n", " <td>4280252324912833683</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>7.500126</td>\n", " <td>False</td>\n", " <td>19</td>\n", " <td>1</td>\n", " <td>39</td>\n", " <td>6090508671071281093</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>7.687519</td>\n", " <td>True</td>\n", " <td>23</td>\n", " <td>1</td>\n", " <td>41</td>\n", " <td>8708360030053768340</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>8.875811</td>\n", " <td>True</td>\n", " <td>24</td>\n", " <td>1</td>\n", " <td>40</td>\n", " <td>8530116731648975419</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>7.387690</td>\n", " <td>False</td>\n", " <td>28</td>\n", " <td>1</td>\n", " <td>38</td>\n", " <td>1776323475383399588</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " weight_pounds is_male mother_age plurality gestation_weeks \\\n", "0 7.109908 False 25 1 38 \n", "1 7.588311 False 19 1 40 \n", "2 4.812691 True 35 1 33 \n", "3 7.251004 True 30 2 38 \n", "4 6.206013 False 21 1 36 \n", "5 6.062712 False 33 1 40 \n", "6 7.500126 False 19 1 39 \n", "7 7.687519 True 23 1 41 \n", "8 8.875811 True 24 1 40 \n", "9 7.387690 False 28 1 38 \n", "\n", " hash_values \n", "0 563561248331884029 \n", "1 3487851893553562338 \n", "2 2669304657201106008 \n", "3 7076342771382320241 \n", "4 8828960867056723893 \n", "5 4280252324912833683 \n", "6 6090508671071281093 \n", "7 8708360030053768340 \n", "8 8530116731648975419 \n", "9 1776323475383399588 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_query = \"\"\"\n", "SELECT\n", " weight_pounds,\n", " is_male,\n", " mother_age,\n", " plurality,\n", " gestation_weeks,\n", " FARM_FINGERPRINT(\n", " CONCAT(\n", " CAST(year AS STRING),\n", " CAST(month AS STRING),\n", " CAST(date AS STRING),\n", " CAST(state AS STRING),\n", " CAST(mother_birth_state AS STRING)\n", " )\n", " ) AS hash_values\n", "FROM\n", " ({CTE_hash_cols_fixed})\n", "\"\"\".format(CTE_hash_cols_fixed=hash_cols_fixed_query)\n", "\n", "display_dataframe_head_from_query(data_query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next query is going to find the counts of each of the unique 657484 `hash_values`. This will be our first step at making actual hash buckets for our split via the `GROUP BY`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>hash_values</th>\n", " <th>num_records</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>6001926139587584124</td>\n", " <td>19</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>6064126287360941757</td>\n", " <td>758</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>6824828135709159935</td>\n", " <td>72</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3363240092080644183</td>\n", " <td>631</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2666158614438147859</td>\n", " <td>964</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>2958542686973584093</td>\n", " <td>363</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>8332670353336108110</td>\n", " <td>47</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>1459116430691530322</td>\n", " <td>52</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>8084544908979932787</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>2610866487448411172</td>\n", " <td>23</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " hash_values num_records\n", "0 6001926139587584124 19\n", "1 6064126287360941757 758\n", "2 6824828135709159935 72\n", "3 3363240092080644183 631\n", "4 2666158614438147859 964\n", "5 2958542686973584093 363\n", "6 8332670353336108110 47\n", "7 1459116430691530322 52\n", "8 8084544908979932787 7\n", "9 2610866487448411172 23" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the counts of each of the unique hashs of our splitting column\n", "first_bucketing_query = \"\"\"\n", "SELECT\n", " hash_values,\n", " COUNT(*) AS num_records\n", "FROM\n", " ({CTE_data})\n", "GROUP BY\n", " hash_values\n", "\"\"\".format(CTE_data=data_query)\n", "\n", "display_dataframe_head_from_query(first_bucketing_query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The query below performs a second layer of bucketing where now for each of these bucket indices we count the number of records." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>bucket_index</th>\n", " <th>num_records</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>17</td>\n", " <td>222562</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>46</td>\n", " <td>281627</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>7</td>\n", " <td>270933</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>85</td>\n", " <td>368045</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>40</td>\n", " <td>333712</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>19</td>\n", " <td>384793</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>77</td>\n", " <td>401941</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>95</td>\n", " <td>313544</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>81</td>\n", " <td>233538</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>24</td>\n", " <td>352559</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " bucket_index num_records\n", "0 17 222562\n", "1 46 281627\n", "2 7 270933\n", "3 85 368045\n", "4 40 333712\n", "5 19 384793\n", "6 77 401941\n", "7 95 313544\n", "8 81 233538\n", "9 24 352559" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the number of records in each of the hash buckets\n", "second_bucketing_query = \"\"\"\n", "SELECT\n", " ABS(MOD(hash_values, {modulo_divisor})) AS bucket_index,\n", " SUM(num_records) AS num_records\n", "FROM\n", " ({CTE_first_bucketing})\n", "GROUP BY\n", " ABS(MOD(hash_values, {modulo_divisor}))\n", "\"\"\".format(\n", " CTE_first_bucketing=first_bucketing_query, modulo_divisor=modulo_divisor)\n", "\n", "display_dataframe_head_from_query(second_bucketing_query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of records is hard for us to easily understand the split, so we will normalize the count into percentage of the data in each of the hash buckets in the next query." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>bucket_index</th>\n", " <th>num_records</th>\n", " <th>percent_records</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>4</td>\n", " <td>398118</td>\n", " <td>0.012060</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>92</td>\n", " <td>336735</td>\n", " <td>0.010201</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>70</td>\n", " <td>285539</td>\n", " <td>0.008650</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>78</td>\n", " <td>326758</td>\n", " <td>0.009898</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>16</td>\n", " <td>172145</td>\n", " <td>0.005215</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>94</td>\n", " <td>431001</td>\n", " <td>0.013056</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>5</td>\n", " <td>449280</td>\n", " <td>0.013610</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>62</td>\n", " <td>426834</td>\n", " <td>0.012930</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>30</td>\n", " <td>333513</td>\n", " <td>0.010103</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>34</td>\n", " <td>379000</td>\n", " <td>0.011481</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " bucket_index num_records percent_records\n", "0 4 398118 0.012060\n", "1 92 336735 0.010201\n", "2 70 285539 0.008650\n", "3 78 326758 0.009898\n", "4 16 172145 0.005215\n", "5 94 431001 0.013056\n", "6 5 449280 0.013610\n", "7 62 426834 0.012930\n", "8 30 333513 0.010103\n", "9 34 379000 0.011481" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate the overall percentages\n", "percentages_query = \"\"\"\n", "SELECT\n", " bucket_index,\n", " num_records,\n", " CAST(num_records AS FLOAT64) / (\n", " SELECT\n", " SUM(num_records)\n", " FROM\n", " ({CTE_second_bucketing})) AS percent_records\n", "FROM\n", " ({CTE_second_bucketing})\n", "\"\"\".format(CTE_second_bucketing=second_bucketing_query)\n", "\n", "display_dataframe_head_from_query(percentages_query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll now select the range of buckets to be used in training." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>bucket_index</th>\n", " <th>num_records</th>\n", " <th>percent_records</th>\n", " <th>dataset_name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>52</td>\n", " <td>204972</td>\n", " <td>0.006209</td>\n", " <td>train</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>33</td>\n", " <td>410226</td>\n", " <td>0.012427</td>\n", " <td>train</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>23</td>\n", " <td>559019</td>\n", " <td>0.016934</td>\n", " <td>train</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>28</td>\n", " <td>449682</td>\n", " <td>0.013622</td>\n", " <td>train</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>62</td>\n", " <td>426834</td>\n", " <td>0.012930</td>\n", " <td>train</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>73</td>\n", " <td>411771</td>\n", " <td>0.012474</td>\n", " <td>train</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>38</td>\n", " <td>338150</td>\n", " <td>0.010243</td>\n", " <td>train</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>35</td>\n", " <td>250505</td>\n", " <td>0.007588</td>\n", " <td>train</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>65</td>\n", " <td>289303</td>\n", " <td>0.008764</td>\n", " <td>train</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>61</td>\n", " <td>453904</td>\n", " <td>0.013750</td>\n", " <td>train</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " bucket_index num_records percent_records dataset_name\n", "0 52 204972 0.006209 train\n", "1 33 410226 0.012427 train\n", "2 23 559019 0.016934 train\n", "3 28 449682 0.013622 train\n", "4 62 426834 0.012930 train\n", "5 73 411771 0.012474 train\n", "6 38 338150 0.010243 train\n", "7 35 250505 0.007588 train\n", "8 65 289303 0.008764 train\n", "9 61 453904 0.013750 train" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Choose hash buckets for training and pull in their statistics\n", "train_query = \"\"\"\n", "SELECT\n", " *,\n", " \"train\" AS dataset_name\n", "FROM\n", " ({CTE_percentages})\n", "WHERE\n", " bucket_index >= 0\n", " AND bucket_index < {train_buckets}\n", "\"\"\".format(\n", " CTE_percentages=percentages_query,\n", " train_buckets=train_buckets)\n", "\n", "display_dataframe_head_from_query(train_query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll do the same by selecting the range of buckets to be used evaluation." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>bucket_index</th>\n", " <th>num_records</th>\n", " <th>percent_records</th>\n", " <th>dataset_name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>80</td>\n", " <td>312489</td>\n", " <td>0.009466</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>83</td>\n", " <td>411258</td>\n", " <td>0.012458</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>85</td>\n", " <td>368045</td>\n", " <td>0.011149</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>82</td>\n", " <td>468179</td>\n", " <td>0.014182</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>87</td>\n", " <td>523881</td>\n", " <td>0.015870</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>88</td>\n", " <td>423809</td>\n", " <td>0.012838</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>86</td>\n", " <td>274489</td>\n", " <td>0.008315</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>89</td>\n", " <td>256482</td>\n", " <td>0.007770</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>81</td>\n", " <td>233538</td>\n", " <td>0.007074</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>84</td>\n", " <td>341155</td>\n", " <td>0.010334</td>\n", " <td>eval</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " bucket_index num_records percent_records dataset_name\n", "0 80 312489 0.009466 eval\n", "1 83 411258 0.012458 eval\n", "2 85 368045 0.011149 eval\n", "3 82 468179 0.014182 eval\n", "4 87 523881 0.015870 eval\n", "5 88 423809 0.012838 eval\n", "6 86 274489 0.008315 eval\n", "7 89 256482 0.007770 eval\n", "8 81 233538 0.007074 eval\n", "9 84 341155 0.010334 eval" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Choose hash buckets for validation and pull in their statistics\n", "eval_query = \"\"\"\n", "SELECT\n", " *,\n", " \"eval\" AS dataset_name\n", "FROM\n", " ({CTE_percentages})\n", "WHERE\n", " bucket_index >= {train_buckets}\n", " AND bucket_index < {cum_eval_buckets}\n", "\"\"\".format(\n", " CTE_percentages=percentages_query,\n", " train_buckets=train_buckets,\n", " cum_eval_buckets=train_buckets + eval_buckets)\n", "\n", "display_dataframe_head_from_query(eval_query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we'll select the hash buckets to be used for the test split." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>bucket_index</th>\n", " <th>num_records</th>\n", " <th>percent_records</th>\n", " <th>dataset_name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>92</td>\n", " <td>336735</td>\n", " <td>0.010201</td>\n", " <td>test</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>91</td>\n", " <td>333267</td>\n", " <td>0.010096</td>\n", " <td>test</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>90</td>\n", " <td>286465</td>\n", " <td>0.008678</td>\n", " <td>test</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>94</td>\n", " <td>431001</td>\n", " <td>0.013056</td>\n", " <td>test</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>93</td>\n", " <td>215710</td>\n", " <td>0.006534</td>\n", " <td>test</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>99</td>\n", " <td>223334</td>\n", " <td>0.006765</td>\n", " <td>test</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>95</td>\n", " <td>313544</td>\n", " <td>0.009498</td>\n", " <td>test</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>97</td>\n", " <td>480790</td>\n", " <td>0.014564</td>\n", " <td>test</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>96</td>\n", " <td>529357</td>\n", " <td>0.016036</td>\n", " <td>test</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>98</td>\n", " <td>374697</td>\n", " <td>0.011351</td>\n", " <td>test</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " bucket_index num_records percent_records dataset_name\n", "0 92 336735 0.010201 test\n", "1 91 333267 0.010096 test\n", "2 90 286465 0.008678 test\n", "3 94 431001 0.013056 test\n", "4 93 215710 0.006534 test\n", "5 99 223334 0.006765 test\n", "6 95 313544 0.009498 test\n", "7 97 480790 0.014564 test\n", "8 96 529357 0.016036 test\n", "9 98 374697 0.011351 test" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Choose hash buckets for testing and pull in their statistics\n", "test_query = \"\"\"\n", "SELECT\n", " *,\n", " \"test\" AS dataset_name\n", "FROM\n", " ({CTE_percentages})\n", "WHERE\n", " bucket_index >= {cum_eval_buckets}\n", " AND bucket_index < {modulo_divisor}\n", "\"\"\".format(\n", " CTE_percentages=percentages_query,\n", " cum_eval_buckets=train_buckets + eval_buckets,\n", " modulo_divisor=modulo_divisor)\n", "\n", "display_dataframe_head_from_query(test_query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the below query, we'll `UNION ALL` all of the datasets together so that all three sets of hash buckets will be within one table. We added `dataset_id` so that we can sort on it in the query after." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>dataset_id</th>\n", " <th>bucket_index</th>\n", " <th>num_records</th>\n", " <th>percent_records</th>\n", " <th>dataset_name</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>85</td>\n", " <td>368045</td>\n", " <td>0.011149</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>88</td>\n", " <td>423809</td>\n", " <td>0.012838</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1</td>\n", " <td>89</td>\n", " <td>256482</td>\n", " <td>0.007770</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1</td>\n", " <td>80</td>\n", " <td>312489</td>\n", " <td>0.009466</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1</td>\n", " <td>81</td>\n", " <td>233538</td>\n", " <td>0.007074</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>1</td>\n", " <td>83</td>\n", " <td>411258</td>\n", " <td>0.012458</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>1</td>\n", " <td>82</td>\n", " <td>468179</td>\n", " <td>0.014182</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>1</td>\n", " <td>84</td>\n", " <td>341155</td>\n", " <td>0.010334</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>1</td>\n", " <td>87</td>\n", " <td>523881</td>\n", " <td>0.015870</td>\n", " <td>eval</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>1</td>\n", " <td>86</td>\n", " <td>274489</td>\n", " <td>0.008315</td>\n", " <td>eval</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " dataset_id bucket_index num_records percent_records dataset_name\n", "0 1 85 368045 0.011149 eval\n", "1 1 88 423809 0.012838 eval\n", "2 1 89 256482 0.007770 eval\n", "3 1 80 312489 0.009466 eval\n", "4 1 81 233538 0.007074 eval\n", "5 1 83 411258 0.012458 eval\n", "6 1 82 468179 0.014182 eval\n", "7 1 84 341155 0.010334 eval\n", "8 1 87 523881 0.015870 eval\n", "9 1 86 274489 0.008315 eval" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Union the training, validation, and testing dataset statistics\n", "union_query = \"\"\"\n", "SELECT\n", " 0 AS dataset_id,\n", " *\n", "FROM\n", " ({CTE_train})\n", "UNION ALL\n", "SELECT\n", " 1 AS dataset_id,\n", " *\n", "FROM\n", " ({CTE_eval})\n", "UNION ALL\n", "SELECT\n", " 2 AS dataset_id,\n", " *\n", "FROM\n", " ({CTE_test})\n", "\"\"\".format(CTE_train=train_query, CTE_eval=eval_query, CTE_test=test_query)\n", "\n", "display_dataframe_head_from_query(union_query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we'll show the final split between train, eval, and test sets. We can see both the number of records and percent of the total data. It is really close to the 80/10/10 that we were hoping to get." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>dataset_id</th>\n", " <th>dataset_name</th>\n", " <th>num_records</th>\n", " <th>percent_records</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>train</td>\n", " <td>25873134</td>\n", " <td>0.783765</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>eval</td>\n", " <td>3613325</td>\n", " <td>0.109457</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2</td>\n", " <td>test</td>\n", " <td>3524900</td>\n", " <td>0.106778</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " dataset_id dataset_name num_records percent_records\n", "0 0 train 25873134 0.783765\n", "1 1 eval 3613325 0.109457\n", "2 2 test 3524900 0.106778" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show final splitting and associated statistics\n", "split_query = \"\"\"\n", "SELECT\n", " dataset_id,\n", " dataset_name,\n", " SUM(num_records) AS num_records,\n", " SUM(percent_records) AS percent_records\n", "FROM\n", " ({CTE_union})\n", "GROUP BY\n", " dataset_id,\n", " dataset_name\n", "ORDER BY\n", " dataset_id\n", "\"\"\".format(CTE_union=union_query)\n", "\n", "display_dataframe_head_from_query(split_query)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we know that our splitting values produce a good global splitting on our data, here's a way to get a well-distributed portion of the data in such a way that the train/eval/test sets do not overlap and takes a subsample of our global splits." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 7733 examples in the train dataset.\n", "There are 1037 examples in the validation dataset.\n", "There are 561 examples in the test dataset.\n" ] } ], "source": [ "# every_n allows us to subsample from each of the hash values\n", "# This helps us get approximately the record counts we want\n", "every_n = 1000\n", "\n", "splitting_string = \"ABS(MOD(hash_values, {0} * {1}))\".format(every_n, modulo_divisor)\n", "\n", "def create_data_split_sample_df(query_string, splitting_string, lo, up):\n", " \"\"\"Creates a dataframe with a sample of a data split.\n", "\n", " Args:\n", " query_string: str, query to run to generate splits.\n", " splitting_string: str, modulo string to split by.\n", " lo: float, lower bound for bucket filtering for split.\n", " up: float, upper bound for bucket filtering for split.\n", " Returns:\n", " Dataframe containing data split sample.\n", " \"\"\"\n", " query = \"SELECT * FROM ({0}) WHERE {1} >= {2} and {1} < {3}\".format(\n", " query_string, splitting_string, int(lo), int(up))\n", "\n", " df = bq.query(query).to_dataframe()\n", "\n", " return df\n", "\n", "train_df = create_data_split_sample_df(\n", " data_query, splitting_string,\n", " lo=0, up=train_percent)\n", "\n", "eval_df = create_data_split_sample_df(\n", " data_query, splitting_string,\n", " lo=train_percent, up=train_percent + eval_percent)\n", "\n", "test_df = create_data_split_sample_df(\n", " data_query, splitting_string,\n", " lo=train_percent + eval_percent, up=modulo_divisor)\n", "\n", "print(\"There are {} examples in the train dataset.\".format(len(train_df)))\n", "print(\"There are {} examples in the validation dataset.\".format(len(eval_df)))\n", "print(\"There are {} examples in the test dataset.\".format(len(test_df)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocess data using Pandas\n", "\n", "We'll perform a few preprocessing steps to the data in our dataset. Let's add extra rows to simulate the lack of ultrasound. That is we'll duplicate some rows and make the `is_male` field be `Unknown`. Also, if there is more than child we'll change the `plurality` to `Multiple(2+)`. While we're at it, we'll also change the plurality column to be a string. We'll perform these operations below. \n", "\n", "Let's start by examining the training dataset as is." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>weight_pounds</th>\n", " <th>is_male</th>\n", " <th>mother_age</th>\n", " <th>plurality</th>\n", " <th>gestation_weeks</th>\n", " <th>hash_values</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>0</td>\n", " <td>9.499719</td>\n", " <td>True</td>\n", " <td>30</td>\n", " <td>1</td>\n", " <td>40</td>\n", " <td>505732274561700014</td>\n", " </tr>\n", " <tr>\n", " <td>1</td>\n", " <td>6.027438</td>\n", " <td>True</td>\n", " <td>26</td>\n", " <td>1</td>\n", " <td>36</td>\n", " <td>1409348435509100014</td>\n", " </tr>\n", " <tr>\n", " <td>2</td>\n", " <td>6.124442</td>\n", " <td>True</td>\n", " <td>34</td>\n", " <td>2</td>\n", " <td>37</td>\n", " <td>2620860165093800008</td>\n", " </tr>\n", " <tr>\n", " <td>3</td>\n", " <td>9.001474</td>\n", " <td>True</td>\n", " <td>28</td>\n", " <td>1</td>\n", " <td>35</td>\n", " <td>1409348435509100014</td>\n", " </tr>\n", " <tr>\n", " <td>4</td>\n", " <td>7.070225</td>\n", " <td>False</td>\n", " <td>23</td>\n", " <td>1</td>\n", " <td>40</td>\n", " <td>4659354114038800077</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " weight_pounds is_male mother_age plurality gestation_weeks \\\n", "0 9.499719 True 30 1 40 \n", "1 6.027438 True 26 1 36 \n", "2 6.124442 True 34 2 37 \n", "3 9.001474 True 28 1 35 \n", "4 7.070225 False 23 1 40 \n", "\n", " hash_values \n", "0 505732274561700014 \n", "1 1409348435509100014 \n", "2 2620860165093800008 \n", "3 1409348435509100014 \n", "4 4659354114038800077 " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, notice that there are some very important numeric fields that are missing in some rows (the count in Pandas doesn't count missing data)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>weight_pounds</th>\n", " <th>mother_age</th>\n", " <th>plurality</th>\n", " <th>gestation_weeks</th>\n", " <th>hash_values</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>count</td>\n", " <td>7733.000000</td>\n", " <td>7733.000000</td>\n", " <td>7733.000000</td>\n", " <td>7733.000000</td>\n", " <td>7.733000e+03</td>\n", " </tr>\n", " <tr>\n", " <td>mean</td>\n", " <td>7.264415</td>\n", " <td>28.213371</td>\n", " <td>1.035691</td>\n", " <td>38.691064</td>\n", " <td>4.983286e+18</td>\n", " </tr>\n", " <tr>\n", " <td>std</td>\n", " <td>1.303220</td>\n", " <td>6.134232</td>\n", " <td>0.201568</td>\n", " <td>2.531921</td>\n", " <td>2.551244e+18</td>\n", " </tr>\n", " <tr>\n", " <td>min</td>\n", " <td>0.562179</td>\n", " <td>13.000000</td>\n", " <td>1.000000</td>\n", " <td>18.000000</td>\n", " <td>5.826385e+15</td>\n", " </tr>\n", " <tr>\n", " <td>25%</td>\n", " <td>6.624891</td>\n", " <td>23.000000</td>\n", " <td>1.000000</td>\n", " <td>38.000000</td>\n", " <td>3.153609e+18</td>\n", " </tr>\n", " <tr>\n", " <td>50%</td>\n", " <td>7.345803</td>\n", " <td>28.000000</td>\n", " <td>1.000000</td>\n", " <td>39.000000</td>\n", " <td>4.896699e+18</td>\n", " </tr>\n", " <tr>\n", " <td>75%</td>\n", " <td>8.062305</td>\n", " <td>33.000000</td>\n", " <td>1.000000</td>\n", " <td>40.000000</td>\n", " <td>6.784884e+18</td>\n", " </tr>\n", " <tr>\n", " <td>max</td>\n", " <td>11.563246</td>\n", " <td>48.000000</td>\n", " <td>4.000000</td>\n", " <td>47.000000</td>\n", " <td>9.210618e+18</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " weight_pounds mother_age plurality gestation_weeks hash_values\n", "count 7733.000000 7733.000000 7733.000000 7733.000000 7.733000e+03\n", "mean 7.264415 28.213371 1.035691 38.691064 4.983286e+18\n", "std 1.303220 6.134232 0.201568 2.531921 2.551244e+18\n", "min 0.562179 13.000000 1.000000 18.000000 5.826385e+15\n", "25% 6.624891 23.000000 1.000000 38.000000 3.153609e+18\n", "50% 7.345803 28.000000 1.000000 39.000000 4.896699e+18\n", "75% 8.062305 33.000000 1.000000 40.000000 6.784884e+18\n", "max 11.563246 48.000000 4.000000 47.000000 9.210618e+18" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is always crucial to clean raw data before using in machine learning, so we have a preprocessing step. We'll define a `preprocess` function below. Note that the mother's age is an input to our model so users will have to provide the mother's age; otherwise, our service won't work. The features we use for our model were chosen because they are such good predictors and because they are easy enough to collect." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "def preprocess(df):\n", " \"\"\" Preprocess pandas dataframe for augmented babyweight data.\n", " \n", " Args:\n", " df: Dataframe containing raw babyweight data.\n", " Returns:\n", " Pandas dataframe containing preprocessed raw babyweight data as well\n", " as simulated no ultrasound data masking some of the original data.\n", " \"\"\"\n", " # Clean up raw data\n", " # Filter out what we don\"t want to use for training\n", " df = df[df.weight_pounds > 0]\n", " df = df[df.mother_age > 0]\n", " df = df[df.gestation_weeks > 0]\n", " df = df[df.plurality > 0]\n", "\n", " # Modify plurality field to be a string\n", " twins_etc = dict(zip([1,2,3,4,5],\n", " [\"Single(1)\",\n", " \"Twins(2)\",\n", " \"Triplets(3)\",\n", " \"Quadruplets(4)\",\n", " \"Quintuplets(5)\"]))\n", " df[\"plurality\"].replace(twins_etc, inplace=True)\n", "\n", " # Clone data and mask certain columns to simulate lack of ultrasound\n", " no_ultrasound = df.copy(deep=True)\n", "\n", " # Modify is_male\n", " no_ultrasound[\"is_male\"] = \"Unknown\"\n", " \n", " # Modify plurality\n", " condition = no_ultrasound[\"plurality\"] != \"Single(1)\"\n", " no_ultrasound.loc[condition, \"plurality\"] = \"Multiple(2+)\"\n", "\n", " # Concatenate both datasets together and shuffle\n", " return pd.concat(\n", " [df, no_ultrasound]).sample(frac=1).reset_index(drop=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's process the train/eval/test set and see a small sample of the training data after our preprocessing:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "train_df = preprocess(train_df)\n", "eval_df = preprocess(eval_df)\n", "test_df = preprocess(test_df)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>weight_pounds</th>\n", " <th>is_male</th>\n", " <th>mother_age</th>\n", " <th>plurality</th>\n", " <th>gestation_weeks</th>\n", " <th>hash_values</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>0</td>\n", " <td>7.874912</td>\n", " <td>Unknown</td>\n", " <td>38</td>\n", " <td>Single(1)</td>\n", " <td>38</td>\n", " <td>8717259940738900003</td>\n", " </tr>\n", " <tr>\n", " <td>1</td>\n", " <td>8.999270</td>\n", " <td>Unknown</td>\n", " <td>31</td>\n", " <td>Single(1)</td>\n", " <td>45</td>\n", " <td>6781866293108400060</td>\n", " </tr>\n", " <tr>\n", " <td>2</td>\n", " <td>7.251004</td>\n", " <td>True</td>\n", " <td>24</td>\n", " <td>Single(1)</td>\n", " <td>40</td>\n", " <td>1696737464106800060</td>\n", " </tr>\n", " <tr>\n", " <td>3</td>\n", " <td>8.562754</td>\n", " <td>True</td>\n", " <td>43</td>\n", " <td>Single(1)</td>\n", " <td>39</td>\n", " <td>4614303140002600076</td>\n", " </tr>\n", " <tr>\n", " <td>4</td>\n", " <td>6.194990</td>\n", " <td>True</td>\n", " <td>23</td>\n", " <td>Single(1)</td>\n", " <td>41</td>\n", " <td>780565305641800050</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " weight_pounds is_male mother_age plurality gestation_weeks \\\n", "0 7.874912 Unknown 38 Single(1) 38 \n", "1 8.999270 Unknown 31 Single(1) 45 \n", "2 7.251004 True 24 Single(1) 40 \n", "3 8.562754 True 43 Single(1) 39 \n", "4 6.194990 True 23 Single(1) 41 \n", "\n", " hash_values \n", "0 8717259940738900003 \n", "1 6781866293108400060 \n", "2 1696737464106800060 \n", "3 4614303140002600076 \n", "4 780565305641800050 " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.head()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>weight_pounds</th>\n", " <th>is_male</th>\n", " <th>mother_age</th>\n", " <th>plurality</th>\n", " <th>gestation_weeks</th>\n", " <th>hash_values</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>15461</td>\n", " <td>7.251004</td>\n", " <td>True</td>\n", " <td>32</td>\n", " <td>Single(1)</td>\n", " <td>39</td>\n", " <td>8655151740159000017</td>\n", " </tr>\n", " <tr>\n", " <td>15462</td>\n", " <td>8.811877</td>\n", " <td>True</td>\n", " <td>30</td>\n", " <td>Single(1)</td>\n", " <td>39</td>\n", " <td>845203792559000058</td>\n", " </tr>\n", " <tr>\n", " <td>15463</td>\n", " <td>7.248799</td>\n", " <td>True</td>\n", " <td>26</td>\n", " <td>Single(1)</td>\n", " <td>40</td>\n", " <td>1409348435509100014</td>\n", " </tr>\n", " <tr>\n", " <td>15464</td>\n", " <td>7.625790</td>\n", " <td>Unknown</td>\n", " <td>22</td>\n", " <td>Single(1)</td>\n", " <td>40</td>\n", " <td>2875790318525700041</td>\n", " </tr>\n", " <tr>\n", " <td>15465</td>\n", " <td>6.499227</td>\n", " <td>Unknown</td>\n", " <td>22</td>\n", " <td>Single(1)</td>\n", " <td>38</td>\n", " <td>8720767384765100051</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " weight_pounds is_male mother_age plurality gestation_weeks \\\n", "15461 7.251004 True 32 Single(1) 39 \n", "15462 8.811877 True 30 Single(1) 39 \n", "15463 7.248799 True 26 Single(1) 40 \n", "15464 7.625790 Unknown 22 Single(1) 40 \n", "15465 6.499227 Unknown 22 Single(1) 38 \n", "\n", " hash_values \n", "15461 8655151740159000017 \n", "15462 845203792559000058 \n", "15463 1409348435509100014 \n", "15464 2875790318525700041 \n", "15465 8720767384765100051 " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look again at a summary of the dataset. Note that we only see numeric columns, so `plurality` does not show up." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>weight_pounds</th>\n", " <th>mother_age</th>\n", " <th>gestation_weeks</th>\n", " <th>hash_values</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>count</td>\n", " <td>15466.000000</td>\n", " <td>15466.000000</td>\n", " <td>15466.000000</td>\n", " <td>1.546600e+04</td>\n", " </tr>\n", " <tr>\n", " <td>mean</td>\n", " <td>7.264415</td>\n", " <td>28.213371</td>\n", " <td>38.691064</td>\n", " <td>4.983286e+18</td>\n", " </tr>\n", " <tr>\n", " <td>std</td>\n", " <td>1.303178</td>\n", " <td>6.134034</td>\n", " <td>2.531839</td>\n", " <td>2.551162e+18</td>\n", " </tr>\n", " <tr>\n", " <td>min</td>\n", " <td>0.562179</td>\n", " <td>13.000000</td>\n", " <td>18.000000</td>\n", " <td>5.826385e+15</td>\n", " </tr>\n", " <tr>\n", " <td>25%</td>\n", " <td>6.624891</td>\n", " <td>23.000000</td>\n", " <td>38.000000</td>\n", " <td>3.153609e+18</td>\n", " </tr>\n", " <tr>\n", " <td>50%</td>\n", " <td>7.345803</td>\n", " <td>28.000000</td>\n", " <td>39.000000</td>\n", " <td>4.896699e+18</td>\n", " </tr>\n", " <tr>\n", " <td>75%</td>\n", " <td>8.062305</td>\n", " <td>33.000000</td>\n", " <td>40.000000</td>\n", " <td>6.784884e+18</td>\n", " </tr>\n", " <tr>\n", " <td>max</td>\n", " <td>11.563246</td>\n", " <td>48.000000</td>\n", " <td>47.000000</td>\n", " <td>9.210618e+18</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " weight_pounds mother_age gestation_weeks hash_values\n", "count 15466.000000 15466.000000 15466.000000 1.546600e+04\n", "mean 7.264415 28.213371 38.691064 4.983286e+18\n", "std 1.303178 6.134034 2.531839 2.551162e+18\n", "min 0.562179 13.000000 18.000000 5.826385e+15\n", "25% 6.624891 23.000000 38.000000 3.153609e+18\n", "50% 7.345803 28.000000 39.000000 4.896699e+18\n", "75% 8.062305 33.000000 40.000000 6.784884e+18\n", "max 11.563246 48.000000 47.000000 9.210618e+18" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Write to .csv files \n", "\n", "In the final versions, we want to read from files, not Pandas dataframes. So, we write the Pandas dataframes out as csv files. Using csv files gives us the advantage of shuffling during read. This is important for distributed training because some workers might be slower than others, and shuffling the data helps prevent the same data from being assigned to the slow workers." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "# Define columns\n", "columns = [\"weight_pounds\",\n", " \"is_male\",\n", " \"mother_age\",\n", " \"plurality\",\n", " \"gestation_weeks\"]\n", "\n", "# Write out CSV files\n", "train_df.to_csv(\n", " path_or_buf=\"train.csv\", columns=columns, header=False, index=False)\n", "eval_df.to_csv(\n", " path_or_buf=\"eval.csv\", columns=columns, header=False, index=False)\n", "test_df.to_csv(\n", " path_or_buf=\"test.csv\", columns=columns, header=False, index=False)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2074 eval.csv\n", " 1122 test.csv\n", " 15466 train.csv\n", " 18662 total\n" ] } ], "source": [ "%%bash\n", "wc -l *.csv" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==> eval.csv <==\n", "8.62448368944,Unknown,31,Single(1),42\n", "6.9996768185,Unknown,32,Single(1),39\n", "6.6248909731,False,30,Single(1),38\n", "8.3114272774,False,19,Single(1),41\n", "8.313631900019999,True,32,Single(1),37\n", "7.06140625186,Unknown,34,Single(1),41\n", "7.62578964258,Unknown,34,Single(1),39\n", "7.3744626639,Unknown,20,Single(1),39\n", "1.93786328298,False,32,Triplets(3),28\n", "8.99926953484,True,34,Single(1),39\n", "\n", "==> test.csv <==\n", "7.3744626639,Unknown,25,Single(1),44\n", "6.93794738514,Unknown,24,Single(1),40\n", "6.87621795178,True,30,Single(1),39\n", "6.87621795178,Unknown,29,Single(1),39\n", "7.0327461578,Unknown,36,Single(1),38\n", "9.31232594688,False,25,Single(1),39\n", "7.936641432,True,23,Single(1),37\n", "4.7840310854,Unknown,34,Multiple(2+),38\n", "7.31273323054,True,23,Single(1),39\n", "8.24969784404,False,32,Single(1),39\n", "\n", "==> train.csv <==\n", "7.87491199864,Unknown,38,Single(1),38\n", "8.99926953484,Unknown,31,Single(1),45\n", "7.25100379718,True,24,Single(1),40\n", "8.56275425608,True,43,Single(1),39\n", "6.1949895622,True,23,Single(1),41\n", "9.0609989682,Unknown,24,Single(1),38\n", "7.5618555866,True,26,Single(1),41\n", "7.30611936268,False,31,Single(1),41\n", "9.6672701887,True,29,Single(1),40\n", "6.4992274837599995,True,22,Single(1),39\n" ] } ], "source": [ "%%bash\n", "head *.csv" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==> eval.csv <==\n", "7.43839671988,False,25,Single(1),37\n", "7.06140625186,True,34,Single(1),41\n", "7.43619209726,True,36,Single(1),40\n", "3.56267015392,True,35,Twins(2),31\n", "8.811876612139999,False,27,Single(1),36\n", "8.0689187892,Unknown,36,Single(1),40\n", "8.7633749145,Unknown,34,Single(1),39\n", "7.43839671988,True,43,Single(1),40\n", "4.62529825676,Unknown,38,Multiple(2+),35\n", "6.1839664491,Unknown,20,Single(1),38\n", "\n", "==> test.csv <==\n", "6.37576861704,Unknown,21,Single(1),39\n", "7.5618555866,True,22,Single(1),39\n", "8.99926953484,Unknown,28,Single(1),42\n", "7.82420567838,Unknown,24,Single(1),39\n", "9.25059651352,True,26,Single(1),40\n", "8.62448368944,Unknown,28,Single(1),39\n", "5.2580249487,False,18,Single(1),38\n", "7.87491199864,True,25,Single(1),37\n", "5.81138522632,Unknown,41,Single(1),36\n", "6.93794738514,True,24,Single(1),40\n", "\n", "==> train.csv <==\n", "7.81318256528,True,18,Single(1),43\n", "7.31273323054,False,35,Single(1),34\n", "6.75055446244,Unknown,37,Single(1),39\n", "7.43839671988,True,32,Single(1),39\n", "6.9666074791999995,True,20,Single(1),38\n", "7.25100379718,True,32,Single(1),39\n", "8.811876612139999,True,30,Single(1),39\n", "7.24879917456,True,26,Single(1),40\n", "7.62578964258,Unknown,22,Single(1),40\n", "6.4992274837599995,Unknown,22,Single(1),38\n" ] } ], "source": [ "%%bash\n", "tail *.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lab Summary: \n", "In this lab, we set up the environment, sampled the natality dataset to create train/eval/test splits, and preprocessed the data in a Pandas dataframe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright 2022 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.3" } }, "nbformat": 4, "nbformat_minor": 4 }

courses/machine_learning/deepdive2/structured/solutions/4a_sample_babyweight.ipynb (2,671 lines of code) (raw):