notebooks/regression/decision_tree_regression/housing_prices_prediction.ipynb (623 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "9ebc1564",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# Copyright 2023 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"id": "6cf651d6",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"# Housing prices prediction from real estate assessments (Decision Tree Regression)"
]
},
{
"cell_type": "markdown",
"id": "d88ad219-150b-4702-8050-9a7d53f4dfdc",
"metadata": {},
"source": [
"<table align=\"left\">\n",
"\n",
"<a href=\"https://github.com/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/regression/decision_tree_regression/housing_prices_prediction.ipynb\">\n",
"<img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n",
"View on GitHub\n",
"</a>\n",
"</td>\n",
"<td>\n",
"<a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/ai-ml-recipes/main/notebooks/regression/decision_tree_regression/housing_prices_prediction.ipynb\">\n",
"<img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\">\n",
"Open in Vertex AI Workbench\n",
"</a>\n",
"</td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"id": "26448294-60b5-4e2f-825d-a4562fbd545f",
"metadata": {},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"id": "ceb0c35d-4d2c-4841-8e39-b727ec480687",
"metadata": {
"tags": []
},
"source": [
"This notebook shows how to predict housing prices based on location and other characteristics using Decision Tree Regression.\n",
"\n",
"#### **Steps**\n",
"Using Spark, \n",
"1) It reads the table [State of Connecticut - Real Estate Sales](https://catalog.data.gov/dataset/real-estate-sales-2001-2018) from [gs://dataproc-metastore-public-binaries/real_estate_sales](https://console.cloud.google.com/storage/browser/dataproc-metastore-public-binaries/real_estate_sales). \n",
"2) It parses process the dataset to choose features and train the ML model (fits the decision tree regression model) to predict a target value. \n",
" **Features**: list_year, property_type, residential_type, longitude, latitude \n",
" **Target**: assessed_value \n",
"3) It evaluates and plot the results.\n",
"\n",
"#### **Details of the dataset**\n",
"- The dataset contains listing of real estate sales with a sales price of $2,000 or greater that occur between October 1 and September 30 of each year (2001 to 2020). \n",
"- The dataset contains data for some towns from the State of Connecticut, like:\n",
" - Danbury, New Milford, New Haven, Norwalk, Hartford, East Haven, Montville, Bridgeport, Southington, Vernon, Wolcott, ...\n",
"- For each sale record, the file includes information such as town, property address, location, date of sale, property type (residential, apartment, commercial, industrial or vacant land), sales price, and property assessment."
]
},
{
"cell_type": "markdown",
"id": "709edcbd-6dd7-4df1-8068-f75b907ee9c2",
"metadata": {
"tags": []
},
"source": [
"### Setup"
]
},
{
"cell_type": "markdown",
"id": "af2a8444-80ec-48b1-8cd4-a058844b604a",
"metadata": {
"tags": []
},
"source": [
"#### Identity and Access Management (IAM)"
]
},
{
"cell_type": "markdown",
"id": "0e225825-fa1e-48b2-8045-c6791865ae22",
"metadata": {},
"source": [
"Make sure the service account running this notebook has the required permissions:\n",
"\n",
"- **Run the notebook**\n",
" - AI Platform Notebooks Service Agent\n",
" - Notebooks Admin\n",
" - Vertex AI Administrator\n",
"- **Read files from bucket**\n",
" - Storage Object Viewer\n",
"- **Run Dataproc jobs**\n",
" - Dataproc Service Agent\n",
" - Dataproc Worker"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4497dba8-0bdd-4314-a328-01e0bdc0c7be",
"metadata": {},
"outputs": [],
"source": [
"#### Import dependencies\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from pyspark.sql.types import IntegerType, FloatType\n",
"from pyspark.sql.functions import round, desc, corr, col\n",
"\n",
"from pyspark.ml import Pipeline\n",
"from pyspark.ml.evaluation import RegressionEvaluator\n",
"from pyspark.ml.regression import DecisionTreeRegressor\n",
"from pyspark.ml.feature import Bucketizer, StringIndexer, VectorAssembler\n",
"from pyspark.ml.linalg import Vectors"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d49eedf-6ef6-40f4-95fc-b0f9d00d4fe7",
"metadata": {},
"outputs": [],
"source": [
"from pyspark.sql import SparkSession"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f37f06c4-6687-43bd-90b6-b57a58816f51",
"metadata": {},
"outputs": [],
"source": [
"spark = SparkSession.builder \\\n",
" .appName(\"Housing prices prediction with Decision Tree Regression\") \\\n",
" .enableHiveSupport() \\\n",
" .getOrCreate()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c4632574-01e1-49ae-8477-d5d92c62654d",
"metadata": {},
"outputs": [],
"source": [
"raw_dataset = spark.read.parquet(\"gs://dataproc-metastore-public-binaries/real_estate_sales/\")"
]
},
{
"cell_type": "markdown",
"id": "7b81c456-31b2-4343-8d59-bbcead8ae282",
"metadata": {
"tags": []
},
"source": [
"### Pre-process dataset / filter values"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d82a8bb8-2504-48ad-abd3-68af72778b7a",
"metadata": {},
"outputs": [],
"source": [
"#### Filters\n",
"filters = [\n",
"\"residential_type IS NOT NULL AND residential_type != 'Unknown'\",\n",
"\"property_type IS NOT NULL AND property_type != 'Unknown'\",\n",
"\"assessed_value IS NOT NULL\",\n",
"\"longitude IS NOT NULL\",\n",
"\"latitude IS NOT NULL\",\n",
"\"sale_amount IS NOT NULL\"\n",
"]\n",
"filters = \" AND \".join(filters)\n",
"filtered_dataset = raw_dataset.filter(filters)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1bf789a7-0281-490e-9f73-98cafeac8056",
"metadata": {},
"outputs": [],
"source": [
"# Towns\n",
"filtered_dataset.groupBy(\"town\").count().orderBy(desc(\"count\")).limit(100).show(10,100)\n",
"print(f'Number of distinct towns: {filtered_dataset.select(\"town\").distinct().count()}')\n",
"print(f'Number of towns: {filtered_dataset.select(\"town\").count()}')"
]
},
{
"cell_type": "markdown",
"id": "d7a8d08b-fba6-4dab-bcee-4e96b200a0f9",
"metadata": {
"tags": []
},
"source": [
"### Select town"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b3566a3a-439f-4ec5-9b19-78bf2f6a31bc",
"metadata": {},
"outputs": [],
"source": [
"SELECTED_TOWN = \"Waterbury\"\n",
"pre_processed_dataset = filtered_dataset.filter(f\"town == '{SELECTED_TOWN}'\")\n",
"pre_processed_dataset.show(5,20)"
]
},
{
"cell_type": "markdown",
"id": "f05e1600-69b9-44de-8b5c-66dcf80155df",
"metadata": {},
"source": [
"|serial_number|list_year|date_recorded| town| address|assessed_value|sale_amount|sales_ratio|property_type|residential_type|non_use_code|assessor_remarks|opm_remarks|longitude|latitude|\n",
"|-------------|---------|-------------|---------|----------------|--------------|-----------|-----------|-------------|----------------|------------|----------------|-----------|---------|--------|\n",
"| 201374| 2020| 2021-04-06|Waterbury| 88 ELLSMERE AVE| 62560.0| 210000.0| 0.2979| Residential| Single Family| Unknown| Unknown| Unknown|-72.99548|41.54391|\n",
"| 200990| 2020| 2021-01-27|Waterbury| 25 MACARTHUR DR| 50110.0| 105000.0| 0.4772| Residential| Single Family| Unknown| Unknown| Unknown|-73.03644|41.57748|\n",
"| 201065| 2020| 2021-02-09|Waterbury| 38 KELLOGG ST| 34240.0| 97900.0| 0.3497| Residential| Single Family| Unknown| Unknown| Unknown|-73.04534|41.56237|\n",
"| 201242| 2020| 2021-03-10|Waterbury|200 SOUTHWEST RD| 164390.0| 310000.0| 0.5302| Residential| Single Family| Unknown| Unknown| Unknown|-73.07834|41.53038|\n",
"| 200407| 2020| 2020-11-12|Waterbury| 30 KENMORE AVE| 108210.0| 225000.0| 0.4809| Residential| Single Family| Unknown| Unknown| Unknown|-73.08336|41.57112|"
]
},
{
"cell_type": "markdown",
"id": "db224407-fae7-4f2f-8828-27e8439b37e4",
"metadata": {
"tags": []
},
"source": [
"### Exploratory Data Analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fde42868-baaf-4fe9-80c1-c4fc807c7e98",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Count, Mean, Min, Max of numeric columns\n",
"numeric_columns = [\"list_year\",\"assessed_value\",\"sale_amount\"]\n",
"pre_processed_dataset.select(numeric_columns).describe().select('summary', *[round(c, 2).alias(c) for c in numeric_columns]).show()"
]
},
{
"cell_type": "markdown",
"id": "2f7f8943-b52e-4d9e-9c2e-b79c2345025d",
"metadata": {},
"source": [
"|summary|list_year|assessed_value|sale_amount|\n",
"|-------|---------|--------------|-----------|\n",
"| count| 3836.0| 3836.0| 3836.0|\n",
"| mean| 2014.19| 86038.21| 120607.07|\n",
"| stddev| 4.89| 61597.34| 125087.8|\n",
"| min| 2006.0| 3500.0| 2000.0|\n",
"| max| 2020.0| 1889750.0| 5091000.0|"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7dc12f0-ab19-42a4-af6c-3faef821f84b",
"metadata": {},
"outputs": [],
"source": [
"# Towns\n",
"pre_processed_dataset.groupBy(\"town\").count().orderBy(desc(\"count\")).limit(100).show(10,100)\n",
"print(f'Number of distinct towns: {pre_processed_dataset.select(\"town\").distinct().count()}')\n",
"print(f'Number of towns: {pre_processed_dataset.select(\"town\").count()}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0105da30-4944-40bc-b56c-60ea648a29c7",
"metadata": {},
"outputs": [],
"source": [
"# Property Type\n",
"pre_processed_dataset.groupBy(\"property_type\").count().orderBy(desc(\"count\")).limit(12).show(12,100)\n",
"print(f'Number of property type: {pre_processed_dataset.select(\"property_type\").count()}')\n",
"print(f'Number of distinct property type: {pre_processed_dataset.select(\"property_type\").distinct().count()}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2583e5e3-e0d0-4ff7-8019-39a0eeb6a9c4",
"metadata": {},
"outputs": [],
"source": [
"# Residential Type\n",
"pre_processed_dataset.groupBy(\"residential_type\").count().orderBy(desc(\"count\")).limit(12).show(12,100)\n",
"print(f'Number of residential type: {pre_processed_dataset.select(\"residential_type\").count()}')\n",
"print(f'Number of distinct residential type: {pre_processed_dataset.select(\"residential_type\").distinct().count()}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "604b6095-ce30-4358-82dd-bf2e0f781ee3",
"metadata": {},
"outputs": [],
"source": [
"# Years\n",
"pre_processed_dataset.groupBy(\"list_year\").count().orderBy(desc(\"count\")).limit(10).show(10,100)"
]
},
{
"cell_type": "markdown",
"id": "79916aa6-0628-4dc2-92c0-dea658c07ba9",
"metadata": {
"tags": []
},
"source": [
"### Process dataset to create features"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f643a32-5ad9-4c37-8e52-4e3303fce782",
"metadata": {},
"outputs": [],
"source": [
"target_column = 'assessed_value'\n",
"columns = [target_column,'list_year','property_type','residential_type', 'longitude', 'latitude']\n",
"categorical_columns = ['list_year','property_type','residential_type']\n",
"feature_columns = ['indexed_list_year','indexed_property_type','indexed_residential_type', 'longitude', 'latitude']\n",
"\n",
"#### Select only relevant columns\n",
"sub_dataset = pre_processed_dataset.select(*columns)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80e52f57-05d7-4838-9ba4-fbe0565842ba",
"metadata": {},
"outputs": [],
"source": [
"sub_dataset.printSchema()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e414bd29-03a5-40dd-8771-eadc59f2b1e7",
"metadata": {},
"outputs": [],
"source": [
"sub_dataset.show(5,20)"
]
},
{
"cell_type": "markdown",
"id": "b9de2bc5-ebe9-46da-9a7d-22c98f44babb",
"metadata": {},
"source": [
"|assessed_value|list_year|property_type|residential_type|longitude|latitude|\n",
"|--------------|---------|-------------|----------------|---------|--------|\n",
"| 62560.0| 2020| Residential| Single Family|-72.99548|41.54391|\n",
"| 50110.0| 2020| Residential| Single Family|-73.03644|41.57748|\n",
"| 34240.0| 2020| Residential| Single Family|-73.04534|41.56237|\n",
"| 164390.0| 2020| Residential| Single Family|-73.07834|41.53038|\n",
"| 108210.0| 2020| Residential| Single Family|-73.08336|41.57112|\n",
"| 49840.0| 2020| Residential| Condo|-72.97195|41.55385|"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c4a8046b-8478-4fa8-bd99-23e44daad694",
"metadata": {},
"outputs": [],
"source": [
"#### Index categorical columns\n",
"indexers = [StringIndexer(inputCol=column, outputCol=\"indexed_\" + column).fit(sub_dataset) for column in categorical_columns]\n",
"pipeline = Pipeline(stages=indexers)\n",
"indexed_dataset = pipeline.fit(sub_dataset).transform(sub_dataset)\n",
"indexed_dataset = indexed_dataset.drop(*categorical_columns)\n",
"indexed_dataset.show(5,20)"
]
},
{
"cell_type": "markdown",
"id": "c96398b0-79a9-4e05-aba9-5eddf7fafd2e",
"metadata": {},
"source": [
"|assessed_value|longitude|latitude|indexed_list_year|indexed_property_type|indexed_residential_type|\n",
"|--------------|---------|--------|-----------------|---------------------|------------------------|\n",
"| 62560.0|-72.99548|41.54391| 0.0| 1.0| 0.0|\n",
"| 50110.0|-73.03644|41.57748| 0.0| 1.0| 0.0|\n",
"| 34240.0|-73.04534|41.56237| 0.0| 1.0| 0.0|\n",
"| 164390.0|-73.07834|41.53038| 0.0| 1.0| 0.0|\n",
"| 108210.0|-73.08336|41.57112| 0.0| 1.0| 0.0|"
]
},
{
"cell_type": "markdown",
"id": "3463e7e4-5059-4b48-a8be-a7b72471e0d0",
"metadata": {
"tags": []
},
"source": [
"### Transform features to LIBSVM format"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3c368ca7-647c-442d-b98b-331704814af6",
"metadata": {},
"outputs": [],
"source": [
"assembler = VectorAssembler(inputCols=feature_columns, outputCol=\"features\")\n",
"dataset = assembler.transform(indexed_dataset)\n",
"\n",
"dataset = dataset.select(target_column,'features')\n",
"dataset = dataset.withColumnRenamed(target_column,'label')\n",
"\n",
"dataset.show(5,100)"
]
},
{
"cell_type": "markdown",
"id": "9f06a6f9-20b9-475b-ab27-6fa2f9d2c2cb",
"metadata": {},
"source": [
"| label| features|\n",
"|--------|--------------------------------------------------|\n",
"| 62560.0| [0.0,1.0,0.0,-72.9954833984375,41.54391098022461]|\n",
"| 50110.0|[0.0,1.0,0.0,-73.03643798828125,41.57748031616211]|\n",
"| 34240.0|[0.0,1.0,0.0,-73.04534149169922,41.56237030029297]|\n",
"|164390.0|[0.0,1.0,0.0,-73.07833862304688,41.53038024902344]|\n",
"|108210.0|[0.0,1.0,0.0,-73.08335876464844,41.57112121582031]|"
]
},
{
"cell_type": "markdown",
"id": "18f2fa39-94b5-4b64-bd1e-d566aeced15a",
"metadata": {
"tags": []
},
"source": [
"### Train/Fit the model"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a9422291-4215-4c9c-8375-5fa6a34cca3b",
"metadata": {},
"outputs": [],
"source": [
"(trainingData, testData) = dataset.randomSplit([0.8, 0.2])\n",
"\n",
"dt = DecisionTreeRegressor(featuresCol=\"features\", maxDepth = 10, maxBins = 12)\n",
"\n",
"pipeline = Pipeline(stages=[dt])\n",
"\n",
"# Train model\n",
"model = pipeline.fit(trainingData)"
]
},
{
"cell_type": "markdown",
"id": "7f4b94b0-855f-4ae4-8a50-80a812cd2f09",
"metadata": {
"tags": []
},
"source": [
"### Evaluate the model"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "50e864dc-a503-43e9-b05d-716260825183",
"metadata": {},
"outputs": [],
"source": [
"# Make predictions.\n",
"predictions = model.transform(testData)\n",
"\n",
"# Select example rows to display.\n",
"predictions.select(\"prediction\", \"label\", \"features\").show(10)\n",
"\n",
"evaluator = RegressionEvaluator(labelCol=\"label\", predictionCol=\"prediction\", metricName=\"rmse\")\n",
"rmse = evaluator.evaluate(predictions)\n",
"print(\"Root Mean Squared Error (RMSE) on test data = %g\" % rmse)"
]
},
{
"cell_type": "markdown",
"id": "ed15fc83-0097-460e-a6f7-563fdcd2b91c",
"metadata": {},
"source": [
"|prediction| label| features|\n",
"|----------|-------|--------------------|\n",
"| 28001.6|21700.0|[1.0,2.0,1.0,-72....|\n",
"| 27911.1|21820.0|[0.0,1.0,1.0,-72....|\n",
"| 27911.1|22640.0|[0.0,1.0,1.0,-73....|\n",
"| 97573.3|22850.0|[4.0,2.0,1.0,-72....|\n",
"| 27911.1|24170.0|[0.0,1.0,1.0,-73....|"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1ab0c11d-be3d-4422-9c08-ac3fcc81f587",
"metadata": {},
"outputs": [],
"source": [
"#### Count, Mean, Min, Max of predictions\n",
"predictions.select([\"prediction\", \"label\"]).describe().select('summary', *[round(c, 2).alias(c) for c in [\"prediction\", \"label\"]]).show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "658f91cf-528e-44e5-b922-573e394e3551",
"metadata": {},
"outputs": [],
"source": [
"#### Model results\n",
"treeModel = model.stages[0]\n",
"print(treeModel)\n",
"print(treeModel.featureImportances)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f9fff284-81a0-40c5-adfd-f06fed688bc9",
"metadata": {},
"outputs": [],
"source": [
"#### Plot predictions against target\n",
"x = range(0, predictions.count())\n",
"y_pred=predictions.select(\"prediction\").collect()\n",
"y_target=predictions.select(\"label\").collect()\n",
" \n",
"plt.plot(x, y_target, label=\"label\")\n",
"plt.plot(x, y_pred, label=\"prediction\")\n",
"plt.title(\"Test and predicted data\")\n",
"\n",
"plt.xlabel('x axis')\n",
"plt.ylabel('y axis')\n",
"\n",
"plt.legend(loc='best',fancybox=True, shadow=False)\n",
"plt.show() "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 on cluster-dpms (Remote)",
"language": "python",
"name": "7a42ae6b283a613cfabce6f7-python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}