courses/ai-for-time-series/notebooks/01-explore.ipynb (439 lines of code) (raw):

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "ur8xi4C7S06n" }, "outputs": [], "source": [ "# Copyright 2021 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "tvgnzT1CKxrO" }, "source": [ "# Overview\n", "\n", "In this notebook, you will learn how to load, explore, visualize, and pre-process a time-series dataset. The output of this notebook is a processed dataset that will be used in following notebooks to build a machine learning model.\n", "\n", "### Dataset\n", "\n", "[CTA - Ridership - Daily Boarding Totals](https://data.cityofchicago.org/Transportation/CTA-Ridership-Daily-Boarding-Totals/6iiy-9s97): This dataset shows systemwide boardings for both bus and rail services provided by Chicago Transit Authority, dating back to 2001.\n", "\n", "### Objective\n", "\n", "The goal is to forecast future transit ridership in the City of Chicago, based on previous ridership." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "i7EUnXsZhAGF" }, "source": [ "## Install packages and dependencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Restarting the kernel may be required to use new packages." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "wyy5Lbnzg5fi" }, "outputs": [], "source": [ "%pip install -U statsmodels scikit-learn --user" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** To restart the Kernel, navigate to Kernel > Restart Kernel... on the Jupyter menu." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "XoEqT2Y4DJmf" }, "source": [ "### Import libraries and define constants" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "pRUOFELefqf1" }, "outputs": [], "source": [ "from pandas.plotting import register_matplotlib_converters\n", "from statsmodels.graphics.tsaplots import plot_acf\n", "from statsmodels.tsa.seasonal import seasonal_decompose\n", "from statsmodels.tsa.stattools import grangercausalitytests\n", "\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "oM1iC_MfAts1" }, "outputs": [], "source": [ "# Enter your project and region. Then run the cell to make sure the\n", "# Cloud SDK uses the right project for all the commands in this notebook.\n", "\n", "PROJECT = 'your-project-name' # REPLACE WITH YOUR PROJECT NAME \n", "REGION = 'your-region' # REPLACE WITH YOUR LAB REGION\n", "\n", "#Don't change the following command - this is to check if you have changed the project name above.\n", "assert PROJECT != 'your-project-name', 'Don''t forget to change the project variables!'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target = 'total_rides' # The variable you are predicting\n", "target_description = 'Total Rides' # A description of the target variable\n", "features = {'day_type': 'Day Type'} # Weekday = W, Saturday = A, Sunday/Holiday = U\n", "ts_col = 'service_date' # The name of the column with the date field\n", "\n", "raw_data_file = 'https://data.cityofchicago.org/api/views/6iiy-9s97/rows.csv?accessType=DOWNLOAD'\n", "processed_file = 'cta_ridership.csv' # Which file to save the results to" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import CSV file\n", "\n", "df = pd.read_csv(raw_data_file, index_col=[ts_col], parse_dates=[ts_col])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Model data prior to 2020 \n", "\n", "df = df[df.index < '2020-01-01']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Drop duplicates\n", "\n", "df = df.drop_duplicates()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Sort by date\n", "\n", "df = df.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explore data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print the top 5 rows\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO 1: Analyze the patterns\n", "\n", "* Is ridership changing much over time?\n", "* Is there a difference in ridership between the weekday and weekends?\n", "* Is the mix of bus vs rail ridership changing over time?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize plotting\n", "\n", "register_matplotlib_converters() # Addresses a warning\n", "sns.set(rc={'figure.figsize':(16,4)})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Explore total rides over time\n", "\n", "sns.lineplot(data=df, x=df.index, y=df[target]).set_title('Total Rides')\n", "fig = plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Explore rides by day type: Weekday (W), Saturday (A), Sunday/Holiday (U)\n", "\n", "sns.lineplot(data=df, x=df.index, y=df[target], hue=df['day_type']).set_title('Total Rides by Day Type')\n", "fig = plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Explore rides by transportation type\n", "\n", "sns.lineplot(data=df[['bus','rail_boardings']]).set_title('Total Rides by Transportation Type')\n", "fig = plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO 2: Review summary statistics\n", "\n", "* How many records are in the dataset?\n", "* What is the average # of riders per day?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[target].describe().apply(lambda x: round(x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO 3: Explore seasonality\n", "\n", "* Is there much difference between months?\n", "* Can you extract the trend and seasonal pattern from the data?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Show the distribution of values for each day of the week in a boxplot:\n", "# Min, 25th percentile, median, 75th percentile, max \n", "\n", "daysofweek = df.index.to_series().dt.dayofweek\n", "\n", "fig = sns.boxplot(x=daysofweek, y=df[target])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Show the distribution of values for each month in a boxplot:\n", "\n", "months = df.index.to_series().dt.month\n", "\n", "fig = sns.boxplot(x=months, y=df[target])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Decompose the data into trend and seasonal components\n", "\n", "result = seasonal_decompose(df[target], period=365)\n", "fig = result.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Auto-correlation\n", "\n", "Next, we will create an auto-correlation plot, to show how correlated a time-series is with itself. Each point on the x-axis indicates the correlation at a given lag. The shaded area indicates the confidence interval.\n", "\n", "Note that the correlation gradually decreases over time, but reflects weekly seasonality (e.g. `t-7` and `t-14` stand out)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_acf(df[target])\n", "\n", "fig = plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export data\n", "\n", "This will generate a CSV file, which you will use in the next labs of this quest.\n", "Inspect the CSV file to see what the data looks like." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[[target]].to_csv(processed_file, index=True, index_label=ts_col)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You've successfully completed the exploration and visualization lab.\n", "You've learned how to:\n", "* Create a query that groups data into a time series\n", "* Visualize data\n", "* Decompose time series into trend and seasonal components" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "ai_platform_notebooks_template.ipynb", "provenance": [], "toc_visible": true }, "environment": { "kernel": "python3", "name": "tf2-gpu.2-6.m82", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-6:m82" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }