courses/ai-for-time-series/notebooks/01-explore.ipynb (439 lines of code) (raw):
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "ur8xi4C7S06n"
},
"outputs": [],
"source": [
"# Copyright 2021 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "tvgnzT1CKxrO"
},
"source": [
"# Overview\n",
"\n",
"In this notebook, you will learn how to load, explore, visualize, and pre-process a time-series dataset. The output of this notebook is a processed dataset that will be used in following notebooks to build a machine learning model.\n",
"\n",
"### Dataset\n",
"\n",
"[CTA - Ridership - Daily Boarding Totals](https://data.cityofchicago.org/Transportation/CTA-Ridership-Daily-Boarding-Totals/6iiy-9s97): This dataset shows systemwide boardings for both bus and rail services provided by Chicago Transit Authority, dating back to 2001.\n",
"\n",
"### Objective\n",
"\n",
"The goal is to forecast future transit ridership in the City of Chicago, based on previous ridership."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "i7EUnXsZhAGF"
},
"source": [
"## Install packages and dependencies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Restarting the kernel may be required to use new packages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "wyy5Lbnzg5fi"
},
"outputs": [],
"source": [
"%pip install -U statsmodels scikit-learn --user"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** To restart the Kernel, navigate to Kernel > Restart Kernel... on the Jupyter menu."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "XoEqT2Y4DJmf"
},
"source": [
"### Import libraries and define constants"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "pRUOFELefqf1"
},
"outputs": [],
"source": [
"from pandas.plotting import register_matplotlib_converters\n",
"from statsmodels.graphics.tsaplots import plot_acf\n",
"from statsmodels.tsa.seasonal import seasonal_decompose\n",
"from statsmodels.tsa.stattools import grangercausalitytests\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "oM1iC_MfAts1"
},
"outputs": [],
"source": [
"# Enter your project and region. Then run the cell to make sure the\n",
"# Cloud SDK uses the right project for all the commands in this notebook.\n",
"\n",
"PROJECT = 'your-project-name' # REPLACE WITH YOUR PROJECT NAME \n",
"REGION = 'your-region' # REPLACE WITH YOUR LAB REGION\n",
"\n",
"#Don't change the following command - this is to check if you have changed the project name above.\n",
"assert PROJECT != 'your-project-name', 'Don''t forget to change the project variables!'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"target = 'total_rides' # The variable you are predicting\n",
"target_description = 'Total Rides' # A description of the target variable\n",
"features = {'day_type': 'Day Type'} # Weekday = W, Saturday = A, Sunday/Holiday = U\n",
"ts_col = 'service_date' # The name of the column with the date field\n",
"\n",
"raw_data_file = 'https://data.cityofchicago.org/api/views/6iiy-9s97/rows.csv?accessType=DOWNLOAD'\n",
"processed_file = 'cta_ridership.csv' # Which file to save the results to"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import CSV file\n",
"\n",
"df = pd.read_csv(raw_data_file, index_col=[ts_col], parse_dates=[ts_col])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Model data prior to 2020 \n",
"\n",
"df = df[df.index < '2020-01-01']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Drop duplicates\n",
"\n",
"df = df.drop_duplicates()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sort by date\n",
"\n",
"df = df.sort_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Print the top 5 rows\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### TODO 1: Analyze the patterns\n",
"\n",
"* Is ridership changing much over time?\n",
"* Is there a difference in ridership between the weekday and weekends?\n",
"* Is the mix of bus vs rail ridership changing over time?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize plotting\n",
"\n",
"register_matplotlib_converters() # Addresses a warning\n",
"sns.set(rc={'figure.figsize':(16,4)})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Explore total rides over time\n",
"\n",
"sns.lineplot(data=df, x=df.index, y=df[target]).set_title('Total Rides')\n",
"fig = plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Explore rides by day type: Weekday (W), Saturday (A), Sunday/Holiday (U)\n",
"\n",
"sns.lineplot(data=df, x=df.index, y=df[target], hue=df['day_type']).set_title('Total Rides by Day Type')\n",
"fig = plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Explore rides by transportation type\n",
"\n",
"sns.lineplot(data=df[['bus','rail_boardings']]).set_title('Total Rides by Transportation Type')\n",
"fig = plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### TODO 2: Review summary statistics\n",
"\n",
"* How many records are in the dataset?\n",
"* What is the average # of riders per day?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[target].describe().apply(lambda x: round(x))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### TODO 3: Explore seasonality\n",
"\n",
"* Is there much difference between months?\n",
"* Can you extract the trend and seasonal pattern from the data?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Show the distribution of values for each day of the week in a boxplot:\n",
"# Min, 25th percentile, median, 75th percentile, max \n",
"\n",
"daysofweek = df.index.to_series().dt.dayofweek\n",
"\n",
"fig = sns.boxplot(x=daysofweek, y=df[target])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Show the distribution of values for each month in a boxplot:\n",
"\n",
"months = df.index.to_series().dt.month\n",
"\n",
"fig = sns.boxplot(x=months, y=df[target])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Decompose the data into trend and seasonal components\n",
"\n",
"result = seasonal_decompose(df[target], period=365)\n",
"fig = result.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Auto-correlation\n",
"\n",
"Next, we will create an auto-correlation plot, to show how correlated a time-series is with itself. Each point on the x-axis indicates the correlation at a given lag. The shaded area indicates the confidence interval.\n",
"\n",
"Note that the correlation gradually decreases over time, but reflects weekly seasonality (e.g. `t-7` and `t-14` stand out)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plot_acf(df[target])\n",
"\n",
"fig = plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Export data\n",
"\n",
"This will generate a CSV file, which you will use in the next labs of this quest.\n",
"Inspect the CSV file to see what the data looks like."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[[target]].to_csv(processed_file, index=True, index_label=ts_col)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You've successfully completed the exploration and visualization lab.\n",
"You've learned how to:\n",
"* Create a query that groups data into a time series\n",
"* Visualize data\n",
"* Decompose time series into trend and seasonal components"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "ai_platform_notebooks_template.ipynb",
"provenance": [],
"toc_visible": true
},
"environment": {
"kernel": "python3",
"name": "tf2-gpu.2-6.m82",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-6:m82"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}