courses/ai-for-time-series/notebooks/02-model.ipynb

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "ur8xi4C7S06n" }, "outputs": [], "source": [ "# Copyright 2020 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "tvgnzT1CKxrO" }, "source": [ "# Overview\n", "\n", "This notebook demonstrates how to sequence data for a time-series problem, and then how to build deep learning and statistical models.\n", "\n", "### Dataset\n", "\n", "[CTA - Ridership - Daily Boarding Totals](https://data.cityofchicago.org/Transportation/CTA-Ridership-Daily-Boarding-Totals/6iiy-9s97): This dataset shows systemwide boardings for both bus and rail services provided by Chicago Transit Authority, dating back to 2001.\n", "\n", "### Objective\n", "\n", "The goal is to forecast future transit ridership in the City of Chicago, based on previous ridership." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "i7EUnXsZhAGF" }, "source": [ "## Install packages and dependencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Restarting the kernel may be required to use new packages." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "wyy5Lbnzg5fi" }, "outputs": [], "source": [ "%pip install -U statsmodels scikit-learn --user" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** To restart the Kernel, navigate to Kernel > Restart Kernel... on the Jupyter menu." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "XoEqT2Y4DJmf" }, "source": [ "### Import libraries and define constants" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "pRUOFELefqf1" }, "outputs": [], "source": [ "import os\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import warnings\n", "\n", "from google.cloud import storage\n", "from pandas.plotting import register_matplotlib_converters\n", "from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error, mean_squared_error\n", "from sklearn.preprocessing import StandardScaler\n", "from statsmodels.tsa.holtwinters import ExponentialSmoothing\n", "from statsmodels.tools.sm_exceptions import ConvergenceWarning\n", "from tensorflow.keras import Sequential\n", "from tensorflow.keras.callbacks import EarlyStopping\n", "from tensorflow.keras.layers import Conv1D, Dense, Dropout, Flatten, LSTM, MaxPooling1D\n", "\n", "register_matplotlib_converters() # Address warning" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", "id": "oM1iC_MfAts1" }, "outputs": [], "source": [ "# Enter your project, region, and a bucket name. Then run the cell to make sure the\n", "# Cloud SDK uses the right project for all the commands in this notebook.\n", "\n", "PROJECT = 'your-project-name' # REPLACE WITH YOUR PROJECT ID\n", "BUCKET = 'your-bucket-name' # REPLACE WITH A UNIQUE BUCKET NAME e.g. your PROJECT NAME\n", "REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1\n", "BUCKET_URI = 'gs://' + BUCKET\n", "\n", "#Don't change the following command - this is to check if you have changed the project name above.\n", "assert PROJECT != 'your-project-name', 'Don''t forget to change the project variables!'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Dataset parameters\n", "\n", "target_col = 'total_rides' # The variable you are predicting\n", "ts_col = 'service_date' # The name of the column with the date field" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Model parameters\n", "\n", "freq = 'D' # Daily frequency\n", "n_input_steps = 30 # Lookback window\n", "n_output_steps = 7 # How many steps to predict forward\n", "n_seasons = 7 # Monthly periodicity\n", "\n", "train_split = 0.8 # % Split between train/test data\n", "epochs = 1000 # How many passes through the data (early-stopping will cause training to stop before this)\n", "patience = 5 # Terminate training after the validation loss does not decrease after this many epochs" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "zgPO1eR3CYjk" }, "source": [ "### Create a Cloud Storage bucket\n", "\n", "**The following steps are required, regardless of your notebook environment.**\n", "\n", "When you submit a training job using the Cloud SDK, you upload a Python package\n", "containing your training code to a Cloud Storage bucket. AI Platform runs\n", "the code from this package. In this tutorial, AI Platform also saves the\n", "trained model that results from your job in the same bucket. You can then\n", "create an AI Platform model version based on this output in order to serve\n", "online predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "both", "colab": {}, "colab_type": "code", "id": "MzGDU7TWdts_" }, "outputs": [], "source": [ "storage_client = storage.Client()\n", "try:\n", " bucket = storage_client.get_bucket(BUCKET)\n", " print('Bucket exists, let''s not recreate it.')\n", "except:\n", " bucket = storage_client.create_bucket(BUCKET)\n", " print('Created bucket: ' + BUCKET)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "processed_file = 'cta_ridership.csv' # Which file to save the results to\n", "\n", "if os.path.exists(processed_file):\n", " input_file = processed_file # File created in previous lab\n", "else:\n", " input_file = f'data/{processed_file}'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read data\n", "\n", "df = pd.read_csv(input_file, index_col=ts_col, parse_dates=True)\n", "df.index.freq = freq\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define some characteristics of the data that will be used later\n", "n_features = len(df.columns)\n", "\n", "# Index of target column. Used later when creating dataframes.\n", "target_col_num = df.columns.get_loc(target_col)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Split data\n", "size = int(len(df) * train_split)\n", "df_train, df_test = df[0:size].copy(deep=True), df[size:len(df)].copy(deep=True)\n", "\n", "df_train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TODO 1: Remove outliers\n", "\n", "* Sometimes, you can improve the accuracy of the model by removing outliers.\n", "* In this lab, you'll simply remove some extremely high values.\n", "* You can also apply techniques such as smoothing or resampling the frequency to reduce the impact of outliers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_train" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Look at the highest peak. What level could you set a threshold that would clip this off?\n", "\n", "_=sns.lineplot(data=df_train[target_col]) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Hint: here are the top 5 values\n", "\n", "df[target_col].sort_values(ascending=False).head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: Update the threshold below to remove the outliers\n", "\n", "threshold = -1 # Set this just below the level you are seeing peaks. It will flag any values above it.\n", "assert threshold != -1, 'Set the threshold to the minimum that will eliminate outlier(s)'\n", "\n", "# Set any values above the threshold to NaN (not a number)\n", "df_train.loc[df_train[target_col] > threshold, target_col] = np.nan\n", "\n", "# Interpolate the missing values (e.g. [3, NaN, 5] becomes [3, 4, 5])\n", "df_train = df_train.interpolate()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Review the updated chart to see if outliers still exist\n", "# NOTE: If you set the threshold too low, rerun starting from the \n", "\n", "_=sns.lineplot(data=df_train[target_col])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Utility functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Utility functions` section does **not** need thorough review.\n", "\n", "Functions such as scaling variables, creating a time-series sequence, etc. are provided here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scale the inputs and outputs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For neural networks to converge quicker, it is helpful to scale the values.\n", "# For example, each feature might be transformed to have a mean of 0 and std. dev. of 1.\n", "#\n", "# You are working with a mix of features, input timesteps, output horizon, etc.\n", "# which don't work out-of-the-box with common scaling utilities.\n", "# So, here are a couple wrappers to handle scaling and inverting the scaling.\n", "\n", "feature_scaler = StandardScaler()\n", "target_scaler = StandardScaler()\n", "\n", "def scale(df, \n", " fit=True, \n", " target_col=target_col,\n", " feature_scaler=feature_scaler,\n", " target_scaler=target_scaler):\n", " \"\"\"\n", " Scale the input features, using a separate scaler for the target.\n", " \n", " Parameters: \n", " df (pd.DataFrame): Input dataframe\n", " fit (bool): Whether to fit the scaler to the data (only apply to training data)\n", " target_col (pd.Series): The column that is being predicted\n", " feature_scaler (StandardScaler): Scaler used for features\n", " target_scaler (StandardScaler): Scaler used for target\n", " \n", " Returns: \n", " df_scaled (pd.DataFrame): Scaled dataframe \n", " \"\"\" \n", " \n", " target = df[target_col].values.reshape(-1, 1)\n", " if fit:\n", " target_scaler.fit(target)\n", " target_scaled = target_scaler.transform(target)\n", " \n", " # Select all columns other than target to be features\n", " features = df.loc[:, df.columns != target_col].values\n", " \n", " if features.shape[1]: # If there are any features\n", " if fit:\n", " feature_scaler.fit(features)\n", " features_scaled = feature_scaler.transform(features)\n", " \n", " # Combine target and features into one data frame\n", " df_scaled = pd.DataFrame(features_scaled)\n", " target_col_num = df.columns.get_loc(target_col)\n", " df_scaled.insert(target_col_num, target_col, target_scaled)\n", " df_scaled.columns = df.columns \n", " \n", " else: # If only target column (no additional features)\n", " df_scaled = pd.DataFrame(target_scaled, columns=df.columns)\n", " \n", " return df_scaled\n", "\n", "def inverse_scale(data, target_scaler=target_scaler):\n", " \"\"\"\n", " Transform the scaled values of the target back into their original form.\n", " The features are left alone, as we're assuming that the output of the model only includes the target.\n", " \n", " Parameters: \n", " data (np.array): Input array\n", " target_scaler (StandardScaler): Scaler used for target\n", " \n", " Returns: \n", " data_scaled (np.array): Scaled array \n", " \"\"\" \n", " \n", " df = pd.DataFrame()\n", " data_scaled = np.empty([data.shape[1], data.shape[0]])\n", " for i in range(data.shape[1]):\n", " data_scaled[i] = target_scaler.inverse_transform([data[:,i]])\n", " return data_scaled.transpose()\n", "\n", "df_train_scaled=scale(df_train)\n", "df_test_scaled=scale(df_test, False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create sequences of time-series data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def reframe(data, n_input_steps = n_input_steps, n_output_steps = n_output_steps, target_col = target_col):\n", "\n", " target_col_num = data.columns.get_loc(target_col) \n", " \n", " # Iterate through data and create sequences of features and outputs\n", " df = pd.DataFrame(data)\n", " cols=list()\n", " for i in range(n_input_steps, 0, -1):\n", " cols.append(df.shift(i))\n", " for i in range(0, n_output_steps):\n", " cols.append(df.shift(-i))\n", " \n", " # Concatenate values and remove any missing values\n", " df = pd.concat(cols, axis=1)\n", " df.dropna(inplace=True)\n", " \n", " # Split the data into feature and target variables\n", " n_feature_cols = n_input_steps * n_features\n", " features = df.iloc[:,0:n_feature_cols]\n", " target_cols = [i for i in range(n_feature_cols + target_col_num, n_feature_cols + n_output_steps * n_features, n_features)]\n", " targets = df.iloc[:,target_cols]\n", "\n", " return (features, targets)\n", "\n", "X_train_reframed, y_train_reframed = reframe(df_train_scaled)\n", "X_test_reframed, y_test_reframed = reframe(df_test_scaled)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def print_stats(timestep, y_true, y_pred, target_col, chart=True, table=False, dec=3):\n", " '''\n", " Helper function to print overall summary statistics and stats for each time step\n", " '''\n", " \n", " # Print summary statistics\n", " print('=== t+' + str(timestep) + ' ===')\n", " print('R^2: ' + str(np.round(r2_score(y_true, y_pred), dec)))\n", " print('MAPE: ' + str(np.round(mean_absolute_percentage_error(y_true, y_pred), dec)))\n", " print('MAE: ' + str(np.round(mean_absolute_error(y_true, y_pred), dec)))\n", " print('')\n", "\n", " df_y_true = pd.DataFrame(y_true)\n", " df_y_true[target_col + '_pred'] = np.round(y_pred, dec)\n", " \n", " # Show plot of actuals vs predictions and a sample of values\n", " if table:\n", " print(str(df_y_true.head(5)) + '\\n')\n", " print(str(df_y_true.tail(5)) + '\\n')\n", " if chart:\n", " sns.lineplot(data=df_y_true[[target_col, target_col+'_pred']])\n", " plt.show()\n", " \n", "def evaluate(y_pred,\n", " exclude_timesteps=n_input_steps,\n", " y_true=df_test,\n", " target_col=target_col):\n", " '''\n", " Helper function to transform predictions to match size and indices of actuals.\n", " \n", " For example, n_timesteps from the test data will be required to make a prediction,\n", " so the number of predictions will be fewer than the number of test samples.\n", " \n", " Parameters:\n", " y_pred (np.array): Predictions\n", " exclude_timesteps (int): Number of leading timesteps to trim from the dataset\n", " y_true (pd.DataFrame): Actuals\n", " '''\n", " \n", " # Number of outputs (future timesteps)\n", " outputs = y_pred.shape[1]\n", " \n", " target_col_num = df.columns.get_loc(target_col)\n", " \n", " # Lists of actual and predicted values for each time step\n", " # For example, y_true_eval[2] will contain actual values 3 time steps out\n", " # These specific lists enable computing the accuracy for specific time steps\n", " y_true_eval, y_pred_eval = list(), list()\n", "\n", " # Actual and predicted values combined across all time steps (to compute overall accuracy metrics)\n", " y_true_all, y_pred_all = np.array([]), np.array([])\n", " \n", " # Append entries to lists for each output timestep\n", " for t in range(outputs):\n", " if exclude_timesteps:\n", " y_true_eval.append(y_true[exclude_timesteps+t:len(y_true)-outputs+t+1].copy())\n", " y_pred_eval.append(y_pred[:,t]) \n", " else:\n", " y_true_eval.append(y_true[t:].copy())\n", " y_pred_eval.append(y_pred[:-1*t-1,t])\n", " # Append the output values to the combined lists\n", " y_true_all = np.concatenate([y_true_all, y_true_eval[t].values[:,target_col_num]], axis=0)\n", " y_pred_all = np.concatenate([y_pred_all, y_pred_eval[t]], axis=0)\n", "\n", " # Print aggregate statistics across all time steps (only if predicting multiple time steps)\n", " if outputs > 1:\n", " print_stats('(1-' + str(outputs) + ')', y_true_all, y_pred_all, target_col, False)\n", "\n", " # Print stats for each future time step\n", " for t in range(outputs): \n", " print_stats(t+1, y_true_eval[t][target_col], y_pred_eval[t], target_col, True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ML Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, you will build models using popular neural network architectures for time-series data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Long Short Term Memory (LSTM)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Reshape test data to match model inputs and outputs\n", "\n", "X_train = X_train_reframed.values.reshape(-1, n_input_steps, n_features)\n", "X_test = X_test_reframed.values.reshape(-1, n_input_steps, n_features)\n", "y_train = y_train_reframed.values.reshape(-1, n_output_steps, 1)\n", "y_test = y_test_reframed.values.reshape(-1, n_output_steps, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### TODO 2: Update the LSTM architecture\n", "\n", "Try increasing and decreasing the number of LSTM units and see if you notice any accuracy improvements.\n", "\n", "You can use hyper-parameter tuning to search for optimal values, but that's outside the scope of this lab." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Try increasing and decreasing the number of LSTM units and see if you notice any accuracy improvements.\n", "# Run the next cell to evaluate the results in more detail.\n", "\n", "model = Sequential([\n", " LSTM(64, input_shape=[n_input_steps, n_features]),\n", " Dense(n_output_steps)])\n", "\n", "model.compile(optimizer='adam', loss='mae')\n", "\n", "early_stopping = EarlyStopping(monitor='val_loss', patience=patience)\n", "_ = model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=epochs, callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.save('./lstm_export/')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Predict the results, and then reverse the transformation that scaled all values to a mean of 0 and std. dev. of 1\n", "preds = model.predict(X_test)\n", "y_pred_lstm = inverse_scale(preds)\n", "\n", "# Evaluate the overall results and for each time step\n", "evaluate(y_pred_lstm)\n", "\n", "# The plot will show the R^2 value (0 lowest -> 1 highest) and the MAE (mean absolute error) for the entire prediction window.\n", "# It will also show individual plots for 1 day out, 2 days out, etc. comparing the actual vs the predicted value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convolutional Neural Network (CNN)\n", "\n", "#### TODO 3: Update the CNN architecture\n", "\n", "Try adjusting the # of filters (pattern types) and kernel size (size of the sliding window)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.layers import AveragePooling1D\n", "\n", "# TODO: Try adjusting the # of filters (pattern types) and kernel size (size of the sliding window)\n", "model = Sequential([\n", " Conv1D(filters=32, kernel_size=3, input_shape=[n_input_steps, n_features]),\n", " Flatten(),\n", " Dense(n_output_steps)])\n", "\n", "model.compile(optimizer='adam', loss='mae')\n", "\n", "early_stopping = EarlyStopping(monitor='val_loss', patience=5)\n", "_ = model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=epochs, callbacks=[early_stopping])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.save('./cnn_export/')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preds = model.predict(X_test)\n", "y_pred_cnn = inverse_scale(preds)\n", "\n", "evaluate(y_pred_cnn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Naïve Models\n", "\n", "So-called \"naïve models\" can be surprisingly hard to beat. These can serve as a useful benchmark for your model's performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random Walk\n", "\n", "Assume that future value(s) will be the same as the most recent value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from statsmodels.tsa.arima.model import ARIMA\n", "\n", "hist = df_train[target_col].copy() # Predict based on historical data. Start with the training data\n", "hist.index.freq = pd.infer_freq(hist.index) # To avoid warnings, explicitly specify the dataframe frequency\n", "n_pred = len(df_test) + 1 # Number of predictions: 1 on the training set; and then 1 for each additional \n", "y_pred_rw = np.empty([n_pred,n_output_steps]) # Create an array to hold predictions, with a number of predictions equal to the test set size, each containing the # of time steps you are predicting forward.\n", "\n", "for t in range(n_pred):\n", " mod = ARIMA(hist, order=(0, 1, 0))\n", " res = mod.fit()\n", " pred = res.forecast(n_output_steps)\n", " y_pred_rw[t] = pred.values\n", " if t < n_pred - 1:\n", " hist.loc[df_test.iloc[t].name] = df_test[target_col][t] # Append the latest test data row to the history, for fitting the next model\n", " hist.index.freq = pd.infer_freq(hist.index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evaluate(y_pred_rw, 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Seasonal Naïve\n", "\n", "Similar to random walk, but instead of using the previous value, you'll use the value from the previous seasonal period. For example, if you're predicting July's forecast, you'll use last July's value, rather than June's value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You will use a walk-forward approach, in which a model is fit on all historical data available.\n", "# As you progress through the test set to evaluate the model, you will be creating new models for each row in the test set.\n", "# Each new model will be fit on not only the training data, but on prior test data.\n", "\n", "from statsmodels.tsa.statespace.sarimax import SARIMAX\n", "\n", "hist = df_train[target_col].copy() # Predict based on historical data. Start with the training data\n", "hist.index.freq = pd.infer_freq(hist.index) # To avoid warnings, explicitly specify the dataframe frequency\n", "n_pred = len(df_test) + 1 # Number of predictions: 1 on the training set; and then 1 for each additional \n", "y_pred_sn = np.empty([n_pred,n_output_steps]) # Create an array to hold predictions, with a number of predictions equal to the test set size, each containing the # of time steps you are predicting forward.\n", "\n", "for t in range(n_pred):\n", " mod = SARIMAX(hist, order=(0, 0, 0), seasonal_order=(0, 1, 0, n_seasons))\n", " res = mod.fit(disp=False)\n", " pred = res.forecast(n_output_steps)\n", " y_pred_sn[t] = pred.values\n", " if t < n_pred - 1:\n", " hist.loc[df_test.iloc[t].name] = df_test[target_col][t] # Append the latest test data row to the history, for fitting the next model\n", " hist.index.freq = pd.infer_freq(hist.index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evaluate(y_pred_sn, 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Statistical Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will next implement a popular statistical method for time-series analysis, *exponential smoothing*. Exponential smoothing estimates future data by weighting recent observations more heavily. The [Holt-Winters exponential smoothing](https://en.wikipedia.org/wiki/Exponential_smoothing) method used here uses a \"triple\" exponential smoothing approach that also considers trend and seasonality.\n", "\n", "You can also ensemble classical and machine learning methods for a potentially even more accurate result." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exponential Smoothing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with warnings.catch_warnings():\n", " warnings.simplefilter(\"ignore\", category=ConvergenceWarning)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You will use a walk-forward approach, in which a model is fit on all historical data available.\n", "# As you progress through the test set to evaluate the model, you will be creating new models for each row in the test set.\n", "# Each new model will be fit on not only the training data, but on prior test data.\n", "\n", "hist = df_train[target_col].copy() # Predict based on historical data. Start with the training data\n", "hist.index.freq = pd.infer_freq(hist.index) # To avoid warnings, explicitly specify the dataframe frequency\n", "n_pred = len(df_test) + 1 # Number of predictions: 1 on the training set; and then 1 for each additional \n", "y_pred_es = np.empty([n_pred,n_output_steps]) # Create an array to hold predictions, with a number of predictions equal to the test set size, each containing the # of time steps you are predicting forward.\n", "\n", "for t in range(n_pred):\n", " mod = ExponentialSmoothing(hist, seasonal_periods=n_seasons, trend='add', seasonal='add', damped_trend=True, use_boxcox=False, initialization_method='heuristic')\n", " res = mod.fit(method='L-BFGS-B') # Use a different minimizer to avoid convergence warnings\n", " pred = res.forecast(n_output_steps)\n", " y_pred_es[t] = pred.values\n", " if t < n_pred - 1:\n", " hist.loc[df_test.iloc[t].name] = df_test[target_col][t] # Append the latest test data row to the history, for fitting the next model\n", " hist.index.freq = pd.infer_freq(hist.index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evaluate(y_pred_es, 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ensemble ML and Statistical Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the performance of the ML and statistical models are similar, ensembling them can be helpful, because their approaches are quite different." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Start by adjusting the sizes of the prediction arrays to match.\n", "# Some methods predict the initial timesteps of the test set.\n", "# Others start after the first sequence length.\n", "# So, you will remove the test data that doesn't exist in both sets.\n", "\n", "def trunc(df, test_set=df_test, n_input_steps = n_input_steps, n_output_steps = n_output_steps):\n", " return df[n_input_steps: -n_output_steps]\n", "\n", "y_pred_es_trunc = trunc(y_pred_es)\n", "y_true_trunc = trunc(df_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "models = [y_pred_lstm, y_pred_cnn, y_pred_es_trunc]\n", "weights = [2, 1, 1]\n", "\n", "y_pred_ensemble = np.average( np.array(models), axis=0, weights=weights)\n", "\n", "evaluate(y_pred_ensemble, 0, y_true_trunc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great job! You've now completed the modeling portion of this workshop. You've covered:\n", "* Removing outliers from the data\n", "* Multi-step forecasting\n", "* Neural network architectures for time-series forecasting: LSTM and CNN\n", "* Statistical models, including Holt-Winters Exponential Smoothing \n", "* Ensembling models" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "ai_platform_notebooks_template.ipynb", "provenance": [], "toc_visible": true }, "environment": { "kernel": "python3", "name": "tf2-gpu.2-6.m82", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-6:m82" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }

courses/ai-for-time-series/notebooks/02-model.ipynb (986 lines of code) (raw):