sdk/python/jobs/automl-standalone-jobs/automl-forecasting-forecast-function/auto-ml-forecasting-function-gap-local-inference.ipynb (502 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "This notebook demonstrates the full interface of the `forecast()` function. \n", "\n", "The best known and most frequent usage of `forecast` enables forecasting on test sets that immediately follows training data. \n", "\n", "However, in many use cases it is necessary to continue using the model for some time before retraining it. This happens especially in **high frequency forecasting** when forecasts need to be made more frequently than the model can be retrained. Examples are in Internet of Things and predictive cloud resource scaling.\n", "\n", "Here we show how to use the `forecast()` function when a time gap exists between training data and prediction period.\n", "\n", "Terminology:\n", "* forecast origin: the last period when the target value is known\n", "* forecast periods(s): the period(s) for which the value of the target is desired.\n", "* lookback: how many past periods (before forecast origin) the model function depends on. The larger of number of lags and length of rolling window.\n", "* prediction context: `lookback` periods immediately preceding the forecast origin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. **Model** \n", " We will need the MLflow model, which is downloaded at the end of the training notebook. Follow any training notebook to get the model. The MLflow model is usually downloaded to the folder: `./artifact_downloads/outputs/mlflow-model`.\n", "\n", "2. **Environment** \n", " We will need the environment to load the model. Please run the following commands to create the environment (the conda file is usually downloaded to: `./artifact_downloads/outputs/mlflow-model/conda.yaml`):\n", " - `conda env create --file <path_to_conda_yaml>`\n", " - `conda activate project_environment`\n", "\n", "3. **Register environment as kernel** \n", " - Please run the following command to register the environment as a kernel: \n", " ```bash\n", " python -m ipykernel install --user --name project_environment --display-name \"model-inference\"\n", " ```\n", " - Refresh the kernel and then select the newly created kernel named `model-inference` from the kernel dropdown.\n", " \n", " Now we are good to run this notebook in the newly created kernel.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "TIME_COLUMN_NAME = \"date\"\n", "TIME_SERIES_ID_COLUMN_NAME = \"time_series_id\"\n", "TARGET_COLUMN_NAME = \"y\"\n", "lags = [1, 2, 3]\n", "forecast_horizon = 6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Local inferencing from model pickle\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Please ensure that the training artifacts are downloaded. For more details refer to the training notebook\n", "mlflow_dir = \"./artifact_downloads/outputs/mlflow-model\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import mlflow.pyfunc\n", "import mlflow.sklearn\n", "import pandas as pd\n", "\n", "fitted_model = mlflow.sklearn.load_model(mlflow_dir)\n", "df_train = pd.read_parquet(\n", " \"./data/training-mltable-folder/df_train.parquet\"\n", ") # We stored the training and test data during training\n", "df_test = pd.read_parquet(\"./data/testing-mltable-folder/df_test.parquet\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "df_train[df_train[\"time_series_id\"] == \"ts1\"].tail(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "df_test[df_test[\"time_series_id\"] == \"ts1\"].head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Forecasting from the trained model\n", "\n", "In this section we will review the forecast interface for two main scenarios: forecasting right after the training data, and the more complex interface for forecasting when there is a gap (in the time sense) between training and testing data.\n", "\n", "## X_train is directly followed by the X_test\n", "Let's first consider the case when the prediction period immediately follows the training data. This is typical in scenarios where we have the time to retrain the model every time we wish to forecast. Forecasts that are made on daily and slower cadence typically fall into this category. Retraining the model every time benefits the accuracy because the most recent data is often the most informative.\n", "\n", "\n", "<img src=\"./images/forecast_function_at_train.png\" alt=\"Description\" width=\"50%\">\n", "\n", "We use X_test as a forecast request to generate the predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "X_test = df_test.copy()\n", "y_test = X_test.pop(TARGET_COLUMN_NAME).values.astype(float)\n", "\n", "y_pred_no_gap, xy_nogap = fitted_model.forecast(X_test)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Confidence Intervals\n", "Forecasting model may be used for the prediction of forecasting intervals by running forecast_quantiles(). This method accepts the same parameters as forecast()." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "quantiles = fitted_model.forecast_quantiles(X_test)\n", "quantiles" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Distribution forecasts\n", "Often the figure of interest is not just the point prediction, but the prediction at some quantile of the distribution. This arises when the forecast is used to control some kind of inventory, for example of grocery items or virtual machines for a cloud service. In such case, the control point is usually something like \"we want the item to be in stock and not run out 99% of the time\". This is called a \"service level\". Here is how you get quantile forecasts." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Specify which quantiles you would like\n", "fitted_model.quantiles = [0.01, 0.5, 0.95]\n", "\n", "# use forecast_quantiles function, not the forecast() one\n", "y_pred_quantiles = fitted_model.forecast_quantiles(X_test)\n", "\n", "# quantile forecasts returned in a Dataframe along with the time and time series id columns\n", "y_pred_quantiles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Forecasting away from training data\n", "Suppose we trained a model, some time passed, and now we want to apply the model without re-training. If the model \"looks back\" -- uses previous values of the target -- then we somehow need to provide those values to the model.\n", "\n", "<img src=\"./images/forecast_function_away_from_train.png\" alt=\"Description\" width=\"50%\">\n", "\n", "The notion of forecast origin comes into play: **the forecast origin is the last period for which we have seen the target value.** This applies per time-series, so each time-series can have a different forecast origin.\n", "\n", "The part of data before the forecast origin is the **prediction context**. To provide the context values the model needs when it looks back, we pass definite values in y_test (aligned with corresponding times in X_test)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# generate the same kind of test data we trained on, but now make the train set much longer, so that the test set will be in the future\n", "from helper import get_timeseries, make_forecasting_query\n", "\n", "X_context, y_context, X_away, y_away = get_timeseries(\n", " train_len=42, # train data was 30 steps long\n", " test_len=4,\n", " time_column_name=TIME_COLUMN_NAME,\n", " target_column_name=TARGET_COLUMN_NAME,\n", " time_series_id_column_name=TIME_SERIES_ID_COLUMN_NAME,\n", " time_series_number=2,\n", ")\n", "\n", "print(\"End of the data we trained on:\")\n", "print(df_train.groupby(TIME_SERIES_ID_COLUMN_NAME)[TIME_COLUMN_NAME].max())\n", "\n", "print(\"\\nStart of the data we want to predict on:\")\n", "print(X_away.groupby(TIME_SERIES_ID_COLUMN_NAME)[TIME_COLUMN_NAME].min())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a gap of 12 hours between end of training and beginning of X_away. (It looks like 13 because all timestamps point to the start of the one hour periods.) Using only X_away will fail without adding context data for the model to consume" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "try:\n", " y_pred_away, xy_away = fitted_model.forecast(X_away)\n", " xy_away\n", "except Exception as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "How should we read that eror message? The forecast origin is at the last time the model saw an actual value of y (the target). That was at the end of the training data! The model is attempting to forecast from the end of training data. But the requested forecast periods are past the forecast horizon. We need to provide a define y value to establish the forecast origin.\n", "\n", "We will use the helper function to take the required amount of context from the data preceding the testing data. It's definition is intentionally simplified to keep the idea in the clear." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see where the context data ends - it ends, by construction, just before the testing data starts." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "print(\n", " X_context.groupby(TIME_SERIES_ID_COLUMN_NAME)[TIME_COLUMN_NAME].agg(\n", " [\"min\", \"max\", \"count\"]\n", " )\n", ")\n", "print(\n", " X_away.groupby(TIME_SERIES_ID_COLUMN_NAME)[TIME_COLUMN_NAME].agg(\n", " [\"min\", \"max\", \"count\"]\n", " )\n", ")\n", "X_context.tail(5)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "How should we read that eror message? The forecast origin is at the last time the model saw an actual value of y (the target). That was at the end of the training data! The model is attempting to forecast from the end of training data. But the requested forecast periods are past the forecast horizon. We need to provide a define y value to establish the forecast origin.\n", "\n", "We will use this helper function to take the required amount of context from the data preceding the testing data. It's definition is intentionally simplified to keep the idea in the clear." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Since the length of the lookback is 3, we need to add 3 periods from the context to the request so that the model has the data it needs\n", "\n", "# Put the X and y back together for a while. They like each other and it makes them happy.\n", "X_context[TARGET_COLUMN_NAME] = y_context\n", "X_away[TARGET_COLUMN_NAME] = y_away\n", "fulldata = pd.concat([X_context, X_away])\n", "\n", "# Forecast origin is the last point of data, which is one 1-hr period before test\n", "forecast_origin = X_away[TIME_COLUMN_NAME].min() - pd.DateOffset(hours=1)\n", "# it is indeed the last point of the context\n", "assert forecast_origin == X_context[TIME_COLUMN_NAME].max()\n", "print(\"Forecast origin: \" + str(forecast_origin))\n", "\n", "# The model uses lags and rolling windows to look back in time\n", "n_lookback_periods = max(\n", " lags\n", ") # n_lookback_periods = max(max(lags), forecast_horizon) # If target_rolling_window_size is used\n", "lookback = pd.DateOffset(hours=n_lookback_periods)\n", "horizon = pd.DateOffset(hours=forecast_horizon)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# now make the forecast query from context (refer to figure)\n", "X_pred, y_pred = make_forecasting_query(\n", " fulldata, TIME_COLUMN_NAME, TARGET_COLUMN_NAME, forecast_origin, horizon, lookback\n", ")\n", "\n", "# show the forecast request aligned\n", "X_show = X_pred.copy()\n", "X_show[TARGET_COLUMN_NAME] = y_pred\n", "X_show[X_show[\"time_series_id\"] == \"ts0\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "X_pred[\n", " \"data_type\"\n", "] = \"unknown\" # Our trining had an additional column called data_type, hence, adding it" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Now everything should work\n", "y_pred_away, xy_away = fitted_model.forecast(X_pred, y_pred)\n", "\n", "# show the forecast aligned without the generated features\n", "X_show = xy_away.reset_index()\n", "X_show[\n", " [\"date\", \"time_series_id\", \"ext_predictor\", \"_automl_target_col\"]\n", "] # prediction is in _automl_target_col" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Let us look at the tail of training data and the head of the test data for one grain" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "df_train[df_train[\"time_series_id\"] == \"ts1\"].tail(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If there is a gap between the train and the test data, and the test data uses lags/ rolling forecasts, we need to append the context data such that the test data has access to the lags\n", "In the above case, train_data ends at 2000-01-02 05:00:00" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "X_show[X_show[\"time_series_id\"] == \"ts1\"][\n", " [\"date\", \"time_series_id\", \"ext_predictor\", \"_automl_target_col\"]\n", "]" ] } ], "metadata": { "authors": [ { "name": "jialiu" } ], "category": "tutorial", "compute": [ "Remote" ], "datasets": [ "None" ], "deployment": [ "None" ], "exclude_from_index": false, "framework": [ "Azure ML AutoML" ], "friendly_name": "Forecasting away from training data", "index_order": 3, "kernelspec": { "display_name": "model-inference", "language": "python", "name": "project_environment" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" }, "microsoft": { "ms_spell_check": { "ms_spell_check_language": "en" } }, "nteract": { "version": "nteract-front-end@1.0.0" }, "tags": [ "Forecasting", "Confidence Intervals" ], "task": "Forecasting", "vscode": { "interpreter": { "hash": "6bd77c88278e012ef31757c15997a7bea8c943977c43d6909403c00ae11d43ca" } } }, "nbformat": 4, "nbformat_minor": 4 }