sdk/python/jobs/automl-standalone-jobs/automl-forecasting-forecast-function/auto-ml-forecasting-function-gap-training.ipynb (631 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "This notebook demonstrates the full interface of the `forecast()` function. \n", "\n", "The best known and most frequent usage of `forecast` enables forecasting on test sets that immediately follows training data. \n", "\n", "However, in many use cases it is necessary to continue using the model for some time before retraining it. This happens especially in **high frequency forecasting** when forecasts need to be made more frequently than the model can be retrained. Examples are in Internet of Things and predictive cloud resource scaling.\n", "\n", "Here we show how to use the `forecast()` function when a time gap exists between training data and prediction period.\n", "\n", "Terminology:\n", "* forecast origin: the last period when the target value is known\n", "* forecast periods(s): the period(s) for which the value of the target is desired.\n", "* lookback: how many past periods (before forecast origin) the model function depends on. The larger of number of lags and length of rolling window.\n", "* prediction context: `lookback` periods immediately preceding the forecast origin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please make sure you have followed the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) so that your ML workspace information is saved in the config file." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import numpy as np\n", "import logging\n", "import warnings\n", "\n", "# Import required libraries\n", "from azure.identity import DefaultAzureCredential\n", "from azure.ai.ml import MLClient\n", "\n", "from azure.ai.ml.constants import AssetTypes, InputOutputModes\n", "from azure.ai.ml import automl\n", "from azure.ai.ml import Input\n", "\n", "# Squash warning messages for cleaner output in the notebook\n", "warnings.showwarning = lambda *args, **kwargs: None\n", "\n", "np.set_printoptions(precision=4, suppress=True, linewidth=120)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [], "source": [ "credential = DefaultAzureCredential()\n", "ml_client = None\n", "try:\n", " ml_client = MLClient.from_config(credential)\n", "except Exception as ex:\n", " print(ex)\n", " subscription_id = \"<SUBSCRIPTION_ID>\"\n", " resource_group = \"<RESOURCE_GROUP>\"\n", " workspace = \"<AML_WORKSPACE_NAME>\"\n", "\n", "ml_client = MLClient(credential, subscription_id, resource_group, workspace)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "workspace = ml_client.workspaces.get(name=ml_client.workspace_name)\n", "\n", "output = {}\n", "output[\"Workspace\"] = ml_client.workspace_name\n", "output[\"Subscription ID\"] = ml_client.subscription_id\n", "output[\"Resource Group\"] = workspace.resource_group\n", "output[\"Location\"] = workspace.location\n", "pd.set_option(\"display.max_colwidth\", None)\n", "outputDf = pd.DataFrame(data=output, index=[\"\"])\n", "outputDf.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "For the demonstration purposes we will generate the data artificially and use them for the forecasting." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [], "source": [ "TIME_COLUMN_NAME = \"date\"\n", "TIME_SERIES_ID_COLUMN_NAME = \"time_series_id\"\n", "TARGET_COLUMN_NAME = \"y\"\n", "lags = [1, 2, 3]\n", "forecast_horizon = 6" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Synthetically generate the data to train the model\n", "n_train_periods = 30\n", "n_test_periods = forecast_horizon\n", "\n", "from helper import get_timeseries\n", "\n", "X_train, y_train, X_test, y_test = get_timeseries(\n", " train_len=n_train_periods,\n", " test_len=n_test_periods,\n", " time_column_name=TIME_COLUMN_NAME,\n", " target_column_name=TARGET_COLUMN_NAME,\n", " time_series_id_column_name=TIME_SERIES_ID_COLUMN_NAME,\n", " time_series_number=2,\n", ")\n", "print(X_train.shape, \" \", X_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what the training data looks like." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Plot the example time series\n", "import matplotlib.pyplot as plt\n", "\n", "whole_data = X_train.copy()\n", "target_label = \"y\"\n", "whole_data[target_label] = y_train\n", "plt.figure(figsize=(10, 6))\n", "for g in whole_data.groupby(\"time_series_id\"):\n", " plt.plot(g[1][\"date\"].values, g[1][\"y\"].values, label=g[0])\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us look at the train and test data of the synthetic data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Create a copy of the X_train and X_test DataFrames and add the corresponding target values\n", "df_train = X_train.copy()\n", "df_train[TARGET_COLUMN_NAME] = y_train\n", "df_test = X_test.copy()\n", "df_test[TARGET_COLUMN_NAME] = y_test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For vizualisation of the time series\n", "df_train[\"data_type\"] = \"Training\" # Add a column to label training data\n", "df_test[\"data_type\"] = \"Testing\" # Add a column to label testing data\n", "\n", "# Concatenate the training and testing DataFrames\n", "df_plot = pd.concat([df_train, df_test])\n", "\n", "# Create a figure and axis\n", "plt.figure(figsize=(10, 6))\n", "ax = plt.gca() # Get current axis\n", "\n", "# Group by both 'data_type' and 'time_series_id'\n", "for (data_type, time_series_id), df in df_plot.groupby([\"data_type\", \"time_series_id\"]):\n", " df.plot(\n", " x=\"date\",\n", " y=TARGET_COLUMN_NAME,\n", " label=f\"{data_type} - {time_series_id}\",\n", " ax=ax,\n", " legend=False,\n", " )\n", "\n", "# Customize the plot\n", "plt.xlabel(\"Date\")\n", "plt.ylabel(\"Value\")\n", "plt.title(\"Train and Test Data\")\n", "\n", "# Manually create the legend after plotting\n", "plt.legend(title=\"Data Type and Time Series ID\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [] }, "outputs": [], "source": [ "import mltable\n", "import os\n", "\n", "\n", "def create_ml_table(data_frame, file_name, output_folder):\n", " os.makedirs(output_folder, exist_ok=True)\n", " data_path = os.path.join(output_folder, file_name)\n", " data_frame.to_parquet(data_path, index=False)\n", " paths = [{\"file\": data_path}]\n", " ml_table = mltable.from_parquet_files(paths)\n", " ml_table.save(output_folder)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "tags": [] }, "outputs": [], "source": [ "os.makedirs(\"data\", exist_ok=True)\n", "create_ml_table(\n", " df_train,\n", " \"df_train.parquet\",\n", " \"./data/training-mltable-folder\",\n", ")\n", "\n", "# Training MLTable defined locally, with local data to be uploaded\n", "my_training_data_input = Input(\n", " type=AssetTypes.MLTABLE, path=\"./data/training-mltable-folder\"\n", ")\n", "\n", "my_training_data_input.__dict__\n", "\n", "# Test data\n", "os.makedirs(\"data\", exist_ok=True)\n", "create_ml_table(\n", " X_test, # df_test,\n", " \"X_test.parquet\",\n", " \"./data/testing-mltable-folder\",\n", ")\n", "\n", "create_ml_table(\n", " df_test,\n", " \"df_test.parquet\",\n", " \"./data/testing-mltable-folder\",\n", ")\n", "\n", "my_test_data_input = Input(\n", " type=AssetTypes.URI_FOLDER,\n", " path=\"./data/testing-mltable-folder\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compute" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from azure.core.exceptions import ResourceNotFoundError\n", "from azure.ai.ml.entities import AmlCompute\n", "\n", "cluster_name = \"forecast-function\"\n", "\n", "try:\n", " # Retrieve an already attached Azure Machine Learning Compute.\n", " compute = ml_client.compute.get(cluster_name)\n", " print(\"Found existing cluster, use it.\")\n", "except ResourceNotFoundError as e:\n", " compute = AmlCompute(\n", " name=cluster_name,\n", " size=\"STANDARD_DS12_V2\",\n", " type=\"amlcompute\",\n", " min_instances=0,\n", " max_instances=4,\n", " idle_time_before_scale_down=120,\n", " )\n", " poller = ml_client.begin_create_or_update(compute)\n", " poller.wait()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "TARGET_COLUMN_NAME, TIME_COLUMN_NAME, TIME_SERIES_ID_COLUMN_NAME" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "tags": [] }, "outputs": [], "source": [ "# target_column_name = \"demand\"\n", "# time_column_name = \"timeStamp\"\n", "# general job parameters\n", "timeout_minutes = 15\n", "trial_timeout_minutes = 5\n", "exp_name = \"forecast-function-exp-no-target-rolling\"\n", "# Create the AutoML forecasting job with the related factory-function.\n", "\n", "forecasting_job = automl.forecasting(\n", " compute=cluster_name,\n", " experiment_name=exp_name,\n", " training_data=my_training_data_input,\n", " target_column_name=TARGET_COLUMN_NAME,\n", " primary_metric=\"NormalizedRootMeanSquaredError\",\n", " n_cross_validations=3,\n", ")\n", "\n", "# Limits are all optional\n", "forecasting_job.set_limits(\n", " timeout_minutes=timeout_minutes,\n", " trial_timeout_minutes=trial_timeout_minutes,\n", " enable_early_termination=True,\n", ")\n", "\n", "# Specialized properties for Time Series Forecasting training\n", "forecasting_job.set_forecast_settings(\n", " time_column_name=TIME_COLUMN_NAME,\n", " forecast_horizon=forecast_horizon,\n", " # target_rolling_window_size=forecast_horizon,\n", " time_series_id_column_names=[TIME_SERIES_ID_COLUMN_NAME],\n", " target_lags=lags,\n", " frequency=\"H\",\n", " cv_step_size=3,\n", ")\n", "\n", "# Training properties are optional\n", "forecasting_job.set_training(blocked_training_algorithms=[\"ExtremeRandomTrees\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Submit training job\n", "returned_job = ml_client.jobs.create_or_update(\n", " forecasting_job\n", ") # submit the job to the backend\n", "\n", "print(f\"Created job: {returned_job}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Wait until AutoML training runs are finished\n", "ml_client.jobs.stream(returned_job.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Retrieve the Best Trial" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import mlflow\n", "\n", "MLFLOW_TRACKING_URI = ml_client.workspaces.get(\n", " name=ml_client.workspace_name\n", ").mlflow_tracking_uri\n", "print(MLFLOW_TRACKING_URI)\n", "\n", "# Set the MLFLOW TRACKING URI\n", "mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)\n", "print(\"\\nCurrent tracking uri: {}\".format(mlflow.get_tracking_uri()))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "tags": [] }, "outputs": [], "source": [ "from mlflow.tracking.client import MlflowClient\n", "\n", "# Initialize MLFlow client\n", "mlflow_client = MlflowClient()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# job_name = returned_job.name\n", "\n", "# Example if providing an specific Job name/ID\n", "job_name = \"yellow_camera_1n84g0vcwp\"\n", "\n", "# Get the parent run\n", "mlflow_parent_run = mlflow_client.get_run(job_name)\n", "\n", "print(\"Parent Run: \")\n", "print(mlflow_parent_run)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Get the best model's child run\n", "best_child_run_id = mlflow_parent_run.data.tags[\"automl_best_child_run_id\"]\n", "print(\"Found best child run id: \", best_child_run_id)\n", "\n", "best_run = mlflow_client.get_run(best_child_run_id)\n", "\n", "print(\"Best child run: \")\n", "print(best_run)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Print parent run tags. 'automl_best_child_run_id' tag should be there.\n", "print(mlflow_parent_run.data.tags)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "pd.DataFrame(best_run.data.metrics, index=[0]).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the model selection and training process. Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Artifact Download" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Create local folder\n", "import os\n", "\n", "local_dir = \"./artifact_downloads\"\n", "if not os.path.exists(local_dir):\n", " os.mkdir(local_dir)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Download run's artifacts/outputs\n", "local_path = mlflow_client.download_artifacts(\n", " best_run.info.run_id, \"outputs\", local_dir\n", ")\n", "print(\"Artifacts downloaded in: {}\".format(local_path))\n", "print(\"Artifacts: {}\".format(os.listdir(local_path)))" ] } ], "metadata": { "authors": [ { "name": "jialiu" } ], "category": "tutorial", "compute": [ "Remote" ], "datasets": [ "None" ], "deployment": [ "None" ], "exclude_from_index": false, "framework": [ "Azure ML AutoML" ], "friendly_name": "Forecasting away from training data", "index_order": 3, "kernelspec": { "display_name": "sdkv2-test1", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.20" }, "microsoft": { "ms_spell_check": { "ms_spell_check_language": "en" } }, "nteract": { "version": "nteract-front-end@1.0.0" }, "tags": [ "Forecasting", "Confidence Intervals" ], "task": "Forecasting" }, "nbformat": 4, "nbformat_minor": 4 }