## Introduction
This notebook demonstrates the full interface of the `forecast()` function. 

The best known and most frequent usage of `forecast` enables forecasting on test sets that immediately follows training data. 

However, in many use cases it is necessary to continue using the model for some time before retraining it. This happens especially in **high frequency forecasting** when forecasts need to be made more frequently than the model can be retrained. Examples are in Internet of Things and predictive cloud resource scaling.

Here we show how to use the `forecast()` function when a time gap exists between training data and prediction period.

Terminology:
* forecast origin: the last period when the target value is known
* forecast periods(s): the period(s) for which the value of the target is desired.
* lookback: how many past periods (before forecast origin) the model function depends on. The larger of number of lags and length of rolling window.
* prediction context: `lookback` periods immediately preceding the forecast origin

## Setup

Please make sure you have followed the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) so that your ML workspace information is saved in the config file.

In [1]:
import os
import pandas as pd
import numpy as np
import logging
import warnings

# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml import automl
from azure.ai.ml import Input

# Squash warning messages for cleaner output in the notebook
warnings.showwarning = lambda *args, **kwargs: None

np.set_printoptions(precision=4, suppress=True, linewidth=120)

In [4]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"

ml_client = MLClient(credential, subscription_id, resource_group, workspace)

In [None]:
workspace = ml_client.workspaces.get(name=ml_client.workspace_name)

output = {}
output["Workspace"] = ml_client.workspace_name
output["Subscription ID"] = ml_client.subscription_id
output["Resource Group"] = workspace.resource_group
output["Location"] = workspace.location
pd.set_option("display.max_colwidth", None)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

## Data
For the demonstration purposes we will generate the data artificially and use them for the forecasting.

In [6]:
TIME_COLUMN_NAME = "date"
TIME_SERIES_ID_COLUMN_NAME = "time_series_id"
TARGET_COLUMN_NAME = "y"
lags = [1, 2, 3]
forecast_horizon = 6

In [None]:
# Synthetically generate the data to train the model
n_train_periods = 30
n_test_periods = forecast_horizon

from helper import get_timeseries

X_train, y_train, X_test, y_test = get_timeseries(
    train_len=n_train_periods,
    test_len=n_test_periods,
    time_column_name=TIME_COLUMN_NAME,
    target_column_name=TARGET_COLUMN_NAME,
    time_series_id_column_name=TIME_SERIES_ID_COLUMN_NAME,
    time_series_number=2,
)
print(X_train.shape, " ", X_test.shape)

Let's see what the training data looks like.

In [None]:
# Plot the example time series
import matplotlib.pyplot as plt

whole_data = X_train.copy()
target_label = "y"
whole_data[target_label] = y_train
plt.figure(figsize=(10, 6))
for g in whole_data.groupby("time_series_id"):
    plt.plot(g[1]["date"].values, g[1]["y"].values, label=g[0])
plt.legend()
plt.show()

Let us look at the train and test data of the synthetic data

In [9]:
# Create a copy of the X_train and X_test DataFrames and add the corresponding target values
df_train = X_train.copy()
df_train[TARGET_COLUMN_NAME] = y_train
df_test = X_test.copy()
df_test[TARGET_COLUMN_NAME] = y_test

In [None]:
# For vizualisation of the time series
df_train["data_type"] = "Training"  # Add a column to label training data
df_test["data_type"] = "Testing"  # Add a column to label testing data

# Concatenate the training and testing DataFrames
df_plot = pd.concat([df_train, df_test])

# Create a figure and axis
plt.figure(figsize=(10, 6))
ax = plt.gca()  # Get current axis

# Group by both 'data_type' and 'time_series_id'
for (data_type, time_series_id), df in df_plot.groupby(["data_type", "time_series_id"]):
    df.plot(
        x="date",
        y=TARGET_COLUMN_NAME,
        label=f"{data_type} - {time_series_id}",
        ax=ax,
        legend=False,
    )

# Customize the plot
plt.xlabel("Date")
plt.ylabel("Value")
plt.title("Train and Test Data")

# Manually create the legend after plotting
plt.legend(title="Data Type and Time Series ID")
plt.show()

In [11]:
import mltable
import os


def create_ml_table(data_frame, file_name, output_folder):
    os.makedirs(output_folder, exist_ok=True)
    data_path = os.path.join(output_folder, file_name)
    data_frame.to_parquet(data_path, index=False)
    paths = [{"file": data_path}]
    ml_table = mltable.from_parquet_files(paths)
    ml_table.save(output_folder)

In [12]:
os.makedirs("data", exist_ok=True)
create_ml_table(
    df_train,
    "df_train.parquet",
    "./data/training-mltable-folder",
)

# Training MLTable defined locally, with local data to be uploaded
my_training_data_input = Input(
    type=AssetTypes.MLTABLE, path="./data/training-mltable-folder"
)

my_training_data_input.__dict__

# Test data
os.makedirs("data", exist_ok=True)
create_ml_table(
    X_test,  # df_test,
    "X_test.parquet",
    "./data/testing-mltable-folder",
)

create_ml_table(
    df_test,
    "df_test.parquet",
    "./data/testing-mltable-folder",
)

my_test_data_input = Input(
    type=AssetTypes.URI_FOLDER,
    path="./data/testing-mltable-folder",
)

### Compute

In [None]:
from azure.core.exceptions import ResourceNotFoundError
from azure.ai.ml.entities import AmlCompute

cluster_name = "forecast-function"

try:
    # Retrieve an already attached Azure Machine Learning Compute.
    compute = ml_client.compute.get(cluster_name)
    print("Found existing cluster, use it.")
except ResourceNotFoundError as e:
    compute = AmlCompute(
        name=cluster_name,
        size="STANDARD_DS12_V2",
        type="amlcompute",
        min_instances=0,
        max_instances=4,
        idle_time_before_scale_down=120,
    )
    poller = ml_client.begin_create_or_update(compute)
    poller.wait()

In [None]:
TARGET_COLUMN_NAME, TIME_COLUMN_NAME, TIME_SERIES_ID_COLUMN_NAME

In [15]:
# target_column_name = "demand"
# time_column_name = "timeStamp"
# general job parameters
timeout_minutes = 15
trial_timeout_minutes = 5
exp_name = "forecast-function-exp-no-target-rolling"
# Create the AutoML forecasting job with the related factory-function.

forecasting_job = automl.forecasting(
    compute=cluster_name,
    experiment_name=exp_name,
    training_data=my_training_data_input,
    target_column_name=TARGET_COLUMN_NAME,
    primary_metric="NormalizedRootMeanSquaredError",
    n_cross_validations=3,
)

# Limits are all optional
forecasting_job.set_limits(
    timeout_minutes=timeout_minutes,
    trial_timeout_minutes=trial_timeout_minutes,
    enable_early_termination=True,
)

# Specialized properties for Time Series Forecasting training
forecasting_job.set_forecast_settings(
    time_column_name=TIME_COLUMN_NAME,
    forecast_horizon=forecast_horizon,
    # target_rolling_window_size=forecast_horizon,
    time_series_id_column_names=[TIME_SERIES_ID_COLUMN_NAME],
    target_lags=lags,
    frequency="H",
    cv_step_size=3,
)

# Training properties are optional
forecasting_job.set_training(blocked_training_algorithms=["ExtremeRandomTrees"])

In [None]:
# Submit training job
returned_job = ml_client.jobs.create_or_update(
    forecasting_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

In [None]:
# Wait until AutoML training runs are finished
ml_client.jobs.stream(returned_job.name)

### Retrieve the Best Trial

In [None]:
import mlflow

MLFLOW_TRACKING_URI = ml_client.workspaces.get(
    name=ml_client.workspace_name
).mlflow_tracking_uri
print(MLFLOW_TRACKING_URI)

# Set the MLFLOW TRACKING URI
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
print("\nCurrent tracking uri: {}".format(mlflow.get_tracking_uri()))

In [17]:
from mlflow.tracking.client import MlflowClient

# Initialize MLFlow client
mlflow_client = MlflowClient()

In [None]:
# job_name = returned_job.name

# Example if providing an specific Job name/ID
job_name = "yellow_camera_1n84g0vcwp"

# Get the parent run
mlflow_parent_run = mlflow_client.get_run(job_name)

print("Parent Run: ")
print(mlflow_parent_run)

In [None]:
# Get the best model's child run
best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"]
print("Found best child run id: ", best_child_run_id)

best_run = mlflow_client.get_run(best_child_run_id)

print("Best child run: ")
print(best_run)

In [None]:
# Print parent run tags. 'automl_best_child_run_id' tag should be there.
print(mlflow_parent_run.data.tags)

In [None]:
pd.DataFrame(best_run.data.metrics, index=[0]).T

Run the model selection and training process.  Validation errors and current status will be shown when setting `show_output=True` and the execution will be synchronous.

## Artifact Download

In [22]:
# Create local folder
import os

local_dir = "./artifact_downloads"
if not os.path.exists(local_dir):
    os.mkdir(local_dir)

In [None]:
# Download run's artifacts/outputs
local_path = mlflow_client.download_artifacts(
    best_run.info.run_id, "outputs", local_dir
)
print("Artifacts downloaded in: {}".format(local_path))
print("Artifacts: {}".format(os.listdir(local_path)))