Chapter 4, Part 3: Baseline and ARIMA Models

Machine Learning

Author
Affiliation

F.San Segundo & N.Rodríguez

Based on notes by Professors A.Muñoz, J.Portela & J.Pizarroso

Published

March 2025


Session Setup

Libraries

Let us begin by loading the libraries we will use.

Hide the code
### Load necessary modules -------------------------------

# %config InlineBackend.figure_format = 'png' # ‘png’, ‘retina’, ‘jpeg’, ‘svg’, ‘pdf’

import numpy as np 
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook')


%matplotlib inline

# Machine Learning general libraries

from sklearn.metrics import mean_squared_error, mean_absolute_error

# statsmodels
import statsmodels.stats.api as sms
import statsmodels.api as sm
from statsmodels.formula.api import ols


# Time Series related imports

import datetime

from sklearn.model_selection import TimeSeriesSplit

import statsmodels.api as sm

# pmdarima
import pmdarima as pmd

# sktime 
from sktime.datasets import load_airline
from sktime.utils.plotting import plot_series
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.split import ExpandingWindowSplitter
from sktime.transformations.series.boxcox import BoxCoxTransformer
from sktime.forecasting.naive import NaiveForecaster
from sktime.forecasting.trend import TrendForecaster
from sktime.forecasting.trend import STLForecaster
from sktime.performance_metrics.forecasting import mean_absolute_error
from sktime.forecasting.model_evaluation import evaluate

import statsmodels.tsa as tsa
from statsmodels.tsa.seasonal import seasonal_decompose, STL
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima.model import ARIMA as smARIMA

# datasets
import rdatasets
import yfinance as yf

# tqdm 
from tqdm import tqdm

from fc_4_3_utils.utils import *

Train / Test splits and validation in time series forecasting

Temporal train-test splits

In time series forecasting, we need to be careful when splitting the data into training and testing sets. We cannot use random splits, because we need to preserve the temporal order of the data and we must avoid using future unknown data to predict past data.

Therefore we need to use time-aware splits, where the training set contains the data up to a certain point in time, and the testing set contains the data after that point. The proportion of the split is usually 80/20 or 70/30 of the time span of the series as in other Machine Learning problems. The split is often performed directly by splitting the data into two parts, but some libraries (like sktime) have functions to perform this split.

Highly recommended reading and video watching: Hyndman and Athanasopoulos (2021), Section 5.8

We will use the Canadian gas production dataset to illustrate these splitting methods.

Hide the code
can_gas = pd.read_csv('../4_1_Introduction_to_Forecasting/tsdata/fpp3/Canadian_gas.csv')

# Set 'Month' as index and set frequency as Month end, required by sktime
can_gas["Month"] = pd.to_datetime(can_gas["Month"], format="%Y %b")
can_gas["Month"] = can_gas["Month"] + pd.offsets.MonthEnd(0)
can_gas.set_index("Month", inplace=True)
can_gas.index.freq = 'M'

can_gas.head()
/var/folders/v_/pl7__6kj0nx_lfsnm4_g91q80000gn/T/ipykernel_93245/490545430.py:7: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  can_gas.index.freq = 'M'
Volume
Month
1960-01-31 1.4306
1960-02-29 1.3059
1960-03-31 1.4022
1960-04-30 1.1699
1960-05-31 1.1161

Let us first illustrate the sktime basic split function.

Hide the code
y = can_gas[["Volume"]]
y
Volume
Month
1960-01-31 1.4306
1960-02-29 1.3059
1960-03-31 1.4022
1960-04-30 1.1699
1960-05-31 1.1161
... ...
2004-10-31 17.8268
2004-11-30 17.8322
2004-12-31 19.4526
2005-01-31 19.5284
2005-02-28 16.9441

542 rows × 1 columns

Hide the code
y_train, y_test = temporal_train_test_split(y, test_size=0.2)
y_train.tail()
Volume
Month
1995-09-30 15.0938
1995-10-31 16.2678
1995-11-30 16.1945
1995-12-31 16.9076
1996-01-31 17.1467
Hide the code
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(y_train, label='train')
ax.plot(y_test, label='test')
ax.legend(["train", "test"])
plt.show();plt.close()

We can see the last position/date of the training set and the first position/date of the testing set as follows:

Hide the code
y_train.index[-1], y_test.index[0]
(Timestamp('1996-01-31 00:00:00'), Timestamp('1996-02-29 00:00:00'))

And the sizes of the training and testing sets as follows:

Hide the code
y_train.shape, y_test.shape
((433, 1), (109, 1))

Similarly, we can do a manual train/test 80/20 split as follows:

Hide the code
train_end = int(np.floor(len(y) * 0.8))

And now the manual split is simply done by slicing the data as follows:

Hide the code
y_train = y.iloc[:train_end]
y_test = y.iloc[train_end:]

The Using darts appendix below will show the code to perform the same analysis using the darts library.

What about validation sets?

Highly recommended reading and video watching: Hyndman and Athanasopoulos (2021), Section 5.10. We will use some figures from that section below.

We know by now that we should use a validation set to assess the performance of our models and to fine-tune the hyperparameters of our models. In time series forecasting, we should use a validation set that is also time-aware. That is, any validation data should be in the future of the training data. And if we plan to use cross-validation, the same time-awareness extends to the folds of the cross-validation.

There are several different approaches to cross validation in time series forecasting. The expanding window approach is the most common one. In this approach, the training set starts with an initial window and grows with each fold by adding some points at the end of the training set. The number of points added is called the step. The test set is always comprised by a fixed number of points following the training set. That fixed number of points is called the forecasting horizon (usually represented as fh in many forecasting libraries). Expanding windows with fh=1 are illustrated in the figure below, from Hyndman and Athanasopoulos (2021), Section 5.10. The blue dots represent the training set, the orange dot represents the test set.

The case for fh=4 is illustrated in the figure below, also from Hyndman and Athanasopoulos (2021), Section 5.10.

The expanding window is also often called rolling forecasting origin.

Rolling windows with sktime

The ExpandingWindowSplitter class in the sktime library can be used to create expanding windows for time series forecasting. We illustrate how it works in the example below how it works with a toy example of a few time points identified by their index.

Hide the code
time_points = np.arange(12)
time_points
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

Let us suppose that we want to use: + a minimum training set size of 4 points, + a forecasting horizon of 2 points, + and we want to increase the size of the training set by 2 points at each step (we often use large steps for computational reasons).

This is accomplished by the following code:

Hide the code
cv_splitter = ExpandingWindowSplitter(initial_window=4, fh=[1, 2], step_length=2)
folds = list(cv_splitter.split(time_points))
folds
[(array([0, 1, 2, 3]), array([4, 5])),
 (array([0, 1, 2, 3, 4, 5]), array([6, 7])),
 (array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9])),
 (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([10, 11]))]
Exercise 001
  • Change the forecasting horizon parameter to fh = [3] and run the code again. What happens?
  • Also play with different values of the step parameter to see what happens.

Let us see how to apply this to the training data of the Canadian gas production time series. We will make sure that our training data series is at least say 12 months long and we will use a forecasting horizon of 6 months with a step of 1 month.

Hide the code
min_train_size = 48
fh = range(1, 7) # forecasting horizon. We need to exclude 0 and include 6
step = 1

y_train_cv_splitter = ExpandingWindowSplitter(initial_window=min_train_size, 
                                              fh=fh, 
                                              step_length=step)
can_gas_fold_idx = list(y_train_cv_splitter.split(y_train))

The splitter provides the indices of the training and testing sets as follows (here we see the first two folds):

Hide the code
print("Number of folds:", len(can_gas_fold_idx))
can_gas_fold_idx[0], can_gas_fold_idx[1]
Number of folds: 380
((array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
         17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
         34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]),
  array([48, 49, 50, 51, 52, 53])),
 (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
         17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
         34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48]),
  array([49, 50, 51, 52, 53, 54])))

And we can use these indices to slice the data as follows:

Hide the code
y_train_folds = [y_train.iloc[train_idx] for train_idx, _ in can_gas_fold_idx]
y_val_folds = [y_train.iloc[val_idx] for _, val_idx in can_gas_fold_idx]

Let us plot two of these folds created by the expanding windows to see how they looks like. One of the first ones appears on the left and one of the last ones appears on the right.

Hide the code
fig, ax = plt.subplots(figsize=(15, 4), ncols=2)
y_train.plot(ax=ax[0], linestyle="dotted")
y_train_folds[75].plot(label='train', ax=ax[0], linewidth=4)
y_val_folds[75].plot(label='val', ax=ax[0], linewidth=4)
ax[0].legend(["complete training", "fold training data", "fold validation data"])
y_train.plot(ax=ax[1], linestyle="dotted")
y_train_folds[320].plot(label='train', ax=ax[1], linewidth=4)
y_val_folds[320].plot(label='val', ax=ax[1], linewidth=4)
ax[1].legend(["complete training", "fold training data", "fold validation data"])
plt.show();plt.close()

Rolling windows with sktime

The TimeSeriesSplit class in the sklearn library can also be used to create expanding windows for time series forecasting. But selecting the right parameters for the split is not as straightforward as with the ExpandingWindowSplitter class in the sktime library. For example, you can not directly choose an initial minimum training set size, the forecasting horizon is indirectly defined, etc. So we do not recommend using it for time series forecasting.

Sliding windows

Other cross-validation strategies are possible, like the sliding window approach. In this approach, the training fold length is fixed while it slides over the time series. The following schematic figure from the sktime website illustrates how the sliding window approach works. Here the asterisks represent the training fold, and the x’s represent the test fold.

|-----------------------|
| * * * * * x x x - - - |
| - * * * * * x x x - - |
| - - * * * * * x x x - |
| - - - * * * * * x x x |

This cross validation strategy is implemented in sktime by the SlidingWindowSplitter class. It works similarly to the ExpandingWindowSplitter class, so we just refer you to the documentation for more details.


Forecasting with naive baseline models

Baseline models

We are now ready to start introducing time series forecasting models. We will start with the simplest models, the so-called baseline models. These models are used to provide a point of comparison for more advanced models, like the Arima models we will introduce below.

Some baseline models use simple time series statistics or values to make predictions, like the mean of the time series or the last observed value. Others are based in the decomposition of the time series into its components: trend, seasonality, and residuals (keep in mind however that some decomposition methods are anything but simple, and in that case the name baseline model is a bit of a stretch).

We will illustrate some of these models by applying them to the Canadian gas production dataset. Remember that we have already split the data into training and testing sets, and that we have also created a cross-validation scheme using expanding windows.

Naive baseline models with sktime

The so called naive baseline models are the simplest baseline models. They are all based on the idea that the future behavior is likely to be similar to the patterns observed in the past. The difference between the models is how they define the past and how that past information is used to make predictions.

For example, we can simply decide to use the last observed value as the prediction for the future. This is the so-called last observed naive model. Or we could use the historical mean of the time series as prediction (the so called ‘mean’ naive model). The last seasonal naive model uses the last observed value from the same season of the preceding seasonal period as the prediction for the future. But the mean of that seasonal period can also be used as the prediction for the future.

The NaiveForecaster class in sktime implements several of these naive models by selecting different values for the strategy parameter: + last implements the last observed naive model, + mean implements the mean naive model, + seasonal_last implements the last seasonal naive model, + seasonal_mean implements the last seasonal mean naive model.

Check the documentation for more details and other options. Keep in mind that when using seasonal naive models, you need to specify the seasonal period of the time series. Be careful: choosing the wrong seasonal period can lead to very poor predictions! Always think about the sampling frequency of the data and the compatible seasonal periods.

Let us apply some of these naive models to the Canadian gas production dataset.

Hide the code
last_obs_model = NaiveForecaster(strategy="last")
mean_model = NaiveForecaster(strategy="mean")
seasonal_last_model = NaiveForecaster(strategy="last", sp=12)
seasonal_mean_model = NaiveForecaster(strategy="mean", sp=12)
drift_model = NaiveForecaster(strategy="drift")



modelDict = {"last_obs": {"model": last_obs_model},
             "mean": {"model": mean_model},
             "seasonal_last": {"model": seasonal_last_model},
             "seasonal_mean": {"model": seasonal_mean_model},
             "drift": {"model": drift_model}}

# naiveList = [last_obs_model, mean_model, seasonal_last_model, seasonal_mean_model, drift_model]

Now let us obtain the predictions for all these models and for each one of the cross validation folds defined above with the expanding windows. We will store the model predictions in lists.

Hide the code
last_obs_val_pred = []
mean_val_pred = []
seasonal_last_val_pred = []
seasonal_mean_val_pred = []
drift_val_pred = []
for tr_fold, val_fold in zip(y_train_folds, y_val_folds):
    fh = ForecastingHorizon(val_fold.index, is_relative=False, freq="M")
    last_obs_model.fit(tr_fold)
    last_obs_val_pred.append(last_obs_model.predict(fh = fh))
    mean_model.fit(tr_fold)
    mean_val_pred.append(mean_model.predict(fh = fh))
    seasonal_last_model.fit(tr_fold)
    seasonal_last_val_pred.append(seasonal_last_model.predict(fh = fh))
    seasonal_mean_model.fit(tr_fold)
    seasonal_mean_val_pred.append(seasonal_mean_model.predict(fh = fh))
    drift_model.fit(tr_fold)
    drift_val_pred.append(drift_model.predict(fh = fh)) 

for model in modelDict.keys():
    modelDict[model]["val_pred"] = eval(model + "_val_pred")

Let us visualize the predictions of all the models for a single fold. You can play with the fold_num parameter to see the predictions for different folds. The extend_idx parameter is used in the visualization to show a portion of the time series after the validation set.

Hide the code
fold_num = 53
Hide the code
extend_idx = y_train_folds[fold_num + 20].index

fig, ax = plt.subplots(figsize=(15, 4))
y.loc[extend_idx].plot(ax=ax, linestyle="dotted")
y_train_folds[fold_num].plot(label='train', ax=ax, linewidth=4)
y_val_folds[fold_num].plot(label='val', ax=ax, linewidth=4)
for model in [last_obs_val_pred, 
              mean_val_pred, 
              seasonal_last_val_pred, 
              seasonal_mean_val_pred, 
              drift_val_pred]:
    model[fold_num].plot(ax=ax, linestyle="--")

ax.legend(["complete training (cropped for clarity)", 
           "fold training data", 
           "fold validation data", 
           "last_obs_val_pred", 
           "mean_val_pred", 
           "seasonal_last_val_pred", 
           "seasonal_mean_val_pred",
           "drift_val_pred"], fontsize='small')
plt.show();plt.close()  

Below we zoom in to the validation set to see the predictions in more detail.

Hide the code
extend_idx = y_train_folds[fold_num + 20].index


fig, ax = plt.subplots(figsize=(15, 4))
y.loc[extend_idx][-15:-5].plot(ax=ax, linestyle="dotted")
y_train_folds[fold_num][-25:].plot(label='train', ax=ax, linewidth=4)
y_val_folds[fold_num][-25:].plot(label='val', ax=ax, linewidth=4)
for model in [last_obs_val_pred, 
              mean_val_pred, 
              seasonal_last_val_pred, 
              seasonal_mean_val_pred, 
              drift_val_pred]:
    model[fold_num][-25:].plot(ax=ax, linestyle="--")

ax.legend(["complete training (cropped for clarity)", 
           "fold training data", 
           "fold validation data", 
           "last_obs_val_pred", 
           "mean_val_pred", 
           "seasonal_last_val_pred", 
           "seasonal_mean_val_pred",
           "drift_val_pred"], fontsize='xx-small')
plt.show();plt.close()  

As you can see, the naive models are very simple and in examples like this they fail to capture the complexity of the time series dynamic. The closest one to the real data seems to be the seasonal_last naive model, but it is still quite different from the real data. As we said, they are useful as a point of comparison for more advanced models.

Combining the predictions of naive models (ensemble)

It is possible to combine the predictions of some naive models to create an ensemble model that is often more accurate than the individual naive models. This is done by averaging the predictions of the naive models. We have already met this idea in the context of ensemble learning for Regression and Classification problems.

Hide the code
ensemble_model_val_pred = [(dr + seas + last)/3 for dr, seas, last in zip(drift_val_pred, seasonal_last_val_pred, last_obs_val_pred)]

model_name = "ensemble"
modelDict[model_name] = {"model": None, "val_pred": eval(f"{model_name}_model_val_pred")}

Let us plot the prediction of the ensemble model for the same fold as above. We encourage you to play with the fold_num parameter to see the predictions for different folds.

Hide the code
extend_idx = y_train_folds[fold_num + 20].index

fig, ax = plt.subplots(figsize=(15, 4))
y.loc[extend_idx].plot(ax=ax, linestyle="dotted")
y_train_folds[fold_num].plot(label='train', ax=ax, linewidth=4)
y_val_folds[fold_num].plot(label='val', ax=ax, linewidth=4)
ensemble_model_val_pred[fold_num].plot(ax=ax, linestyle="--")

ax.legend(["complete training (cropped for clarity)", 
           "fold training data", 
           "fold validation data",
           "ensemble_val_pred"], 
           fontsize='small')
plt.show();plt.close()  

The ensemble model seems to be a bit better than the individual naive models. How can we turn this into a more formal comparison? In order to do that, we need to calculate some error metrics.

Metrics for time series forecasting. Validation and test metrics.

Since forecasting time series is essentially a regression problem, we can use the same metrics we used for regression problems. The most common metrics are: + Mean Absolute Error (MAE) + Mean Squared Error (MSE) + Root Mean Squared Error (RMSE) + Mean Absolute Percentage Error (MAPE) + Symmetric Mean Absolute Percentage Error (sMAPE)

We recommend reading Section 5.8 of Hyndman and Athanasopoulos (2021) (start from Forecast errors) for a more detailed explanation of these metrics. The first three measures are, as we know, scale dependent. But when we use them to compare different models for the same time series, the scale dependency is not a problem. These measures are implemented in the sklearn library, so we can use them directly.

To compute the validation metrics, we need to calculate the error metrics for each fold of the cross-validation. We can then average the error metrics to get a single value for each metric. For the test set metric things are simple, as we just need to calculate the error metrics for the test set. In the cell code below we use the model dictionary we have built to calculate the error metrics for each model and each fold. We then average the error metrics for each model and each fold. We also calculate the error metrics for the test set.

Hide the code
for model_name in modelDict.keys():
    model = modelDict[model_name]['model']
    model_val_preds = modelDict[model_name]["val_pred"]
    val_maes = [mean_absolute_error(y_val_fold, model_val_pred) for y_val_fold, model_val_pred in 
                zip(y_val_folds, model_val_preds)]
    modelDict[model_name]["val_maes"] = val_maes
    modelDict[model_name]["average_val_maes"] = np.array(val_maes).mean()

    test_fh = ForecastingHorizon(y_test.index, is_relative=False, freq="M")
    if model_name != "ensemble":
        test_pred = model.predict(fh = test_fh)
        modelDict[model_name]["test_pred"] = test_pred
        modelDict[model_name]["test_mae"] = mean_absolute_error(y_test, test_pred)

The case of the ensemble model is special, because we do not have a predict method for it. We need to calculate the ensemble prediction and error metrics manually. We will do that in the cell code below.

Hide the code
zipped_preds = zip(modelDict["drift"]["test_pred"].values.ravel(),
                   modelDict["seasonal_last"]["test_pred"].values.ravel(),
                   modelDict["last_obs"]["test_pred"].values.ravel())
                   
ensemble_test_pred = [(dr + seas + last)/3 for dr, seas, last in zipped_preds]

modelDict['ensemble']['test_pred'] = ensemble_test_pred
modelDict['ensemble']["test_mae"] = mean_absolute_error(y_test, ensemble_test_pred)

We store the error metrics in a dataframe to sort and display them for easy comparison.

Hide the code
model_maes_df = pd.DataFrame({'model_name':modelDict.keys(),
                              'val_avrg_mae': [modelDict[model_name]['average_val_maes'] for model_name in modelDict.keys()],
                              'test_mae': [modelDict[model_name]['test_mae'] for model_name in modelDict.keys()]})

model_maes_df
model_maes_df.sort_values(by='val_avrg_mae', inplace=True)
model_maes_df
model_name val_avrg_mae test_mae
2 seasonal_last 0.555838 1.888272
5 ensemble 0.747182 1.382830
0 last_obs 1.051184 2.124965
4 drift 1.061138 0.708353
1 mean 3.486469 9.750811
3 seasonal_mean 3.537769 9.749002

The table seems to indicate that the seasonal last naive model is the best of the naive models in cross validation, closely followed by the ensemble model. The test set metrics however select the drift method as the best one. Keep in mind that the more robust method to select the best model is to use the cross-validation metrics and that ensembles always tend to provide more robust results than individual models. So in this case I would select the seasonal last or the ensemble model as the best one. The situation with the drift model is peculiar, it is the only model that gets a higher score in the test set than in the cross-validation. We will now visualize the predictions of all this models for the test set in order to understand why this may be the case. We will plot the real data, the predictions of the naive models and those of the ensemble model

Hide the code
test_pred_df = pd.DataFrame({'y_test': y_test.values.ravel(),
                                'drift_pred': modelDict['drift']['test_pred'].values.ravel(),
                                'seasonal_last_pred': modelDict['seasonal_last']['test_pred'].values.ravel(),
                                'last_obs_pred': modelDict['last_obs']['test_pred'].values.ravel(),
                                'ensemble_pred': modelDict['ensemble']['test_pred']})

test_pred_df.index = y_test.index
len(y_test)

fig, ax = plt.subplots(figsize=(15, 4))
y_train.iloc[-90:].plot(ax=ax, linestyle="dotted")
y_test.plot(label='test', ax=ax, linewidth=4)
for model in ['drift', 'seasonal_last', 'last_obs', 'ensemble']:
    test_pred_df[model + '_pred'].plot(ax=ax, linestyle="--")
# test_pred_df['drift_pred'].plot(ax=ax, linewidth=4, label='naive_pred')
ax.legend(["training (cropped for clarity)", 
           "test data"] + ['drift', 'seasonal_last', 'last_obs', 'ensemble'], 
           fontsize='small')
plt.show();plt.close()  

The plot suggests that in this series the trend component is the most important one, and that the only naive model that partially captures it is the drift model. The rest of the naive models do not really pay attention to the trend component. The ensemble model is a bit better than the other naive models, because the drift model is part of the ensemble, but the other ingredient is in fact making it worse than it should be.

Decomposition based models

Time series dynamics and decomposition

In many cases when the dynamic of a time series is dominated by a strong component, either trend or seasonality, the naive models that pay attention to that component are the best ones. But when both components are important, naive models struggle to capture the complexity of the time series. Some more advanced models try to combine a good prediction of these components.

We recommend reading (all of) Chapter 3 of Hyndman and Athanasopoulos (2021), with a special focus on Sections 3.4 to 3.7. The discussion included there of the methods used by official statistical agencies to decompose time series is relevant, as this methods are often regulatorily required to produce official statistics (think of deseasonalized unemployment rates, for example).

Here we will show how to use both a classical additive decomposition and the more advanced STL method to decompose a time series and get forecasts from its components. Later in the course we will see introduce the Prophet model released by Facebook’s Core Data Science team in 2017, which is a comparatively more recent model that uses a similar decomposition approach. Its main advantages are its simplicity and its ability to capture the complexity of time series with multiple seasonalities. Think, for example, of the electricity demand time series, which has a daily seasonality, a weekly seasonality and a yearly seasonality. Classical decomposition methods are essentially unable to capture all these components, but Prophet was designed to do so. More recently the forecasting community has turned to neural network models, and there is a NeuralProphet model that is a neural network version of Prophet. You can read more about this in this Medium post by the authors of the Prophet model.

Using the classical additive decomposition method

The code below shows how to use this classical decomposition method to forecast the Canadian gas production time series and evaluate its performance with cross validation. The basic ideas are similar to what we have done above, but we have to be careful because the trend component of the classical decomposition method contains missing values at the beginning and at the end of the time series. The (admittedly simplistic) solution we apply is to fill these missing values by using the drift naive model.

Note: sktime includes a TrendForecaster model class to forecast the trend component (using linear regression), but combining its results with the seasonal component can be tricky, so we will use the method described above instead.

Hide the code
decomp = seasonal_decompose(y_train, model="additive", period=12)  
trend, seasonal, resid = decomp.trend.dropna(), decomp.seasonal, decomp.resid
# missing_length = (y_train.shape[0] - trend.shape[0]) //  2

The two following cells show that the trend component is missing the last values of the train period (silently, as time gaps).

Hide the code
trend.tail(8).index
DatetimeIndex(['1994-12-31', '1995-01-31', '1995-02-28', '1995-03-31',
               '1995-04-30', '1995-05-31', '1995-06-30', '1995-07-31'],
              dtype='datetime64[ns]', name='Month', freq='ME')
Hide the code
y_train.tail(8).index
DatetimeIndex(['1995-06-30', '1995-07-31', '1995-08-31', '1995-09-30',
               '1995-10-31', '1995-11-30', '1995-12-31', '1996-01-31'],
              dtype='datetime64[ns]', name='Month', freq='ME')

We will therefore use a drift naive model to predict both the missing values at the end of the train period and the test period (these form the future_dates forecasting horizon below). Then we will use the seasonal naive model to predict the seasonal component of the time series. We will then add the two components to get the final prediction.

Hide the code
future_dates = pd.date_range(start=trend.index[-1], end=y_test.index[-1], freq="ME")
fh_future_dates = ForecastingHorizon(future_dates, is_relative=False)
fh_future_dates
ForecastingHorizon(['1995-07-31', '1995-08-31', '1995-09-30', '1995-10-31',
               '1995-11-30', '1995-12-31', '1996-01-31', '1996-02-29',
               '1996-03-31', '1996-04-30',
               ...
               '2004-05-31', '2004-06-30', '2004-07-31', '2004-08-31',
               '2004-09-30', '2004-10-31', '2004-11-30', '2004-12-31',
               '2005-01-31', '2005-02-28'],
              dtype='datetime64[ns]', length=116, freq='ME', is_relative=False)

This is the naive drift model prediction:

Hide the code
trend_forecaster = NaiveForecaster(strategy="drift")  # Drift model for trend
trend_forecaster.fit(trend)
trend_pred_future_values = trend_forecaster.predict(fh_future_dates)

Let us visualize this prediction

Hide the code
y_train.tail(30).plot(label="Train", color="black")
trend.tail(30).plot(label="Trend", color="blue")
trend_pred_future_values.plot(label="Trend Forecast", color="red", linestyle="--")
plt.axvline(y_train.index[-1], color="blue", linestyle="--", label="Train/Test Split")
plt.legend()
plt.show();plt.close()

Note the overlap with the real data at the end of the train period. Note also that the trend forecast we are using has a local nature, it pays attention to the trend of the last part of the train period.

We now select the part of this trend prediction corresponding to the test set that we will use for the combined forecast.

Hide the code
trend_pred_test = trend_pred_future_values.loc[y_test.index]

Now we fit a naive forecast model to the seasonal component of the time series. We will use the seasonal naive model to predict the seasonal component of the time series, as it simply extends the last seasonal period to the test set.

Hide the code
fh_seasonal = ForecastingHorizon(y_test.index, is_relative=False, freq="ME")
seasonal_forecaster = NaiveForecaster(strategy="last", sp=12)  # Last observed seasonality
seasonal_forecaster.fit(y = seasonal, fh=fh_seasonal)
seasonal_pred_test = seasonal_forecaster.predict(fh_seasonal)

Now we can simply add both components for the final test forecast

Hide the code
test_pred = trend_pred_test + seasonal_pred_test

Let us plot the resulting prediction for the test set.

Hide the code
plt.figure(figsize=(10, 5))
plt.plot(y.tail(150), label="Train data", color="black")
plt.plot(y_test, label="Test data", color="blue")
plt.plot(test_pred, label="Classical decomposition prediction", linestyle="dashed", color="red")
plt.axvline(y_train.index[-1], color="orange", linestyle="--", label="Train/Test Split")
plt.legend()
plt.title("Forecasting Canadian Gas Production by extending the classical additive decomposition")
plt.xlabel("Year")
plt.ylabel("Volume")
plt.show();plt.close()

The forecast looks much better than the naive models, but it clearly deteriorates as we move further into the future. The main reason in this case seems to be a change in the trend component of the time series. The seasonal component is also changing, but the trend component appears to be the one that deviates the most from the prediction of this decomposition model.

Let us store the information for this model in our dictionary. We will not however compute the cross validation metrics for this model, instead we proceed with another more advanced decomposition model.

Hide the code
model_name = "clsc_add_dcmp"
modelDict[model_name] = {"model": None, "val_maes": None, "average_val_maes": None}
modelDict[model_name]["test_pred"] = test_pred
modelDict[model_name]["test_mae"] = mean_absolute_error(y_test, test_pred)
modelDict[model_name]["test_mae"]
0.7207625571296102

Using STL decomposition

The STL method is a more advanced decomposition method that is able to capture more complex time series dynamics. Its more recent versions include time series decomposition when several seasonalities are present. See (Section 3.6 of Hyndman and Athanasopoulos 2021) and the references therein for more details. The code below shows how to use this method to forecast the Canadian gas production time series and evaluate its performance with cross validation. We use the implementation of the STL method in the sktime library, because of its similarity to the sklearn library we are already familiar with. Essentially al we need to provide the method is the seasonal period of the time series.

Hide the code
model_name = "stl"

stl_model = STLForecaster(sp=12)
stl_model.fit(y_train);

With this we are ready to make predictions for both the test set and the cross validation folds. We will store the predictions in the dictionary we have been using.

Hide the code
test_pred = stl_model.predict(fh=test_fh)
modelDict[model_name] = {"model": stl_model, "test_pred": test_pred}


stl_val_pred = []
for tr_fold, val_fold in zip(y_train_folds, y_val_folds):
    fh = ForecastingHorizon(val_fold.index, is_relative=False, freq="M")
    stl_model.fit(tr_fold)
    stl_val_pred.append(stl_model.predict(fh = fh))

modelDict[model_name]["val_pred"] = eval(model_name + "_val_pred")

modelDict[model_name]["test_mae"] = mean_absolute_error(y_test, modelDict[model_name]["test_pred"])

val_maes = [mean_absolute_error(y_val_fold, model_pred) for y_val_fold, model_pred in 
            zip(y_val_folds, modelDict[model_name]["val_pred"])]

modelDict[model_name]["val_maes"] = val_maes
modelDict[model_name]["average_val_maes"] = np.array(val_maes).mean()

The performance measures for the STL model are the best we have seen so far.

Hide the code
print(f"Average validation MAE for {model_name}:", modelDict[model_name]["average_val_maes"])
print(f"Test MAE for {model_name}:", modelDict[model_name]["test_mae"])
Average validation MAE for stl: 0.3163585394569614
Test MAE for stl: 0.6423952710618968

And the following plot shows that the predictions for the test set are indeed quite good, even though they are similarly affected by the change in the trend component of the time series for the last seasonal periods. But before that final part, the predictions of the STL are better in general than the ones of the additive decomposition model (and the rest of the naive models).

Hide the code
fig, ax = plt.subplots(figsize=(15, 4))
ax.plot(y.tail(150), label="Actual Data", color="black")
modelDict["clsc_add_dcmp"]["test_pred"].plot(label="Classical decomposition test prediction", linestyle="dashed", color="red", ax=ax)
modelDict["stl"]["test_pred"].plot(label="STL test prediction", linestyle="dashed", color="blue", ax=ax)
plt.axvline(y_train.index[-1], color="blue", linestyle="--", label="Train/Test Split")
plt.legend(["Actual Data", "Classical decomposition test prediction", "STL test prediction"])
plt.title("Forecasting Canadian Gas Production with Classical Decomposition and STL")
plt.xlabel("Year")
plt.ylabel("Volume")
plt.show()
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/pandas/plotting/_matplotlib/core.py:981: UserWarning: This axis already has a converter set and is updating to a potentially incompatible converter
  return ax.plot(*args, **kwds)

Let us update the model performance dataframe with the metrics of the classical decomposition and the STL decomposition models. Note that we have not computed the cross validation metrics for the classical decomposition model, but there are reasons to believe that the STL model is better.

Hide the code
model_maes_df = pd.DataFrame({'model_name':modelDict.keys(),
                              'val_avrg_mae': [modelDict[model_name]['average_val_maes'] for model_name in modelDict.keys()],
                              'test_mae': [modelDict[model_name]['test_mae'] for model_name in modelDict.keys()]})

model_maes_df
model_maes_df.sort_values(by='val_avrg_mae', inplace=True)
model_maes_df
model_name val_avrg_mae test_mae
7 stl 0.316359 0.642395
2 seasonal_last 0.555838 1.888272
5 ensemble 0.747182 1.382830
0 last_obs 1.051184 2.124965
4 drift 1.061138 0.708353
1 mean 3.486469 9.750811
3 seasonal_mean 3.537769 9.749002
6 clsc_add_dcmp NaN 0.720763

Autoregressive (AR) Processes

Introduction

In the following sections we will gradually introduce the ARIMA model family, which is the most important of the classical time series forecasting models. The ARIMA models combine the ideas of autoregressive (AR) processes, moving average (MA) processes, and differencing (the I is for integration, as the opposite of differencing) to create a powerful model that can capture a wide range of time series dynamics. We will first see what AR (autoregressive) and MA (moving average) processes are, before introducing the ARIMA models.

Later we will extend these methods to seasonal ARIMA models, usually called SARIMA models, which will enable us to capture the seasonal dynamics of the time series.

AR Process Definition in the Non Seasonal (or Regular) Case

A stochastic process \((Y_t)\) is called autoregressive if it satisfies the equation:

\[y_{t} = \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \dots + \phi_{p}y_{t-p} + \varepsilon_{t},\]

where: + \((\varepsilon_{t})\) is gaussian white noise + \(\phi_1, \ldots, \phi_{p}\) are constants, called the coefficients of the process. + The number \(p\) is the order of the process, which is then called an \(AR(p)\) process.

Any time series which is a realization of an autoregressive process is called an autoregressive time series.

An autoregressive process can be considered as anlogous to a multiple linear regression model. But here the prediction of \(y_t\) uses recent lags of the time series as input variables. We regress the time series onto its recent past, and that is why we call it autoregressive.

Backshift Operator and Characteristic Polynomial

Using the backshift operator \(B(y_t) = y_{t - 1}\) we can write the autoregressive process as:

\[\Phi(B) y_t = (1-\phi_1B - \cdots - \phi_p B^p)y_{t} = \varepsilon_{t}\]

where \[\Phi(B) = (1-\phi_1B - \cdots - \phi_p B^p)\] is the characteristic polynomial of the process.

\(AR(1)\) and \(AR(2)\) Processes

We are particularly interested in the lower order processes. A process of type \(AR(1)\) is: \[y_{t} = \phi_{1}y_{t-1} + \varepsilon_{t},\] or in characteristic polynomial form: \[(1 - \phi_{1} B)y_{t} = \varepsilon_{t}\] Note in particular that a random walk is an autoregressive process of order 1. But it is not stationary! We will return to the subject of stationarity shortly.

A second order \(AR(2)\) process is: \[y_{t} = \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \varepsilon_{t},\] or \[(1 - \phi_{1}B - \phi_{2} B^2 )y_{t} = \varepsilon_{t}\]

Stationarity for AR Processes

The AR process is stationary if the roots of the characteristic polynomial \[\Phi(B) = (1-\phi_1B - \cdots - \phi_p B^p)\] are all outside the unit circle of the complex plane.


Simulating a Non Seasonal AR Processes with Python

We will use the generate_sample method of the ArmaProcess object in statsmodels to simulate autoregressive time series. But first we will do it using a for loop to get a better understanding of the nature of this time series. Let us begin by simulating an \(AR(2)\) process with equation (example from (Peixeiro 2022), section 5.3.1): \[y_{t} = (1/3)y_{t-1} + (1/2)y_{t-2} + \varepsilon_{t},\] Our first step is to generate a white noise time series. We will use a function to do this and ensure reproducibilty while keeping control of the process.

Hide the code
n = 100

def noise_gen(size, seed=2024):
    rng = np.random.default_rng(seed)
    return rng.normal(size = size)

W = noise_gen(size = n)

Next we define the coefficients of the model using two arrays, as required by ArmaProcess. Note in particular that the sign of the coefficients is reversed, as they appear in the characteristic polynomial: \[ (1 - (1/3)B - (1/2)B^2)y_{t} = \varepsilon_{t}\]

Hide the code
ar_cff = np.array([1, -1/3, -1/2])
ma_cff = np.array([1,  0,    0]) 

And now let us generate the series in a for loop. Note the special case of the first two terms.

Hide the code
AR = np.zeros(n)

AR[0] = W[0]
AR[1] = -ar_cff[1] * AR[0] + W[1]

for i in range(2, n):
    AR[i] = -ar_cff[1] * AR[i - 1] - ar_cff[2] * AR[i - 2] + W[i]

AR_df = pd.DataFrame({"AR":AR})
AR_df.head()
AR
0 1.028857
1 1.984872
2 2.322772
3 0.793514
4 0.033091

The same result can be obtained also with generate_sample:

Hide the code
AR = ArmaProcess(ar_cff, ma_cff).generate_sample(nsample=n, distrvs=noise_gen)
AR_df = pd.DataFrame({"AR":AR})
AR_df.head()
AR
0 1.028857
1 1.984872
2 2.322772
3 0.793514
4 0.033091

Let us plot the time series and its ACF:

Hide the code
df_ts = AR_df
var = "AR"
title="Autoregressive Process (order 2)"

fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)

ax0 = axs[0]
df_ts[var].plot(color="blue", ax=ax0)
pd.DataFrame({"t":range(df_ts.shape[0]), var:df_ts[var]}).plot(color="blue", kind="scatter", x="t", y=var, ax=ax0)
ax0.set_title(title)
ax0.grid(visible=True, which='both', axis='x')
ax0.grid(visible=False, which='Major', axis='y')

ax1 = axs[1]
sm.graphics.tsa.plot_acf(df_ts[var].dropna(), ax=ax1, lags=25, zero=False, title='ACF')
ax1.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')

plt.tight_layout()
plt.show();plt.close()

ACF and Dynamics for AR Models

This ACF is, as we will see, typical of an stationary AR process. Note that the first ACF values drop to zero in an scillatory pattern but they do not suddenly vanish.In fact, we get three significantly different from zero coefficients. This observation will be important below in the context of identification of the right modeling strategy for our data. The pattern of the ACF looks more like a slow fading to zero. Observe also the interesting dynamics in the time plot. Both plots indicate that this is certainly not noise. To visualize more examples of this dynamics let us generate some more realizations of AR(2) processes with the same characteristic polynomial, where the only change is the noise term.

Hide the code
k = 5

AR = np.zeros((n, k))

for j in range(k):
    W = noise_gen(size = n, seed=j)
    
    AR[0, j] = W[0]
    AR[1, j] = -ar_cff[1] * AR[0, j] + W[1]

    for i in range(2, n):
        AR[i, j] = -ar_cff[1] * AR[i - 1, j] - ar_cff[2] * AR[i - 2, j] + W[i]

AR_df2 = pd.DataFrame(AR, columns=["AR"+str(i) for i in range(k)])
AR_df2.head()
AR0 AR1 AR2 AR3 AR4
0 0.125730 0.345584 0.189053 2.040919 -0.651791
1 -0.090195 0.936813 -0.459731 -1.875359 -0.391981
2 0.673223 0.815500 -0.471780 0.813439 1.207168
3 0.284210 -0.562917 -2.828593 -1.234303 0.865547
4 -0.104321 1.125467 0.620953 -0.457364 -0.749298
Hide the code
fig, ax = plt.subplots()
AR_df2.plot(ax=ax, zorder=-10)
if (k > 10):
    ax.get_legend().remove()

Again, note that the dynamic of this series is quite complex, even when the coefficients are the same for all those series. This provides \(AR\) models with a lot of flexibility. Besides, their interpretation as saying that the main drivers of a process are its closest past values is intuitively appealing as a model for many processes we observe in nature or in human activities.

Stationarity in AR Processes

Not all AR processes are stationary: as we have said, a random walk is an exampleof AR(1) process, but we know it is not stationary. Recall that the characteristic polynomial o an autoregressive \(AR(p)\) is \[\Phi(B) = (1-\phi_1B - \cdots - \phi_p B^p)\] This polynomial provides a very simple characterization of stationarity for this processes:

An autoregressive AR(p) process is stationary if and only if the roots of the characteristic polynomial are outside the unit circle. \[\Phi(B) = 0 \Rightarrow |B| > 1\]

Exercise 001

The roots of the AR(2) process that we have used as example can be obtained here. Is it stationary?


Moving Average (MA) Processes

Moving Average Processes Definition in the Non Seasonal Case

A stochastic process \((Y_t)\) is called a moving average process if it satisfies the equation:

\[y_{t} = \varepsilon_t + \theta_{1}\varepsilon_{t-1} + \theta_{2}\varepsilon_{t-2} + \cdots + \theta_{q}\varepsilon_{t-q},\]

where: + \((\varepsilon_{t})\) is gaussian white noise + \(\theta_1, \ldots, \theta_{q}\) are constants, called the coefficients of the process. + The number \(q\) is the order of the process, which is then called an MA(q) process.

Any time series which is a realization of a moving average process is called an moving average time series.


Characteristic Polynomial and Invertibility for MA Processes

The characteristic polynomial of the MA(q) process is: \[ \Theta(B) = 1 + \theta_{1}B + \theta_{2}B^2 + \cdots + \theta_{q}B^q \] so yhat the MA process can be written as: \[ y_t = \Theta(B) \varepsilon_t \]

Suppose that we have a MA(1) process with \(|\theta_1| < 1\) \[ y_t = (1 - \theta_1\,B)\varepsilon_t \] then the operator \((1 - \theta_1\,B)\) is invertible, meaning that \[ \dfrac{1}{1 - \theta_1\,B} = 1 + \theta_1\, B + \theta_1^2\, B^2 + \theta_1^3\, B^2+\dots \] In partcular this means that an invertible \(MA(1)\) process can be considered as an infinite autoregresive process \(\text{AR}(\infty)\): \[ (1 + \theta_1\, B + \theta_1^2\, B^2 + \theta_1^3\, B^2+\dots)y_t = \varepsilon_t \] The same ideas apply to general order MA(q) processes, provide they are invertible, that is if the roots of the characteristic polynomial \(\Theta(B)\) are all outside the unit circle of the complex plane. Ivertibility implies that the \(MA(q)\) polynomial is equivalent to a certain \(\text{AR}(\infty)\) process. This in turn implies that the MA(q) model possesses some nice mathematical properties (see e.g. the discussion and the video in Section 9.4 of Hyndman and Athanasopoulos (2021))


Simulating a Non Seasonal MA(q) Processes with Python

The simulation follows the same structure that we saw in the autoregressive case. Let us simulate the MA(2) process defined by:

\[y_t = \epsilon_{t} + 0.4\,\epsilon_{t–1} - 0.3\,\epsilon_{t–2}\]

Again we need to think of the characteristic polynomial form in order to get the coefficients right in Python:

\[y_t = (1 + 0.4\,B - 0.3\, B^2)\,\epsilon_{t}\]

Hide the code
ma_cff = np.array([1, 0.4, -0.3])
ar_cff = np.array([1,  0,    0]) 

MA = ArmaProcess(ar_cff, ma_cff).generate_sample(nsample=n, distrvs=noise_gen)

MA_df = pd.DataFrame({"MA":MA})

MA_df.head()
MA
0 1.028857
1 2.053463
2 1.494830
3 -1.007068
4 -2.126088

Let us plot the time series and its ACF:

Hide the code
df_ts = MA_df
var = "MA"
title="Moving Average Process (order 2)"

fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)

ax0 = axs[0]
df_ts[var].plot(color="blue", ax=ax0)
pd.DataFrame({"t":range(df_ts.shape[0]), var:df_ts[var]}).plot(color="blue", kind="scatter", x="t", y=var, ax=ax0)
ax0.set_title(title)
ax0.grid(visible=True, which='both', axis='x')
ax0.grid(visible=False, which='Major', axis='y')

ax1 = axs[1]
sm.graphics.tsa.plot_acf(df_ts[var].dropna(), ax=ax1, lags=25, zero=False, title='ACF')
ax1.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')

plt.tight_layout()
plt.show();plt.close()

Exercise 002

Can you generate this exact time series using a for loop instead of generate_sample?


Identifying a Non Seasonal MA(q) or AR(p) with the ACF

The above example illustrates the typical behavior of the ACF of a moving average MA(q) time series: the autocorrelation coefficients become abruptly non-significant after lag q.

However, we saw that the ACF of the AR(p) example was different and the number of significant coefficients (meaning significantly different from 0) is no indicator of the order of the process. In prder to do this we will use a different plot, called the PACF, for partial autocorrelation function. This is a modified version of the ACF. If the ACF k-th coefficient examines the correlation between \(y_t\) and \(y_{t - k}\), the analogous in the PACF examines coefficient of \(y_{t-k}\) in a regression model for \(y_t\) using all the lags \[y_{t-1}, y_{t-1},\ldots, y_{t-k}\] as predictors. This means that we are examining the relation between \(y_{t}\) and \(y_{t - k}\) while controlling for the effect of the intermediate lags. that is why these are called partial autocorrelation coefficients.

The way to use the PACF to identify the order of an autoregressive model is simple. In the PACF of an autoregressive AR(p) time series: the partial autocorrelation coefficients become abruptly non-significant after lag p.

In Python we can plot the PACF with

Hide the code
df_ts = AR_df
var = "AR"
title="AR(2) Process"

fig, axs = plt.subplots(3, 1, figsize=(8, 9), sharex=False, sharey=False)

ax0 = axs[0]
df_ts[var].plot(color="blue", ax=ax0)
pd.DataFrame({"t":range(df_ts.shape[0]), var:df_ts[var]}).plot(color="blue", kind="scatter", x="t", y=var, ax=ax0)
ax0.set_title(title)
ax0.grid(visible=True, which='both', axis='x')
ax0.grid(visible=False, which='Major', axis='y')

ax1 = axs[1]
sm.graphics.tsa.plot_acf(df_ts[var].dropna(), ax=ax1, lags=25, zero=False, title='ACF')
ax1.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')


ax2 = axs[2]
sm.graphics.tsa.plot_pacf(df_ts[var].dropna(), ax=ax2, lags=25, zero=False, title='PACF')
ax2.set(ylim=(-1,1), xlabel='Lag', ylabel='pacf')

plt.tight_layout()
plt.show();plt.close()


The PACF for Moving Average Processes

How does the PACF of a MA(q) looks like? Lets us do it for our example.

Hide the code
MA = ArmaProcess([1, 0], [1, -0.999]).generate_sample(nsample=n)

MA_df = pd.DataFrame({"MA":MA})

MA_df.head()
MA
0 -0.513720
1 -0.619581
2 1.707659
3 -0.650540
4 1.137666
Hide the code
df_ts = MA_df
var = "MA"
title="MA(1) Process"

fig, axs = plt.subplots(3, 1, figsize=(8, 9), sharex=False, sharey=False)

ax0 = axs[0]
df_ts[var].plot(color="blue", ax=ax0)
pd.DataFrame({"t":range(df_ts.shape[0]), var:df_ts[var]}).plot(color="blue", kind="scatter", x="t", y=var, ax=ax0)
ax0.set_title(title)
ax0.grid(visible=True, which='both', axis='x')
ax0.grid(visible=False, which='Major', axis='y')

ax1 = axs[1]
sm.graphics.tsa.plot_acf(df_ts[var].dropna(), ax=ax1, lags=25, zero=False, title='ACF')
ax1.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')


ax2 = axs[2]
sm.graphics.tsa.plot_pacf(df_ts[var].dropna(), ax=ax2, lags=25, zero=False, title='PACF')
ax2.set(ylim=(-1,1), xlabel='Lag', ylabel='pacf')

plt.tight_layout()
plt.show();plt.close()

In this case the situation is reversed: the relevant information for a pure MA process is in the ACF, while the PACF can exhibit a more complex pattern of decay.


ARMA Processes

Definition of the ARMA(p, q) Process

An stochastic process is an ARMA(p, q) process if it satisfies the equation \[ y_{t} = \phi_{1}y_{t-1} + \cdots + \phi_{p}y_{t-p} + \theta_{1}\varepsilon_{t-1} + \cdots + \theta_{q}\varepsilon_{t-q} + \varepsilon_{t} \] where, as usual, \(\varepsilon_{t}\) is white noise, and \((p, q)\) are jointly called the order of the process.

The expression of the ARMA(p, q) process in terms of characteristic polynomials is: \[ \Phi(B)\,y_t = \Theta(B)\varepsilon_t \] And the process is stationary and invertible if the roots of both \(\Phi(B)\) and \(\Theta(B)\) are outside the complex unit circle.

Experimenting with the Identification of ARMA(p, q) Processes with the ACF and PACF

In the next code cells first we will load an external script with a couple of auxiliary functions: generate_custom_ARMA and plot_acf_pacf. The first one will allow you to choose the (p, q) values and generate a time series corresponding to a random ARMA(p, q) process (guaranteed to be stationary and invertible). Then using the second function you can plot the ACF and PACF of that time series and see if and how it relates to p and q, using the ideas from the previous paragraphs.

The function generate_custom_ARMA will not generate arbitrary order ARMA(p, q) time series, it is in fact limited to values \(p, q\) such that \(p + q \leq 2\) as these are the values for the models that we are mostly interested in.

The Identification is Often a Trial and Error Situation

Be warned that it is often not easy to guess the right values of p and q from the ACF and PACF plots. As this identification is part of the modeling process we are going to describe, we are therefore often led to consider several alternative models depending on our choice of p and q. The model fit diagnosis and performance evaluation measures will then be the key to select a final model.

Hide the code
# %run -i "./4_3_auxiliary_ARMA.py"
Hide the code
Y_df, p, q, ar_cff, ma_cff, rts_p, rts_q = generate_custom_ARMA(p=0, q=2, size=1200)

plot_acf_pacf(df_ts= Y_df, var="Y", title="ARMA Process" )

Exercise 003. IMPORTANT!

After experimenting for a while with known values of p and q, you can run the generate_custom_ARMA without selecting p, q values. Those values will then be selected at random:

    Y_df, p, q, ar_cff, ma_cff, rts_p, rts_q = generate_custom_ARMA(size=1200)  

After running this code (and without cheating looking at the values of p, q) plot the ACF and PACF. Make an educated guess!

Then run

    print(p, q)
    print(ar_cff, ma_cff)
    print(np.absolute(rts_p), np.absolute(rts_q))

to see the p, q values, the coefficients and the modulus for each root of both characteristic polynomials.


ARIMA Models

ARIMA Meaning

If our series looks stationary, we can try to find an ARMA(p, q) process/model that fits our time series (and verifies the stationarity and invertibility conditions). This strategy often succeeds, as the ARMA family of processes is flexible enough to capture the dynamics of many time series coming from natural phenomena or human related activities. That still holds even if we keep to low order ARMA(p, q) processes, say with \(p, q < 5\).

But what if our series is not stationary? Then we can try to apply difference it until it becomes stationary. This is the idea behind the ARIMA model. Here we will introduce the ARIMA model for the case of a non seasonal time series. The seasonal case will be introduced in coming sessions.

Fitting an ARIMA Model with the Box-Jenkins Methodology

The above idea was organized into a methodology by Box and Jenkins in the 1970s. The methodology, described in their book (Box et al. 2015), is based on the following steps:

Step 1: Data preparation

1.1 Divide the data in Training and Test.
1.2 Transform data to stabilize variance if needed (e.g. using BoxCox transformation).
1.3 Difference data to obtain stationary series if needed.

Step 2: Model selection

2.1 Examine data, ACF and PACF to identify a potential model or models.

Step 3: Estimation / Fit

3.1 Estimate parameters in potential models: fit the model.

Step 4: Model Diagnosis

4.1 Examine the coefficients and residuals.

Step 5: Forecasting and Model Comparison

5.1 Obtain the model forecasts for the test set.
5.2 Use those forecasts to measure the model performance and compare candidate models.
5.3 If transformations were applied to stabilize variance or if differencing was required, make sure to transform back the forecasts.


ARIMA Model Examples

In the following code cells we will show how to apply the Box-Jenkins methodology to fit an ARIMA model to a time series. We will use the ARIMA class from the statsmodels library.

The Arima_examples.csv file contains five daily time series Y1, ..., Y5 that can be used as examples to illustrate the methodology. Here we will use Y2 as an example, leaving the other series for you to experiment with. The EDA can be kept to a minimum: these series do not contain missing values or time gaps.

Hide the code
df_ts = pd.read_csv('Arima_examples.csv', sep=";")
df_ts.head()
date Y1 Y2 Y3 Y4 Y5
0 1995-01-01 -0.587047 0.993847 0.983347 0.987 1.022369
1 1995-01-02 -1.427480 0.985104 0.950711 0.969 1.071212
2 1995-01-03 0.059141 1.000622 0.932761 0.958 1.140348
3 1995-01-04 0.653344 1.006893 0.940219 0.999 1.228530
4 1995-01-05 -1.788129 0.981376 0.941066 0.952 1.337822

Let us check the structure of the dataset using info.

Hide the code
df_ts.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1096 entries, 0 to 1095
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    1096 non-null   object 
 1   Y1      1096 non-null   float64
 2   Y2      1096 non-null   float64
 3   Y3      1096 non-null   float64
 4   Y4      1096 non-null   float64
 5   Y5      1096 non-null   float64
dtypes: float64(5), object(1)
memory usage: 51.5+ KB
Hide the code
df_ts["date"] = pd.to_datetime(df_ts["date"], format="%Y-%m-%d")
df_ts.set_index("date", inplace=True)
df_ts.index.freq = 'D'
df_ts.head()
Y1 Y2 Y3 Y4 Y5
date
1995-01-01 -0.587047 0.993847 0.983347 0.987 1.022369
1995-01-02 -1.427480 0.985104 0.950711 0.969 1.071212
1995-01-03 0.059141 1.000622 0.932761 0.958 1.140348
1995-01-04 0.653344 1.006893 0.940219 0.999 1.228530
1995-01-05 -1.788129 0.981376 0.941066 0.952 1.337822

Let us select one of the time series (and drop the rest).

Hide the code
Y = "Y2"
df_ts = df_ts[[Y]].copy()

We use the plot_acf_pacf function to plot the series along with its ACF and PACF.

Hide the code
plot_acf_pacf(df_ts=df_ts, var=Y)

We can clearly see that the series is not stationary. Both the time plot and the ACF make this clear. So we know that we will have to apply differencing below. But first let us split the data into training and test sets and deal with the variance. We make copies here to avoid warnings below about train and test being views of the original data (copies by reference instead of copies by value).

We will use the last 8 weeks (roughly two moths) of the series as the test set. This is common practice in time series forecasting, where instead of using proportions such as 80% of the data for training and 20% for testing, we use some natural temporal division of the data.

Hide the code
y_train, y_test = temporal_train_test_split(df_ts, test_size= 7 * 8)
y_train = y_train.copy()
y_test = y_test.copy()
y_train.tail()
Y2
date
1997-11-01 0.927600
1997-11-02 0.931670
1997-11-03 0.951108
1997-11-04 0.940791
1997-11-05 0.920038
Hide the code
y_test.head(12)
Y2
date
1997-11-06 0.944234
1997-11-07 0.955646
1997-11-08 0.957338
1997-11-09 0.949454
1997-11-10 0.969647
1997-11-11 0.962780
1997-11-12 0.960357
1997-11-13 0.978204
1997-11-14 0.957904
1997-11-15 0.966769
1997-11-16 0.973134
1997-11-17 0.963750
Hide the code
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(y_train, label='train')
ax.plot(y_test, label='test')
ax.legend(["train", "test"])
plt.show();plt.close()

Stabilizing the Variance. Box-Cox Transformations.

The plot of the training set suggests that the variance of the series is varying with time (the amplitude of the oscillations for the last years is increasing). We are therefore inclined to apply a Box-Cox transformation to the series. Recall that we can formally check homogeneous variance with a Breusch-Pagan test, as we do next.

Also, before proceeding further, note that seasonality does not appear to be present in the series. We will therefore only consider a regular ARIMA model, We will deal with the seasonal case in the next session.

This is the code to make the Breusch-Pagan test. We make a copy of the series to apply the required linear regression.

Hide the code
y_bxcx = y_train.reset_index(drop=True).reset_index()
y_bxcx.columns = ['time', 'value']
y_bxcx
formula = 'value ~ time'
olsr = ols(formula, y_bxcx).fit()

_, breusch_pagan_p_value, _, _ = sms.het_breuschpagan(olsr.resid, olsr.model.exog)
breusch_pagan_p_value 
6.667264111720864e-23

The p-value is very low, so we can reject the null hypothesis of homoscedasticity. We will therefore apply the Box-Cox transformation to the series. In case you decide not to use it you can simply switch box_cox = false in the next cell.

Hide the code
box_cox = True

if box_cox:
    y_train_original = y_train.copy()
    bxcx_transformer = BoxCoxTransformer()
    transformed_y_train = bxcx_transformer.fit_transform(y_train[[Y]])
    y_train[Y + "_bxcx"] = transformed_y_train
    # y_train.insert(-1, Y + "_bxcx", transformed_y_train)
    
    Y_original = Y
    Y = Y + "_bxcx"
    print(f"lambda = {bxcx_transformer.lambda_:.3f}")
    plot_acf_pacf(df_ts=y_train, var=Y)
lambda = -2.673

The time plot shows that the variance is now more stable. But it clearly still shows a trend and the ACF confirms that the series is not stationary. We will therefore apply regular differencing to the series.

Before moving on to that recall that when we apply the Box-Cox transformation to the test set we will be using the lambda value from training (with transform instead of fit_transform). But instead of doing that manually we are going to use sktime to structure the process in a pipeline-ish way, similar to those that we have used with sklearn.


Deciding if Differencing is Required

From our EDA we already know that differencing is required, but for the sake of completeness we will include the result of the ADF test before and after differencing. Recall that the ADF Test uses as null hypothesis that the series is non stationary. Therefore if the p-value is small we (reject the null hypothesis and) conclude that the series is stationary. We will also use ndiffs from the pmdarima library to determine the number of differences required to make the series stationary.

The p-value before differencing is high, clearly suggesting the need for differencing.

Hide the code
ADF_pvalue = adfuller(y_train[Y])[1]
ADF_pvalue
0.5136472789500353

The result of ndiffs is 1, which is consistent with the ADF test. We will therefore apply first order differencing to the series.

Hide the code
ndiffs, nsdiffs = pmd.arima.ndiffs, pmd.arima.nsdiffs

d =  ndiffs(y_train[Y])

if (d > 0):
    y_train[Y + "_diff"] = y_train[Y].diff(periods=d)
    Ydiff = Y + "_diff"

print(f"Number of applied regular differences = {d}")    
    
    
Number of applied regular differences = 1

The p-value after differencing is very low, confirming that the series is now stationary. Note that we use dropna to remove the first value of the series, which after differencing becomes missing and could cause problems in the ADF computation.

Hide the code
ADF_pvalue = adfuller(y_train[Ydiff].dropna())[1]
print(f"The ADF p-value is {ADF_pvalue}")
The ADF p-value is 5.392828155316003e-26

The goal of this step is not to actually transform the series and keep working with the differenced series. What we want is the value of d, to provide it to the ARIMA model.


Selecting the ARMA(p, q) Component Candidate

After differencing we check again the ACF and PACF to confirm stationarity and importantly to identify the potential values of p and q that define the structure of the ARMA model. These values, together with the value of d that we have just obtained, will be used to fit a model to the series of type ARIMA(p, d, q) (this is the common convention for the name).

Hide the code
plot_acf_pacf(df_ts=y_train, var=Ydiff, lags=25, plot_points=False)

The above diagram suggests a mix of AR and MA components. We always try to keep the orders low to apply the parsimony principle. But we must pay attention to the (significantly) non-zero coefficients. In particular the second spike in the ACF and PACF suggests p = 2 and q = 2. We will therefore fit an ARIMA(2, 1, 2) model to the series.

Hide the code
p = 2
q = 2

Fitting an ARIMA Model the statsmodels way

Note: this was our only method for the previous year. We keep it here because the alternative is still experimental for us, and also because it is probably the most common way to fit ARIMA models in Python.

Now we pass those values to the ARIMA function from the statsmodels library (we call it smARIMA because we will soon use another version) and we use the fit method to estimate the parameters of the model. Note that, contrary to sklearn, the data is not passed to the fit method, but to the endog parameter of the ARIMA object. Note also that we pass the non-differenced series to the fit method, as the ARIMA object will take care of the differencing.

Hide the code
arima_model = smARIMA(endog = y_train[Y], order=(p, 1, q), trend="n").fit() 
model_name = "ARIMA_" + str(p) + "_" + str(d) + "_" + str(q)
ARIMA Model Diagnosis. Step 1: significance of the coefficients (and possible refitting).

Next we check if the coefficients of this model are statistically significant

Hide the code
print(arima_model.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                Y2_bxcx   No. Observations:                 1040
Model:                 ARIMA(2, 1, 2)   Log Likelihood                3307.808
Date:                Tue, 25 Mar 2025   AIC                          -6605.616
Time:                        11:45:38   BIC                          -6580.886
Sample:                    01-01-1995   HQIC                         -6596.234
                         - 11-05-1997                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.4689      0.054     -8.701      0.000      -0.574      -0.363
ar.L2         -0.0243      0.055     -0.440      0.660      -0.133       0.084
ma.L1          0.2489      0.047      5.321      0.000       0.157       0.341
ma.L2         -0.5312      0.048    -11.128      0.000      -0.625      -0.438
sigma2         0.0001   4.25e-06     23.609      0.000     9.2e-05       0.000
===================================================================================
Ljung-Box (L1) (Q):                   0.01   Jarque-Bera (JB):                 3.03
Prob(Q):                              0.94   Prob(JB):                         0.22
Heteroskedasticity (H):               0.79   Skew:                             0.10
Prob(H) (two-sided):                  0.03   Kurtosis:                         3.17
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

You can see in the table above that the coeficcient for the second autoregressive term is not significant. We will therefore refit the model without this term.

Hide the code
p = 1
arima_model = smARIMA(endog = y_train[Y], order=(p, 1, q), trend="n").fit() 
model_name = "ARIMA_" + str(p) + "_" + str(d) + "_" + str(q)
print(arima_model.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                Y2_bxcx   No. Observations:                 1040
Model:                 ARIMA(1, 1, 2)   Log Likelihood                3307.491
Date:                Tue, 25 Mar 2025   AIC                          -6606.981
Time:                        11:45:38   BIC                          -6587.197
Sample:                    01-01-1995   HQIC                         -6599.476
                         - 11-05-1997                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.4506      0.046     -9.887      0.000      -0.540      -0.361
ma.L1          0.2340      0.040      5.820      0.000       0.155       0.313
ma.L2         -0.5505      0.025    -21.960      0.000      -0.600      -0.501
sigma2         0.0001   4.26e-06     23.595      0.000    9.21e-05       0.000
===================================================================================
Ljung-Box (L1) (Q):                   0.03   Jarque-Bera (JB):                 2.95
Prob(Q):                              0.87   Prob(JB):                         0.23
Heteroskedasticity (H):               0.79   Skew:                             0.10
Prob(H) (two-sided):                  0.03   Kurtosis:                         3.17
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Now all the coefficients of the model are significant.

ARIMA Model Diagnosis. Step 2: Residuals analysis.

Now we inspect the model’s residuals to see if the qualify as Gaussian White Noise. We will draw some diagnostic plots and apply the Ljung-Box test to the residuals.

First we extract the residual:

Hide the code
residuals = arima_model.resid[d:]
resid_df = pd.DataFrame({'residuals':residuals})

The residual plots below show:

  1. The residuals ACF show no significant coefficients. This is a good sign that the residuals are white noise.
  2. The residuals histogram shows a distribution that is close to normal. This is another good sign.
  3. The Q-Q plot shows that the residuals behave like the normal distribution.
  4. The time plot of the standardized residuals does not show any discernible pattern, and so they can be considered as white noise.
Hide the code
arima_model.plot_diagnostics(figsize=(10, 8), acf_kwargs={'zero':False});
plt.show();plt.close()

Our final residual diagnostic tool is the computation of the Ljung-Box statistic using the first lags of the residuals. All the p-values are over the significance level, so we do not reject the null hypothesis (that the residuals are white noise).

Hide the code
sm.stats.acorr_ljungbox(residuals, np.arange(1, 10, 1))#.sort_values(by="lb_pvalue", ascending=False)
lb_stat lb_pvalue
1 0.029513 0.863600
2 0.033476 0.983401
3 0.526751 0.912976
4 1.097646 0.894645
5 1.652520 0.894826
6 2.905425 0.820620
7 3.283285 0.857618
8 5.665153 0.684682
9 5.731089 0.766501

Fitting an ARIMA Model the sktime way

Now we will essentially repeat the above steps using the ARIMA class from the sktime library (called skARIMA here for clarity). The main difference is that we will use a pipeline to structure the process. In this case the pipeline will only include the Box-Cox transformation and the model fitting. But even with such a simple pipeline we will see the big advantages of this approach.

We define the pipeline as a TransformedTargetForecaster object, a pipeline-like object that contains a description of the modeling process as list of steps. In this case we use a boolean variable box_cox to decide if we want to apply the transformation or not and change the pipeline accordingly. The pipeline itself is called skARIMA_forecaster. Note that we are using the results of our previous work in that we decide whether we want to apply the Box-Cox transformation or not, and we have already decided the (p, d, q) structure for the ARIMA model.

Hide the code
from sktime.forecasting.compose import TransformedTargetForecaster
from sktime.forecasting.arima import ARIMA as skARIMA

if not box_cox:
    print
    bxcx_trnsf = BoxCoxTransformer(method = "fixed", lambda_fixed=1)
else:
    print("Using Box-Cox transformation")
    bxcx_trnsf = BoxCoxTransformer()


skARIMA_forecaster = TransformedTargetForecaster(steps=[
    ("boxcox", bxcx_trnsf),
    ("model", skARIMA(order=(p,d,q), with_intercept=False))
])
Using Box-Cox transformation

Now we can use the fit method of the pipeline to fit the model to the training set. Some remarks:

  • Since this includes BoxCox as part of the pipeline, we use the column of y_train containing the untransformed (pre-BoxCox) values.
  • Note that the syntax is more sklearn-like, as we pass the training set to the fit method of the pipeline.
Hide the code
skARIMA_forecaster.fit(y_train[Y_original])
TransformedTargetForecaster(steps=[('boxcox', BoxCoxTransformer()),
                                   ('model',
                                    ARIMA(order=(1, 1, 2),
                                          with_intercept=False))])
Please rerun this cell to show the HTML repr or trust the notebook.

Once the model has been fitted, we can use the forecaster step to see a summary almost idential to the one we get from the statsmodels ARIMA model.

Hide the code
skARIMA_forecaster.forecaster_.summary()
SARIMAX Results
Dep. Variable: y No. Observations: 1040
Model: SARIMAX(1, 1, 2) Log Likelihood 3307.491
Date: Tue, 25 Mar 2025 AIC -6606.981
Time: 11:45:39 BIC -6587.197
Sample: 01-01-1995 HQIC -6599.476
- 11-05-1997
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 -0.4506 0.046 -9.887 0.000 -0.540 -0.361
ma.L1 0.2340 0.040 5.820 0.000 0.155 0.313
ma.L2 -0.5505 0.025 -21.960 0.000 -0.600 -0.501
sigma2 0.0001 4.26e-06 23.595 0.000 9.21e-05 0.000
Ljung-Box (L1) (Q): 0.03 Jarque-Bera (JB): 2.95
Prob(Q): 0.87 Prob(JB): 0.23
Heteroskedasticity (H): 0.79 Skew: 0.10
Prob(H) (two-sided): 0.03 Kurtosis: 3.17


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Extracting the residuals is a bit trickier, because the ARIMA implementation we are using is essentially the one from pmdarima, but it does not expose all the methods of those pmdarima objects. We have to use its own predict_residuals method to get the residuals.

Hide the code
residuals = skARIMA_forecaster.predict_residuals(y_train[Y_original]).dropna()
resid_df = pd.DataFrame({'residuals':residuals})

Similarly, the diagnostic plots for the residuals are a bit more complicated, but not much.

Hide the code
plt.figure(figsize=(12, 8))

plt.subplot(221)
plt.plot(residuals)
plt.title("Residuals")

plt.subplot(222)
sns.histplot(residuals, kde=True)
plt.title("Histogram and density plot of residuals")

plt.subplot(223)
sm.qqplot(residuals, line='s', ax=plt.gca())
plt.title("Q-Q Plot for residuals")

plt.subplot(224)
sm.graphics.tsa.plot_acf(residuals, lags=40, ax=plt.gca(), zero=False)
plt.title("ACF of Residuals")

plt.tight_layout()
plt.show()

The model is the same so we have already seen that these residuals behave like white noise. The Ljung-Box test is obtained just like before, as it only depends on the residuals.

Hide the code
sm.stats.acorr_ljungbox(residuals, np.arange(1, 10, 1))#.sort_values(by="lb_pvalue", ascending=False)
lb_stat lb_pvalue
1 0.018203 0.892678
2 0.027812 0.986190
3 0.482139 0.922797
4 1.295080 0.862210
5 1.838202 0.871049
6 3.069606 0.800061
7 3.118964 0.873783
8 5.511283 0.701790
9 5.511552 0.787632

Forecasting performance of ARIMA models

Cross validation scores with ARIMA Models

The major difference between the two libraries is in the prediction step. We will see that in this case the sktime implementation is easier to use.

We will use temporal cross validation to evaluate the model. The way we do this is with the ExpandingWindowSplitter, that we already met with the naive models. We will use a forecasting horizon of length 1 (one-step-ahead forecasting). We will keep the first half of the training set as initial window, to ensure that the model has enough data to make the first forecast.

Hide the code
int(np.ceil(y_train.shape[0] * 0.5)), len(y_test.index)
(520, 56)
Hide the code
validation_fh = ForecastingHorizon(1, is_relative=True, freq="D")
# validation_fh = ForecastingHorizon(range(1, len(y_test.index) + 1), is_relative=True, freq="D")
validation_fh
ForecastingHorizon([1], dtype='int64', is_relative=True)

This defines our cross validation strategy:

Hide the code
cv = ExpandingWindowSplitter(
    initial_window=int(np.ceil(y_train.shape[0] * 0.5)),  
    fh=validation_fh
)
Hide the code
crossval_splits = list(cv.split(y_train))
print(f"Length of the training part of the first cross-validation split: {crossval_splits[0][0].shape[0]}")
print(f"Length of the validation part of the first cross-validation split: {crossval_splits[0][1].shape[0]}")
Length of the training part of the first cross-validation split: 520
Length of the validation part of the first cross-validation split: 1

Now we are ready to get the validation predictions and scores. For the naive models we did this using a for loop, but here we can use the evaluate function of sktime, which as described in the documentation is a performance benchmarking utility. It has the virtue of sparing us of explicitly writing the looping code.

A very important parameter of evaluate is the strategy. Using refit as strategy means that the model will be retrained at each step of the cross validation. This is what we called a rolling forecasting origin strategy.

Note: running this code took around two minutes in my laptop. If computation time is a concern you can increase the step_length parameter of the ExpandingWindowSplitter to reduce the number of steps.get some warnings about the failure of convergence for certain folds.

Hide the code
cv_results = evaluate(
    forecaster=skARIMA_forecaster,
    y=y_train[Y_original],
    cv=cv,
    strategy="refit",  # refit model on each fold
    scoring=mean_absolute_error,
    return_data=True
)
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "

The results of cross validation are stored in the cv_results object. This is a pandas dataframe that contains the predictions and the actual values for each fold, as well as the metrics for each fold.

Hide the code
cv_results.head()
test__DynamicForecastingErrorMetric fit_time pred_time len_train_window cutoff y_train y_test y_pred
0 0.016129 0.223771 0.006479 520 1996-06-03 date 1995-01-01 0.993847 1995-01-02 0.98... date 1996-06-04 1.076537 Freq: D, Name: Y2,... 1996-06-04 1.092665 Name: Y2, dtype: float64
1 0.002245 0.119813 0.007738 521 1996-06-04 date 1995-01-01 0.993847 1995-01-02 0.98... date 1996-06-05 1.081851 Freq: D, Name: Y2,... 1996-06-05 1.079606 Name: Y2, dtype: float64
2 0.002024 0.140400 0.013789 522 1996-06-05 date 1995-01-01 0.993847 1995-01-02 0.98... date 1996-06-06 1.090313 Freq: D, Name: Y2,... 1996-06-06 1.088289 Name: Y2, dtype: float64
3 0.013451 0.152167 0.009819 523 1996-06-06 date 1995-01-01 0.993847 1995-01-02 0.98... date 1996-06-07 1.098762 Freq: D, Name: Y2,... 1996-06-07 1.085311 Name: Y2, dtype: float64
4 0.009020 0.160006 0.016635 524 1996-06-07 date 1995-01-01 0.993847 1995-01-02 0.98... date 1996-06-08 1.105986 Freq: D, Name: Y2,... 1996-06-08 1.096966 Name: Y2, dtype: float64

Be careful, if the forecasting horizon is more than one time step, each element of the predictions column contains in fact a time series: the predictions for all the steps of the horizon. For example, the first row contains arrays for the first training fold, the corresponding validation fold, and the predictions for that first validation fold. Their shapes are:

Hide the code
fold = 0
cv_results['len_train_window'][fold], cv_results['y_test'][fold].shape, cv_results['y_pred'][fold].shape
(520, (1,), (1,))

The first column contains the validation scores for each fold (using MAE as metric in this case). We can summarize it by averaging:

Hide the code
cv_results.iloc[:, 0].mean()
0.008405342278120153

And we can also visualize the values of the scores across the different folds (as the window expands).

Hide the code
cv_results.iloc[:, 0].plot()
plt.xlabel("Fold")
plt.ylabel("MAE")
plt.show();plt.close()  

Test score for ARIMA Models with the rolling forecasting origin strategy

Getting the score of the test set is simply like getting the score of an additional fold. We use evaluate with the strategy parameter set to refit and the cv parameter set to ExpandingWindowSplitter. The window length is the same as the training set, and the step length is 1.

Hide the code
# window_length = y_train.shape[0] 

# test_fh = [1]
test_fh = validation_fh

step_length = 1

# Create a splitter that walks through the last 100 points
cv_test = ExpandingWindowSplitter(
    initial_window=y_train.shape[0], 
    step_length=step_length,
    fh=test_fh
)



# Evaluate using refitting at each step
test_results = evaluate(
    forecaster=skARIMA_forecaster,
    y=df_ts[Y_original],
    cv=cv_test,
    strategy="refit",
    scoring=mean_absolute_error,
    return_data=True
)
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "

The results are returned in a pandas dataframe like the one for cross validation.

Hide the code
test_results.head()
test__DynamicForecastingErrorMetric fit_time pred_time len_train_window cutoff y_train y_test y_pred
0 0.013276 0.177430 0.009469 1040 1997-11-05 date 1995-01-01 0.993847 1995-01-02 0.98... date 1997-11-06 0.944234 Freq: D, Name: Y2,... 1997-11-06 0.930958 Name: Y2, dtype: float64
1 0.013905 0.226878 0.005811 1041 1997-11-06 date 1995-01-01 0.993847 1995-01-02 0.98... date 1997-11-07 0.955646 Freq: D, Name: Y2,... 1997-11-07 0.941741 Name: Y2, dtype: float64
2 0.011053 0.223991 0.005926 1042 1997-11-07 date 1995-01-01 0.993847 1995-01-02 0.98... date 1997-11-08 0.957338 Freq: D, Name: Y2,... 1997-11-08 0.946285 Name: Y2, dtype: float64
3 0.002280 0.193312 0.005919 1043 1997-11-08 date 1995-01-01 0.993847 1995-01-02 0.98... date 1997-11-09 0.949454 Freq: D, Name: Y2,... 1997-11-09 0.951734 Name: Y2, dtype: float64
4 0.023035 0.291921 0.010263 1044 1997-11-09 date 1995-01-01 0.993847 1995-01-02 0.98... date 1997-11-10 0.969647 Freq: D, Name: Y2,... 1997-11-10 0.946613 Name: Y2, dtype: float64

The average test score (MAE) is in line with wht we get from validation:

Hide the code
test_results.iloc[:, 0].mean()
0.008843598177944362
Understanding the difference between rolling forecasting origin and using predict with the test set

Based on our experience in sklearn regression models you may have thought that to get the test score we could simply use this code:

Hide the code
y_test_fh = ForecastingHorizon(y_test.index, is_relative=False, freq="D")
y_test_pred = skARIMA_forecaster.predict(fh = y_test_fh)
mean_absolute_error(y_test[Y_original], y_test_pred)
0.03540233165594314

How is this different (and why is it so comparatively bad)? How are those predictions by predict obtained? The key concept is contained in the strategy parameter of evaluate. The refit strategy means that the model is retrained at each step of the cross validation. This is what we called a rolling forecasting origin strategy. But when we use strategy = no-update_params then the model does not refit. This is similar to the strategy that is used when we call predict with the test set: a single fit of the model on the whole training data is used to predict all the test set. This of course implies that the quality of the prediction deteriorates very fast as we move further into the future. In fact, Arima models without refitting (without rolling origin) revert back to the mean of the series after a few steps. The following plot illustartes this for our example.

Hide the code
fig, ax = plt.subplots(figsize=(12, 6))
pd.DataFrame({"skpred":[(test_results.iloc[idx, -1].values)[0] for idx in test_results.index]}, index = y_test.index).plot(ax = ax, color="red", label="sktime ARIMA")
y_train[Y_original].tail(50).plot(ax = ax, color="black", label="Train")
y_test.plot(ax = ax, color="blue", label="Test (rolling forecast with refit)")
y_test_pred.plot(ax = ax, color="green", label="Test with predict")
plt.legend(["sktime ARIMA", "Train", "Test (rolling forecast with refit)", "Test (with a single call of predict)"])
plt.show();plt.close()

Forecasting and Model Performance with statsmodels ARIMA

We said before that the sktime implementation offers the advantage of simplicity over the statsmodels implementation. In this section we will show how to use statsmodels to make the same predictions as we have obtained with the rollong forecasting origin strategy.

Note: This is our code from last year, we include it here for completeness and as a fallback for possible sktime problems.

We begin by creating a dataframe to store the predictions. Initially it only contains the original time series.

Hide the code
df_ts_pred = df_ts.copy() 
df_ts_pred.head()
Y2
date
1995-01-01 0.993847
1995-01-02 0.985104
1995-01-03 1.000622
1995-01-04 1.006893
1995-01-05 0.981376

Next we apply the Box-Cox transformation to the series, and we store the transformed values in a new column. Note that we are using the lambda value that we have obtained from the training set (with transform for the already fitted transformer).

Hide the code
print("The value of lambda used for Box-Cox transformation is ", bxcx_transformer.lambda_)
if box_cox:
    df_ts_pred[Y] = bxcx_transformer.transform(df_ts[Y_original])

# df_ts_pred[Y_original + "_pred"] = np.nan

df_ts_length = df_ts.shape[0]

df_ts_pred.head()
The value of lambda used for Box-Cox transformation is  -2.6727376137143977
Y2 Y2_bxcx
date
1995-01-01 0.993847 -0.006223
1995-01-02 0.985104 -0.015313
1995-01-03 1.000622 0.000621
1995-01-04 1.006893 0.006806
1995-01-05 0.981376 -0.019280

We will now explicitly loop through the values of the test set (this is the rolling origin), and at each step we will refit the model with the time series values up to that point, and make the prediction for the next step. We will store the predictions in a list that will then be used to fill the corresponding column of the dataframe (for the rows corresponding to the test set).

The values of p, d, q are already known from the previous steps. We will use them to fit the model to the expandng windows.

Hide the code
Y_pred = []
window = p
train_end = len(y_train[Y])
train_end, df_ts_length
(1040, 1096)

This is the prediction loop. Note that the Arima function from statsmodels does not have a predict method. We have to use the get_prediction method of the fitted model instead. This returns a probabilistic forecast, but we are only interested in the point forecast,s troed in the predicted_mean column.

Hide the code
for i in range(train_end, df_ts_length, window):
            rolling_model = smARIMA(df_ts_pred[Y][:i], order=(p, d, q), trend="n").fit() 
            predictions = rolling_model.get_prediction(0, i + window - 1)
            window_pred = predictions.predicted_mean.iloc[-window:]
            Y_pred.extend(window_pred)
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "

Now we add this predictions to the dataframe, in a column with a name for this model.

Hide the code
model_name
model_name = "sm_" + model_name 
df_ts_pred[model_name] = [0] * y_train.shape[0] + Y_pred

We are not done yet! In this manual approach we have to reverse the Box-Cox transformation to get the actual predictions. We will store these in a new column of the dataframe.

Hide the code
df_ts_pred[model_name+"_unbxcx"] = np.nan
df_ts_pred.iloc[(train_end):, -1] = bxcx_transformer.inverse_transform(df_ts_pred[model_name][(train_end):])
df_ts_pred.tail(10)
Y2 Y2_bxcx sm_ARIMA_1_1_2 sm_ARIMA_1_1_2_unbxcx
date
1997-12-22 0.978290 -0.022605 -0.022959 0.977964
1997-12-23 0.965847 -0.036415 -0.019498 0.981172
1997-12-24 0.969644 -0.032132 -0.034488 0.967549
1997-12-25 0.964850 -0.037550 -0.024409 0.976631
1997-12-26 0.976064 -0.025029 -0.039702 0.962969
1997-12-27 0.976755 -0.024275 -0.020158 0.980558
1997-12-28 0.982892 -0.017660 -0.033614 0.968323
1997-12-29 0.976322 -0.024747 -0.014472 0.985901
1997-12-30 0.962605 -0.040121 -0.031428 0.970273
1997-12-31 0.981817 -0.018808 -0.029345 0.972144

Finally we can plot the predictions and the actual values.

Hide the code
fig, ax = plt.subplots(figsize=(12, 6))

y_train[Y_original].tail(50).plot(ax = ax, color="black", label="Train")
y_test.plot(ax = ax, color="red", label="Test")

y_test_pred.plot(ax = ax, color="green", label="Single call of predict (sktime ARIMA)")

df_ts_pred[model_name+"_unbxcx"][(train_end-1):].plot(ax = ax, label="statsmodels (Box-Cox)", linewidth=6, color="blue", alpha=0.5)

pd.DataFrame({"skpred":[(test_results.iloc[idx, -1].values)[0] for idx in test_results.index]}, index = y_test.index).plot(ax = ax, color="orange", label="sktime ARIMA")

plt.axvline(y_train.index[-1], color="magenta", linestyle="--", label="Train/Test Split")

plt.legend(["Train set", "Test set (actual values)", 
            "Single call of predict (sktime ARIMA)", 
            "statsmodels loop", "sktime ARIMA (rolling forecast with refit)"],
            fontsize='x-small')
plt.show();plt.close()

Note that the predictions of sktime with the rolling origin and those from statsmodels are the same (the model is the same, and the predictions are made in the same way). But the framework provided by sktime automates the process, while statsmodels requires quite a lot of manual intervention.

Note also how quickly the prediction deteriorates when we use the model without refitting, with predict used for the whole test forecasting horizon.


In the Next Sessions

Our plan for the next session is to extend these methods to time series with a seasonal component (like the Canadian Gas Production series) using SARIMA models. Then we will add additional time series as exogenous predictors (SARIMAX models). We will also discuss autoarima strategies.

References

Box, George E. P., Gwilym M. Jenkins, Gregory C. Reinsel, and Greta M. Ljung. 2015. Time Series Analysis: Forecasting and Control. 5th ed. Hoboken, NJ: John Wiley & Sons. https://www.wiley.com/en-us/Time+Series+Analysis%3A+Forecasting+and+Control%2C+5th+Edition-p-9781118675021.
Hyndman, R. J., and G. Athanasopoulos. 2021. Forecasting: Principles and Practice. OTexts. https://otexts.com/fpp3/.
Peixeiro, M. 2022. Time Series Forecasting in Python. Manning. https://www.manning.com/books/time-series-forecasting-in-python-book.