Based on notes by Professors A.Muñoz, J.Portela & J.Pizarroso
Published
March 2025
Session Setup
Libraries
Let us begin by loading the libraries we will use.
Hide the code
### Load necessary modules -------------------------------# %config InlineBackend.figure_format = 'png' # ‘png’, ‘retina’, ‘jpeg’, ‘svg’, ‘pdf’import numpy as np import pandas as pdimport scipy.stats as statsimport matplotlib.pyplot as pltimport seaborn as snssns.set_context('notebook')%matplotlib inline# Machine Learning general librariesfrom sklearn.metrics import mean_squared_error, mean_absolute_error# statsmodelsimport statsmodels.stats.api as smsimport statsmodels.api as smfrom statsmodels.formula.api import ols# Time Series related importsimport datetimefrom sklearn.model_selection import TimeSeriesSplitimport statsmodels.api as sm# pmdarimaimport pmdarima as pmd# sktime from sktime.datasets import load_airlinefrom sktime.utils.plotting import plot_seriesfrom sktime.forecasting.base import ForecastingHorizonfrom sktime.forecasting.model_selection import temporal_train_test_splitfrom sktime.split import ExpandingWindowSplitterfrom sktime.transformations.series.boxcox import BoxCoxTransformerfrom sktime.forecasting.naive import NaiveForecasterfrom sktime.forecasting.trend import TrendForecasterfrom sktime.forecasting.trend import STLForecasterfrom sktime.performance_metrics.forecasting import mean_absolute_errorfrom sktime.forecasting.model_evaluation import evaluateimport statsmodels.tsa as tsafrom statsmodels.tsa.seasonal import seasonal_decompose, STLfrom statsmodels.tsa.stattools import adfuller, kpssfrom statsmodels.tsa.stattools import adfuller, kpssfrom statsmodels.tsa.arima_process import ArmaProcessfrom statsmodels.tsa.stattools import acf, pacffrom statsmodels.tsa.arima.model import ARIMA as smARIMA# datasetsimport rdatasetsimport yfinance as yf# tqdm from tqdm import tqdmfrom fc_4_3_utils.utils import*
Train / Test splits and validation in time series forecasting
Temporal train-test splits
In time series forecasting, we need to be careful when splitting the data into training and testing sets. We cannot use random splits, because we need to preserve the temporal order of the data and we must avoid using future unknown data to predict past data.
Therefore we need to use time-aware splits, where the training set contains the data up to a certain point in time, and the testing set contains the data after that point. The proportion of the split is usually 80/20 or 70/30 of the time span of the series as in other Machine Learning problems. The split is often performed directly by splitting the data into two parts, but some libraries (like sktime) have functions to perform this split.
We will use the Canadian gas production dataset to illustrate these splitting methods.
Hide the code
can_gas = pd.read_csv('../4_1_Introduction_to_Forecasting/tsdata/fpp3/Canadian_gas.csv')# Set 'Month' as index and set frequency as Month end, required by sktimecan_gas["Month"] = pd.to_datetime(can_gas["Month"], format="%Y %b")can_gas["Month"] = can_gas["Month"] + pd.offsets.MonthEnd(0)can_gas.set_index("Month", inplace=True)can_gas.index.freq ='M'can_gas.head()
/var/folders/v_/pl7__6kj0nx_lfsnm4_g91q80000gn/T/ipykernel_93245/490545430.py:7: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
can_gas.index.freq = 'M'
Volume
Month
1960-01-31
1.4306
1960-02-29
1.3059
1960-03-31
1.4022
1960-04-30
1.1699
1960-05-31
1.1161
Let us first illustrate the sktime basic split function.
We know by now that we should use a validation set to assess the performance of our models and to fine-tune the hyperparameters of our models. In time series forecasting, we should use a validation set that is also time-aware. That is, any validation data should be in the future of the training data. And if we plan to use cross-validation, the same time-awareness extends to the folds of the cross-validation.
There are several different approaches to cross validation in time series forecasting. The expanding window approach is the most common one. In this approach, the training set starts with an initial window and grows with each fold by adding some points at the end of the training set. The number of points added is called the step. The test set is always comprised by a fixed number of points following the training set. That fixed number of points is called the forecasting horizon (usually represented as fh in many forecasting libraries). Expanding windows with fh=1 are illustrated in the figure below, from Hyndman and Athanasopoulos (2021), Section 5.10. The blue dots represent the training set, the orange dot represents the test set.
The case for fh=4 is illustrated in the figure below, also from Hyndman and Athanasopoulos (2021), Section 5.10.
The expanding window is also often called rolling forecasting origin.
Rolling windows with sktime
The ExpandingWindowSplitter class in the sktime library can be used to create expanding windows for time series forecasting. We illustrate how it works in the example below how it works with a toy example of a few time points identified by their index.
Hide the code
time_points = np.arange(12)time_points
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Let us suppose that we want to use: + a minimum training set size of 4 points, + a forecasting horizon of 2 points, + and we want to increase the size of the training set by 2 points at each step (we often use large steps for computational reasons).
Change the forecasting horizon parameter to fh = [3] and run the code again. What happens?
Also play with different values of the step parameter to see what happens.
Let us see how to apply this to the training data of the Canadian gas production time series. We will make sure that our training data series is at least say 12 months long and we will use a forecasting horizon of 6 months with a step of 1 month.
Hide the code
min_train_size =48fh =range(1, 7) # forecasting horizon. We need to exclude 0 and include 6step =1y_train_cv_splitter = ExpandingWindowSplitter(initial_window=min_train_size, fh=fh, step_length=step)can_gas_fold_idx =list(y_train_cv_splitter.split(y_train))
The splitter provides the indices of the training and testing sets as follows (here we see the first two folds):
Hide the code
print("Number of folds:", len(can_gas_fold_idx))can_gas_fold_idx[0], can_gas_fold_idx[1]
And we can use these indices to slice the data as follows:
Hide the code
y_train_folds = [y_train.iloc[train_idx] for train_idx, _ in can_gas_fold_idx]y_val_folds = [y_train.iloc[val_idx] for _, val_idx in can_gas_fold_idx]
Let us plot two of these folds created by the expanding windows to see how they looks like. One of the first ones appears on the left and one of the last ones appears on the right.
The TimeSeriesSplit class in the sklearn library can also be used to create expanding windows for time series forecasting. But selecting the right parameters for the split is not as straightforward as with the ExpandingWindowSplitter class in the sktime library. For example, you can not directly choose an initial minimum training set size, the forecasting horizon is indirectly defined, etc. So we do not recommend using it for time series forecasting.
Sliding windows
Other cross-validation strategies are possible, like the sliding window approach. In this approach, the training fold length is fixed while it slides over the time series. The following schematic figure from the sktime website illustrates how the sliding window approach works. Here the asterisks represent the training fold, and the x’s represent the test fold.
|-----------------------|
| * * * * * x x x - - - |
| - * * * * * x x x - - |
| - - * * * * * x x x - |
| - - - * * * * * x x x |
This cross validation strategy is implemented in sktime by the SlidingWindowSplitter class. It works similarly to the ExpandingWindowSplitter class, so we just refer you to the documentation for more details.
Forecasting with naive baseline models
Baseline models
We are now ready to start introducing time series forecasting models. We will start with the simplest models, the so-called baseline models. These models are used to provide a point of comparison for more advanced models, like the Arima models we will introduce below.
Some baseline models use simple time series statistics or values to make predictions, like the mean of the time series or the last observed value. Others are based in the decomposition of the time series into its components: trend, seasonality, and residuals (keep in mind however that some decomposition methods are anything but simple, and in that case the name baseline model is a bit of a stretch).
We will illustrate some of these models by applying them to the Canadian gas production dataset. Remember that we have already split the data into training and testing sets, and that we have also created a cross-validation scheme using expanding windows.
Naive baseline models with sktime
The so called naive baseline models are the simplest baseline models. They are all based on the idea that the future behavior is likely to be similar to the patterns observed in the past. The difference between the models is how they define the past and how that past information is used to make predictions.
For example, we can simply decide to use the last observed value as the prediction for the future. This is the so-called last observed naive model. Or we could use the historical mean of the time series as prediction (the so called ‘mean’ naive model). The last seasonal naive model uses the last observed value from the same season of the preceding seasonal period as the prediction for the future. But the mean of that seasonal period can also be used as the prediction for the future.
The NaiveForecaster class in sktime implements several of these naive models by selecting different values for the strategy parameter: + last implements the last observed naive model, + mean implements the mean naive model, + seasonal_last implements the last seasonal naive model, + seasonal_mean implements the last seasonal mean naive model.
Check the documentation for more details and other options. Keep in mind that when using seasonal naive models, you need to specify the seasonal period of the time series. Be careful: choosing the wrong seasonal period can lead to very poor predictions! Always think about the sampling frequency of the data and the compatible seasonal periods.
Let us apply some of these naive models to the Canadian gas production dataset.
Now let us obtain the predictions for all these models and for each one of the cross validation folds defined above with the expanding windows. We will store the model predictions in lists.
Let us visualize the predictions of all the models for a single fold. You can play with the fold_num parameter to see the predictions for different folds. The extend_idx parameter is used in the visualization to show a portion of the time series after the validation set.
Hide the code
fold_num =53
Hide the code
extend_idx = y_train_folds[fold_num +20].indexfig, ax = plt.subplots(figsize=(15, 4))y.loc[extend_idx].plot(ax=ax, linestyle="dotted")y_train_folds[fold_num].plot(label='train', ax=ax, linewidth=4)y_val_folds[fold_num].plot(label='val', ax=ax, linewidth=4)for model in [last_obs_val_pred, mean_val_pred, seasonal_last_val_pred, seasonal_mean_val_pred, drift_val_pred]: model[fold_num].plot(ax=ax, linestyle="--")ax.legend(["complete training (cropped for clarity)", "fold training data", "fold validation data", "last_obs_val_pred", "mean_val_pred", "seasonal_last_val_pred", "seasonal_mean_val_pred","drift_val_pred"], fontsize='small')plt.show();plt.close()
Below we zoom in to the validation set to see the predictions in more detail.
Hide the code
extend_idx = y_train_folds[fold_num +20].indexfig, ax = plt.subplots(figsize=(15, 4))y.loc[extend_idx][-15:-5].plot(ax=ax, linestyle="dotted")y_train_folds[fold_num][-25:].plot(label='train', ax=ax, linewidth=4)y_val_folds[fold_num][-25:].plot(label='val', ax=ax, linewidth=4)for model in [last_obs_val_pred, mean_val_pred, seasonal_last_val_pred, seasonal_mean_val_pred, drift_val_pred]: model[fold_num][-25:].plot(ax=ax, linestyle="--")ax.legend(["complete training (cropped for clarity)", "fold training data", "fold validation data", "last_obs_val_pred", "mean_val_pred", "seasonal_last_val_pred", "seasonal_mean_val_pred","drift_val_pred"], fontsize='xx-small')plt.show();plt.close()
As you can see, the naive models are very simple and in examples like this they fail to capture the complexity of the time series dynamic. The closest one to the real data seems to be the seasonal_last naive model, but it is still quite different from the real data. As we said, they are useful as a point of comparison for more advanced models.
Combining the predictions of naive models (ensemble)
It is possible to combine the predictions of some naive models to create an ensemble model that is often more accurate than the individual naive models. This is done by averaging the predictions of the naive models. We have already met this idea in the context of ensemble learning for Regression and Classification problems.
Hide the code
ensemble_model_val_pred = [(dr + seas + last)/3for dr, seas, last inzip(drift_val_pred, seasonal_last_val_pred, last_obs_val_pred)]model_name ="ensemble"modelDict[model_name] = {"model": None, "val_pred": eval(f"{model_name}_model_val_pred")}
Let us plot the prediction of the ensemble model for the same fold as above. We encourage you to play with the fold_num parameter to see the predictions for different folds.
Hide the code
extend_idx = y_train_folds[fold_num +20].indexfig, ax = plt.subplots(figsize=(15, 4))y.loc[extend_idx].plot(ax=ax, linestyle="dotted")y_train_folds[fold_num].plot(label='train', ax=ax, linewidth=4)y_val_folds[fold_num].plot(label='val', ax=ax, linewidth=4)ensemble_model_val_pred[fold_num].plot(ax=ax, linestyle="--")ax.legend(["complete training (cropped for clarity)", "fold training data", "fold validation data","ensemble_val_pred"], fontsize='small')plt.show();plt.close()
The ensemble model seems to be a bit better than the individual naive models. How can we turn this into a more formal comparison? In order to do that, we need to calculate some error metrics.
Metrics for time series forecasting. Validation and test metrics.
Since forecasting time series is essentially a regression problem, we can use the same metrics we used for regression problems. The most common metrics are: + Mean Absolute Error (MAE) + Mean Squared Error (MSE) + Root Mean Squared Error (RMSE) + Mean Absolute Percentage Error (MAPE) + Symmetric Mean Absolute Percentage Error (sMAPE)
We recommend reading Section 5.8 of Hyndman and Athanasopoulos (2021) (start from Forecast errors) for a more detailed explanation of these metrics. The first three measures are, as we know, scale dependent. But when we use them to compare different models for the same time series, the scale dependency is not a problem. These measures are implemented in the sklearn library, so we can use them directly.
To compute the validation metrics, we need to calculate the error metrics for each fold of the cross-validation. We can then average the error metrics to get a single value for each metric. For the test set metric things are simple, as we just need to calculate the error metrics for the test set. In the cell code below we use the model dictionary we have built to calculate the error metrics for each model and each fold. We then average the error metrics for each model and each fold. We also calculate the error metrics for the test set.
The case of the ensemble model is special, because we do not have a predict method for it. We need to calculate the ensemble prediction and error metrics manually. We will do that in the cell code below.
Hide the code
zipped_preds =zip(modelDict["drift"]["test_pred"].values.ravel(), modelDict["seasonal_last"]["test_pred"].values.ravel(), modelDict["last_obs"]["test_pred"].values.ravel())ensemble_test_pred = [(dr + seas + last)/3for dr, seas, last in zipped_preds]modelDict['ensemble']['test_pred'] = ensemble_test_predmodelDict['ensemble']["test_mae"] = mean_absolute_error(y_test, ensemble_test_pred)
We store the error metrics in a dataframe to sort and display them for easy comparison.
Hide the code
model_maes_df = pd.DataFrame({'model_name':modelDict.keys(),'val_avrg_mae': [modelDict[model_name]['average_val_maes'] for model_name in modelDict.keys()],'test_mae': [modelDict[model_name]['test_mae'] for model_name in modelDict.keys()]})model_maes_dfmodel_maes_df.sort_values(by='val_avrg_mae', inplace=True)model_maes_df
model_name
val_avrg_mae
test_mae
2
seasonal_last
0.555838
1.888272
5
ensemble
0.747182
1.382830
0
last_obs
1.051184
2.124965
4
drift
1.061138
0.708353
1
mean
3.486469
9.750811
3
seasonal_mean
3.537769
9.749002
The table seems to indicate that the seasonal last naive model is the best of the naive models in cross validation, closely followed by the ensemble model. The test set metrics however select the drift method as the best one. Keep in mind that the more robust method to select the best model is to use the cross-validation metrics and that ensembles always tend to provide more robust results than individual models. So in this case I would select the seasonal last or the ensemble model as the best one. The situation with the drift model is peculiar, it is the only model that gets a higher score in the test set than in the cross-validation. We will now visualize the predictions of all this models for the test set in order to understand why this may be the case. We will plot the real data, the predictions of the naive models and those of the ensemble model
The plot suggests that in this series the trend component is the most important one, and that the only naive model that partially captures it is the drift model. The rest of the naive models do not really pay attention to the trend component. The ensemble model is a bit better than the other naive models, because the drift model is part of the ensemble, but the other ingredient is in fact making it worse than it should be.
Decomposition based models
Time series dynamics and decomposition
In many cases when the dynamic of a time series is dominated by a strong component, either trend or seasonality, the naive models that pay attention to that component are the best ones. But when both components are important, naive models struggle to capture the complexity of the time series. Some more advanced models try to combine a good prediction of these components.
We recommend reading (all of) Chapter 3 of Hyndman and Athanasopoulos (2021), with a special focus on Sections 3.4 to 3.7. The discussion included there of the methods used by official statistical agencies to decompose time series is relevant, as this methods are often regulatorily required to produce official statistics (think of deseasonalized unemployment rates, for example).
Here we will show how to use both a classical additive decomposition and the more advanced STL method to decompose a time series and get forecasts from its components. Later in the course we will see introduce the Prophet model released by Facebook’s Core Data Science team in 2017, which is a comparatively more recent model that uses a similar decomposition approach. Its main advantages are its simplicity and its ability to capture the complexity of time series with multiple seasonalities. Think, for example, of the electricity demand time series, which has a daily seasonality, a weekly seasonality and a yearly seasonality. Classical decomposition methods are essentially unable to capture all these components, but Prophet was designed to do so. More recently the forecasting community has turned to neural network models, and there is a NeuralProphet model that is a neural network version of Prophet. You can read more about this in this Medium post by the authors of the Prophet model.
Using the classical additive decomposition method
The code below shows how to use this classical decomposition method to forecast the Canadian gas production time series and evaluate its performance with cross validation. The basic ideas are similar to what we have done above, but we have to be careful because the trend component of the classical decomposition method contains missing values at the beginning and at the end of the time series. The (admittedly simplistic) solution we apply is to fill these missing values by using the drift naive model.
Note: sktime includes a TrendForecaster model class to forecast the trend component (using linear regression), but combining its results with the seasonal component can be tricky, so we will use the method described above instead.
We will therefore use a drift naive model to predict both the missing values at the end of the train period and the test period (these form the future_dates forecasting horizon below). Then we will use the seasonal naive model to predict the seasonal component of the time series. We will then add the two components to get the final prediction.
trend_forecaster = NaiveForecaster(strategy="drift") # Drift model for trendtrend_forecaster.fit(trend)trend_pred_future_values = trend_forecaster.predict(fh_future_dates)
Note the overlap with the real data at the end of the train period. Note also that the trend forecast we are using has a local nature, it pays attention to the trend of the last part of the train period.
We now select the part of this trend prediction corresponding to the test set that we will use for the combined forecast.
Now we fit a naive forecast model to the seasonal component of the time series. We will use the seasonal naive model to predict the seasonal component of the time series, as it simply extends the last seasonal period to the test set.
Now we can simply add both components for the final test forecast
Hide the code
test_pred = trend_pred_test + seasonal_pred_test
Let us plot the resulting prediction for the test set.
Hide the code
plt.figure(figsize=(10, 5))plt.plot(y.tail(150), label="Train data", color="black")plt.plot(y_test, label="Test data", color="blue")plt.plot(test_pred, label="Classical decomposition prediction", linestyle="dashed", color="red")plt.axvline(y_train.index[-1], color="orange", linestyle="--", label="Train/Test Split")plt.legend()plt.title("Forecasting Canadian Gas Production by extending the classical additive decomposition")plt.xlabel("Year")plt.ylabel("Volume")plt.show();plt.close()
The forecast looks much better than the naive models, but it clearly deteriorates as we move further into the future. The main reason in this case seems to be a change in the trend component of the time series. The seasonal component is also changing, but the trend component appears to be the one that deviates the most from the prediction of this decomposition model.
Let us store the information for this model in our dictionary. We will not however compute the cross validation metrics for this model, instead we proceed with another more advanced decomposition model.
The STL method is a more advanced decomposition method that is able to capture more complex time series dynamics. Its more recent versions include time series decomposition when several seasonalities are present. See (Section 3.6 of Hyndman and Athanasopoulos 2021) and the references therein for more details. The code below shows how to use this method to forecast the Canadian gas production time series and evaluate its performance with cross validation. We use the implementation of the STL method in the sktime library, because of its similarity to the sklearn library we are already familiar with. Essentially al we need to provide the method is the seasonal period of the time series.
With this we are ready to make predictions for both the test set and the cross validation folds. We will store the predictions in the dictionary we have been using.
The performance measures for the STL model are the best we have seen so far.
Hide the code
print(f"Average validation MAE for {model_name}:", modelDict[model_name]["average_val_maes"])print(f"Test MAE for {model_name}:", modelDict[model_name]["test_mae"])
Average validation MAE for stl: 0.3163585394569614
Test MAE for stl: 0.6423952710618968
And the following plot shows that the predictions for the test set are indeed quite good, even though they are similarly affected by the change in the trend component of the time series for the last seasonal periods. But before that final part, the predictions of the STL are better in general than the ones of the additive decomposition model (and the rest of the naive models).
Hide the code
fig, ax = plt.subplots(figsize=(15, 4))ax.plot(y.tail(150), label="Actual Data", color="black")modelDict["clsc_add_dcmp"]["test_pred"].plot(label="Classical decomposition test prediction", linestyle="dashed", color="red", ax=ax)modelDict["stl"]["test_pred"].plot(label="STL test prediction", linestyle="dashed", color="blue", ax=ax)plt.axvline(y_train.index[-1], color="blue", linestyle="--", label="Train/Test Split")plt.legend(["Actual Data", "Classical decomposition test prediction", "STL test prediction"])plt.title("Forecasting Canadian Gas Production with Classical Decomposition and STL")plt.xlabel("Year")plt.ylabel("Volume")plt.show()
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/pandas/plotting/_matplotlib/core.py:981: UserWarning: This axis already has a converter set and is updating to a potentially incompatible converter
return ax.plot(*args, **kwds)
Let us update the model performance dataframe with the metrics of the classical decomposition and the STL decomposition models. Note that we have not computed the cross validation metrics for the classical decomposition model, but there are reasons to believe that the STL model is better.
Hide the code
model_maes_df = pd.DataFrame({'model_name':modelDict.keys(),'val_avrg_mae': [modelDict[model_name]['average_val_maes'] for model_name in modelDict.keys()],'test_mae': [modelDict[model_name]['test_mae'] for model_name in modelDict.keys()]})model_maes_dfmodel_maes_df.sort_values(by='val_avrg_mae', inplace=True)model_maes_df
model_name
val_avrg_mae
test_mae
7
stl
0.316359
0.642395
2
seasonal_last
0.555838
1.888272
5
ensemble
0.747182
1.382830
0
last_obs
1.051184
2.124965
4
drift
1.061138
0.708353
1
mean
3.486469
9.750811
3
seasonal_mean
3.537769
9.749002
6
clsc_add_dcmp
NaN
0.720763
Autoregressive (AR) Processes
Introduction
In the following sections we will gradually introduce the ARIMA model family, which is the most important of the classical time series forecasting models. The ARIMA models combine the ideas of autoregressive (AR) processes, moving average (MA) processes, and differencing (the I is for integration, as the opposite of differencing) to create a powerful model that can capture a wide range of time series dynamics. We will first see what AR (autoregressive) and MA (moving average) processes are, before introducing the ARIMA models.
Later we will extend these methods to seasonal ARIMA models, usually called SARIMA models, which will enable us to capture the seasonal dynamics of the time series.
AR Process Definition in the Non Seasonal (or Regular) Case
A stochastic process \((Y_t)\) is called autoregressive if it satisfies the equation:
where: + \((\varepsilon_{t})\) is gaussian white noise + \(\phi_1, \ldots, \phi_{p}\) are constants, called the coefficients of the process. + The number \(p\) is the order of the process, which is then called an \(AR(p)\) process.
Any time series which is a realization of an autoregressive process is called an autoregressive time series.
An autoregressive process can be considered as anlogous to a multiple linear regression model. But here the prediction of \(y_t\) uses recent lags of the time series as input variables. We regress the time series onto its recent past, and that is why we call it autoregressive.
Backshift Operator and Characteristic Polynomial
Using the backshift operator\(B(y_t) = y_{t - 1}\) we can write the autoregressive process as:
where \[\Phi(B) = (1-\phi_1B - \cdots - \phi_p B^p)\] is the characteristic polynomial of the process.
\(AR(1)\) and \(AR(2)\) Processes
We are particularly interested in the lower order processes. A process of type \(AR(1)\) is: \[y_{t} = \phi_{1}y_{t-1} + \varepsilon_{t},\] or in characteristic polynomial form: \[(1 - \phi_{1} B)y_{t} = \varepsilon_{t}\] Note in particular that a random walk is an autoregressive process of order 1. But it is not stationary! We will return to the subject of stationarity shortly.
A second order \(AR(2)\) process is: \[y_{t} = \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \varepsilon_{t},\] or \[(1 - \phi_{1}B - \phi_{2} B^2 )y_{t} = \varepsilon_{t}\]
Stationarity for AR Processes
The AR process is stationary if the roots of the characteristic polynomial\[\Phi(B) = (1-\phi_1B - \cdots - \phi_p B^p)\]are all outside the unit circle of the complex plane.
Simulating a Non Seasonal AR Processes with Python
We will use the generate_sample method of the ArmaProcess object in statsmodels to simulate autoregressive time series. But first we will do it using a for loop to get a better understanding of the nature of this time series. Let us begin by simulating an \(AR(2)\) process with equation (example from (Peixeiro 2022), section 5.3.1): \[y_{t} = (1/3)y_{t-1} + (1/2)y_{t-2} + \varepsilon_{t},\] Our first step is to generate a white noise time series. We will use a function to do this and ensure reproducibilty while keeping control of the process.
Next we define the coefficients of the model using two arrays, as required by ArmaProcess. Note in particular that the sign of the coefficients is reversed, as they appear in the characteristic polynomial: \[ (1 - (1/3)B - (1/2)B^2)y_{t} = \varepsilon_{t}\]
This ACF is, as we will see, typical of an stationary AR process. Note that the first ACF values drop to zero in an scillatory pattern but they do not suddenly vanish.In fact, we get three significantly different from zero coefficients. This observation will be important below in the context of identification of the right modeling strategy for our data. The pattern of the ACF looks more like a slow fading to zero. Observe also the interesting dynamics in the time plot. Both plots indicate that this is certainly not noise. To visualize more examples of this dynamics let us generate some more realizations of AR(2) processes with the same characteristic polynomial, where the only change is the noise term.
Hide the code
k =5AR = np.zeros((n, k))for j inrange(k): W = noise_gen(size = n, seed=j) AR[0, j] = W[0] AR[1, j] =-ar_cff[1] * AR[0, j] + W[1]for i inrange(2, n): AR[i, j] =-ar_cff[1] * AR[i -1, j] - ar_cff[2] * AR[i -2, j] + W[i]AR_df2 = pd.DataFrame(AR, columns=["AR"+str(i) for i inrange(k)])AR_df2.head()
Again, note that the dynamic of this series is quite complex, even when the coefficients are the same for all those series. This provides \(AR\) models with a lot of flexibility. Besides, their interpretation as saying that the main drivers of a process are its closest past values is intuitively appealing as a model for many processes we observe in nature or in human activities.
Stationarity in AR Processes
Not all AR processes are stationary: as we have said, a random walk is an exampleof AR(1) process, but we know it is not stationary. Recall that the characteristic polynomial o an autoregressive \(AR(p)\) is \[\Phi(B) = (1-\phi_1B - \cdots - \phi_p B^p)\] This polynomial provides a very simple characterization of stationarity for this processes:
An autoregressive AR(p) process is stationary if and only if the roots of the characteristic polynomial are outside the unit circle.\[\Phi(B) = 0 \Rightarrow |B| > 1\]
Exercise 001
The roots of the AR(2) process that we have used as example can be obtained here. Is it stationary?
Moving Average (MA) Processes
Moving Average Processes Definition in the Non Seasonal Case
A stochastic process \((Y_t)\) is called a moving average process if it satisfies the equation:
where: + \((\varepsilon_{t})\) is gaussian white noise + \(\theta_1, \ldots, \theta_{q}\) are constants, called the coefficients of the process. + The number \(q\) is the order of the process, which is then called an MA(q) process.
Any time series which is a realization of a moving average process is called an moving average time series.
Characteristic Polynomial and Invertibility for MA Processes
The characteristic polynomial of the MA(q) process is: \[
\Theta(B) = 1 + \theta_{1}B + \theta_{2}B^2 + \cdots + \theta_{q}B^q
\] so yhat the MA process can be written as: \[
y_t = \Theta(B) \varepsilon_t
\]
Suppose that we have a MA(1) process with \(|\theta_1| < 1\)\[
y_t = (1 - \theta_1\,B)\varepsilon_t
\] then the operator \((1 - \theta_1\,B)\) is invertible, meaning that \[
\dfrac{1}{1 - \theta_1\,B} = 1 + \theta_1\, B + \theta_1^2\, B^2 + \theta_1^3\, B^2+\dots
\] In partcular this means that an invertible \(MA(1)\) process can be considered as an infinite autoregresive process \(\text{AR}(\infty)\): \[
(1 + \theta_1\, B + \theta_1^2\, B^2 + \theta_1^3\, B^2+\dots)y_t = \varepsilon_t
\] The same ideas apply to general order MA(q) processes, provide they are invertible, that is if the roots of the characteristic polynomial \(\Theta(B)\) are all outside the unit circle of the complex plane. Ivertibility implies that the \(MA(q)\) polynomial is equivalent to a certain \(\text{AR}(\infty)\) process. This in turn implies that the MA(q) model possesses some nice mathematical properties (see e.g. the discussion and the video in Section 9.4 of Hyndman and Athanasopoulos (2021))
Simulating a Non Seasonal MA(q) Processes with Python
The simulation follows the same structure that we saw in the autoregressive case. Let us simulate the MA(2) process defined by:
Can you generate this exact time series using a for loop instead of generate_sample?
Identifying a Non Seasonal MA(q) or AR(p) with the ACF
The above example illustrates the typical behavior of the ACF of a moving average MA(q) time series: the autocorrelation coefficients become abruptly non-significant after lag q.
However, we saw that the ACF of the AR(p) example was different and the number of significant coefficients (meaning significantly different from 0) is no indicator of the order of the process. In prder to do this we will use a different plot, called the PACF, for partial autocorrelation function. This is a modified version of the ACF. If the ACF k-th coefficient examines the correlation between \(y_t\) and \(y_{t - k}\), the analogous in the PACF examines coefficient of \(y_{t-k}\) in a regression model for \(y_t\) using all the lags \[y_{t-1}, y_{t-1},\ldots, y_{t-k}\] as predictors. This means that we are examining the relation between \(y_{t}\) and \(y_{t - k}\)while controlling for the effect of the intermediate lags. that is why these are called partial autocorrelation coefficients.
The way to use the PACF to identify the order of an autoregressive model is simple. In the PACF of an autoregressive AR(p) time series: the partial autocorrelation coefficients become abruptly non-significant after lag p.
In this case the situation is reversed: the relevant information for a pure MA process is in the ACF, while the PACF can exhibit a more complex pattern of decay.
ARMA Processes
Definition of the ARMA(p, q) Process
An stochastic process is an ARMA(p, q) process if it satisfies the equation \[
y_{t} = \phi_{1}y_{t-1} + \cdots + \phi_{p}y_{t-p} + \theta_{1}\varepsilon_{t-1} + \cdots + \theta_{q}\varepsilon_{t-q} + \varepsilon_{t}
\] where, as usual, \(\varepsilon_{t}\) is white noise, and \((p, q)\) are jointly called the order of the process.
The expression of the ARMA(p, q) process in terms of characteristic polynomials is: \[
\Phi(B)\,y_t = \Theta(B)\varepsilon_t
\] And the process is stationary and invertible if the roots of both \(\Phi(B)\) and \(\Theta(B)\) are outside the complex unit circle.
Experimenting with the Identification of ARMA(p, q) Processes with the ACF and PACF
In the next code cells first we will load an external script with a couple of auxiliary functions: generate_custom_ARMA and plot_acf_pacf. The first one will allow you to choose the (p, q) values and generate a time series corresponding to a random ARMA(p, q) process (guaranteed to be stationary and invertible). Then using the second function you can plot the ACF and PACF of that time series and see if and how it relates to p and q, using the ideas from the previous paragraphs.
The function generate_custom_ARMA will not generate arbitrary order ARMA(p, q) time series, it is in fact limited to values \(p, q\) such that \(p + q \leq 2\) as these are the values for the models that we are mostly interested in.
The Identification is Often a Trial and Error Situation
Be warned that it is often not easy to guess the right values of p and q from the ACF and PACF plots. As this identification is part of the modeling process we are going to describe, we are therefore often led to consider several alternative models depending on our choice of p and q. The model fit diagnosis and performance evaluation measures will then be the key to select a final model.
After experimenting for a while with known values of p and q, you can run the generate_custom_ARMA without selecting p, q values. Those values will then be selected at random:
Y_df, p, q, ar_cff, ma_cff, rts_p, rts_q = generate_custom_ARMA(size=1200)
After running this code (and without cheating looking at the values of p, q) plot the ACF and PACF. Make an educated guess!
to see the p, q values, the coefficients and the modulus for each root of both characteristic polynomials.
ARIMA Models
ARIMA Meaning
If our series looks stationary, we can try to find an ARMA(p, q) process/model that fits our time series (and verifies the stationarity and invertibility conditions). This strategy often succeeds, as the ARMA family of processes is flexible enough to capture the dynamics of many time series coming from natural phenomena or human related activities. That still holds even if we keep to low order ARMA(p, q) processes, say with \(p, q < 5\).
But what if our series is not stationary? Then we can try to apply difference it until it becomes stationary. This is the idea behind the ARIMA model. Here we will introduce the ARIMA model for the case of a non seasonal time series. The seasonal case will be introduced in coming sessions.
Fitting an ARIMA Model with the Box-Jenkins Methodology
The above idea was organized into a methodology by Box and Jenkins in the 1970s. The methodology, described in their book (Box et al. 2015), is based on the following steps:
Step 1: Data preparation
1.1 Divide the data in Training and Test.
1.2 Transform data to stabilize variance if needed (e.g. using BoxCox transformation).
1.3 Difference data to obtain stationary series if needed.
Step 2: Model selection
2.1 Examine data, ACF and PACF to identify a potential model or models.
Step 3: Estimation / Fit
3.1 Estimate parameters in potential models: fit the model.
Step 4: Model Diagnosis
4.1 Examine the coefficients and residuals.
Step 5: Forecasting and Model Comparison
5.1 Obtain the model forecasts for the test set.
5.2 Use those forecasts to measure the model performance and compare candidate models.
5.3 If transformations were applied to stabilize variance or if differencing was required, make sure to transform back the forecasts.
ARIMA Model Examples
In the following code cells we will show how to apply the Box-Jenkins methodology to fit an ARIMA model to a time series. We will use the ARIMA class from the statsmodels library.
The Arima_examples.csv file contains five daily time series Y1, ..., Y5 that can be used as examples to illustrate the methodology. Here we will use Y2 as an example, leaving the other series for you to experiment with. The EDA can be kept to a minimum: these series do not contain missing values or time gaps.
Let us select one of the time series (and drop the rest).
Hide the code
Y ="Y2"df_ts = df_ts[[Y]].copy()
We use the plot_acf_pacf function to plot the series along with its ACF and PACF.
Hide the code
plot_acf_pacf(df_ts=df_ts, var=Y)
We can clearly see that the series is not stationary. Both the time plot and the ACF make this clear. So we know that we will have to apply differencing below. But first let us split the data into training and test sets and deal with the variance. We make copies here to avoid warnings below about train and test being views of the original data (copies by reference instead of copies by value).
We will use the last 8 weeks (roughly two moths) of the series as the test set. This is common practice in time series forecasting, where instead of using proportions such as 80% of the data for training and 20% for testing, we use some natural temporal division of the data.
Stabilizing the Variance. Box-Cox Transformations.
The plot of the training set suggests that the variance of the series is varying with time (the amplitude of the oscillations for the last years is increasing). We are therefore inclined to apply a Box-Cox transformation to the series. Recall that we can formally check homogeneous variance with a Breusch-Pagan test, as we do next.
Also, before proceeding further, note that seasonality does not appear to be present in the series. We will therefore only consider a regular ARIMA model, We will deal with the seasonal case in the next session.
This is the code to make the Breusch-Pagan test. We make a copy of the series to apply the required linear regression.
The p-value is very low, so we can reject the null hypothesis of homoscedasticity. We will therefore apply the Box-Cox transformation to the series. In case you decide not to use it you can simply switch box_cox = false in the next cell.
Hide the code
box_cox =Trueif box_cox: y_train_original = y_train.copy() bxcx_transformer = BoxCoxTransformer() transformed_y_train = bxcx_transformer.fit_transform(y_train[[Y]]) y_train[Y +"_bxcx"] = transformed_y_train# y_train.insert(-1, Y + "_bxcx", transformed_y_train) Y_original = Y Y = Y +"_bxcx"print(f"lambda = {bxcx_transformer.lambda_:.3f}") plot_acf_pacf(df_ts=y_train, var=Y)
lambda = -2.673
The time plot shows that the variance is now more stable. But it clearly still shows a trend and the ACF confirms that the series is not stationary. We will therefore apply regular differencing to the series.
Before moving on to that recall that when we apply the Box-Cox transformation to the test set we will be using the lambda value from training (with transform instead of fit_transform). But instead of doing that manually we are going to use sktime to structure the process in a pipeline-ish way, similar to those that we have used with sklearn.
Deciding if Differencing is Required
From our EDA we already know that differencing is required, but for the sake of completeness we will include the result of the ADF test before and after differencing. Recall that the ADF Test uses as null hypothesis that the series is non stationary. Therefore if the p-value is small we (reject the null hypothesis and) conclude that the series is stationary. We will also use ndiffs from the pmdarima library to determine the number of differences required to make the series stationary.
The p-value before differencing is high, clearly suggesting the need for differencing.
Hide the code
ADF_pvalue = adfuller(y_train[Y])[1]ADF_pvalue
0.5136472789500353
The result of ndiffs is 1, which is consistent with the ADF test. We will therefore apply first order differencing to the series.
The p-value after differencing is very low, confirming that the series is now stationary. Note that we use dropna to remove the first value of the series, which after differencing becomes missing and could cause problems in the ADF computation.
Hide the code
ADF_pvalue = adfuller(y_train[Ydiff].dropna())[1]print(f"The ADF p-value is {ADF_pvalue}")
The ADF p-value is 5.392828155316003e-26
The goal of this step is not to actually transform the series and keep working with the differenced series. What we want is the value of d, to provide it to the ARIMA model.
Selecting the ARMA(p, q) Component Candidate
After differencing we check again the ACF and PACF to confirm stationarity and importantly to identify the potential values of p and q that define the structure of the ARMA model. These values, together with the value of d that we have just obtained, will be used to fit a model to the series of type ARIMA(p, d, q) (this is the common convention for the name).
The above diagram suggests a mix of AR and MA components. We always try to keep the orders low to apply the parsimony principle. But we must pay attention to the (significantly) non-zero coefficients. In particular the second spike in the ACF and PACF suggests p = 2 and q = 2. We will therefore fit an ARIMA(2, 1, 2) model to the series.
Hide the code
p =2q =2
Fitting an ARIMA Model the statsmodels way
Note: this was our only method for the previous year. We keep it here because the alternative is still experimental for us, and also because it is probably the most common way to fit ARIMA models in Python.
Now we pass those values to the ARIMA function from the statsmodels library (we call it smARIMA because we will soon use another version) and we use the fit method to estimate the parameters of the model. Note that, contrary to sklearn, the data is not passed to the fit method, but to the endog parameter of the ARIMA object. Note also that we pass the non-differenced series to the fit method, as the ARIMA object will take care of the differencing.
You can see in the table above that the coeficcient for the second autoregressive term is not significant. We will therefore refit the model without this term.
Now all the coefficients of the model are significant.
ARIMA Model Diagnosis. Step 2: Residuals analysis.
Now we inspect the model’s residuals to see if the qualify as Gaussian White Noise. We will draw some diagnostic plots and apply the Ljung-Box test to the residuals.
Our final residual diagnostic tool is the computation of the Ljung-Box statistic using the first lags of the residuals. All the p-values are over the significance level, so we do not reject the null hypothesis (that the residuals are white noise).
Now we will essentially repeat the above steps using the ARIMA class from the sktime library (called skARIMA here for clarity). The main difference is that we will use a pipeline to structure the process. In this case the pipeline will only include the Box-Cox transformation and the model fitting. But even with such a simple pipeline we will see the big advantages of this approach.
We define the pipeline as a TransformedTargetForecaster object, a pipeline-like object that contains a description of the modeling process as list of steps. In this case we use a boolean variable box_cox to decide if we want to apply the transformation or not and change the pipeline accordingly. The pipeline itself is called skARIMA_forecaster. Note that we are using the results of our previous work in that we decide whether we want to apply the Box-Cox transformation or not, and we have already decided the (p, d, q) structure for the ARIMA model.
Once the model has been fitted, we can use the forecaster step to see a summary almost idential to the one we get from the statsmodels ARIMA model.
Hide the code
skARIMA_forecaster.forecaster_.summary()
SARIMAX Results
Dep. Variable:
y
No. Observations:
1040
Model:
SARIMAX(1, 1, 2)
Log Likelihood
3307.491
Date:
Tue, 25 Mar 2025
AIC
-6606.981
Time:
11:45:39
BIC
-6587.197
Sample:
01-01-1995
HQIC
-6599.476
- 11-05-1997
Covariance Type:
opg
coef
std err
z
P>|z|
[0.025
0.975]
ar.L1
-0.4506
0.046
-9.887
0.000
-0.540
-0.361
ma.L1
0.2340
0.040
5.820
0.000
0.155
0.313
ma.L2
-0.5505
0.025
-21.960
0.000
-0.600
-0.501
sigma2
0.0001
4.26e-06
23.595
0.000
9.21e-05
0.000
Ljung-Box (L1) (Q):
0.03
Jarque-Bera (JB):
2.95
Prob(Q):
0.87
Prob(JB):
0.23
Heteroskedasticity (H):
0.79
Skew:
0.10
Prob(H) (two-sided):
0.03
Kurtosis:
3.17
Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step).
Extracting the residuals is a bit trickier, because the ARIMA implementation we are using is essentially the one from pmdarima, but it does not expose all the methods of those pmdarima objects. We have to use its own predict_residuals method to get the residuals.
Similarly, the diagnostic plots for the residuals are a bit more complicated, but not much.
Hide the code
plt.figure(figsize=(12, 8))plt.subplot(221)plt.plot(residuals)plt.title("Residuals")plt.subplot(222)sns.histplot(residuals, kde=True)plt.title("Histogram and density plot of residuals")plt.subplot(223)sm.qqplot(residuals, line='s', ax=plt.gca())plt.title("Q-Q Plot for residuals")plt.subplot(224)sm.graphics.tsa.plot_acf(residuals, lags=40, ax=plt.gca(), zero=False)plt.title("ACF of Residuals")plt.tight_layout()plt.show()
The model is the same so we have already seen that these residuals behave like white noise. The Ljung-Box test is obtained just like before, as it only depends on the residuals.
The major difference between the two libraries is in the prediction step. We will see that in this case the sktime implementation is easier to use.
We will use temporal cross validation to evaluate the model. The way we do this is with the ExpandingWindowSplitter, that we already met with the naive models. We will use a forecasting horizon of length 1 (one-step-ahead forecasting). We will keep the first half of the training set as initial window, to ensure that the model has enough data to make the first forecast.
crossval_splits =list(cv.split(y_train))print(f"Length of the training part of the first cross-validation split: {crossval_splits[0][0].shape[0]}")print(f"Length of the validation part of the first cross-validation split: {crossval_splits[0][1].shape[0]}")
Length of the training part of the first cross-validation split: 520
Length of the validation part of the first cross-validation split: 1
Now we are ready to get the validation predictions and scores. For the naive models we did this using a for loop, but here we can use the evaluate function of sktime, which as described in the documentation is a performance benchmarking utility. It has the virtue of sparing us of explicitly writing the looping code.
A very important parameter of evaluate is the strategy. Using refit as strategy means that the model will be retrained at each step of the cross validation. This is what we called a rolling forecasting origin strategy.
Note: running this code took around two minutes in my laptop. If computation time is a concern you can increase the step_length parameter of the ExpandingWindowSplitter to reduce the number of steps.get some warnings about the failure of convergence for certain folds.
Hide the code
cv_results = evaluate( forecaster=skARIMA_forecaster, y=y_train[Y_original], cv=cv, strategy="refit", # refit model on each fold scoring=mean_absolute_error, return_data=True)
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
The results of cross validation are stored in the cv_results object. This is a pandas dataframe that contains the predictions and the actual values for each fold, as well as the metrics for each fold.
Hide the code
cv_results.head()
test__DynamicForecastingErrorMetric
fit_time
pred_time
len_train_window
cutoff
y_train
y_test
y_pred
0
0.016129
0.223771
0.006479
520
1996-06-03
date 1995-01-01 0.993847 1995-01-02 0.98...
date 1996-06-04 1.076537 Freq: D, Name: Y2,...
1996-06-04 1.092665 Name: Y2, dtype: float64
1
0.002245
0.119813
0.007738
521
1996-06-04
date 1995-01-01 0.993847 1995-01-02 0.98...
date 1996-06-05 1.081851 Freq: D, Name: Y2,...
1996-06-05 1.079606 Name: Y2, dtype: float64
2
0.002024
0.140400
0.013789
522
1996-06-05
date 1995-01-01 0.993847 1995-01-02 0.98...
date 1996-06-06 1.090313 Freq: D, Name: Y2,...
1996-06-06 1.088289 Name: Y2, dtype: float64
3
0.013451
0.152167
0.009819
523
1996-06-06
date 1995-01-01 0.993847 1995-01-02 0.98...
date 1996-06-07 1.098762 Freq: D, Name: Y2,...
1996-06-07 1.085311 Name: Y2, dtype: float64
4
0.009020
0.160006
0.016635
524
1996-06-07
date 1995-01-01 0.993847 1995-01-02 0.98...
date 1996-06-08 1.105986 Freq: D, Name: Y2,...
1996-06-08 1.096966 Name: Y2, dtype: float64
Be careful, if the forecasting horizon is more than one time step, each element of the predictions column contains in fact a time series: the predictions for all the steps of the horizon. For example, the first row contains arrays for the first training fold, the corresponding validation fold, and the predictions for that first validation fold. Their shapes are:
Test score for ARIMA Models with the rolling forecasting origin strategy
Getting the score of the test set is simply like getting the score of an additional fold. We use evaluate with the strategy parameter set to refit and the cv parameter set to ExpandingWindowSplitter. The window length is the same as the training set, and the step length is 1.
Hide the code
# window_length = y_train.shape[0] # test_fh = [1]test_fh = validation_fhstep_length =1# Create a splitter that walks through the last 100 pointscv_test = ExpandingWindowSplitter( initial_window=y_train.shape[0], step_length=step_length, fh=test_fh)# Evaluate using refitting at each steptest_results = evaluate( forecaster=skARIMA_forecaster, y=df_ts[Y_original], cv=cv_test, strategy="refit", scoring=mean_absolute_error, return_data=True)
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
The results are returned in a pandas dataframe like the one for cross validation.
Hide the code
test_results.head()
test__DynamicForecastingErrorMetric
fit_time
pred_time
len_train_window
cutoff
y_train
y_test
y_pred
0
0.013276
0.177430
0.009469
1040
1997-11-05
date 1995-01-01 0.993847 1995-01-02 0.98...
date 1997-11-06 0.944234 Freq: D, Name: Y2,...
1997-11-06 0.930958 Name: Y2, dtype: float64
1
0.013905
0.226878
0.005811
1041
1997-11-06
date 1995-01-01 0.993847 1995-01-02 0.98...
date 1997-11-07 0.955646 Freq: D, Name: Y2,...
1997-11-07 0.941741 Name: Y2, dtype: float64
2
0.011053
0.223991
0.005926
1042
1997-11-07
date 1995-01-01 0.993847 1995-01-02 0.98...
date 1997-11-08 0.957338 Freq: D, Name: Y2,...
1997-11-08 0.946285 Name: Y2, dtype: float64
3
0.002280
0.193312
0.005919
1043
1997-11-08
date 1995-01-01 0.993847 1995-01-02 0.98...
date 1997-11-09 0.949454 Freq: D, Name: Y2,...
1997-11-09 0.951734 Name: Y2, dtype: float64
4
0.023035
0.291921
0.010263
1044
1997-11-09
date 1995-01-01 0.993847 1995-01-02 0.98...
date 1997-11-10 0.969647 Freq: D, Name: Y2,...
1997-11-10 0.946613 Name: Y2, dtype: float64
The average test score (MAE) is in line with wht we get from validation:
Hide the code
test_results.iloc[:, 0].mean()
0.008843598177944362
Understanding the difference between rolling forecasting origin and using predict with the test set
Based on our experience in sklearn regression models you may have thought that to get the test score we could simply use this code:
How is this different (and why is it so comparatively bad)? How are those predictions by predict obtained? The key concept is contained in the strategy parameter of evaluate. The refit strategy means that the model is retrained at each step of the cross validation. This is what we called a rolling forecasting origin strategy. But when we use strategy = no-update_params then the model does not refit. This is similar to the strategy that is used when we call predict with the test set: a single fit of the model on the whole training data is used to predict all the test set. This of course implies that the quality of the prediction deteriorates very fast as we move further into the future. In fact, Arima models without refitting (without rolling origin) revert back to the mean of the series after a few steps. The following plot illustartes this for our example.
Hide the code
fig, ax = plt.subplots(figsize=(12, 6))pd.DataFrame({"skpred":[(test_results.iloc[idx, -1].values)[0] for idx in test_results.index]}, index = y_test.index).plot(ax = ax, color="red", label="sktime ARIMA")y_train[Y_original].tail(50).plot(ax = ax, color="black", label="Train")y_test.plot(ax = ax, color="blue", label="Test (rolling forecast with refit)")y_test_pred.plot(ax = ax, color="green", label="Test with predict")plt.legend(["sktime ARIMA", "Train", "Test (rolling forecast with refit)", "Test (with a single call of predict)"])plt.show();plt.close()
Forecasting and Model Performance with statsmodels ARIMA
We said before that the sktime implementation offers the advantage of simplicity over the statsmodels implementation. In this section we will show how to use statsmodels to make the same predictions as we have obtained with the rollong forecasting origin strategy.
Note: This is our code from last year, we include it here for completeness and as a fallback for possible sktime problems.
We begin by creating a dataframe to store the predictions. Initially it only contains the original time series.
Hide the code
df_ts_pred = df_ts.copy() df_ts_pred.head()
Y2
date
1995-01-01
0.993847
1995-01-02
0.985104
1995-01-03
1.000622
1995-01-04
1.006893
1995-01-05
0.981376
Next we apply the Box-Cox transformation to the series, and we store the transformed values in a new column. Note that we are using the lambda value that we have obtained from the training set (with transform for the already fitted transformer).
Hide the code
print("The value of lambda used for Box-Cox transformation is ", bxcx_transformer.lambda_)if box_cox: df_ts_pred[Y] = bxcx_transformer.transform(df_ts[Y_original])# df_ts_pred[Y_original + "_pred"] = np.nandf_ts_length = df_ts.shape[0]df_ts_pred.head()
The value of lambda used for Box-Cox transformation is -2.6727376137143977
Y2
Y2_bxcx
date
1995-01-01
0.993847
-0.006223
1995-01-02
0.985104
-0.015313
1995-01-03
1.000622
0.000621
1995-01-04
1.006893
0.006806
1995-01-05
0.981376
-0.019280
We will now explicitly loop through the values of the test set (this is the rolling origin), and at each step we will refit the model with the time series values up to that point, and make the prediction for the next step. We will store the predictions in a list that will then be used to fill the corresponding column of the dataframe (for the rows corresponding to the test set).
The values of p, d, q are already known from the previous steps. We will use them to fit the model to the expandng windows.
This is the prediction loop. Note that the Arima function from statsmodels does not have a predict method. We have to use the get_prediction method of the fitted model instead. This returns a probabilistic forecast, but we are only interested in the point forecast,s troed in the predicted_mean column.
Hide the code
for i inrange(train_end, df_ts_length, window): rolling_model = smARIMA(df_ts_pred[Y][:i], order=(p, d, q), trend="n").fit() predictions = rolling_model.get_prediction(0, i + window -1) window_pred = predictions.predicted_mean.iloc[-window:] Y_pred.extend(window_pred)
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/Users/fernando/miniconda3/envs/MLMIC25fc/lib/python3.10/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
Now we add this predictions to the dataframe, in a column with a name for this model.
We are not done yet! In this manual approach we have to reverse the Box-Cox transformation to get the actual predictions. We will store these in a new column of the dataframe.
Finally we can plot the predictions and the actual values.
Hide the code
fig, ax = plt.subplots(figsize=(12, 6))y_train[Y_original].tail(50).plot(ax = ax, color="black", label="Train")y_test.plot(ax = ax, color="red", label="Test")y_test_pred.plot(ax = ax, color="green", label="Single call of predict (sktime ARIMA)")df_ts_pred[model_name+"_unbxcx"][(train_end-1):].plot(ax = ax, label="statsmodels (Box-Cox)", linewidth=6, color="blue", alpha=0.5)pd.DataFrame({"skpred":[(test_results.iloc[idx, -1].values)[0] for idx in test_results.index]}, index = y_test.index).plot(ax = ax, color="orange", label="sktime ARIMA")plt.axvline(y_train.index[-1], color="magenta", linestyle="--", label="Train/Test Split")plt.legend(["Train set", "Test set (actual values)", "Single call of predict (sktime ARIMA)", "statsmodels loop", "sktime ARIMA (rolling forecast with refit)"], fontsize='x-small')plt.show();plt.close()
Note that the predictions of sktime with the rolling origin and those from statsmodels are the same (the model is the same, and the predictions are made in the same way). But the framework provided by sktime automates the process, while statsmodels requires quite a lot of manual intervention.
Note also how quickly the prediction deteriorates when we use the model without refitting, with predict used for the whole test forecasting horizon.
In the Next Sessions
Our plan for the next session is to extend these methods to time series with a seasonal component (like the Canadian Gas Production series) using SARIMA models. Then we will add additional time series as exogenous predictors (SARIMAX models). We will also discuss autoarima strategies.