Chapter 4, Part 2: Stationarity

Machine Learning

Author
Affiliation

F.San Segundo & N.Rodríguez

Based on notes by Professors A.Muñoz, J.Portela & J.Pizarroso

Published

March 2025

Chapter 4, Part 2: Stationarity


Session Setup

Libraries

Let us begin by loading the libraries we will use.

Hide the code
### Load necessary modules -------------------------------

%config InlineBackend.figure_format = 'png' # ‘png’, ‘retina’, ‘jpeg’, ‘svg’, ‘pdf’


import numpy as np 
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
sns.set_context('notebook')

# statsmodels
import statsmodels.stats.api as sms
import statsmodels.api as sm
import statsmodels.tsa as tsa

from statsmodels.tsa.seasonal import seasonal_decompose, STL
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.formula.api import ols


import pmdarima as pmd

import sys
sys.path.insert(0, '.')
# from utils import *


plt.rc('axes', titlesize='medium')
plt.rc('axes', titlelocation='left')
plt.rc('axes.spines', right=False)
plt.rc('axes.spines', top=False)
sizets = (8,4.5)
plt.rc('figure', figsize=sizets)

%matplotlib inline
Hide the code
from fc_4_2_utils.utils import *

# Time Series libraries

import datetime

# pmdarima
import pmdarima as pmd

# datasets
import rdatasets
import yfinance as yf

# tqdm 
from tqdm import tqdm

# sktime 
from sktime.datasets import load_airline
from sktime.transformations.series.boxcox import BoxCoxTransformer
from sktime.utils.plotting import plot_series

Types of Models for Time Series

Pure autoregressive models for time series

Our forecasting goal is to predict the value \(y_t\) (the future) using the previous history of the time series: \[ \mathbf{ \bar y_{t-1}} = (y_1, y_2, \ldots, y_{t-1}) \] The first family of models that we will explore in the next sessions can be expressed as: \[ y_t = f(\mathbf{\bar y_{t-1}}, \mathbf{\bar \epsilon_{t-1}}) \] where \(\mathbf{\bar \epsilon_{t-1}}\) is also a time series of error or noise terms. Since the output \(y_t\) is numeric, this kind of model can be seen as a (special kind of) regression problem in which we regress the next value of the time series using its past and noise terms as inputs.

We call this a pure autorregresive model because the time series is regressed on itself. That is, the output \(y_t\) at time \(t\) is regressed on past values \(y_{t-1}, y_{t-2}, \ldots, y_{t-p}\) and only on them, without any other external input.

There are two natural and interconnected questions to ask about this models:

  • How is this special or different from our previous discussion of regression problems?
  • What do we mean by noise in the context of time series?

Before addressing these questions, let us talk about a generalization of these models.

Dynamic Regression Models for Time Series

Sometimes we will consider one or more additional time series that contain information to help us predict the value of \(y_t\). In those cases we will arrange these additional time series in a matrix \(\mathbf{X(t)}\) of exogenous variables \[ \mathbf{X_{t-1}} = \left( \begin{array}{cccc} \uparrow&\uparrow&\cdots&\uparrow&\\ \mathbf{\bar x_{1;t-1}}& \mathbf{\bar x_{2;t-1}}&\cdots&\mathbf{\bar x_{p;t-1}}\\ \downarrow&\downarrow&\cdots&\downarrow& \end{array} \right) \] where each column of the matrix is a time series containing the (past) values of the corresponding auxiliary input.

Therefore a formulation of the forecasting modeling problem for time series is this. Given the previous histpry of the output time series \(\mathbf{\bar y_{t-1}}\) and, optionally, a matrix of exogenous variables up to time \(t-1\), find a modelto predict the next value of the output series \[ y_t = f(\mathbf{\bar y_{t-1}}, \mathbf{\bar X_{-1}}, \mathbf{\bar \epsilon_{t-1}}) \] where again \(\mathbf{\bar \epsilon_{t-1}}\) is the time series of noise terms.


Theoretical Foundations

Stochastic Processes and Statistical Properties of Time Series

The answer to our first question about time series models requires that we introduce some previous ideas about time series in the wider context of stochastic processes. A discrete stochastic process \(Y(u, t)\) is a family of time indexed random variables. Here we consider time as discrete, meaning that we only consider a sequence of time values \[t_1, t_2, t_3, \ldots\] In particular this means that: + for a fixed time value \(t_k\), we have a random variable \(Y_t(u) = Y(u, t_k)\) defined on a certain common sample space \(\Omega\) (this where \(u\) lives), and having its corresponding probability distribution, mean, variance (assuming these exist) and so on. + for a fixed \(u_0\) we get a time series: \[Y^u(t) = \left[Y(u_0, t_1),\,\, Y(u_0, t_2),\,\, Y(u_0, t_3),\,\, \ldots\right]\] abbreviated as \[Y^u(t) = \left[y^{u_0}_{t_1},\,\, y^{u_0}_{t_2},\,\, y^{u_0}_{t_3},\,\, \ldots\right]\] that we call a realization of the stochastic process.

Example.

To make things more concrete and to understand the role of \(u\), let us look at an example and then we will implement it in Python. Let: \[ X(u, t) = 3\sin(2 \pi t / 500 + A(u)) \] where \(A(u)\) is a real valued random variable, and \(t = 1, 2, \ldots\) is a positive natural number. In particular let us assume that \(A(u)\) is uniformly distributed in \([0,2\pi]\). We can think of \(A(u)\) as a weird dice. Rolling that dice will result in a random value for \(A\) (according to its distribution); say we get \(A(u_0) = 2.5\) for a certain we-don’t-really-care-which value \(u_0\). Think of \(u_0\) as a shorthand for roll the dice. Then we have \[ X(u_0, t) = 3\sin(2 \pi t / 500 + 2.5) \] and this defines a time series (seasonal, with period 500), which is a realization of the stochastic process. Another roll of the dices, say \(u_1\), and you get a different time series.

Let us implement this in Python. We will construct the first 1500 values (i.e. \(t=1,\ldots, t=1500\)) for 10 different realizations of the stochastic process (i.e. for 10 different values of \(u_0\) above, 10 different rolls of a peculiar dice).

Hide the code
n = 1500
k = 10

X = np.zeros((n, k))
rng = np.random.default_rng(2024)

Ts = np.arange(n)

for i in range(k):
    A = rng.uniform(low=0, high=2 * np.pi, size = 1)
    X[:, i] = 3 * np.sin(2 * np.pi * Ts/500 + A)

Let us store the resulting values in a pandas Dataframe to make processing easier:

Hide the code
Xdf = pd.DataFrame(X, columns=["X" + str(i) for i in range(k)])
Xdf.head()
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9
0 -2.680097 2.924941 2.793115 -2.856263 -0.079119 2.338127 1.424163 2.721056 2.315777 2.625455
1 -2.696825 2.933090 2.779138 -2.844509 -0.041428 2.361563 1.457230 2.736716 2.291629 2.643488
2 -2.713126 2.940776 2.764722 -2.832305 -0.003730 2.384625 1.490067 2.751943 2.267119 2.661103
3 -2.728999 2.947997 2.749869 -2.819654 0.033968 2.407311 1.522669 2.766736 2.242251 2.678299
4 -2.744441 2.954753 2.734582 -2.806558 0.071661 2.429617 1.555030 2.781092 2.217029 2.695071

And let us plot the different realizations as time series:

Hide the code
fig, ax = plt.subplots()
Xdf.plot(ax=ax, zorder=-10)
t_a = 325
plt.axvline(x = t_a, ls="--")
plt.scatter(x=[t_a]*k, y=X[t_a, :], zorder=1)
if (k > 5):
    ax.get_legend().remove()
ax.grid(visible=True, which='Both', axis='x')
plt.show();plt.close()

We have added a dashed line at \(t_a = 325\) to help you visualize the random variable \(X_{325}\), which is part of this stochastic process. The dots illustrate that we have generated a \(k\)-sized sample of \(X_{t_a}\). The sample corresponds to a row of the pandas dataset (while the different time series corresponds to columns). We could use this sample to estimate the mean, variance and other statistical properties of \(X_{t_a}\). The key point is that in this example we have access to different realizations of the stochastic process, different time series of the process.

Exercise 001

Do this, find the mean of \(X_{t_a = 325}\) and plot a density curve for \(X_{t_a = 325}\). Then change the value of k to 1000 and repeat the computation of the mean and density curve. What do you observe? Does the mean depend on the value of \(t\)? What about the variance of the time series, does it depend on t?

About the Role of u in the Code for this Example and the Consequences for Notation.

Note, in particular that we do not care or need to know the value \(u_0\), all we really care about is the value of \(A(u)\) and \(B(u)\) and the probabilistic distribution of these values. And in the Python code above this clearly shows, because we needed no variable for the value of u. All we had to do was roll the dice. Can you spot the lines of code where we did this dice rolling?

As consequence, we will often remove \(u\) from our notation and we will write e.g. \(X_t\) instead of \(X(u, t)\). We will only include \(u\) explicitly when we want to consider different values of \(u\), that is different time series of the same stochastic process.

We usually work with a single time series. We then use \(x_t\) to refer to the particular value of the random variable \(X_t\) that we have in our time series.

Making Noise

The above example certainly defines an stochastic process. If you roll the dices again, you will get a different time series. But the individual time series themselves are certainly not noisy. In fact, they are individually ruled by a deterministic equation: once \(w\) is fixed, there is nothing random about them.

Next we are going to introduce a different type of stochastic process that is, in comparison, extremely noisy.

White Noise, Different Shades of

The following definition provides a first answer to our pending question of what noise means in the context of time series.

  • An stochastic process \(W_t = W(u,t)\) is white noise if for each fixed time \(t_k\), the random variables \[W_{t_1}, W_{t_2}, W_{t_3}, \ldots\] are uncorrelated and identically distributed with mean zero. In particular, they all have the same variance \(\sigma^2\).

  • A white noise stochastic process is called independent white noise if those \(W_{t_i}\) random variables are (not just uncorrelated but) independent.

  • Finally, an independent white noise is gaussian white noise if the \(W_{t_i}\) are iid (independent identically distributed) normal variables \(N(0, \sigma)\).

Let us implement a gaussian white noise process in Python. Let us suppose that the \(W_{t_i}\) are iid standard normals \(Z = N(0, 0.25)\). In particular this means that each row of the dataset is a sample (of size \(k\)) of \(Z\). So this time we use that to fill the dataset by rows.

Hide the code

W = np.zeros((n, k))

for i in range(n):
    W[i, :] = np.random.default_rng(i).normal(loc = 0, scale = 0.25, size = k)

Wdf = pd.DataFrame(W, columns=["W" + str(i) for i in range(k)])
Wdf.head()    
W0 W1 W2 W3 W4 W5 W6 W7 W8 W9
0 0.031433 -0.033026 0.160106 0.026225 -0.133917 0.090399 0.326000 0.236770 -0.175934 -0.316355
1 0.086396 0.205405 0.082609 -0.325789 0.226339 0.111594 -0.134238 0.145280 0.091143 0.073533
2 0.047263 -0.130687 -0.103266 -0.610367 0.449927 0.286041 -0.081356 0.193452 0.070303 -0.138456
3 0.510230 -0.638916 0.104525 -0.141942 -0.113162 -0.053899 -0.504997 -0.057983 -0.216303 0.830750
4 -0.162948 -0.043679 0.415931 0.164787 -0.410349 -0.001301 -0.155866 0.037158 -0.402047 0.060443

Let us plot the \(k\) time series that we have obtained as realizations of this stochastic process. Now that looks noisy!

Hide the code
Wdf.plot()
plt.show();plt.close()


Combining the two examples

Let us now consider a combined stochastic process \(Y\) which is simply the sum of the two preceding examples. That is, the random variable \(Y_t\) is defined as \[Y_t = X_t + W_t\] for each value of \(t\).

We can quickly visualize what that means in Python.

Hide the code
Y = X + W
Ydf = pd.DataFrame(Y, columns=["Y" + str(i) for i in range(k)])
Ydf.plot()
plt.show();plt.close()  

You can see that the time series in this last plot are beginning to look much more alike the ones we saw in the previous session. Each time series (= realization of the process = column of the dataset) displays a signal (the \(X\) part) and a noise (the \(W\) part). Most of our time series models in the next sessions will be similar: they include a signal part and a noise part that we sum to get the complete model structure (in additive models; we multiply them in multiplicative models).


The Ensemble and a Mental Image of Stochastic Processes

After the predecing examples, a mental image of an stochastic process \(Y(u, t)\) as some kind of two-way table or matrix is probably emerging:

\[ \left( \begin{array}{cccccc} Y(u_1, t = 1)&Y(u_2, t = 1)&Y(u_3, t = 1)&\cdots&Y(u_k, t = 1)&\cdots&\cdots\\ Y(u_1, t = 2)&Y(u_2, t = 2)&Y(u_3, t = 2)&\cdots&Y(u_k, t = 2)&\cdots&\cdots\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ Y(u_1, t = t_a)&Y(u_2, t = t_a)&Y(u_3, t = t_a)&\cdots&Y(u_k, t = t_a)&\cdots&\cdots\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \vdots&\vdots&\vdots&\vdots&\vdots&\ddots&\ddots \end{array} \right) \] Once again: each column is a time series generated by the process using a fixed value of \(u\) and each row is a sample of the ranndom variable \(Y_{t_a}\) that corresponds to a fixed time instant \(t_a\). Here we use the \(Y(u, t)\) symbol to emphasize the different values of \(u\).

However, be careful with a too literal interpretation. The dots try to suggest that there is a possible column for each point \(u\) in the sample space \(\Omega\), which can be continuous and high dimensional! So a better mental picture at leaast replaces the idea of rows with a horizontal axis (better yet, a copy of \(\Omega\)). For discrete time processes the idea that a row corresponds to an instant in tie is o, but keep in mind that there exist continuous time stochastic processes; and for those we need to replace the vertical dimension of this mental image with a continuous axis as well. And we need an additional axis for the values that the stochastic process generates (the \(Y\) axis) This resulting set of points \[(t, u, Y(t, u))\in \mathbb R\times\Omega\times\mathbb R\] is called the ensemble of the stochastic process and it contains all the possible time series that the stochatic process can possibly generate. This web page contains some helpful visualizations of the ensemble for an stochastic process.

How is this Relevant for Time Series Modeling?

As we have said, our models for time series assume a certain signal/noise structure of the time series that we are trying to first fit and then predict with the model. In particular, we need to be able to recognize noise when we see it! And that means that we need to be able to estimate the statistical properties of a time series to see if it qualifies as noise.

So, how do we do that? In statistics the basis of any estimation is a sample. And here is where things get tricky with time series. The random variables that form the stochastic process are obtained with a fixed time instant \(t_a\). A sample is a row-like section of the stochastic process, as we have seen. But in Machine Learning our sample is a time series, which is a column-like section of the process. To emphasize the problem, with a single time series (the usual situation) we have a sample for \(Y_{t_a}\) of sample size one! How are we going to get estimates with this, the tiniest of samples?


Stationarity

Strict Stationarity

The notion of stationarity offers a way out of this sampling problem. Remember that when we introduced white noise stochastic processes, one of its key properties was that the time series that form it are independent or at least uncorrelated. In particular if I show you a portion of a white noise series it is impossible to tell (to estimate) to which time values it corresponds. White noise series are time-shift invariant.

This idea of time-shift invariance is at the core of the idea of stationarity. More precisely a stochastic process \(Y_t\) is strictly stationary if the joint distribution of a random vector such as \[\left(Y_{t_{a_1}}, Y_{t_{a_2}},\ldots, Y_{t_{a_s}}\right)\] does not depend on the particular time instants \(t_{a_1},\ldots,t_{a_s}\) for any vector length \(s\). For example, a gaussian white noise process is, by definition, a strictly stationary stochastic process.

Why is strict stationarity helpful? Because if the process is stationary then the entire time series can be used as a sample to estimate the (common) statistical properties. For example, to estimate the mean we can simply use as estimator the average of all the available time series values: \[\hat\mu = \bar Y_t = \dfrac{Y_1+ Y_2 + \cdots + Y_n}{n}\] Because the random variables \(Y_1,\ldots,Y_n\) all have the same mean \(\mu\).


Autocovariance and Autocorrelation (ACF)

A proper statistical estimate, like \(\bar\mu\) above, is not complete until we understand its variability and sampling distribution. In order to do that, we turn into the variance of \(\mu\): \[ \operatorname{var}(Y_t) = \operatorname{var}\left(\dfrac{Y_1+ Y_2 + \cdots + Y_n}{n}\right) = \dfrac{1}{n^2}\operatorname{cov}\left(\sum_{t = 1}^n Y_t, \sum_{s = 1}^n Y_s\right) \]

Here we will use a very useful property of covariance: it is bilinear. Thus, if we have a linear combination of (finite variance) random variables \[U = \sum_{j = 1}^m a_j X_j,\qquad V = \sum_{k = 1}^r b_k Y_kj\] then \[\operatorname{cov}(U, V) = \sum_{j = 1}^m\sum_{k = 1}^r a_jb_k \operatorname{cov}(X_j, Y_k) \qquad\qquad\qquad\qquad (1.13)\] And besides \[\operatorname{var}(U) = \operatorname{cov}(U, U)\]

With this: \[ \operatorname{var}(Y_k) = \dfrac{1}{k^2}\sum_{t = 1}^k\sum_{s = 1}^k\operatorname{cov}\left( Y_t, Y_s\right) \] Let us give the covariances that appear here a name; the autocovariance function of the stochastic process is \[ \gamma_Y(s, t) = \operatorname{cov}\left( Y_t, Y_s\right) \] As usual the problem with variance is that it depeds on the scale of the variable. So we define an absolute version. The autocorrelation function (ACF) is: \[ \rho_Y(s, t) = \dfrac{\operatorname{cov}(Y_t, Y_s)}{\sqrt{\operatorname{cov}(Y_t, Y_t)\operatorname{cov}(Y_s, Y_s)}} \] and it is always a number between -1 and 1.

Autocovariance and Autocorrelation (ACF) for Strictly Stationary Processes

Now suppose that the stochastic process is strictly stationary. Then for any time shift \(k\) we have that the distribution of the vectors \((Y_t, Y_s)\) and \((Y_{t + k}, Y_{s + k})\) must be the same. In particular \[ \operatorname{cov}\left( Y_t, Y_s\right) = \operatorname{cov}(Y_{t + k}, Y_{s + k}) \] and this means (take \(k = -t\)) that the autocovariance and ACF only depends on the difference \(h = s - t\). For a stationary time series we simplify thenotation by defining: \[ \gamma_Y(h) = \gamma_Y(0, h) = \gamma_Y(t, t + h), \qquad \rho_Y(h) = \rho_Y(0, h) = \rho_Y(t, t + h) \] Note that the value of \(t\) is irrelevant in these equations.


Again: why is this relevant or useful? Time Lags of a Time Series.

Remember that the problem is that we only have one time series (in fact \(n\) values of the time series): \[y_1, y_2, \ldots, y_n\] But if we can assume stationarity, then we can use lagged (time-shifted) copies of our time series. Given an integer \(h\) we define the corresponding lagged copy of the time series as: \[ z_1 = y_{1 + h}, \ldots, z_n = y_{n+h} \] And then all the lagged values \(z_1, \ldots, z_n\) can be used as values for \(Y_{t + h}\) when estimating \[ \rho_Y(h) = \rho_Y(t, 0) = \rho_Y(t, t + h)\qquad\text{In particular we always have:}\quad \rho_Y(0) = 1 \] We emphasize that this is possible because the stationarity guarantees that the result only depends on \(h\). In the non stationary case a lag of the time series would be useless here.

By the way, now we can answer the other pending question: the presence of autocorrelation in time series is what makes time series special and different from other regression problems we have seen.. For example. we need to take it into account when doing estimates like the one for \(\operatorname{var}(Y_t)\) above. In standard linear regression models we assume that we have samples formed by iid (in particular independent and therefore uncorrelated) copies of the random variables in the problem.

Lagged Time Series in Python

Let us see how this works in practice using Python. We will use the same Canadian gas example that we used in the previous session, so we will begin my loading and preparing it. Then we will use the shift method in pandas to obtain the first 15 lagged time series out of it.

Hide the code
can_gas = pd.read_csv('../4_1_Introduction_to_Forecasting/tsdata/fpp3/Canadian_gas.csv')
can_gas["Month"] = pd.to_datetime(can_gas["Month"], format="%Y %b")
can_gas.set_index('Month', drop=True, inplace=True)
can_gas.head()
Volume
Month
1960-01-01 1.4306
1960-02-01 1.3059
1960-03-01 1.4022
1960-04-01 1.1699
1960-05-01 1.1161

Now we create the lags, for \(h= 0\) (a copy of the original time series) to \(h = 15\). We use a dictionary for this. Note that the range in the for loop uses 16 to get 15 lags, and the way we add names to the columns. We will examine the result and analyze the consequences below.

Hide the code
can_gas_lags = pd.DataFrame({"lag" + str(lag): can_gas["Volume"].shift(lag) for lag in range(16)})
can_gas_lags
lag0 lag1 lag2 lag3 lag4 lag5 lag6 lag7 lag8 lag9 lag10 lag11 lag12 lag13 lag14 lag15
Month
1960-01-01 1.4306 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1960-02-01 1.3059 1.4306 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1960-03-01 1.4022 1.3059 1.4306 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1960-04-01 1.1699 1.4022 1.3059 1.4306 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1960-05-01 1.1161 1.1699 1.4022 1.3059 1.4306 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2004-10-01 17.8268 16.9067 17.6412 17.8783 17.1657 18.0935 17.8475 18.3332 17.7823 19.2439 19.1440 17.7200 17.7920 16.9053 17.5883 17.3844
2004-11-01 17.8322 17.8268 16.9067 17.6412 17.8783 17.1657 18.0935 17.8475 18.3332 17.7823 19.2439 19.1440 17.7200 17.7920 16.9053 17.5883
2004-12-01 19.4526 17.8322 17.8268 16.9067 17.6412 17.8783 17.1657 18.0935 17.8475 18.3332 17.7823 19.2439 19.1440 17.7200 17.7920 16.9053
2005-01-01 19.5284 19.4526 17.8322 17.8268 16.9067 17.6412 17.8783 17.1657 18.0935 17.8475 18.3332 17.7823 19.2439 19.1440 17.7200 17.7920
2005-02-01 16.9441 19.5284 19.4526 17.8322 17.8268 16.9067 17.6412 17.8783 17.1657 18.0935 17.8475 18.3332 17.7823 19.2439 19.1440 17.7200

542 rows × 16 columns

Overlaping Lags and ACF Estimation.

Our plan is to use the columns of this table of lags to estimate the ACF. For example, to estimate \(\rho_Y(1)\) we will compute the correlation between the first two colums of this dataframe. Obviously, the lagged time series contain missing values, because we can not get out of sample lagged values. Thus the lag1 time series has one missing value, the lag2 series has two missing values, and so on. In partiular this means that our estimate of \(\rho_Y(1)\) will be based in \(n - 1\) values, the estimate of \(\rho_Y(2)\) is based on \(n-1\), etc. This implies that the sampling support for the values of \(\rho_Y(h)\) decreases as we consider bigger lags. In practice we will therefore attribute more importance to the estimates of the first values of the ACF (for smaller lags) and we will rely less on estimates based on bigger lags.

ACF Values and Plot in Python.

We could try a manual computation of the ACF values, but getting it right means dealing with those missing values in the right way. Instead, we will use the solution provided by the statsmodels library.

Hide the code
can_gas_ACF = sm.tsa.stattools.acf(can_gas["Volume"])
can_gas_ACF
array([1.        , 0.98630929, 0.97389895, 0.95920413, 0.94504748,
       0.93438939, 0.92468341, 0.9240095 , 0.92392526, 0.92766874,
       0.93077195, 0.93117848, 0.93330983, 0.91975445, 0.90729176,
       0.8922351 , 0.87805474, 0.866886  , 0.85682908, 0.85573537,
       0.85563592, 0.85900748, 0.86164275, 0.86181158, 0.86357829,
       0.85043021, 0.83791706, 0.82291474])

The library also includes a function for a nice graphical representation of the ACF values, a so called ACF plot, that appears in the left panel below. In the right panel we show the plot of the original time series and the 4-lag time series:

Hide the code
which_lag = 6
fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)
ax0 = axs[0]
sm.graphics.tsa.plot_acf(can_gas["Volume"], ax=ax0, lags=25, zero=False, title='ACF for the Canadian Gas Example')
ax0.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')

ax1 = axs[1]
ax1 = can_gas["Volume"].head(100).plot(color="blue", label="Original")
ax1 = can_gas_lags[f"lag{which_lag}"].head(100).plot(color="red", label=f"lag{which_lag}")
ax1.set_title('Canadian Monthly Gas Production')
ax1.grid(visible=True, which='both', axis='x')
ax1.grid(visible=False, which='Major', axis='y')
ax1.legend()
plt.tight_layout()
plt.show();plt.close()

Initial Interpretation of the Canadian Gas ACF Plot

Recall that the definition of the ACF values is \[\rho_Y(h) = \rho_Y(t, t + h)\] and that \(\rho\) represents correlation. That means that \(\rho\) measures how strong is the linear relation between \(Y_t\) and \(Y_{t + h}\). The right panel illustrates that in this example this relation is very strong: for most of the time series span, the slope of the graph for \(Y_t\) and \(Y_{t + h}\) are very similar. This holds clearly for small values of \(h\) but if you factor in the seasonility, the relation carries on for not so small values of \(h\). This is what the ACF on the left illustrates: the height of each vertical bar indicates a value of \(\rho_Y(h)\) for increasing lags. And we can see in this example correlation values close to 1 even after 25 lags. Not ealso that there is an uptick in the correlations at lag=12, corresponding to the seasonality period of the time series.

Note also the shaded area in the ACF plot. Soon we will see that we can use statistics to test the hypothesis that the ACF values are significantly different from 0. An ACF out of the shaded area is significantly different from 0.

Finally note that you can select the number of lags to include in the ACF plot, and that we have selected zero=False to exclude the ACF at lag = 0, which is always equal to 1 and sometimes messes with the vertical scale of the plot.

Exercise 02

Replace lag4 with lag12 above and repeat the plots.

Lag Plots

The lag plots are scatterplots of \(Y_t\) vs \(Y_{t + h}\) for different values of \(h\). They are particularly interesting for smaller values of \(h\) and for small multiples of the seasonal frequency. They offer a different visualization of the information of the ACF. Below we include the lag plots for all the lags included in an extended version of can_gas_lag. Note that the correlations are very strong, and in the case of lags = 12 and 24 they are even stronger than those of their close neighbors.

Hide the code
can_gas_lags = pd.DataFrame({"lag" + str(lag): can_gas["Volume"].shift(lag) for lag in range(25)})
fig, axs = plt.subplots(5, 5, figsize=(12,20), sharex=True, sharey=True)
for (i, ax) in enumerate(axs.ravel()):
    sns.regplot(x = can_gas_lags.iloc[:, i], y = can_gas_lags.iloc[:, 0], order=1, ax = ax, 
    line_kws=dict(linewidth=1, color="yellow"), scatter_kws=dict(s=1)) 
    lim = ax.get_xlim()
    ax.plot(lim, lim, 'k--', alpha=.25, zorder=-10)
    ax.set(xlim=lim, ylim=lim, title=f'lag {i}', aspect='equal')
plt.tight_layout()
plt.show();plt.close()


ACF for (Gaussian) White Noise

The mean of a white noise process \(W_t\) is \(\mu_W = 0\). The autocovariance function is: \[ \gamma_W(s, t) = \begin{cases} \sigma_W^2& s = t\\[3mm] 0 & s \neq t \end{cases} \] and so the ACF equals 1 for \(t = 0\) and is \(0\) for any value \(t > 0\). Let us plot the ACF for one of the gaussian white noise series in the columns of the dataset Wdf in an example above (any column will do). As you can see, all tha ACF values are very small , almos all of them inside the shaded region (this is a 95% significance test, so we expect some ACF values to appear significant at random; but even in those cases the values are anyway very small). This behavior is of course to be expected, because the random variables \(W_t\) and \(W_{t+h}\) in a gaussian white noise time series are not only uncorrelated, but also independent. A consequence of their definition is that all gaussian white noise time series are strictly stationary.

Hide the code
fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)

ax0 = axs[0]
sm.graphics.tsa.plot_acf(Wdf["W0"], ax=ax0, lags=25, zero=False, title='ACF for Gausian White Noise')
ax0.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')

ax1 = axs[1]
ax1 = Wdf["W0"].plot(color="blue", label="Original")
ax1.set_title('White Noise')
ax1.grid(visible=True, which='both', axis='x')
ax1.grid(visible=False, which='Major', axis='y')
plt.tight_layout()
plt.show();plt.close()

Weak Stationarity.

As we have seen, strict stationarity opens the possibility of using the time series values to make estimates of the statistical properties of the underlying stochastic process, even though we only have a realization of that process. However, the problem is that checking if an observed time series is strictly stationary is not easy at all. We can however relax the equirements and still get the properties we need to serve as the basis for modeling.

An stochastic process (and its realizations as time series) is called weakly stationary if: 1. Its mean and variance are constant (do not depend on \(t\)) 2. The autocovariance function \(\gamma(t, s)\) only depends on the difference \(|s - t|\).

Every strictly stationary process is clearly also weakly stationary.The advantage of this notion of weak stationarity is that it is simpler to check if it does not hold, as we will see below. Because of this, from now on, the word stationary will be used to mean weakly stationary. And if we mean strict stationarity we will make it explicit.


Detecting Non Stationarity

Trend, Stable Variance and Stationarity.

If a time series has a trend, that means that the average value (the mean) of \(Y_t\) and \(Y_{t + h}\) will not be the same for certain values of \(h\). In particular this means that a series with a trend can not posibly be stationary. The presence of a trend can be detected in the ACF because the values do not approach zero or approach zero very slowly (with a linear decay). We have already seen an example in the ACF for the Canadian gas time series.

If the variance of a time series increases or decreases clearly asin dofferent sections of the time series, then just as in the case of a non constant mean, this shows that the variance of \(Y_t\) is not constant and so a time series with non constant variance can not posibly be stationary. The following example of the recordings inf a sismograph during an earthquake (Example 1.7 from (Shumway and Stoffer 2017)) is a clear case of non constant variance in the time series, as the amplitude of the oscilations recorded as sismometer measures increase during the earthquake. The ACF of this series exhibits an oscilating (but non periodic) slow decay.

Hide the code
EQ5 = pd.read_csv('./EQ5.csv')
fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)
ax0 = axs[0]
sm.graphics.tsa.plot_acf(EQ5["seismograph"], ax=ax0, lags=100, zero=False, title='ACF for the Earthquake Example')
ax0.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')

ax1 = axs[1]
ax1 = EQ5["seismograph"].plot(color="blue", label="Original")
ax1.set_title('Original Time Series')
ax1.grid(visible=True, which='both', axis='x')
ax1.grid(visible=False, which='Major', axis='y')
plt.tight_layout()
plt.show();plt.close()


Random Walks.

This is another important example of a time series that appears in may applications. Here we will use it to see yet another example of ACF.

A stochastic process \(Y_t\) is a random walk if it satisfies the equation \[Y_t = Y_{t -1} + W_t, \qquad\text{where }W_t\qquad\text{ is white noise.}\] If we assume that \(Y_1 = 0\) note that \[Y_3 = Y_2 + W_3 = W_2 + W_3\qquad\text{ and, in general }Y_k = \sum_{i=2}^k W_k\] Thus a random walk is a sum of white noise values. This in turn means that the autocovariance of a random walk is: \[ \gamma_Y(s, t) = \min(s, t)\,\sigma_w^2 \] where \(\sigma_w^2\) is the variance of the white noise time series. This shows that the autocovariance depends on the particular values of \(t\) and \(s\) and therefore a random walk is not stationary.

Random Walks in Python.

Let us simulate a random walk in Python. The only thing we need to do is to create a white noise time series (we will use the W1 column from the Wdf example above) and then use the recursive equation above to generate the rest of the values \(y_2, y_3, \ldots\) as a cumulative sum of the white noises series values.

Hide the code
rng = np.random.default_rng(2024)
RW = Wdf["W1"].cumsum()
RWdf = pd.DataFrame({'RW':RW})

And now lets plot the time series and its ACF. Note again this pattern of slow decay in the ACF values:

Hide the code
fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)

ax0 = axs[0]
sm.graphics.tsa.plot_acf(RWdf["RW"], ax=ax0, lags=25, zero=False, title='ACF for Random Walk')
ax0.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')

ax1 = axs[1]
ax1 = RWdf["RW"].plot(color="blue")
ax1.set_title('Random Walk')
ax1.grid(visible=True, which='both', axis='x')
ax1.grid(visible=False, which='Major', axis='y')
plt.tight_layout()
plt.show();plt.close()


Seasonality and Stationarity

Is a seasonal time series stationary? As we will see this is not a simple question. We saw before an example, the sinusoidal stochastic process \[X_t = 3 \sin(2\pi t/500 + A)\] where \(A\) is uniformly distibuted in \([0, 2\pi]\). And in the exercise following that example we tried to gain some intuition that the mean and variance of the time series do not depend on \(t\). For this time series stationarity can be established formally (see e.g. this online reference from the EE2S31 Signal Processing course at TU Delft, page 28). However, the ACF exhibits a pattern that we would normally associate with non stationarity.

Hide the code
fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)

ax0 = axs[0]
sm.graphics.tsa.plot_acf(Xdf["X2"], ax=ax0, lags= 1000, zero=False, title='ACF for Sinusoidal with Random Uniform Phase')
ax0.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')

ax1 = axs[1]
ax1 = Xdf["X2"].plot(color="blue")
ax1.set_title('Original Time Series')
ax1.grid(visible=True, which='both', axis='x')
ax1.grid(visible=False, which='Major', axis='y')
plt.tight_layout()
plt.show();plt.close()

Let us consider a second example of a sinusoidal stochastic process \(Z-t\) given by \[Z_t = 3 \sin(2\pi t/100 + B)\] but this time let us assume that the phase \(B\) is a random normal variable with mean 0 and standard deviation 0.5. We repeat the same steps, fist plotting a large number of realizations of this process (many time series that only differ on the roll of the dice).

Hide the code
n = 1500
k = 1000

Z = np.zeros((n, k))
rng = np.random.default_rng(2024)

Ts = np.arange(n)

for i in range(k):
    A = rng.normal(loc= 0, scale=0.5, size=1)
    Z[:, i] = 3 * np.sin(Ts/50 + A)

Zdf = pd.DataFrame(Z, columns=["Z" + str(i) for i in range(k)])
Zdf.head()
Z0 Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 ... Z990 Z991 Z992 Z993 Z994 Z995 Z996 Z997 Z998 Z999
0 1.476112 2.195401 1.627373 -1.402843 -1.924380 0.100776 1.252454 0.755556 2.359540 1.099995 ... -2.423743 1.482323 0.223951 0.768504 -1.446321 -1.550680 -0.754521 0.762182 -0.305112 -0.602334
1 1.528048 2.235851 1.677449 -1.349530 -1.877969 0.160718 1.306721 0.813467 2.396120 1.155592 ... -2.387902 1.534187 0.283735 0.826344 -1.393469 -1.499010 -0.696302 0.820057 -0.245366 -0.543439
2 1.579372 2.275406 1.726854 -1.295677 -1.830807 0.220595 1.360465 0.871052 2.431742 1.210727 ... -2.351107 1.585438 0.343405 0.883854 -1.340059 -1.446741 -0.637805 0.877604 -0.185522 -0.484327
3 1.630065 2.314051 1.775569 -1.241306 -1.782912 0.280385 1.413665 0.928290 2.466391 1.265378 ... -2.313371 1.636054 0.402938 0.941010 -1.286113 -1.393893 -0.579053 0.934800 -0.125604 -0.425021
4 1.680106 2.351770 1.823573 -1.186438 -1.734304 0.340062 1.466300 0.985155 2.500053 1.319523 ... -2.274710 1.686016 0.462310 0.997790 -1.231653 -1.340487 -0.520070 0.991622 -0.065636 -0.365545

5 rows × 1000 columns

And let us again plot the different realizations of \(Z_t\) as time series:

Hide the code
fig, ax = plt.subplots()
Zdf.plot(ax=ax, zorder=-10)
t_a = 325
plt.axvline(x = t_a, ls="--")
plt.scatter(x=[t_a]*k, y=Z[t_a, :], zorder = 1)
if (k > 10):
    ax.get_legend().remove()
plt.show();plt.close()

Exercise 003

In this case, does the mean of the time series depend on the value of \(t\)? And what about the variance?

In this case it can also be shown formally that the process is not stationary. Now let us look at the corresponding ACF:

Hide the code
fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)

ax0 = axs[0]
sm.graphics.tsa.plot_acf(Zdf["Z2"], ax=ax0, lags=1000, zero=False, title='ACF for Sinusoidal with Normal Uniform Phase')
ax0.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')

ax1 = axs[1]
ax1 = Zdf["Z2"].plot(color="blue")
ax1.set_title('Original Time Series')
ax1.grid(visible=True, which='both', axis='x')
ax1.grid(visible=False, which='Major', axis='y')
plt.tight_layout()
plt.show();plt.close()

Detecting Stationarity in Time Series Samples

As you can see, it looks pretty similar! The take home message from this examples is that the ACF is helpful in making us suspect non stationarity, but it is no formal proof. In fact, it is important to understand that the exact same time series can be a result of both sinusoidal examples: once we have rolled either the uniform or the normal dice, there is no way to tell which one was used to obtain the phase. As we will see in the next session, we will use models that assume that the time series originates from a stationary process. But ultimately, this is an assumption. We will use the results in this session as EDA and preprocessing to remove the non-stationary behavior so that the time series we feed into our models will look compatible with a generating stationary process. Whether this modeling strategy is right or not is left for the performance evaluation step and will be discussed further in the next session.


Hypothesis Tests for Stationarity

A number of hypothesis tests have been devised to test for stationarity. However, as with the ACf plots, they must be used with caution as they can even result in apparently contradictory conlusions. A couple of commonly such tests are:

  • ADF Test: uses as null hypothesis: the time series is not stationary and non seasonal.
  • KPSS Test: Uses as null hypothesis: the time series is stationary and non seasonal.

Both tests are implemented in statsmodels.tsa, we will apply both of them to some of our examples.

For the random walk (which we know for certain to be non stationary!) the resulting p-values confirm the non stationarity of the series.

Hide the code
RW_adf = adfuller(RWdf["RW"])
RW_kpss = kpss(RWdf["RW"])
RW_adf[1], RW_kpss[1]
/var/folders/v_/pl7__6kj0nx_lfsnm4_g91q80000gn/T/ipykernel_30156/596027656.py:2: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is smaller than the p-value returned.

  RW_kpss = kpss(RWdf["RW"])
(0.791832704806252, 0.01)

For the stationary sinusoidal (uniform random phase) the p-value of the ADF tells us to reject the null. This means that the series is either stationary or seasonal, the later being true. But the KPSS test fails to reject the null, even though we know it to be false because of (strong) seasonality.

Hide the code
X_2_adf = adfuller(Xdf["X2"])
X_2_kpss = kpss(Xdf["X2"], nlags="legacy")
X_2_adf[1], X_2_kpss[1]
/var/folders/v_/pl7__6kj0nx_lfsnm4_g91q80000gn/T/ipykernel_30156/2546339701.py:2: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is greater than the p-value returned.

  X_2_kpss = kpss(Xdf["X2"], nlags="legacy")
(0.0, 0.1)

In the case of the non stationary sinusoid (normal random phase) the ADF again rejects the null, but the KPSS does not reject it, even though the series is not stationary and it is seasonal.

Hide the code
Z_adf = adfuller(Zdf["Z2"] )
Z_kpss = kpss(Zdf["Z2"], nlags="legacy")
Z_adf[1], Z_kpss[1]
/var/folders/v_/pl7__6kj0nx_lfsnm4_g91q80000gn/T/ipykernel_30156/2005724078.py:2: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is greater than the p-value returned.

  Z_kpss = kpss(Zdf["Z2"], nlags="legacy")
(0.0, 0.1)
Exercise 004

What happens with the earthquake data?


Transformations to Achieve Stationarity

Dealing with trend and seasonality

Introduction

A series with a trend or a series with seasonality can not be stationary. In the case of a trend, the mean of the time series is not constant. In the case of seasonality, the mean of the time series is constant but the variance is not. However, some important forecasting model families are based on the assumption that the time series is stationary. This is the case for ARIMA models, for example. Therefore, often the first step in the analysis of a time series is to remove the trend and seasonality. In order to do that we introduce the important concept of differencing.

Differencing

The backshift or lag operator \(B\) is defined by: \[BY_t = Y_{t - 1}\] The powers of this backshift operator can be used to express further lags of the time series. For example: \[B^3Y_t = Y_{t - 3}\] The operation \[ (1 -B)Y_t \] is a (first order regular ) difference of the time series. Similarly, for a seasonal time series with period \(p\) we define the first order seasonal difference operator to be: \[ (1 -B^p)Y_t = Y_t - Y_{t - p} \] Note of course that when we apply this operations to a time series, the first values of the difference will be missing because they require past values that are not present in our data.

In Python we can use diff to obtain these differences of a time series. For example, for the Canadian cas example (recall, seasonality 12) the first order regular and seasonal difference are obtained with:

Hide the code
can_gas["Volume_d1"] = can_gas["Volume"].diff()
can_gas["Volume_d12"] = can_gas["Volume"].diff(periods=12)
can_gas.head(18)
Volume Volume_d1 Volume_d12
Month
1960-01-01 1.4306 NaN NaN
1960-02-01 1.3059 -0.1247 NaN
1960-03-01 1.4022 0.0963 NaN
1960-04-01 1.1699 -0.2323 NaN
1960-05-01 1.1161 -0.0538 NaN
1960-06-01 1.0113 -0.1048 NaN
1960-07-01 0.9660 -0.0453 NaN
1960-08-01 0.9773 0.0113 NaN
1960-09-01 1.0311 0.0538 NaN
1960-10-01 1.2521 0.2210 NaN
1960-11-01 1.4419 0.1898 NaN
1960-12-01 1.5637 0.1218 NaN
1961-01-01 1.7450 0.1813 0.3144
1961-02-01 1.5835 -0.1615 0.2776
1961-03-01 1.6770 0.0935 0.2748
1961-04-01 1.5155 -0.1615 0.3456
1961-05-01 1.4051 -0.1104 0.2890
1961-06-01 1.2464 -0.1587 0.2351

Let us plot the original time series and also the differenced ones.

Hide the code
fig, axs = plt.subplots(3, 1, figsize=(10,8), sharex=False, sharey=False)

ax0 = axs[0]
can_gas["Volume"].plot(color="blue", ax=ax0)
ax0.set_title('Canadian Monthly Gas Production')

ax1 = axs[1]
can_gas["Volume_d1"].plot(color="blue", ax=ax1)
ax1.set_title('First Regular Difference')

ax2 = axs[2]
can_gas["Volume_d12"].plot(color="blue", ax=ax2)
ax2.set_title('First Seasonal Difference')

plt.tight_layout()
plt.show();plt.close()

Exercise 005

Do the same with the other time series examples in this session, but only take seasonal differences in seasonal time series!

Differencing to remove trends and seasonality

After some experimenting, you should have noticed that: + Regular differencing removes (or at least greatly reduce) the trend in a time series. If the series is seasonal, it does not remove seasonality. + Seasonal differencing removes (or at least greatly reduce) seasonality in a seasonal time series. And if the series has a trend seasonal differencing may also reduce or even cemove the trend.

Therefore, when we hve a time series that exhibits seasonality and or a trend we will proceed as follows: + If the series has seasonality we will apply seasonal differencig tom remove it. If it also has a trend, we examine whether seasonal differencing has removed the trend. If it has not, then we apply regular differencing to remove the trend. + If the series has a trend but no seasonality then we only apply regular differencing to remove it.

Keep also in mind that sometimes a single difference (seasonal or regular) may not be enough to do the job. In some cases we need to apply a second difference of a certain kind to remove a seasonal behavior or a trend. But we will never apply a third difference of the same kind, as this leads to too complex modeling dynamics.

Australian Beer Example

The data in aus_production.csvcontains Quarterly production of selected commodities in Australia from 1956 to 2010. This data is [contained in the R librarytsibbledata`](https://cran.r-project.org/web/packages/tsibbledata/tsibbledata.pdf), and is also used in the book (Hyndman and Athanasopoulos 2021). Let us begin by reading the data.

Hide the code
ausprod  = pd.read_csv('../4_1_Introduction_to_Forecasting/tsdata/tsibbledata/aus_production.csv')
ausprod
Quarter Beer Tobacco Bricks Cement Electricity Gas
0 1956 Q1 284 5225.0 189.0 465 3923 5
1 1956 Q2 213 5178.0 204.0 532 4436 6
2 1956 Q3 227 5297.0 208.0 561 4806 7
3 1956 Q4 308 5681.0 197.0 570 4418 6
4 1957 Q1 262 5577.0 187.0 529 4339 5
... ... ... ... ... ... ... ...
213 2009 Q2 398 NaN NaN 2160 57471 238
214 2009 Q3 419 NaN NaN 2325 58394 252
215 2009 Q4 488 NaN NaN 2273 57336 210
216 2010 Q1 414 NaN NaN 1904 58309 205
217 2010 Q2 374 NaN NaN 2401 58041 236

218 rows × 7 columns

To parse the quarter data correctly we need to do a string replacement:

Hide the code
ausprod["Quarter"] =  ausprod["Quarter"].str.replace(' ', '-')
ausprod.head()
Quarter Beer Tobacco Bricks Cement Electricity Gas
0 1956-Q1 284 5225.0 189.0 465 3923 5
1 1956-Q2 213 5178.0 204.0 532 4436 6
2 1956-Q3 227 5297.0 208.0 561 4806 7
3 1956-Q4 308 5681.0 197.0 570 4418 6
4 1957-Q1 262 5577.0 187.0 529 4339 5
Hide the code
ausprod["Date"] = pd.to_datetime(ausprod["Quarter"])
ausprod.head()
Quarter Beer Tobacco Bricks Cement Electricity Gas Date
0 1956-Q1 284 5225.0 189.0 465 3923 5 1956-01-01
1 1956-Q2 213 5178.0 204.0 532 4436 6 1956-04-01
2 1956-Q3 227 5297.0 208.0 561 4806 7 1956-07-01
3 1956-Q4 308 5681.0 197.0 570 4418 6 1956-10-01
4 1957-Q1 262 5577.0 187.0 529 4339 5 1957-01-01
Hide the code
ausprod.set_index('Date', drop=True, inplace=True)
ausprod.head()
Quarter Beer Tobacco Bricks Cement Electricity Gas
Date
1956-01-01 1956-Q1 284 5225.0 189.0 465 3923 5
1956-04-01 1956-Q2 213 5178.0 204.0 532 4436 6
1956-07-01 1956-Q3 227 5297.0 208.0 561 4806 7
1956-10-01 1956-Q4 308 5681.0 197.0 570 4418 6
1957-01-01 1957-Q1 262 5577.0 187.0 529 4339 5

We will focus in the time series corresponding to beer production. We can visualize the time series and its ACF. Both plots tell you about clear trend and seasonality with period 4.

Hide the code
df_ts = ausprod
var = "Beer"

fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)

ax0 = axs[0]
sm.graphics.tsa.plot_acf(df_ts[var], ax=ax0, lags=25, zero=False, title='ACF')
ax0.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')

ax1 = axs[1]
ax1 = df_ts[var].plot(color="blue")
ax1.set_title('Austalia Beer Production')
ax1.grid(visible=True, which='both', axis='x')
ax1.grid(visible=False, which='Major', axis='y')
plt.tight_layout()
plt.show();plt.close()

As described, we first take a seasonal difference:

Hide the code
ausprod["Beer_sd4"] = ausprod["Beer"].diff(periods=4)
ausprod.head(12)
Quarter Beer Tobacco Bricks Cement Electricity Gas Beer_sd4
Date
1956-01-01 1956-Q1 284 5225.0 189.0 465 3923 5 NaN
1956-04-01 1956-Q2 213 5178.0 204.0 532 4436 6 NaN
1956-07-01 1956-Q3 227 5297.0 208.0 561 4806 7 NaN
1956-10-01 1956-Q4 308 5681.0 197.0 570 4418 6 NaN
1957-01-01 1957-Q1 262 5577.0 187.0 529 4339 5 -22.0
1957-04-01 1957-Q2 228 5651.0 214.0 604 4811 7 15.0
1957-07-01 1957-Q3 236 5317.0 227.0 603 5259 7 9.0
1957-10-01 1957-Q4 320 6152.0 222.0 582 4735 6 12.0
1958-01-01 1958-Q1 272 5758.0 199.0 554 4608 5 10.0
1958-04-01 1958-Q2 233 5641.0 229.0 620 5196 7 5.0
1958-07-01 1958-Q3 237 5802.0 249.0 646 5609 8 1.0
1958-10-01 1958-Q4 313 6265.0 234.0 637 4977 6 -7.0

Let us visualize the result. Note that we remove the missing data before plotting the ACF:

Hide the code
df_ts = ausprod
var = "Beer_sd4"

fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)

ax0 = axs[0]
sm.graphics.tsa.plot_acf(df_ts[var].dropna(), ax=ax0, lags=25, zero=False, title='ACF of Seasonally Differenced  Series')
ax0.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')

ax1 = axs[1]
ax1 = df_ts[var].plot(color="blue")
ax1.set_title('Austalia Beer Production, after seasonal difference')
ax1.grid(visible=True, which='both', axis='x')
ax1.grid(visible=False, which='Major', axis='y')
plt.tight_layout()
plt.show();plt.close()

The plots, the ACF in particular, shows that we have managed to remove seasonality. But since there is still some hint of a trend, we will now apply a regular difference (to the already seasonally differenced data) and plot the result again.

Hide the code
ausprod["Beer_sd4_d1"] = ausprod["Beer_sd4"].diff()

df_ts = ausprod
var = "Beer_sd4_d1"

fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)

ax0 = axs[0]
sm.graphics.tsa.plot_acf(df_ts[var].dropna(), ax=ax0, lags=25, zero=False, title='ACF of Seasonally and Regularly Differenced  Series')
ax0.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')

ax1 = axs[1]
ax1 = df_ts[var].plot(color="blue")
ax1.set_title('Austalia Beer Production, after Differencing (both seasonal and regular)')
ax1.grid(visible=True, which='both', axis='x')
ax1.grid(visible=False, which='Major', axis='y')
plt.tight_layout()
plt.show();plt.close()

As we will see in the next session, this resulting ACF with only a few significant initial terms is a good starting point for modeling with SARIMA models. Let us check the stationarity

Hide the code
ausbeer_diff = ausprod["Beer_sd4_d1"].dropna()
ausbeer_adf = adfuller(ausbeer_diff)
ausbeer_kpss = kpss(ausbeer_diff)
ausbeer_adf[1], ausbeer_kpss[1]
/var/folders/v_/pl7__6kj0nx_lfsnm4_g91q80000gn/T/ipykernel_30156/882291498.py:3: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is greater than the p-value returned.

  ausbeer_kpss = kpss(ausbeer_diff)
(9.951622692227455e-11, 0.1)

ADF rejects the null of non-stationarity (and non seasonality) and KPSS does not reject the null of stationarity (and non seasonality). Therefore they agree and give us another confirmation that the differenced time series appears like the the result of a stationary process.

Automatic differencing order determination

The pmdarima contains a couple of functions ndiff and nsdiffs that indicate the recommended number of seasonal and regular differences to apply to get to this stationary behavior. If you apply them to the beer production time series you see that the recommendation agrees with what we have done.

Always take this recommendation with skepticism! The best way to proceed is to experiment with different orders of differencing and check the results with the ACF plot. Train your eyes and your good judgement and trust them more than you trust these automated procedures!

Hide the code
ndiffs, nsdiffs = pmd.arima.ndiffs, pmd.arima.nsdiffs

print(f"Recommended number of regular differences: {ndiffs(ausprod['Beer'])}")
print(f"Recommended number of seasonal differences: {nsdiffs(ausprod['Beer'], 4)}") 
Recommended number of regular differences: 1
Recommended number of seasonal differences: 1
Choosing the right seasonality is critical

One of the first steps in time series EDA is the identification of the seasonality period of the time series, if seasonal behavior is present. This is critical because the seasonality will determine the period of the seasonal difference. If we choose the wrong period, the differencing will not remove the seasonality. Even worse, a spurious dynamic may be introduced in the time series that will severely aff ect the performance of the models you try to fit to the data. Double check the seasonality period of the time series before proceeding further with the analysis!

The sampling frequency ot the time series is a good starting point to determine the possible seasonality periods. We already suggested checking the table in Section 2.1 of (Hyndman and Athanasopoulos 2021) can be used as a guide for this relation between the tyoe of observations (yearly, monthly, weekly, daily, hourly) and the seasonal period that we can expect to find in our data.

Box-Cox transformations to stabilize the variance

Introduction to the Box-Cox family of transformations

A series whose variance changes over time is not stationary. In order to stabilize the variance we can use different transformations. One of the most popular is the Box-Cox transformation. This transformation is defined as: \[ Y_t^{(\lambda)} = \begin{cases} \dfrac{Y_t^\lambda - 1}{\lambda} & \text{if }\lambda \neq 0\\[3mm] \log(Y_t) & \text{if }\lambda = 0 \end{cases} \] Note in particular that: + For \(\lambda = 0\) the transformation is the logarithm. + For \(\lambda = 1\) the transformation is essentially the identity (with a unit shift). + For \(\lambda = 0.5\) the transformation is the square root. + For \(\lambda = 2\) the transformation is the square.

Therefore the Box-Cox family of transformations includes the logarithm and power transformations. The goal of this transformation is to make the variance of the time series homogeneous through the time span of the series.

The parameter \(\lambda\) that best achieves this is estimated by maximizing the log-likelihood of the transformed data, and we will use the sktime implementation of the Box-Cox transformation to do this.

Let us see a famous example of a time series whose varinace increases with time: the airline passengers data. We will first plot the time series. The time plot clearly shows an increasing variance.

Hide the code
y = load_airline()
Hide the code
y.plot()
plt.show();plt.close()  

Next we apply the BoxCoxTransformer from sktime to the time series. This transformation is implemented as a transformer, with the scikit interface that should be familiar to you by now.

We apply the transformation to the time series and plot the result. We have upscaled the transformed time series by a factor of 20 to make the plot more useful. Note how the variance becomes more stable after the transformation.

Hide the code
boxcox_transformer = BoxCoxTransformer()
y_bxcx = 20 * boxcox_transformer.fit_transform(y)
y.plot(label="Original")
y_bxcx.plot(label="Box-Cox Transformed (rescaled)")
plt.legend()
plt.show();plt.close()

After fitting the transformer we can access the best value of \(\lambda\) with the lambda_ attribute. Recall that if this value is close to 1, then the transformation is essentially the identity and therefore the variance of the time series is already stable.

Hide the code
boxcox_transformer.lambda_
0.14802254856840585
How do we decide whether to apply a Box-Cox transformation?

The first thing that you should always do is plot the time series and check if the variance seems to change through time. If it does, then you should consider a Box-Cox transformation. Once again: trust your eyes and your judgement more than any automated procedure!

However, in some cases we are not certain about whether we need to apply a Box-Cox transformation. In those cases, you can use some tests to check if the variance is stable. The most common test is the Breusch-Pagan test. This test is implemented in the statsmodels library and we will apply it below to the airline passengers time series. The null hypothesis of this test is that the variance is constant. Therefore if the p-value is small, then we reject the null hypothesis and conclude that the variance is not constant.

To apply the test we need to fit a linear regression model to the time series and then apply the test to the residuals. We will use OLs from statsmodels to fit the model. First we create a pandas dataframe with the time series and an integer time index, using standard names to make the code reusable.

Hide the code
y = y.reset_index(drop=True).reset_index()
y.columns = ['time', 'value']
y
time value
0 0 112.0
1 1 118.0
2 2 132.0
3 3 129.0
4 4 121.0
... ... ...
139 139 606.0
140 140 508.0
141 141 461.0
142 142 390.0
143 143 432.0

144 rows × 2 columns

Now we fit the linear regression

Hide the code
formula = 'value ~ time'
olsr = ols(formula, y).fit()

And we apply the Breusch-Pagan test to the residuals and extract the p-value:

Hide the code
_, breusch_pagan_p_value, _, _ = sms.het_breuschpagan(olsr.resid, olsr.model.exog)
breusch_pagan_p_value # If the p-value is less than 0.05, then the residuals are heteroskedastic
4.5590018568832897e-07

As expected for the airline passengers time series, the p-value is very small, indicating that the variance is not constant. A way to understand this is to plot the residuals of the linear regression model. We will do this below.

Hide the code
sns.scatterplot(x = y.index, y = np.abs(olsr.resid), label="OLS Residuals vs time")
plt.show();plt.close()

The funnel-like shape of the residuals plot is a clear indication that the variance is not constant. There are other tests that can be used to check for this, but the Breusch-Pagan test is the most common one. Check this Medium post by V. Cerqueira for further details (Cerqueira is one of the authors of (Cerqueira and Roque 2024)). You can aso play with the online interactive tool in Section 3.1 of Forecasting: Principles and Practice

Resampling a Time Series

Resampling

Even though resampling is not directly connected with the notion of stationarity, it is a frequently used tool in the analysis of time series and you may come across it in your work. Resampling is the process of changing the sampling frequency of the time series. For example, you may have an hourly time series and you want to convert it to a daily time series. This is a resampling operation. We can of course only resample to a lower frequency: you can not resample daily data to hourly data!

Let us see a couple of examples to understand how this works. We will use the resample method in pandas to do this.

Pedestrian Count Example

This is (quote): A dataset containing the hourly pedestrian counts from 2015-01-01 to 2016-12-31 at 4 sensors in the city of Melbourne.

Check this link to the dataset for more information about the dataset. We will load the data and select one of the sensors () to work with.

Hide the code
pedestrians = pd.read_csv('../4_1_Introduction_to_Forecasting/tsdata/tsibble/pedestrian.csv')
pedestrians = pedestrians[pedestrians["Sensor"] == "Bourke Street Mall (North)"]
pedestrians.drop(columns=["Sensor"], inplace=True)
pedestrians
Date_Time Date Time Count
14566 2015-02-16T13:00:00Z 2015-02-17 0 61
14567 2015-02-16T14:00:00Z 2015-02-17 1 16
14568 2015-02-16T15:00:00Z 2015-02-17 2 15
14569 2015-02-16T16:00:00Z 2015-02-17 3 17
14570 2015-02-16T17:00:00Z 2015-02-17 4 9
... ... ... ... ...
30975 2016-12-31T08:00:00Z 2016-12-31 19 2467
30976 2016-12-31T09:00:00Z 2016-12-31 20 1730
30977 2016-12-31T10:00:00Z 2016-12-31 21 1342
30978 2016-12-31T11:00:00Z 2016-12-31 22 1409
30979 2016-12-31T12:00:00Z 2016-12-31 23 749

16414 rows × 4 columns

Hide the code
pedestrians["Date_Time"] = pd.to_datetime(pedestrians["Date_Time"])
pedestrians.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16414 entries, 14566 to 30979
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   Date_Time  16414 non-null  datetime64[ns, UTC]
 1   Date       16414 non-null  object             
 2   Time       16414 non-null  int64              
 3   Count      16414 non-null  int64              
dtypes: datetime64[ns, UTC](1), int64(2), object(1)
memory usage: 641.2+ KB
Hide the code
pedestrians.set_index('Date_Time', drop=True, inplace=True)
pedestrians.drop(columns=["Date", "Time"], inplace=True)
pedestrians
Count
Date_Time
2015-02-16 13:00:00+00:00 61
2015-02-16 14:00:00+00:00 16
2015-02-16 15:00:00+00:00 15
2015-02-16 16:00:00+00:00 17
2015-02-16 17:00:00+00:00 9
... ...
2016-12-31 08:00:00+00:00 2467
2016-12-31 09:00:00+00:00 1730
2016-12-31 10:00:00+00:00 1342
2016-12-31 11:00:00+00:00 1409
2016-12-31 12:00:00+00:00 749

16414 rows × 1 columns

Now suppose that we want to convert this hourly time series to a daily time series. We can do this with the resample method in pandas. We will use the sum method to aggregate the hourly counts into daily counts, because we are interested in the total number of pedestrians per day. We would get it like this:

Hide the code
pedestrians_daily = pedestrians.resample('D').sum()
pedestrians_daily
Count
Date_Time
2015-02-16 00:00:00+00:00 3056
2015-02-17 00:00:00+00:00 23591
2015-02-18 00:00:00+00:00 25733
2015-02-19 00:00:00+00:00 30238
2015-02-20 00:00:00+00:00 33199
... ...
2016-12-27 00:00:00+00:00 39335
2016-12-28 00:00:00+00:00 38138
2016-12-29 00:00:00+00:00 28675
2016-12-30 00:00:00+00:00 38681
2016-12-31 00:00:00+00:00 31855

685 rows × 1 columns

Hide the code
pedestrians.plot(label="Hourly")
plt.title("Hourly Pedestrian Counts")
pedestrians_daily.plot(label="Daily")
plt.title("Daily Pedestrian Counts")
plt.show();plt.close()

You can see in this example how the daily resampling operation allows us to more clearly identify seasonality patterns in the time series.

Another resampling example: sea surface temperature data

In our previous session we already introduced the sea surface temperature data from the NOAA. We will use this data to show how to resample a time series to a different frequency. We will use the monthly data to create a yearly time series. The crucial difference with the preceding example is that we will use the mean method to aggregate the monthly temperatures into yearly temperatures. Adding up the temperatures would not make sense in this case, but finding the mean temperature for each year is a meaningful operation.

:::

Hide the code
NOAA_global_temp = pd.read_csv("../4_1_Introduction_to_Forecasting/noaa.globaltmp.comb.csv", parse_dates=True, index_col=0, na_values=-9999.)
NOAA_global_temp.columns = ['Temp_Anomaly'] 
NOAA_global_temp
Temp_Anomaly
Date
1850-01-01 -0.775
1850-02-01 -0.551
1850-03-01 -0.566
1850-04-01 -0.679
1850-05-01 -0.609
... ...
2024-08-01 0.972
2024-09-01 0.964
2024-10-01 1.063
2024-11-01 NaN
2024-12-01 NaN

2100 rows × 1 columns

Now let us apply the resampling operation:

Hide the code
NOAA_global_temp_yearly = NOAA_global_temp.resample('Y').mean()
NOAA_global_temp_yearly
Temp_Anomaly
Date
1850-12-31 -0.501000
1851-12-31 -0.395167
1852-12-31 -0.357000
1853-12-31 -0.411667
1854-12-31 -0.372083
... ...
2020-12-31 0.704583
2021-12-31 0.553000
2022-12-31 0.584750
2023-12-31 0.875167
2024-12-31 0.969000

175 rows × 1 columns

Hide the code
NOAA_global_temp.plot()
plt.title("Monthly Global Temperature Anomalies")
NOAA_global_temp_yearly.plot()
plt.title("Yearly Global Temperature Anomalies")        
plt.show();plt.close()  

Note again in this example how resampling can help reveal the long term trends and patterns in the time series. For example it is well known that the “El niño” phenomenon is associated with a rise in sea surface temperatures. Can you spot this in the yearly time series?


Other Machine Learning Tasks for Time Series

Before ending this session we would like to mention that forecasting is not the only goal in time series analysis. We may be also interested in time series classification. For example, if you have a time series representation of a bird song, can you tell which species it is? There are also clustering problems. Imagine you are a biologist arriving at a new unexplored region and there are several different (but yet unknown to science) species of bats using ultrasounds to move around. Can you use records of their sounds to guess how many species of bats are there?


In the Next Session

First we see how to define so called baseline models for time series forecasting. These are simple models that are easy to implement and that can be used to compare the performance of more complex models. Then we learn about autoregressive and moving average process and we will see that these constitute the foundation of a very important family of forecasting models, the SARIMA models.

References

Cerqueira, Vitor, and Luís Roque. 2024. Deep Learning for Time Series Cookbook: Use PyTorch and Python Recipes for Forecasting, Classification, and Anomaly Detection. Packt Publishing. https://www.packtpub.com/product/deep-learning-for-time-series-cookbook/9781805129233.
Hyndman, R. J., and G. Athanasopoulos. 2021. Forecasting: Principles and Practice. OTexts. https://otexts.com/fpp3/.
Shumway, R. H., and D. S. Stoffer. 2017. Time Series Analysis and Its Applications: With r Examples. Springer Texts in Statistics. Springer International Publishing. https://link.springer.com/book/10.1007/978-3-319-52452-8.