Based on notes by Professors A.Muñoz, J.Portela & J.Pizarroso
Published
March 2025
Chapter 4, Part 2: Stationarity
Session Setup
Libraries
Let us begin by loading the libraries we will use.
Hide the code
### Load necessary modules -------------------------------%config InlineBackend.figure_format ='png'# ‘png’, ‘retina’, ‘jpeg’, ‘svg’, ‘pdf’import numpy as np import pandas as pdimport scipy.stats as statsimport matplotlib.pyplot as pltimport seaborn as snssns.set_style('whitegrid')sns.set_context('notebook')# statsmodelsimport statsmodels.stats.api as smsimport statsmodels.api as smimport statsmodels.tsa as tsafrom statsmodels.tsa.seasonal import seasonal_decompose, STLfrom statsmodels.tsa.stattools import adfuller, kpssfrom statsmodels.formula.api import olsimport pmdarima as pmdimport syssys.path.insert(0, '.')# from utils import *plt.rc('axes', titlesize='medium')plt.rc('axes', titlelocation='left')plt.rc('axes.spines', right=False)plt.rc('axes.spines', top=False)sizets = (8,4.5)plt.rc('figure', figsize=sizets)%matplotlib inline
Hide the code
from fc_4_2_utils.utils import*# Time Series librariesimport datetime# pmdarimaimport pmdarima as pmd# datasetsimport rdatasetsimport yfinance as yf# tqdm from tqdm import tqdm# sktime from sktime.datasets import load_airlinefrom sktime.transformations.series.boxcox import BoxCoxTransformerfrom sktime.utils.plotting import plot_series
Types of Models for Time Series
Pure autoregressive models for time series
Our forecasting goal is to predict the value \(y_t\) (the future) using the previous history of the time series: \[
\mathbf{ \bar y_{t-1}} = (y_1, y_2, \ldots, y_{t-1})
\] The first family of models that we will explore in the next sessions can be expressed as: \[
y_t = f(\mathbf{\bar y_{t-1}}, \mathbf{\bar \epsilon_{t-1}})
\] where \(\mathbf{\bar \epsilon_{t-1}}\) is also a time series of error or noise terms. Since the output \(y_t\) is numeric, this kind of model can be seen as a (special kind of) regression problem in which we regress the next value of the time series using its past and noise terms as inputs.
We call this a pure autorregresive model because the time series is regressed on itself. That is, the output \(y_t\) at time \(t\) is regressed on past values \(y_{t-1}, y_{t-2}, \ldots, y_{t-p}\) and only on them, without any other external input.
There are two natural and interconnected questions to ask about this models:
How is this special or different from our previous discussion of regression problems?
What do we mean by noise in the context of time series?
Before addressing these questions, let us talk about a generalization of these models.
Dynamic Regression Models for Time Series
Sometimes we will consider one or more additional time series that contain information to help us predict the value of \(y_t\). In those cases we will arrange these additional time series in a matrix \(\mathbf{X(t)}\) of exogenous variables\[
\mathbf{X_{t-1}} =
\left(
\begin{array}{cccc}
\uparrow&\uparrow&\cdots&\uparrow&\\
\mathbf{\bar x_{1;t-1}}& \mathbf{\bar x_{2;t-1}}&\cdots&\mathbf{\bar x_{p;t-1}}\\
\downarrow&\downarrow&\cdots&\downarrow&
\end{array}
\right)
\] where each column of the matrix is a time series containing the (past) values of the corresponding auxiliary input.
Therefore a formulation of the forecasting modeling problem for time series is this. Given the previous histpry of the output time series\(\mathbf{\bar y_{t-1}}\) and, optionally, a matrix of exogenous variables up to time \(t-1\), find a modelto predict the next value of the output series \[
y_t = f(\mathbf{\bar y_{t-1}}, \mathbf{\bar X_{-1}}, \mathbf{\bar \epsilon_{t-1}})
\] where again \(\mathbf{\bar \epsilon_{t-1}}\) is the time series of noise terms.
Theoretical Foundations
Stochastic Processes and Statistical Properties of Time Series
The answer to our first question about time series models requires that we introduce some previous ideas about time series in the wider context of stochastic processes. A discrete stochastic process\(Y(u, t)\) is a family of time indexed random variables. Here we consider time as discrete, meaning that we only consider a sequence of time values \[t_1, t_2, t_3, \ldots\] In particular this means that: + for a fixed time value \(t_k\), we have a random variable \(Y_t(u) = Y(u, t_k)\) defined on a certain common sample space\(\Omega\) (this where \(u\) lives), and having its corresponding probability distribution, mean, variance (assuming these exist) and so on. + for a fixed \(u_0\) we get a time series: \[Y^u(t) = \left[Y(u_0, t_1),\,\, Y(u_0, t_2),\,\, Y(u_0, t_3),\,\, \ldots\right]\] abbreviated as \[Y^u(t) = \left[y^{u_0}_{t_1},\,\, y^{u_0}_{t_2},\,\, y^{u_0}_{t_3},\,\, \ldots\right]\] that we call a realization of the stochastic process.
Example.
To make things more concrete and to understand the role of \(u\), let us look at an example and then we will implement it in Python. Let: \[
X(u, t) = 3\sin(2 \pi t / 500 + A(u))
\] where \(A(u)\) is a real valued random variable, and \(t = 1, 2, \ldots\) is a positive natural number. In particular let us assume that \(A(u)\) is uniformly distributed in \([0,2\pi]\). We can think of \(A(u)\) as a weird dice. Rolling that dice will result in a random value for \(A\) (according to its distribution); say we get \(A(u_0) = 2.5\) for a certain we-don’t-really-care-which value \(u_0\). Think of \(u_0\) as a shorthand for roll the dice. Then we have \[
X(u_0, t) = 3\sin(2 \pi t / 500 + 2.5)
\] and this defines a time series (seasonal, with period 500), which is a realization of the stochastic process. Another roll of the dices, say \(u_1\), and you get a different time series.
Let us implement this in Python. We will construct the first 1500 values (i.e. \(t=1,\ldots, t=1500\)) for 10 different realizations of the stochastic process (i.e. for 10 different values of \(u_0\) above, 10 different rolls of a peculiar dice).
Hide the code
n =1500k =10X = np.zeros((n, k))rng = np.random.default_rng(2024)Ts = np.arange(n)for i inrange(k): A = rng.uniform(low=0, high=2* np.pi, size =1) X[:, i] =3* np.sin(2* np.pi * Ts/500+ A)
Let us store the resulting values in a pandas Dataframe to make processing easier:
Hide the code
Xdf = pd.DataFrame(X, columns=["X"+str(i) for i inrange(k)])Xdf.head()
X0
X1
X2
X3
X4
X5
X6
X7
X8
X9
0
-2.680097
2.924941
2.793115
-2.856263
-0.079119
2.338127
1.424163
2.721056
2.315777
2.625455
1
-2.696825
2.933090
2.779138
-2.844509
-0.041428
2.361563
1.457230
2.736716
2.291629
2.643488
2
-2.713126
2.940776
2.764722
-2.832305
-0.003730
2.384625
1.490067
2.751943
2.267119
2.661103
3
-2.728999
2.947997
2.749869
-2.819654
0.033968
2.407311
1.522669
2.766736
2.242251
2.678299
4
-2.744441
2.954753
2.734582
-2.806558
0.071661
2.429617
1.555030
2.781092
2.217029
2.695071
And let us plot the different realizations as time series:
We have added a dashed line at \(t_a = 325\) to help you visualize the random variable \(X_{325}\), which is part of this stochastic process. The dots illustrate that we have generated a \(k\)-sized sample of \(X_{t_a}\). The sample corresponds to a row of the pandas dataset (while the different time series corresponds to columns). We could use this sample to estimate the mean, variance and other statistical properties of \(X_{t_a}\). The key point is that in this example we have access to different realizations of the stochastic process, different time series of the process.
Exercise 001
Do this, find the mean of \(X_{t_a = 325}\) and plot a density curve for \(X_{t_a = 325}\). Then change the value of k to 1000 and repeat the computation of the mean and density curve. What do you observe? Does the mean depend on the value of \(t\)? What about the variance of the time series, does it depend on t?
About the Role of u in the Code for this Example and the Consequences for Notation.
Note, in particular that we do not care or need to know the value \(u_0\), all we really care about is the value of \(A(u)\) and \(B(u)\) and the probabilistic distribution of these values. And in the Python code above this clearly shows, because we needed no variable for the value of u. All we had to do was roll the dice. Can you spot the lines of code where we did this dice rolling?
As consequence, we will often remove \(u\) from our notation and we will write e.g. \(X_t\) instead of \(X(u, t)\). We will only include \(u\) explicitly when we want to consider different values of \(u\), that is different time series of the same stochastic process.
We usually work with a single time series. We then use \(x_t\) to refer to the particular value of the random variable \(X_t\) that we have in our time series.
Making Noise
The above example certainly defines an stochastic process. If you roll the dices again, you will get a different time series. But the individual time series themselves are certainly not noisy. In fact, they are individually ruled by a deterministic equation: once \(w\) is fixed, there is nothing random about them.
Next we are going to introduce a different type of stochastic process that is, in comparison, extremely noisy.
White Noise, Different Shades of
The following definition provides a first answer to our pending question of what noise means in the context of time series.
An stochastic process \(W_t = W(u,t)\) is white noise if for each fixed time \(t_k\), the random variables \[W_{t_1}, W_{t_2}, W_{t_3}, \ldots\] are uncorrelated and identically distributed with mean zero. In particular, they all have the same variance \(\sigma^2\).
A white noise stochastic process is called independent white noise if those \(W_{t_i}\) random variables are (not just uncorrelated but) independent.
Finally, an independent white noise is gaussian white noise if the \(W_{t_i}\) are iid (independent identically distributed) normal variables \(N(0, \sigma)\).
Let us implement a gaussian white noise process in Python. Let us suppose that the \(W_{t_i}\) are iid standard normals \(Z = N(0, 0.25)\). In particular this means that each row of the dataset is a sample (of size \(k\)) of \(Z\). So this time we use that to fill the dataset by rows.
Hide the code
W = np.zeros((n, k))for i inrange(n): W[i, :] = np.random.default_rng(i).normal(loc =0, scale =0.25, size = k)Wdf = pd.DataFrame(W, columns=["W"+str(i) for i inrange(k)])Wdf.head()
W0
W1
W2
W3
W4
W5
W6
W7
W8
W9
0
0.031433
-0.033026
0.160106
0.026225
-0.133917
0.090399
0.326000
0.236770
-0.175934
-0.316355
1
0.086396
0.205405
0.082609
-0.325789
0.226339
0.111594
-0.134238
0.145280
0.091143
0.073533
2
0.047263
-0.130687
-0.103266
-0.610367
0.449927
0.286041
-0.081356
0.193452
0.070303
-0.138456
3
0.510230
-0.638916
0.104525
-0.141942
-0.113162
-0.053899
-0.504997
-0.057983
-0.216303
0.830750
4
-0.162948
-0.043679
0.415931
0.164787
-0.410349
-0.001301
-0.155866
0.037158
-0.402047
0.060443
Let us plot the \(k\) time series that we have obtained as realizations of this stochastic process. Now that looks noisy!
Hide the code
Wdf.plot()plt.show();plt.close()
Combining the two examples
Let us now consider a combined stochastic process \(Y\) which is simply the sum of the two preceding examples. That is, the random variable \(Y_t\) is defined as \[Y_t = X_t + W_t\] for each value of \(t\).
We can quickly visualize what that means in Python.
Hide the code
Y = X + WYdf = pd.DataFrame(Y, columns=["Y"+str(i) for i inrange(k)])Ydf.plot()plt.show();plt.close()
You can see that the time series in this last plot are beginning to look much more alike the ones we saw in the previous session. Each time series (= realization of the process = column of the dataset) displays a signal (the \(X\) part) and a noise (the \(W\) part). Most of our time series models in the next sessions will be similar: they include a signal part and a noise part that we sum to get the complete model structure (in additive models; we multiply them in multiplicative models).
The Ensemble and a Mental Image of Stochastic Processes
After the predecing examples, a mental image of an stochastic process \(Y(u, t)\) as some kind of two-way table or matrix is probably emerging:
\[
\left(
\begin{array}{cccccc}
Y(u_1, t = 1)&Y(u_2, t = 1)&Y(u_3, t = 1)&\cdots&Y(u_k, t = 1)&\cdots&\cdots\\
Y(u_1, t = 2)&Y(u_2, t = 2)&Y(u_3, t = 2)&\cdots&Y(u_k, t = 2)&\cdots&\cdots\\
\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\
\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\
Y(u_1, t = t_a)&Y(u_2, t = t_a)&Y(u_3, t = t_a)&\cdots&Y(u_k, t = t_a)&\cdots&\cdots\\
\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\
\vdots&\vdots&\vdots&\vdots&\vdots&\ddots&\ddots
\end{array}
\right)
\] Once again: each column is a time series generated by the process using a fixed value of \(u\) and each row is a sample of the ranndom variable \(Y_{t_a}\) that corresponds to a fixed time instant \(t_a\). Here we use the \(Y(u, t)\) symbol to emphasize the different values of \(u\).
However, be careful with a too literal interpretation. The dots try to suggest that there is a possible column for each point \(u\) in the sample space \(\Omega\), which can be continuous and high dimensional! So a better mental picture at leaast replaces the idea of rows with a horizontal axis (better yet, a copy of \(\Omega\)). For discrete time processes the idea that a row corresponds to an instant in tie is o, but keep in mind that there exist continuous time stochastic processes; and for those we need to replace the vertical dimension of this mental image with a continuous axis as well. And we need an additional axis for the values that the stochastic process generates (the \(Y\) axis) This resulting set of points \[(t, u, Y(t, u))\in \mathbb R\times\Omega\times\mathbb R\] is called the ensemble of the stochastic process and it contains all the possible time series that the stochatic process can possibly generate. This web page contains some helpful visualizations of the ensemble for an stochastic process.
How is this Relevant for Time Series Modeling?
As we have said, our models for time series assume a certain signal/noise structure of the time series that we are trying to first fit and then predict with the model. In particular, we need to be able to recognize noise when we see it! And that means that we need to be able to estimate the statistical properties of a time series to see if it qualifies as noise.
So, how do we do that? In statistics the basis of any estimation is a sample. And here is where things get tricky with time series. The random variables that form the stochastic process are obtained with a fixed time instant \(t_a\). A sample is a row-like section of the stochastic process, as we have seen. But in Machine Learning our sample is a time series, which is a column-like section of the process. To emphasize the problem, with a single time series (the usual situation) we have a sample for \(Y_{t_a}\) of sample size one! How are we going to get estimates with this, the tiniest of samples?
The notion of stationarity offers a way out of this sampling problem. Remember that when we introduced white noise stochastic processes, one of its key properties was that the time series that form it are independent or at least uncorrelated. In particular if I show you a portion of a white noise series it is impossible to tell (to estimate) to which time values it corresponds. White noise series are time-shift invariant.
This idea of time-shift invariance is at the core of the idea of stationarity. More precisely a stochastic process \(Y_t\) is strictly stationary if the joint distribution of a random vector such as \[\left(Y_{t_{a_1}}, Y_{t_{a_2}},\ldots, Y_{t_{a_s}}\right)\] does not depend on the particular time instants \(t_{a_1},\ldots,t_{a_s}\) for any vector length \(s\). For example, a gaussian white noise process is, by definition, a strictly stationary stochastic process.
Why is strict stationarity helpful? Because if the process is stationary then the entire time series can be used as a sample to estimate the (common) statistical properties. For example, to estimate the mean we can simply use as estimator the average of all the available time series values: \[\hat\mu = \bar Y_t = \dfrac{Y_1+ Y_2 + \cdots + Y_n}{n}\] Because the random variables \(Y_1,\ldots,Y_n\) all have the same mean \(\mu\).
Autocovariance and Autocorrelation (ACF)
A proper statistical estimate, like \(\bar\mu\) above, is not complete until we understand its variability and sampling distribution. In order to do that, we turn into the variance of \(\mu\): \[
\operatorname{var}(Y_t) = \operatorname{var}\left(\dfrac{Y_1+ Y_2 + \cdots + Y_n}{n}\right) =
\dfrac{1}{n^2}\operatorname{cov}\left(\sum_{t = 1}^n Y_t, \sum_{s = 1}^n Y_s\right)
\]
Here we will use a very useful property of covariance: it is bilinear. Thus, if we have a linear combination of (finite variance) random variables \[U = \sum_{j = 1}^m a_j X_j,\qquad V = \sum_{k = 1}^r b_k Y_kj\] then \[\operatorname{cov}(U, V) = \sum_{j = 1}^m\sum_{k = 1}^r a_jb_k \operatorname{cov}(X_j, Y_k)
\qquad\qquad\qquad\qquad (1.13)\] And besides \[\operatorname{var}(U) = \operatorname{cov}(U, U)\]
With this: \[
\operatorname{var}(Y_k) = \dfrac{1}{k^2}\sum_{t = 1}^k\sum_{s = 1}^k\operatorname{cov}\left( Y_t, Y_s\right)
\] Let us give the covariances that appear here a name; the autocovariance function of the stochastic process is \[
\gamma_Y(s, t) = \operatorname{cov}\left( Y_t, Y_s\right)
\] As usual the problem with variance is that it depeds on the scale of the variable. So we define an absolute version. The autocorrelation function (ACF) is: \[
\rho_Y(s, t) = \dfrac{\operatorname{cov}(Y_t, Y_s)}{\sqrt{\operatorname{cov}(Y_t, Y_t)\operatorname{cov}(Y_s, Y_s)}}
\] and it is always a number between -1 and 1.
Autocovariance and Autocorrelation (ACF) for Strictly Stationary Processes
Now suppose that the stochastic process is strictly stationary. Then for any time shift \(k\) we have that the distribution of the vectors \((Y_t, Y_s)\) and \((Y_{t + k}, Y_{s + k})\) must be the same. In particular \[
\operatorname{cov}\left( Y_t, Y_s\right) = \operatorname{cov}(Y_{t + k}, Y_{s + k})
\] and this means (take \(k = -t\)) that the autocovariance and ACF only depends on the difference \(h = s - t\). For a stationary time series we simplify thenotation by defining: \[
\gamma_Y(h) = \gamma_Y(0, h) = \gamma_Y(t, t + h), \qquad \rho_Y(h) = \rho_Y(0, h) = \rho_Y(t, t + h)
\] Note that the value of \(t\) is irrelevant in these equations.
Again: why is this relevant or useful? Time Lags of a Time Series.
Remember that the problem is that we only have one time series (in fact \(n\) values of the time series): \[y_1, y_2, \ldots, y_n\] But if we can assume stationarity, then we can use lagged (time-shifted) copies of our time series. Given an integer \(h\) we define the corresponding lagged copy of the time series as: \[
z_1 = y_{1 + h}, \ldots, z_n = y_{n+h}
\] And then all the lagged values\(z_1, \ldots, z_n\) can be used as values for \(Y_{t + h}\) when estimating \[
\rho_Y(h) = \rho_Y(t, 0) = \rho_Y(t, t + h)\qquad\text{In particular we always have:}\quad \rho_Y(0) = 1
\] We emphasize that this is possible because the stationarity guarantees that the result only depends on \(h\). In the non stationary case a lag of the time series would be useless here.
By the way, now we can answer the other pending question: the presence of autocorrelation in time series is what makes time series special and different from other regression problems we have seen.. For example. we need to take it into account when doing estimates like the one for \(\operatorname{var}(Y_t)\) above. In standard linear regression models we assume that we have samples formed by iid (in particular independent and therefore uncorrelated) copies of the random variables in the problem.
Lagged Time Series in Python
Let us see how this works in practice using Python. We will use the same Canadian gas example that we used in the previous session, so we will begin my loading and preparing it. Then we will use the shift method in pandas to obtain the first 15 lagged time series out of it.
Now we create the lags, for \(h= 0\) (a copy of the original time series) to \(h = 15\). We use a dictionary for this. Note that the range in the for loop uses 16 to get 15 lags, and the way we add names to the columns. We will examine the result and analyze the consequences below.
Hide the code
can_gas_lags = pd.DataFrame({"lag"+str(lag): can_gas["Volume"].shift(lag) for lag inrange(16)})can_gas_lags
lag0
lag1
lag2
lag3
lag4
lag5
lag6
lag7
lag8
lag9
lag10
lag11
lag12
lag13
lag14
lag15
Month
1960-01-01
1.4306
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1960-02-01
1.3059
1.4306
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1960-03-01
1.4022
1.3059
1.4306
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1960-04-01
1.1699
1.4022
1.3059
1.4306
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1960-05-01
1.1161
1.1699
1.4022
1.3059
1.4306
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
2004-10-01
17.8268
16.9067
17.6412
17.8783
17.1657
18.0935
17.8475
18.3332
17.7823
19.2439
19.1440
17.7200
17.7920
16.9053
17.5883
17.3844
2004-11-01
17.8322
17.8268
16.9067
17.6412
17.8783
17.1657
18.0935
17.8475
18.3332
17.7823
19.2439
19.1440
17.7200
17.7920
16.9053
17.5883
2004-12-01
19.4526
17.8322
17.8268
16.9067
17.6412
17.8783
17.1657
18.0935
17.8475
18.3332
17.7823
19.2439
19.1440
17.7200
17.7920
16.9053
2005-01-01
19.5284
19.4526
17.8322
17.8268
16.9067
17.6412
17.8783
17.1657
18.0935
17.8475
18.3332
17.7823
19.2439
19.1440
17.7200
17.7920
2005-02-01
16.9441
19.5284
19.4526
17.8322
17.8268
16.9067
17.6412
17.8783
17.1657
18.0935
17.8475
18.3332
17.7823
19.2439
19.1440
17.7200
542 rows × 16 columns
Overlaping Lags and ACF Estimation.
Our plan is to use the columns of this table of lags to estimate the ACF. For example, to estimate \(\rho_Y(1)\) we will compute the correlation between the first two colums of this dataframe. Obviously, the lagged time series contain missing values, because we can not get out of sample lagged values. Thus the lag1 time series has one missing value, the lag2 series has two missing values, and so on. In partiular this means that our estimate of \(\rho_Y(1)\) will be based in \(n - 1\) values, the estimate of \(\rho_Y(2)\) is based on \(n-1\), etc. This implies that the sampling support for the values of \(\rho_Y(h)\) decreases as we consider bigger lags. In practice we will therefore attribute more importance to the estimates of the first values of the ACF (for smaller lags) and we will rely less on estimates based on bigger lags.
ACF Values and Plot in Python.
We could try a manual computation of the ACF values, but getting it right means dealing with those missing values in the right way. Instead, we will use the solution provided by the statsmodels library.
The library also includes a function for a nice graphical representation of the ACF values, a so called ACF plot, that appears in the left panel below. In the right panel we show the plot of the original time series and the 4-lag time series:
Hide the code
which_lag =6fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)ax0 = axs[0]sm.graphics.tsa.plot_acf(can_gas["Volume"], ax=ax0, lags=25, zero=False, title='ACF for the Canadian Gas Example')ax0.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')ax1 = axs[1]ax1 = can_gas["Volume"].head(100).plot(color="blue", label="Original")ax1 = can_gas_lags[f"lag{which_lag}"].head(100).plot(color="red", label=f"lag{which_lag}")ax1.set_title('Canadian Monthly Gas Production')ax1.grid(visible=True, which='both', axis='x')ax1.grid(visible=False, which='Major', axis='y')ax1.legend()plt.tight_layout()plt.show();plt.close()
Initial Interpretation of the Canadian Gas ACF Plot
Recall that the definition of the ACF values is \[\rho_Y(h) = \rho_Y(t, t + h)\] and that \(\rho\) represents correlation. That means that \(\rho\) measures how strong is the linear relation between \(Y_t\) and \(Y_{t + h}\). The right panel illustrates that in this example this relation is very strong: for most of the time series span, the slope of the graph for \(Y_t\) and \(Y_{t + h}\) are very similar. This holds clearly for small values of \(h\) but if you factor in the seasonility, the relation carries on for not so small values of \(h\). This is what the ACF on the left illustrates: the height of each vertical bar indicates a value of \(\rho_Y(h)\) for increasing lags. And we can see in this example correlation values close to 1 even after 25 lags. Not ealso that there is an uptick in the correlations at lag=12, corresponding to the seasonality period of the time series.
Note also the shaded area in the ACF plot. Soon we will see that we can use statistics to test the hypothesis that the ACF values are significantly different from 0. An ACF out of the shaded area is significantly different from 0.
Finally note that you can select the number of lags to include in the ACF plot, and that we have selected zero=False to exclude the ACF at lag = 0, which is always equal to 1 and sometimes messes with the vertical scale of the plot.
Exercise 02
Replace lag4 with lag12 above and repeat the plots.
Lag Plots
The lag plots are scatterplots of \(Y_t\) vs \(Y_{t + h}\) for different values of \(h\). They are particularly interesting for smaller values of \(h\) and for small multiples of the seasonal frequency. They offer a different visualization of the information of the ACF. Below we include the lag plots for all the lags included in an extended version of can_gas_lag. Note that the correlations are very strong, and in the case of lags = 12 and 24 they are even stronger than those of their close neighbors.
The mean of a white noise process \(W_t\) is \(\mu_W = 0\). The autocovariance function is: \[
\gamma_W(s, t) =
\begin{cases}
\sigma_W^2& s = t\\[3mm]
0 & s \neq t
\end{cases}
\] and so the ACF equals 1 for \(t = 0\) and is \(0\) for any value \(t > 0\). Let us plot the ACF for one of the gaussian white noise series in the columns of the dataset Wdf in an example above (any column will do). As you can see, all tha ACF values are very small , almos all of them inside the shaded region (this is a 95% significance test, so we expect some ACF values to appear significant at random; but even in those cases the values are anyway very small). This behavior is of course to be expected, because the random variables \(W_t\) and \(W_{t+h}\) in a gaussian white noise time series are not only uncorrelated, but also independent. A consequence of their definition is that all gaussian white noise time series are strictly stationary.
As we have seen, strict stationarity opens the possibility of using the time series values to make estimates of the statistical properties of the underlying stochastic process, even though we only have a realization of that process. However, the problem is that checking if an observed time series is strictly stationary is not easy at all. We can however relax the equirements and still get the properties we need to serve as the basis for modeling.
An stochastic process (and its realizations as time series) is called weakly stationary if: 1. Its mean and variance are constant (do not depend on \(t\)) 2. The autocovariance function \(\gamma(t, s)\) only depends on the difference \(|s - t|\).
Every strictly stationary process is clearly also weakly stationary.The advantage of this notion of weak stationarity is that it is simpler to check if it does not hold, as we will see below. Because of this, from now on, the word stationary will be used to mean weakly stationary. And if we mean strict stationarity we will make it explicit.
Detecting Non Stationarity
Trend, Stable Variance and Stationarity.
If a time series has a trend, that means that the average value (the mean) of \(Y_t\) and \(Y_{t + h}\) will not be the same for certain values of \(h\). In particular this means that a series with a trend can not posibly be stationary. The presence of a trend can be detected in the ACF because the values do not approach zero or approach zero very slowly (with a linear decay). We have already seen an example in the ACF for the Canadian gas time series.
If the variance of a time series increases or decreases clearly asin dofferent sections of the time series, then just as in the case of a non constant mean, this shows that the variance of \(Y_t\) is not constant and so a time series with non constant variance can not posibly be stationary. The following example of the recordings inf a sismograph during an earthquake (Example 1.7 from (Shumway and Stoffer 2017)) is a clear case of non constant variance in the time series, as the amplitude of the oscilations recorded as sismometer measures increase during the earthquake. The ACF of this series exhibits an oscilating (but non periodic) slow decay.
This is another important example of a time series that appears in may applications. Here we will use it to see yet another example of ACF.
A stochastic process \(Y_t\) is a random walk if it satisfies the equation \[Y_t = Y_{t -1} + W_t, \qquad\text{where }W_t\qquad\text{ is white noise.}\] If we assume that \(Y_1 = 0\) note that \[Y_3 = Y_2 + W_3 = W_2 + W_3\qquad\text{ and, in general }Y_k = \sum_{i=2}^k W_k\] Thus a random walk is a sum of white noise values. This in turn means that the autocovariance of a random walk is: \[
\gamma_Y(s, t) = \min(s, t)\,\sigma_w^2
\] where \(\sigma_w^2\) is the variance of the white noise time series. This shows that the autocovariance depends on the particular values of \(t\) and \(s\) and therefore a random walk is not stationary.
Random Walks in Python.
Let us simulate a random walk in Python. The only thing we need to do is to create a white noise time series (we will use the W1 column from the Wdf example above) and then use the recursive equation above to generate the rest of the values \(y_2, y_3, \ldots\) as a cumulative sum of the white noises series values.
Is a seasonal time series stationary? As we will see this is not a simple question. We saw before an example, the sinusoidal stochastic process \[X_t = 3 \sin(2\pi t/500 + A)\] where \(A\) is uniformly distibuted in \([0, 2\pi]\). And in the exercise following that example we tried to gain some intuition that the mean and variance of the time series do not depend on \(t\). For this time series stationarity can be established formally (see e.g. this online reference from the EE2S31 Signal Processing course at TU Delft, page 28). However, the ACF exhibits a pattern that we would normally associate with non stationarity.
Hide the code
fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)ax0 = axs[0]sm.graphics.tsa.plot_acf(Xdf["X2"], ax=ax0, lags=1000, zero=False, title='ACF for Sinusoidal with Random Uniform Phase')ax0.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')ax1 = axs[1]ax1 = Xdf["X2"].plot(color="blue")ax1.set_title('Original Time Series')ax1.grid(visible=True, which='both', axis='x')ax1.grid(visible=False, which='Major', axis='y')plt.tight_layout()plt.show();plt.close()
Let us consider a second example of a sinusoidal stochastic process \(Z-t\) given by \[Z_t = 3 \sin(2\pi t/100 + B)\] but this time let us assume that the phase \(B\) is a random normal variable with mean 0 and standard deviation 0.5. We repeat the same steps, fist plotting a large number of realizations of this process (many time series that only differ on the roll of the dice).
Hide the code
n =1500k =1000Z = np.zeros((n, k))rng = np.random.default_rng(2024)Ts = np.arange(n)for i inrange(k): A = rng.normal(loc=0, scale=0.5, size=1) Z[:, i] =3* np.sin(Ts/50+ A)Zdf = pd.DataFrame(Z, columns=["Z"+str(i) for i inrange(k)])Zdf.head()
Z0
Z1
Z2
Z3
Z4
Z5
Z6
Z7
Z8
Z9
...
Z990
Z991
Z992
Z993
Z994
Z995
Z996
Z997
Z998
Z999
0
1.476112
2.195401
1.627373
-1.402843
-1.924380
0.100776
1.252454
0.755556
2.359540
1.099995
...
-2.423743
1.482323
0.223951
0.768504
-1.446321
-1.550680
-0.754521
0.762182
-0.305112
-0.602334
1
1.528048
2.235851
1.677449
-1.349530
-1.877969
0.160718
1.306721
0.813467
2.396120
1.155592
...
-2.387902
1.534187
0.283735
0.826344
-1.393469
-1.499010
-0.696302
0.820057
-0.245366
-0.543439
2
1.579372
2.275406
1.726854
-1.295677
-1.830807
0.220595
1.360465
0.871052
2.431742
1.210727
...
-2.351107
1.585438
0.343405
0.883854
-1.340059
-1.446741
-0.637805
0.877604
-0.185522
-0.484327
3
1.630065
2.314051
1.775569
-1.241306
-1.782912
0.280385
1.413665
0.928290
2.466391
1.265378
...
-2.313371
1.636054
0.402938
0.941010
-1.286113
-1.393893
-0.579053
0.934800
-0.125604
-0.425021
4
1.680106
2.351770
1.823573
-1.186438
-1.734304
0.340062
1.466300
0.985155
2.500053
1.319523
...
-2.274710
1.686016
0.462310
0.997790
-1.231653
-1.340487
-0.520070
0.991622
-0.065636
-0.365545
5 rows × 1000 columns
And let us again plot the different realizations of \(Z_t\) as time series:
In this case, does the mean of the time series depend on the value of \(t\)? And what about the variance?
In this case it can also be shown formally that the process is not stationary. Now let us look at the corresponding ACF:
Hide the code
fig, axs = plt.subplots(1, 2, figsize=(16,4), sharex=False, sharey=False)ax0 = axs[0]sm.graphics.tsa.plot_acf(Zdf["Z2"], ax=ax0, lags=1000, zero=False, title='ACF for Sinusoidal with Normal Uniform Phase')ax0.set(ylim=(-1,1), xlabel='Lag', ylabel='acf')ax1 = axs[1]ax1 = Zdf["Z2"].plot(color="blue")ax1.set_title('Original Time Series')ax1.grid(visible=True, which='both', axis='x')ax1.grid(visible=False, which='Major', axis='y')plt.tight_layout()plt.show();plt.close()
Detecting Stationarity in Time Series Samples
As you can see, it looks pretty similar! The take home message from this examples is that the ACF is helpful in making us suspect non stationarity, but it is no formal proof. In fact, it is important to understand that the exact same time series can be a result of both sinusoidal examples: once we have rolled either the uniform or the normal dice, there is no way to tell which one was used to obtain the phase. As we will see in the next session, we will use models that assume that the time series originates from a stationary process. But ultimately, this is an assumption. We will use the results in this session as EDA and preprocessing to remove the non-stationary behavior so that the time series we feed into our models will look compatible with a generating stationary process. Whether this modeling strategy is right or not is left for the performance evaluation step and will be discussed further in the next session.
Hypothesis Tests for Stationarity
A number of hypothesis tests have been devised to test for stationarity. However, as with the ACf plots, they must be used with caution as they can even result in apparently contradictory conlusions. A couple of commonly such tests are:
ADF Test: uses as null hypothesis: the time series is not stationary and non seasonal.
KPSS Test: Uses as null hypothesis: the time series is stationary and non seasonal.
Both tests are implemented in statsmodels.tsa, we will apply both of them to some of our examples.
/var/folders/v_/pl7__6kj0nx_lfsnm4_g91q80000gn/T/ipykernel_30156/596027656.py:2: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is smaller than the p-value returned.
RW_kpss = kpss(RWdf["RW"])
(0.791832704806252, 0.01)
For the stationary sinusoidal (uniform random phase) the p-value of the ADF tells us to reject the null. This means that the series is either stationary or seasonal, the later being true. But the KPSS test fails to reject the null, even though we know it to be false because of (strong) seasonality.
/var/folders/v_/pl7__6kj0nx_lfsnm4_g91q80000gn/T/ipykernel_30156/2546339701.py:2: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is greater than the p-value returned.
X_2_kpss = kpss(Xdf["X2"], nlags="legacy")
(0.0, 0.1)
In the case of the non stationary sinusoid (normal random phase) the ADF again rejects the null, but the KPSS does not reject it, even though the series is not stationary and it is seasonal.
/var/folders/v_/pl7__6kj0nx_lfsnm4_g91q80000gn/T/ipykernel_30156/2005724078.py:2: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is greater than the p-value returned.
Z_kpss = kpss(Zdf["Z2"], nlags="legacy")
(0.0, 0.1)
Exercise 004
What happens with the earthquake data?
Transformations to Achieve Stationarity
Dealing with trend and seasonality
Introduction
A series with a trend or a series with seasonality can not be stationary. In the case of a trend, the mean of the time series is not constant. In the case of seasonality, the mean of the time series is constant but the variance is not. However, some important forecasting model families are based on the assumption that the time series is stationary. This is the case for ARIMA models, for example. Therefore, often the first step in the analysis of a time series is to remove the trend and seasonality. In order to do that we introduce the important concept of differencing.
Differencing
The backshift or lag operator\(B\) is defined by: \[BY_t = Y_{t - 1}\] The powers of this backshift operator can be used to express further lags of the time series. For example: \[B^3Y_t = Y_{t - 3}\] The operation \[
(1 -B)Y_t
\] is a (first order regular ) difference of the time series. Similarly, for a seasonal time series with period \(p\) we define the first order seasonal difference operator to be: \[
(1 -B^p)Y_t = Y_t - Y_{t - p}
\] Note of course that when we apply this operations to a time series, the first values of the difference will be missing because they require past values that are not present in our data.
In Python we can use diff to obtain these differences of a time series. For example, for the Canadian cas example (recall, seasonality 12) the first order regular and seasonal difference are obtained with:
Do the same with the other time series examples in this session, but only take seasonal differences in seasonal time series!
Differencing to remove trends and seasonality
After some experimenting, you should have noticed that: + Regular differencing removes (or at least greatly reduce) the trend in a time series. If the series is seasonal, it does not remove seasonality. + Seasonal differencing removes (or at least greatly reduce) seasonality in a seasonal time series. And if the series has a trend seasonal differencing may also reduce or even cemove the trend.
Therefore, when we hve a time series that exhibits seasonality and or a trend we will proceed as follows: + If the series has seasonality we will apply seasonal differencig tom remove it. If it also has a trend, we examine whether seasonal differencing has removed the trend. If it has not, then we apply regular differencing to remove the trend. + If the series has a trend but no seasonality then we only apply regular differencing to remove it.
Keep also in mind that sometimes a single difference (seasonal or regular) may not be enough to do the job. In some cases we need to apply a second difference of a certain kind to remove a seasonal behavior or a trend. But we will never apply a third difference of the same kind, as this leads to too complex modeling dynamics.
Australian Beer Example
The data in aus_production.csvcontains Quarterly production of selected commodities in Australia from 1956 to 2010. This data is [contained in the R librarytsibbledata`](https://cran.r-project.org/web/packages/tsibbledata/tsibbledata.pdf), and is also used in the book (Hyndman and Athanasopoulos 2021). Let us begin by reading the data.
We will focus in the time series corresponding to beer production. We can visualize the time series and its ACF. Both plots tell you about clear trend and seasonality with period 4.
The plots, the ACF in particular, shows that we have managed to remove seasonality. But since there is still some hint of a trend, we will now apply a regular difference (to the already seasonally differenced data) and plot the result again.
As we will see in the next session, this resulting ACF with only a few significant initial terms is a good starting point for modeling with SARIMA models. Let us check the stationarity
/var/folders/v_/pl7__6kj0nx_lfsnm4_g91q80000gn/T/ipykernel_30156/882291498.py:3: InterpolationWarning: The test statistic is outside of the range of p-values available in the
look-up table. The actual p-value is greater than the p-value returned.
ausbeer_kpss = kpss(ausbeer_diff)
(9.951622692227455e-11, 0.1)
ADF rejects the null of non-stationarity (and non seasonality) and KPSS does not reject the null of stationarity (and non seasonality). Therefore they agree and give us another confirmation that the differenced time series appears like the the result of a stationary process.
Automatic differencing order determination
The pmdarima contains a couple of functions ndiff and nsdiffs that indicate the recommended number of seasonal and regular differences to apply to get to this stationary behavior. If you apply them to the beer production time series you see that the recommendation agrees with what we have done.
Always take this recommendation with skepticism! The best way to proceed is to experiment with different orders of differencing and check the results with the ACF plot. Train your eyes and your good judgement and trust them more than you trust these automated procedures!
Hide the code
ndiffs, nsdiffs = pmd.arima.ndiffs, pmd.arima.nsdiffsprint(f"Recommended number of regular differences: {ndiffs(ausprod['Beer'])}")print(f"Recommended number of seasonal differences: {nsdiffs(ausprod['Beer'], 4)}")
Recommended number of regular differences: 1
Recommended number of seasonal differences: 1
Choosing the right seasonality is critical
One of the first steps in time series EDA is the identification of the seasonality period of the time series, if seasonal behavior is present. This is critical because the seasonality will determine the period of the seasonal difference. If we choose the wrong period, the differencing will not remove the seasonality. Even worse, a spurious dynamic may be introduced in the time series that will severely aff ect the performance of the models you try to fit to the data. Double check the seasonality period of the time series before proceeding further with the analysis!
The sampling frequency ot the time series is a good starting point to determine the possible seasonality periods. We already suggested checking the table in Section 2.1 of (Hyndman and Athanasopoulos 2021) can be used as a guide for this relation between the tyoe of observations (yearly, monthly, weekly, daily, hourly) and the seasonal period that we can expect to find in our data.
Box-Cox transformations to stabilize the variance
Introduction to the Box-Cox family of transformations
A series whose variance changes over time is not stationary. In order to stabilize the variance we can use different transformations. One of the most popular is the Box-Cox transformation. This transformation is defined as: \[
Y_t^{(\lambda)} =
\begin{cases}
\dfrac{Y_t^\lambda - 1}{\lambda} & \text{if }\lambda \neq 0\\[3mm]
\log(Y_t) & \text{if }\lambda = 0
\end{cases}
\] Note in particular that: + For \(\lambda = 0\) the transformation is the logarithm. + For \(\lambda = 1\) the transformation is essentially the identity (with a unit shift). + For \(\lambda = 0.5\) the transformation is the square root. + For \(\lambda = 2\) the transformation is the square.
Therefore the Box-Cox family of transformations includes the logarithm and power transformations. The goal of this transformation is to make the variance of the time series homogeneous through the time span of the series.
The parameter \(\lambda\) that best achieves this is estimated by maximizing the log-likelihood of the transformed data, and we will use the sktime implementation of the Box-Cox transformation to do this.
Let us see a famous example of a time series whose varinace increases with time: the airline passengers data. We will first plot the time series. The time plot clearly shows an increasing variance.
Hide the code
y = load_airline()
Hide the code
y.plot()plt.show();plt.close()
Next we apply the BoxCoxTransformer from sktime to the time series. This transformation is implemented as a transformer, with the scikit interface that should be familiar to you by now.
We apply the transformation to the time series and plot the result. We have upscaled the transformed time series by a factor of 20 to make the plot more useful. Note how the variance becomes more stable after the transformation.
After fitting the transformer we can access the best value of \(\lambda\) with the lambda_ attribute. Recall that if this value is close to 1, then the transformation is essentially the identity and therefore the variance of the time series is already stable.
Hide the code
boxcox_transformer.lambda_
0.14802254856840585
How do we decide whether to apply a Box-Cox transformation?
The first thing that you should always do is plot the time series and check if the variance seems to change through time. If it does, then you should consider a Box-Cox transformation. Once again: trust your eyes and your judgement more than any automated procedure!
However, in some cases we are not certain about whether we need to apply a Box-Cox transformation. In those cases, you can use some tests to check if the variance is stable. The most common test is the Breusch-Pagan test. This test is implemented in the statsmodels library and we will apply it below to the airline passengers time series. The null hypothesis of this test is that the variance is constant. Therefore if the p-value is small, then we reject the null hypothesis and conclude that the variance is not constant.
To apply the test we need to fit a linear regression model to the time series and then apply the test to the residuals. We will use OLs from statsmodels to fit the model. First we create a pandas dataframe with the time series and an integer time index, using standard names to make the code reusable.
Hide the code
y = y.reset_index(drop=True).reset_index()y.columns = ['time', 'value']y
time
value
0
0
112.0
1
1
118.0
2
2
132.0
3
3
129.0
4
4
121.0
...
...
...
139
139
606.0
140
140
508.0
141
141
461.0
142
142
390.0
143
143
432.0
144 rows × 2 columns
Now we fit the linear regression
Hide the code
formula ='value ~ time'olsr = ols(formula, y).fit()
And we apply the Breusch-Pagan test to the residuals and extract the p-value:
Hide the code
_, breusch_pagan_p_value, _, _ = sms.het_breuschpagan(olsr.resid, olsr.model.exog)breusch_pagan_p_value # If the p-value is less than 0.05, then the residuals are heteroskedastic
4.5590018568832897e-07
As expected for the airline passengers time series, the p-value is very small, indicating that the variance is not constant. A way to understand this is to plot the residuals of the linear regression model. We will do this below.
Hide the code
sns.scatterplot(x = y.index, y = np.abs(olsr.resid), label="OLS Residuals vs time")plt.show();plt.close()
The funnel-like shape of the residuals plot is a clear indication that the variance is not constant. There are other tests that can be used to check for this, but the Breusch-Pagan test is the most common one. Check this Medium post by V. Cerqueira for further details (Cerqueira is one of the authors of (Cerqueira and Roque 2024)). You can aso play with the online interactive tool in Section 3.1 of Forecasting: Principles and Practice
Resampling a Time Series
Resampling
Even though resampling is not directly connected with the notion of stationarity, it is a frequently used tool in the analysis of time series and you may come across it in your work. Resampling is the process of changing the sampling frequency of the time series. For example, you may have an hourly time series and you want to convert it to a daily time series. This is a resampling operation. We can of course only resample to a lower frequency: you can not resample daily data to hourly data!
Let us see a couple of examples to understand how this works. We will use the resample method in pandas to do this.
Pedestrian Count Example
This is (quote): A dataset containing the hourly pedestrian counts from 2015-01-01 to 2016-12-31 at 4 sensors in the city of Melbourne.
Check this link to the dataset for more information about the dataset. We will load the data and select one of the sensors () to work with.
Hide the code
pedestrians = pd.read_csv('../4_1_Introduction_to_Forecasting/tsdata/tsibble/pedestrian.csv')pedestrians = pedestrians[pedestrians["Sensor"] =="Bourke Street Mall (North)"]pedestrians.drop(columns=["Sensor"], inplace=True)pedestrians
Now suppose that we want to convert this hourly time series to a daily time series. We can do this with the resample method in pandas. We will use the sum method to aggregate the hourly counts into daily counts, because we are interested in the total number of pedestrians per day. We would get it like this:
You can see in this example how the daily resampling operation allows us to more clearly identify seasonality patterns in the time series.
Another resampling example: sea surface temperature data
In our previous session we already introduced the sea surface temperature data from the NOAA. We will use this data to show how to resample a time series to a different frequency. We will use the monthly data to create a yearly time series. The crucial difference with the preceding example is that we will use the mean method to aggregate the monthly temperatures into yearly temperatures. Adding up the temperatures would not make sense in this case, but finding the mean temperature for each year is a meaningful operation.
NOAA_global_temp.plot()plt.title("Monthly Global Temperature Anomalies")NOAA_global_temp_yearly.plot()plt.title("Yearly Global Temperature Anomalies") plt.show();plt.close()
Note again in this example how resampling can help reveal the long term trends and patterns in the time series. For example it is well known that the “El niño” phenomenon is associated with a rise in sea surface temperatures. Can you spot this in the yearly time series?
Other Machine Learning Tasks for Time Series
Before ending this session we would like to mention that forecasting is not the only goal in time series analysis. We may be also interested in time series classification. For example, if you have a time series representation of a bird song, can you tell which species it is? There are also clustering problems. Imagine you are a biologist arriving at a new unexplored region and there are several different (but yet unknown to science) species of bats using ultrasounds to move around. Can you use records of their sounds to guess how many species of bats are there?
In the Next Session
First we see how to define so called baseline models for time series forecasting. These are simple models that are easy to implement and that can be used to compare the performance of more complex models. Then we learn about autoregressive and moving average process and we will see that these constitute the foundation of a very important family of forecasting models, the SARIMA models.