To explain how we can do a time series forecasting, we will use a famous data set on the evolution of the number of passengers on a famous American airline from 1949 to 1960.
import seaborn as sns
total_data = sns.load_dataset("flights")
total_data.head()
The raw data set would be used to perform the usual Machine Learning process as we have seen in past modules. This time, we need to apply a transformation of it to generate a time series with two dimensions: the time dimension and the dimension of the data we want to analyze and predict. In this case the time dimension will be composed by the month (month
) and the year (year
) and the data we will observe over time will be the number of passengers (passengers
).
import pandas as pd
total_data["month"] = pd.to_datetime(total_data.month, format = "%b").dt.month
total_data["date"] = pd.to_datetime(total_data[["year", "month"]].assign(day = 1))
total_data = total_data.set_index("date")
ts = total_data["passengers"]
ts.head()
Next, we will visualize the time series to perform a visual analysis of it:
import matplotlib.pyplot as plt
fig, axis = plt.subplots(figsize = (10, 5))
sns.lineplot(data = ts)
plt.tight_layout()
plt.show()
To analyze a time series, as we saw in the theory, we must study several parameters:
Through visual analysis, we might be able to estimate these metrics by eye, but it is always better to orient the analysis to mathematical data. For the task of making predictions on time series and analyzing them, we will rely on the statsmodels
library.
The decomposition of a time series is a statistical process that separates a time series into several distinct elements. Each of these components represents a part of the underlying structure of the time series. Decomposing a time series can be very useful to better understand the data and make informed decisions when building forecasting models.
We use the seasonal_decompose
function of the statsmodels
library to decompose the time series into its trend, seasonality and residual components.
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(ts, period = 12)
decomposition
A trend refers to the general direction in which the data is moving. To access its information, we resort to the trend
component of the resulting decomposition
.
trend = decomposition.trend
fig, axis = plt.subplots(figsize = (10, 5))
sns.lineplot(data = ts)
sns.lineplot(data = trend)
plt.tight_layout()
plt.show()
This confirms what has been observed: a clear positive trend over the years.
Seasonality refers to repetitive patterns in the data. To access its information, we resort to the seasonal
component of the resulting decomposition
.
seasonal = decomposition.seasonal
fig, axis = plt.subplots(figsize = (10, 5))
sns.lineplot(data = ts)
sns.lineplot(data = seasonal)
plt.tight_layout()
plt.show()
To evaluate the stationarity of the time series we can apply the so-called Dickey-Fuller test, which is a hypothesis test in which the null hypothesis is that the series is stationary, and the alternative is that it is non-stationary:
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
print("Dickey-Fuller test results:")
dftest = adfuller(timeseries, autolag = "AIC")
dfoutput = pd.Series(dftest[0:4], index = ["Test Statistic", "p-value", "#Lags Used", "Number of Observations Used"])
for key,value in dftest[4].items():
dfoutput["Critical Value (%s)"%key] = value
return dfoutput
test_stationarity(ts)
Here we can see that the p-value
is greater than 0.05, which means that our null hypothesis will be rejected, and we will take this series as non-stationary.
Variability involves the study of residuals: that is how the data fluctuate once trend and seasonality have been studied. To access their information, we resort to the resid
component of the resulting decomposition
.
residual = decomposition.resid
fig, axis = plt.subplots(figsize = (10, 5))
sns.lineplot(data = ts)
sns.lineplot(data = residual)
plt.tight_layout()
plt.show()
This partly confirms what was observed, since the waste load is more noticeable at the beginning and end of the period studied.
Autocorrelation is the correlation of a time series with a lagged copy of itself. This chart helps us see if values in the time series are correlated with previous values.
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(ts)
plt.tight_layout()
plt.show()
There is a high correlation between the points and their delayed copies, which decreases over time.
An model consists of three hyperparameters:
p
: The order of the autoregressive (AR) component.d
: The degree of differencing needed to make the time series stationary.q
: The order of the moving average (MA) component.The study of these hyperparameters escapes our function, since it is a purely mathematical-statistical analysis. Nowadays, there are tools that make our lives easier by estimating internally the most appropriate hyperparameters and generating the best possible model, such as the pmdarima
package and its auto_arima
function. The only thing we have to consider is that in order to optimize its results to the maximum, we must transform the series into stationary, and as in the case of this series, it is not, we must transform it:
ts_stationary = ts.diff().dropna()
test_stationarity(ts_stationary)
Now the series is stationary, and we can apply the automatic ARIMA method:
from pmdarima import auto_arima
model = auto_arima(ts_stationary, seasonal = True, trace = True, m = 12)
As we can see, the function searches the possible solution space to estimate the best parameters. In this case, we would have an . The model returned by this function is fully usable, like any other we have seen, and its summary()
function returns statistical and performance information that is of great value:
model.summary()
Once the model has been trained, it can be used to predict into the future (we will predict the next 10
months)
forecast = model.predict(10)
forecast
import matplotlib.pyplot as plt
fig, axis = plt.subplots(figsize = (10, 5))
sns.lineplot(data = ts_stationary)
sns.lineplot(data = forecast, c = "green")
plt.tight_layout()
plt.show()
Our model is now able to make forward predictions on our stationary series.