python

machine-learning

To explain how we can do a time series forecasting, we will use a famous data set on the evolution of the number of passengers on a famous American airline from 1949 to 1960.

In [1]:

```
import seaborn as sns
total_data = sns.load_dataset("flights")
total_data.head()
```

Out[1]:

The raw data set would be used to perform the usual Machine Learning process as we have seen in past modules. This time, we need to apply a transformation of it to generate a time series with two dimensions: the time dimension and the dimension of the data we want to analyze and predict. In this case the time dimension will be composed by the month (`month`

) and the year (`year`

) and the data we will observe over time will be the number of passengers (`passengers`

).

In [2]:

```
import pandas as pd
total_data["month"] = pd.to_datetime(total_data.month, format = "%b").dt.month
total_data["date"] = pd.to_datetime(total_data[["year", "month"]].assign(day = 1))
total_data = total_data.set_index("date")
ts = total_data["passengers"]
ts.head()
```

Out[2]:

Next, we will visualize the time series to perform a visual analysis of it:

In [3]:

```
import matplotlib.pyplot as plt
fig, axis = plt.subplots(figsize = (10, 5))
sns.lineplot(data = ts)
plt.tight_layout()
plt.show()
```

To analyze a time series, as we saw in the theory, we must study several parameters:

**Trend**: An upward trend is apparent, indicating that the number of passengers has increased over time. This may be due to several factors: the growth of the airline industry and the provision of more resources to move passengers, price reduction, increased interest in air travel, and so on.**Seasonality**: There is some seasonality in the data, with certain months consistently having more flights than others. This could be due to seasonal demand (more people flying during vacations, for example).**Variability**: There are some points of variability in the time series, especially between periods of increasing and decreasing demand.**Outliers**: Studying the trend and seasonality of the time series, no outliers are observed in the time series.**Inflection points**: Depending on the year, the increase in the number of passengers is not regular, and sometimes there are variations in the slope; these are inflection points.

Through visual analysis, we might be able to estimate these metrics by eye, but it is always better to orient the analysis to mathematical data. For the task of making predictions on time series and analyzing them, we will rely on the `statsmodels`

library.

The decomposition of a time series is a statistical process that separates a time series into several distinct elements. Each of these components represents a part of the underlying structure of the time series. Decomposing a time series can be very useful to better understand the data and make informed decisions when building forecasting models.

We use the `seasonal_decompose`

function of the `statsmodels`

library to decompose the time series into its trend, seasonality and residual components.

In [4]:

```
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(ts, period = 12)
decomposition
```

Out[4]:

A trend refers to the general direction in which the data is moving. To access its information, we resort to the `trend`

component of the resulting `decomposition`

.

In [5]:

```
trend = decomposition.trend
fig, axis = plt.subplots(figsize = (10, 5))
sns.lineplot(data = ts)
sns.lineplot(data = trend)
plt.tight_layout()
plt.show()
```

This confirms what has been observed: a clear positive trend over the years.

Seasonality refers to repetitive patterns in the data. To access its information, we resort to the `seasonal`

component of the resulting `decomposition`

.

In [6]:

```
seasonal = decomposition.seasonal
fig, axis = plt.subplots(figsize = (10, 5))
sns.lineplot(data = ts)
sns.lineplot(data = seasonal)
plt.tight_layout()
plt.show()
```

To evaluate the stationarity of the time series we can apply the so-called **Dickey-Fuller test**, which is a hypothesis test in which the null hypothesis is that the series is stationary, and the alternative is that it is non-stationary:

In [7]:

```
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
print("Dickey-Fuller test results:")
dftest = adfuller(timeseries, autolag = "AIC")
dfoutput = pd.Series(dftest[0:4], index = ["Test Statistic", "p-value", "#Lags Used", "Number of Observations Used"])
for key,value in dftest[4].items():
dfoutput["Critical Value (%s)"%key] = value
return dfoutput
test_stationarity(ts)
```

Out[7]:

Here we can see that the `p-value`

is greater than 0.05, which means that our null hypothesis will be rejected, and we will take this series as non-stationary.

Variability involves the study of residuals: that is how the data fluctuate once trend and seasonality have been studied. To access their information, we resort to the `resid`

component of the resulting `decomposition`

.

In [8]:

```
residual = decomposition.resid
fig, axis = plt.subplots(figsize = (10, 5))
sns.lineplot(data = ts)
sns.lineplot(data = residual)
plt.tight_layout()
plt.show()
```

This partly confirms what was observed, since the waste load is more noticeable at the beginning and end of the period studied.

Autocorrelation is the correlation of a time series with a lagged copy of itself. This chart helps us see if values in the time series are correlated with previous values.

In [9]:

```
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(ts)
plt.tight_layout()
plt.show()
```

There is a high correlation between the points and their delayed copies, which decreases over time.

An $ARIMA(p, d, q)$ model consists of three hyperparameters:

`p`

: The order of the autoregressive (AR) component.`d`

: The degree of differencing needed to make the time series stationary.`q`

: The order of the moving average (MA) component.

The study of these hyperparameters escapes our function, since it is a purely mathematical-statistical analysis. Nowadays, there are tools that make our lives easier by estimating internally the most appropriate hyperparameters and generating the best possible model, such as the `pmdarima`

package and its `auto_arima`

function. The only thing we have to consider is that in order to optimize its results to the maximum, we must transform the series into stationary, and as in the case of this series, it is not, we must transform it:

In [10]:

```
ts_stationary = ts.diff().dropna()
test_stationarity(ts_stationary)
```

Out[10]:

Now the series is stationary, and we can apply the automatic ARIMA method:

In [11]:

```
from pmdarima import auto_arima
model = auto_arima(ts_stationary, seasonal = True, trace = True, m = 12)
```

As we can see, the function searches the possible solution space to estimate the best parameters. In this case, we would have an $ARIMA(0, 0, 1)$. The model returned by this function is fully usable, like any other we have seen, and its `summary()`

function returns statistical and performance information that is of great value:

In [12]:

```
model.summary()
```

Out[12]:

Once the model has been trained, it can be used to predict into the future (we will predict the next `10`

months)

In [13]:

```
forecast = model.predict(10)
forecast
```

Out[13]:

In [14]:

```
import matplotlib.pyplot as plt
fig, axis = plt.subplots(figsize = (10, 5))
sns.lineplot(data = ts_stationary)
sns.lineplot(data = forecast, c = "green")
plt.tight_layout()
plt.show()
```

Our model is now able to make forward predictions on our stationary series.