When it comes to time series analysis, it is important to choose the right model to make accurate predictions. Two of the most commonly used models are autoregressive (AR) and moving average (MA). The question is, how do you decide which one to use? One useful tool to help make this decision is the ACF plot.

The ACF (autocorrelation function) plot is a graph that shows the correlation between a time series and its lagged values. It is a measure of how similar a time series is to itself at different points in time. ACF plots can be used to determine whether an AR or MA model is more appropriate for a given time series.

To use ACF plots to decide on AR vs MA models, follow these steps:

Step 1: Determine the order of differencing

The first step is to determine the order of differencing needed to make the time series stationary. Stationarity is important for time series analysis because it ensures that the statistical properties of the time series remain constant over time. To determine the order of differencing, look at the p-value of the ADF test.

import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import statsmodels.api as sm
import pmdarima as pm
import yfinance as yf
import seaborn as sns
from statsmodels.tsa.arima.model import ARIMA

msft = yf.Ticker('MSFT')
df = msft.history(period='5y')

sns.light_palette("seagreen", as_cmap=True)
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
sns.lineplot(df['Close'])
plt.title('MSFT')

ADF Test

The Augmented Dickey-Fuller (ADF) test is a statistical test used to determine whether a time series has a unit root, which indicates non-stationarity. Non-stationarity refers to the property of a time series where the statistical properties, such as mean and variance, change over time. The ADF test is commonly used in econometrics and finance to analyze the stationarity of economic and financial data, such as stock prices, interest rates, and exchange rates. The test is named after the econometricians David Dickey and Wayne Fuller, who extended the original Dickey-Fuller test to include additional explanatory variables to improve its statistical power. The ADF test is a popular tool for analyzing time series data and is widely used in academic research and practical applications.

adf_result = sm.tsa.stattools.adfuller(df['Close'])
print('ADF Statistic:', adf_result[0])
print('p-value:', adf_result[1])
print('Critical Values:', adf_result[4])

ADF Statistic: -1.1569117974100418
p-value: 0.6918512889859891
Critical Values: {'1%': -3.4355964295197743, '5%': -2.863856825923603, '10%': -2.5680035060041626}

If the p-value of an ADF test is greater than 0.05 then we will need to keep differentiating the series till we reach stationarity. The concept of stationarity will be explained in a separate post. We perform differentiation on a discrete time-series by differencing. pandas diff function will do the trick. We have to keep differencing till the p-value of the ADF test falls below the threshold of 0.05. The number of times we had to perform the diff operation is the order of differencing needed to make the time series stationary.

p = 0.5
df1 = df['Close']
d = 0
while p > 0.05:
    p = adfuller(df1)[1]
    df1 = df1.diff().fillna(0)
    d = d + 1
d = d - 1
print("Difference value (d) for the time-series: ", d)

Difference value (d) for the time-series:  1

Since we are looking at a stock price d=1. Differencing the stock prices once give us a stationary series.

The difference of stock prices does not produce martingles, since difference is not the same as returns. The difference of the log of the stock prices produces a time-series which is normally distributed. This is why stock prices are said to follow a log-normal distribution. We will not concern ourselves with the distribution and focus on the time-series analysis. However, in real world applications we will need to worry about the distribution of the underlying process (Levy processes etc.).

Step 2: Plot the ACF

The second step is to plot the ACF for the time series data. The ACF plot will show the correlation between the time series and its lagged values. The plot will have lag on the x-axis and the correlation coefficient on the y-axis.

plot_acf(df['Close'].diff().fillna(0), lags=20)

Step 3: Look for significant spikes in the ACF plot

After determining the order of differencing, look for significant spikes in the ACF plot. A significant spike is one that is outside the confidence interval. The confidence interval is the range within which the correlation coefficient is likely to fall. If there are significant spikes at lag 1, 2, 3, etc., then an AR model is appropriate. If there are significant spikes at multiple lags, then an ARMA model may be appropriate. If there are significant spikes at lag 0, 1, 2, 3, etc., then an MA model is appropriate.

Step 4: Determine the order of the AR or MA model

The final step is to determine the order of the AR or MA model. This can be done by looking at the significant spikes in the ACF plot. If there is a significant spike at lag 1, then an AR(1) model is appropriate. If there are significant spikes at lag 1 and 2, then an AR(2) model is appropriate. If there is a significant spike at lag 0, then an MA(1) model is appropriate. If there are significant spikes at lag 0 and 1, then an MA(2) model is appropriate. In our case, it seems there is a significant spike at log 0 and no more significant spikes (there is a spurious spike at lag 9 which we can ignore). Therefore, the best mode lfor this time-series will be a MA(1) model. We can verify this conclusion using auto_arima function as well.

AUTO_ARIMA

We can use the auto_arima to find the appropriate ARIMA model. The function finds the right model in the search space by minimizing the AIC of the model.

model = pm.auto_arima(df['Close'], seasonal=True, m=12)
print(model.order)
print(model.seasonal_order)

(0, 1, 1)
(0, 0, 0, 12)

In conclusion, the ACF plot is a useful tool for deciding on an AR vs MA model for time series analysis. By following the steps outlined above, you can determine whether an AR or MA model is appropriate and the order of the model needed to make accurate predictions.

There are beautiful flowcharts in the book by Peixiero [2] which I found very useful in identifying the time-series model. I am including a variation below.

References

MSFT Stock Price Data: https://finance.yahoo.com/quote/MSFT/history/
Peixeiro, M. (2020). Time Series Forecasting Using Python: An introduction to traditional and deep learning models for time series forecasting. Apress.

Tag: timeseries

Deciding between AR and MA model for time-series analysis