28 min read

Forecasting time series with R using forecast package

# install & load required packages
# install.packages("forecast") # for feracasting time series
# install.packages("ggplot2")
library(forecast)
library(ggplot2)
# alternatively you can install & load fpp2 package where forecast and ggplot2 packages will be installed and loaded by default.
# install.packages("fpp2")
library(fpp2)
## Loading required package: fma
## Loading required package: expsmooth
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

The predictability of an event or a quantity depends on several factors including:

  • how well we understand the factors that contribute to it
  • how much data are available
  • whether the forecasts can affect the thing we are trying to forecast.

For example, forecasts of electricity demand can be highly accurate because all three conditions are usually satisfied. We have a good idea of the contributing factors: electricity demand is driven largely by temperatures, with smaller effects for calendar variation such as holidays, and economic conditions. Provided there is a sufficient history of data on electricity demand and weather conditions, and we have the skills to develop a good model linking electricity demand and the key driver variables, the forecasts can be remarkably accurate. (Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition)

On the other hand, when forecasting currency exchange rates, only one of the conditions is satisfied: there is plenty of available data. However, we have a very limited understanding of the factors that affect exchange rates, and forecasts of the exchange rate have a direct effect on the rates themselves. If there are well-publicized forecasts that the exchange rate will increase, then people will immediately adjust the price they are willing to pay and so the forecasts are self-fullfilling. In a sense, the exchange rates become their own forecasts. This is an example of the “efficient market hypothesis”. Consequently, forecasting whether the exchange rate will rise or fall tomorrow is about as predictable as forecasting whether a tossed coin will come down as a head or a tail. In both situations, you will be correct about 50% of the time, whatever you forecast. In situations like this, forecasters need to be aware of their own limitations, and not claim more than is possible.(Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition)

Forecasting is a common statistical task in business, where it helps to inform decisions about the scheduling of production, transportation and personnel, and provides a guide to long-term strategic planning. However, business forecasting is often done poorly, and is frequently confused with planning and goals. They are three different things.(Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition)

Forecasting

is about predicting the future as accurately as possible, given all of the information available, including historical data and knowledge of any future events that might impact the forecasts.(Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition)

Goals

are what you would like to have happen. Goals should be linked to forecasts and plans, but this does not always occur. Too often, goals are set without any plan for how to achieve them, and no forecasts for whether they are realistic.(Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition)

Planning

is a response to forecasts and goals. Planning involves determining the appropriate actions that are required to make your forecasts match your goals. Forecasting should be an integral part of the decision-making activities of management, as it can play an important role in many areas of a company. Modern organizations require short-term, medium-term and long-term forecasts, depending on the specific application.(Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition)

Forecasting methods

There are three main types of methods that can be used.

  1. forecasting using predictor variables: This is called explanatory model. \[\begin{equation} {nasdaq} = f{(gdp, inflation, interestRates)} \end{equation}\]

  2. forecasting using its past values: Time Series. \[\begin{equation} {nasdaq_{t+1}} = f{(nasdaq_t, nasdaq_{t-1}, nasdaq_{t-2}, .....nasdaq_{t-k})} \end{equation}\]

In more general form: \[\begin{equation} {y_{t+1}} = f{(y_t, y_{t-1}, y_{t-2}, .....y_{t-k})} \end{equation}\]

  1. Combination of above two: In the third model we can combined the above two. \[\begin{equation} {nasdaq_{t+1}} = f{(nasdaq_t, gdp, inflation, .....)} \end{equation}\]

These types of “mixed models” have been given various names in different disciplines. They are known as dynamic regression models, panel data models, longitudinal models, transfer function models, and linear system models (assuming that f is linear) (Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition).

A forecasting usually involves five basic step:

  1. Problem definition
  2. Gathering information
  3. Preliminary(Exploratory) analysis
  4. Choosing and fitting models
  5. Using an evaluating a forecasting model

Statistical forecasting perspective

The thing we are trying to forecast is unknown (or we wouldn’t be forecasting it), and so we can think of it as a random variable. For example, the total sales for next month could take a range of possible values, and until we add up the actual sales at the end of the month, we don???t know what the value will be. So until we know the sales for next month, it is a random quantity (Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition).

Because next month is relatively close, we usually have a good idea what the likely sales values could be. On the other hand, if we are forecasting the sales for the same month next year, the possible values it could take are much more variable. In most forecasting situations, the variation associated with the thing we are forecasting will shrink as the event approaches. In other words, the further ahead we forecast, the more uncertain we are. (Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition)

When we obtain a forecast, we are estimating the middle of the range of possible values the random variable could take. Very often, a forecast is accompanied by a prediction interval giving a range of values the random variable could take with relatively high probability. For example, a 95% prediction interval contains a range of values which should include the actual future value with probability 95%.(Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition)

When forecast is made we often show the prediction intervals not all possible scenarios. The forecast is the average of all possible future values and this is often called as the “point forecasts”.

Throughout this post I will use four datasets;NASDAQ composite index, us gdp, cpi and interest rate date and we will forecast the future prices of nasdaq composite index.

Let’s now use simple methods to understand the structure of nasdaq data.

class(nasdaq)
## [1] "data.frame"
dim(nasdaq)
## [1] 2074    7
names(nasdaq)
## [1] "Date"      "Open"      "High"      "Low"       "Close"     "Adj.Close"
## [7] "Volume"
str(nasdaq)
## 'data.frame':    2074 obs. of  7 variables:
##  $ Date     : chr  "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
##  $ Open     : num  2294 2307 2308 2298 2292 ...
##  $ High     : num  2311 2314 2314 2301 2318 ...
##  $ Low      : num  2294 2296 2296 2285 2291 ...
##  $ Close    : num  2308 2309 2301 2300 2317 ...
##  $ Adj.Close: num  2308 2309 2301 2300 2317 ...
##  $ Volume   : num  1.93e+09 2.37e+09 2.25e+09 2.27e+09 2.15e+09 ...
head(nasdaq)
##         Date    Open    High     Low   Close Adj.Close     Volume
## 1 2010-01-04 2294.41 2311.15 2294.41 2308.42   2308.42 1931380000
## 2 2010-01-05 2307.27 2313.73 2295.62 2308.71   2308.71 2367860000
## 3 2010-01-06 2307.71 2314.07 2295.68 2301.09   2301.09 2253340000
## 4 2010-01-07 2298.09 2301.30 2285.22 2300.05   2300.05 2270050000
## 5 2010-01-08 2292.24 2317.60 2290.61 2317.17   2317.17 2145390000
## 6 2010-01-11 2324.78 2326.28 2302.21 2312.41   2312.41 2077890000
tail(nasdaq)
##            Date    Open    High     Low   Close Adj.Close     Volume
## 2069 2018-03-22 7257.55 7303.19 7164.38 7166.68   7166.68 2347160000
## 2070 2018-03-23 7170.68 7194.31 6992.67 6992.67   6992.67 2390410000
## 2071 2018-03-26 7125.20 7225.83 7022.34 7220.54   7220.54 2326060000
## 2072 2018-03-27 7255.47 7255.54 6963.68 7008.81   7008.81 2325990000
## 2073 2018-03-28 6978.30 7036.09 6901.07 6949.23   6949.23 2518670000
## 2074 2018-03-29 6984.66 7120.46 6935.78 7063.45   7063.45 2554500000

As seen from the structure “Date” column is not in a proper format. Let’s now format it as Date.

nasdaq$Date <- as.Date(nasdaq$Date, format = "%Y-%m-%d")

# checking its class
class(nasdaq$Date)
## [1] "Date"
# lets just take Date and Adj.Close column
nasdaq <- nasdaq[, c("Date", "Adj.Close")]

# calling Date and Adj.Close columns as nasdaq and date, respectively.
names(nasdaq) <- c("date", "nasdaq")

Let’s do the same steps for other data sets

class(cpi)
class(gdp)
class(ir)

str(cpi)
str(gdp)
str(ir)

cpi$DATE <- as.Date(cpi$DATE, format = "%Y-%m-%d")
gdp$DATE <- as.Date(gdp$DATE, format = "%Y-%m-%d")
ir$DATE <- as.Date(ir$DATE, format = "%Y-%m-%d")

names(cpi) <- c("date", "cpi")
names(gdp) <- c("date", "gdp")
names(ir) <- c("date", "ir")

Now will do use the 5 important steps to forecast the nasdaq index’s future values. Let’s start with the step 1:

Problem definition: what will be the value of nasdaq composite index over the next month?

After we completed the step 1, we can already go to step 2:

Gathering informarion: We can use yahoo finance to gather nasdaq historical data. Assuming that you already know how to download data and load it to R session. In this step we will download and load relevant data set. Naturally the question what data is most relevant in terms of forecasting nasdaq composite index prices. In additon to nasdaq historical values let’s also download us consumer price index (month on month, growth rate (mom) and interest rates). Note that gpd is available quarterly and cpi is monthly. I have downloaded effective federal funds rate with weekly frequency.

To import historical nasdaq data:

  • Go to nasdaq
  • Specify the dates you’d like to download. For my analysis I have download data from January 1, 2018 to March 31st, 2018.

To download historical interest rates(effective federal fund rate):

To download historical inflation data:

To download historical growth(quartely percentage change) data: * Go to gdp

Laod all above data sets to R.

After having all the relevant data we can move on the step 3:

Preliminary(Exploratory) analysis: We need to explore our data sets. The best way of exploring a data set is to visualize. Let’s see what is our data sets are telling us.

To analyze and forecast a time series requires use of a suitable data structure. In order to continue with our work here we need to create time series(ts) object for our data sets.

nasdaq.ts <- ts(nasdaq$nasdaq, start = 1, frequency = 1)
gdp.ts <- ts(gdp$gdp, start = c(2010, 1), frequency = 4)
cpi.ts <- ts(cpi$cpi, start = c(2010, 1), frequency = 12)
ir.ts <- ts(ir$ir, start = c(2010, 1), frequency = 52)

Let’s also create xts object for nasdaq as we do not have very clear frequency. We can pass it to the autoplot() and we will be able to see the “dates”.

nasdaq.xts <- xts(nasdaq$nasdaq, order.by = nasdaq$date)

Check the difference between two plots.

autoplot(nasdaq.ts)

autoplot(nasdaq.xts)

autoplot() function is very useful to visualize the objects regardless of their types

autoplot(nasdaq.xts)

autoplot(gdp.ts)

autoplot(cpi.ts)

autoplot(ir.ts)

Finding the index of outlier in nasdaq data.

nasdaq.outlier <- which.max(nasdaq.ts)
nasdaq.outlier
## [1] 2061

Looking at the seasonal frequencies of the four series.

frequency(nasdaq.ts)
## [1] 1
frequency(gdp.ts)
## [1] 4
frequency(cpi.ts)
## [1] 12
frequency(ir.ts)
## [1] 52

In additon to normal plot we can also check the seasonal plot. A seasonal plot is similar to a time plot except that the data are plotted against the individual “seasons” in which the data were observed.

ggseasonplot(cpi.ts)

ggseasonplot(ir.ts)

ggseasonplot(gdp.ts)

Looking at the generated graphs we can discover many interesting features. The main time series patterns which can be observed are:

Trend :A trend exists when there is a long or short term increase in data. Graph of nasdaq historical series shows a strong uptrend.

Seasonal :A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known frequency(Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition).

Cyclic :A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency. These fluctuations are usually due to economic conditions, and are often related to the “business cycle”. The duration of these fluctuations is usually at least 2 years(Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition).

Many time series include trend, cycles and seasonality. When choosing a forecasting method, we will first need to identify the time series patterns in the data, and then choose a method that is able to capture the patterns properly(Forecasting: Principles and Practice, Rob J Hyndman and George Athanasopoulos, 2nd edition).

A useful variation on the seasonal plot uses polar coordinates. Setting polar=TRUE makes the time series axis circular rather than horizontal, as shown below.

ggseasonplot(gdp.ts, polar = TRUE)

An alternative plot that emphasises the seasonal patterns is where the data for each season are collected together in separate mini time plots. The horizontal lines indicate the means for each quarter and month for gdp and cpi. This form of plot enables the underlying seasonal pattern to be seen clearly, and also shows the changes in seasonality over time. It is especially useful in identifying changes within particular seasons. In some cases, this is the most useful way of viewing seasonal changes over time.

ggsubseriesplot(gdp.ts)

ggsubseriesplot(cpi.ts)

nasdaq_ir <- merge(nasdaq, ir, by = "date")

The graphs discussed so far are useful for visualizing individual time series. In the case we would like to explore relationships between time series, a scatterplot is a very convenient way to do this.

ggplot(data = nasdaq_ir) + geom_point(mapping = aes(x = ir, y = nasdaq))

Correlation

It is quite common to check correlations to masure the strength of the relationship between two variables.

Caution: The correlation coefficient only measures the strength of the linear relationship, and can sometimes be misleading. Important not to rely only on correlation coefficients but also to look at the plots of the data.

library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:fma':
## 
##     pigs

Scatterplot matrices

When there are several potential predictor variables, it is useful to plot each variable against each other variable.

nasdaq_ir[,3:2] %>% GGally::ggpairs()

autoplot(ts(nasdaq_ir[,2:3]), facets=TRUE)

nasdaq_ir[,3:2] %>% GGally::ggpairs()

Nasdaq historical prives vs cpi.

nasdaq_cpi <- merge(nasdaq, cpi, by = "date")
autoplot(ts(nasdaq_cpi[,2:3]), facets=TRUE)

nasdaq_cpi[,3:2] %>% GGally::ggpairs()

Nasdaq historical prives vs gdp.

nasdaq_gdp <- merge(nasdaq, gdp, by = "date")
autoplot(ts(nasdaq_gdp[,2:3]), facets=TRUE)

nasdaq_gdp[,3:2] %>% GGally::ggpairs()

lag plots

gglagplot(nasdaq.ts) + ggtitle("NASDAQ Composite")

gglagplot(gdp.ts) + ggtitle("GDP")

gglagplot(cpi.ts) + ggtitle("Consumer Price Index")

gglagplot(ir.ts) + ggtitle("Interest Rate")

Autocorrelation

Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series.

ggAcf(nasdaq.ts)

ACF plot shows quite significant trend on nasdaq.

White noise

Time series that show no autocorrelation are called “white noise”. Let’s look at the cpi and gpd data sets.For white noise series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within [-2/sqrt(T) ,-2/sqrt(T)] where T is the length of the time series. It is common to plot these bounds on a graph of the ACF (the blue dashed lines above). If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise.

autoplot(cpi.ts)

ggAcf(cpi.ts)

There is a significant spike for lag one. This does not fully look like a white noise.

Before we move into creating benchmark forecast lest split nasdaq data set into two pieces; test and train. This can be managed via window function.

nasdaq.train <- window(nasdaq.ts, end = 2013)
nasdaq.test <- window(nasdaq.ts, start = 2014)

Creating benchmark forecast

There are some effective methods that will be used as banchmark for nasdaq forecast.

Average method: Forecast of all future values are equal to the mean of the historical data.

mean.benchmark <- meanf(nasdaq.train, h = 60)
autoplot(mean.benchmark)

Naive method: Forecast of all future values are equal to the last observation in the data.

naive.benchmark <- naive(nasdaq.train, h = 60)
autoplot(naive.benchmark)

Drift method: A variation on the naive method is to allow the forecasts to increase or decrease over time, where the amount of change over time (called the “drift”) is set to be the average change seen in the historical data.his is equivalent to drawing a line between the first and last observations, and extrapolating it into the future.

drift.benchmark <- rwf(nasdaq.train, h = 60, drift = TRUE)
autoplot(drift.benchmark)

Now let’s choose the best benchmark model based on MAPE criteria. Below shows various use of accuracy() function.

accuracy(drift.benchmark, nasdaq.ts)
##                         ME      RMSE       MAE         MPE      MAPE
## Training set -1.125476e-13  38.56353  27.71146 -0.01439089 0.7419408
## Test set      2.839767e+02 334.87234 298.71479  3.85559727 4.0702113
##                    MASE        ACF1 Theil's U
## Training set  0.9934178 -0.01033551        NA
## Test set     10.7085169  0.80508762  3.368776
accuracy(drift.benchmark, nasdaq.ts)["Test set", "MAPE"]
## [1] 4.070211
accuracy(drift.benchmark)
##                         ME     RMSE      MAE         MPE      MAPE
## Training set -1.125476e-13 38.56353 27.71146 -0.01439089 0.7419408
##                   MASE        ACF1
## Training set 0.9934178 -0.01033551
# chosing the best model based on MAPE
accuracy(mean.benchmark, nasdaq.ts)["Test set", "MAPE"]
## [1] 44.71328
accuracy(naive.benchmark, nasdaq.ts)["Test set", "MAPE"]
## [1] 4.889852
accuracy(drift.benchmark, nasdaq.ts)["Test set", "MAPE"]
## [1] 4.070211
best.model <- min(accuracy(mean.benchmark, nasdaq.ts)["Test set", "MAPE"],accuracy(naive.benchmark, nasdaq.ts)["Test set", "MAPE"], accuracy(drift.benchmark, nasdaq.ts)["Test set", "MAPE"])
print(paste("The best model is drift model with lowest MAPE : ", round(best.model, 2)))
## [1] "The best model is drift model with lowest MAPE :  4.07"

Seasonal naive method: Althought this will not be used as benchmark for nasdaq forecast I wanted to include it for information purposes. On cpi data the seasonal naive forecast will be ass follow;

snaive.benchmark.cpi <- snaive(cpi.ts, h = 36)
autoplot(snaive.benchmark.cpi)

checkresiduals(snaive.benchmark.cpi) # model diagnostics

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 39.542, df = 24, p-value = 0.02394
## 
## Model df: 0.   Total lags used: 24

Does not look bad yeah?

I am exited and would like to see on in-sample data set. Let’s split the data set into two pieces; cpi.test and cpi.train

cpi.train <- window(cpi.ts, end = 2015) # till the end of 
cpi.test <- window(cpi.ts, start = 2016)
cpi.train.snaive <- snaive(cpi.train, h = 25)
autoplot(cpi.train.snaive)

checkresiduals(cpi.train.snaive)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 45.199, df = 24, p-value = 0.005517
## 
## Model df: 0.   Total lags used: 24
# checking the accuracy
accuracy(cpi.train.snaive, cpi.test)
##                       ME      RMSE       MAE      MPE     MAPE     MASE
## Training set -0.03354816 0.3234758 0.2714396      Inf      Inf 1.000000
## Test set      0.23966071 0.4223215 0.3029079 287.4996 308.2497 1.115931
##                    ACF1 Theil's U
## Training set 0.42170657        NA
## Test set     0.06745185  1.067656
autoplot(cbind(ts(cpi.train.snaive$mean, frequency = 1), ts(cpi.test, frequency = 1)), facets = F) + ggtitle("Forecast vs Actual")

Our forecast is shown with red color. As seen from the graph it has slightly underestimated the cpi

Let’s also try to fit seasonal naive to gdp series.

autoplot(snaive(gdp.ts, h = 12))

checkresiduals(snaive(gdp.ts, h = 12)) # model diagnostics

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 11.665, df = 8, p-value = 0.1668
## 
## Model df: 0.   Total lags used: 8

BoxCox transformation:

lambda <- BoxCox.lambda(nasdaq.train)
autoplot(BoxCox(nasdaq.train,lambda))

autoplot(nasdaq.train)

Simple exponential smoothing

This method is suitable for forecasting data with no trend and seasonality.

To make forecast with simple exponential smoothing we use ses() function.

nasdaq.exp.smot <- ses(nasdaq.train, h = 61)
summary(nasdaq.exp.smot)
## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = nasdaq.train, h = 61) 
## 
##   Smoothing parameters:
##     alpha = 0.9933 
## 
##   Initial states:
##     l = 2308.4245 
## 
##   sigma:  38.6206
## 
##      AIC     AICc      BIC 
## 30029.80 30029.82 30046.63 
## 
## Error measures:
##                    ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 2.298229 38.62063 27.88564 0.04920158 0.7461496 0.9996621
##                     ACF1
## Training set -0.00359401
## 
## Forecasts:
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2014       6903.704 6854.209 6953.198 6828.009 6979.399
## 2015       6903.704 6833.943 6973.465 6797.013 7010.394
## 2016       6903.704 6818.360 6989.047 6773.182 7034.225
## 2017       6903.704 6805.213 7002.194 6753.075 7054.332
## 2018       6903.704 6793.625 7013.782 6735.353 7072.055
## 2019       6903.704 6783.146 7024.262 6719.326 7088.081
## 2020       6903.704 6773.507 7033.900 6704.585 7102.822
## 2021       6903.704 6764.534 7042.873 6690.863 7116.544
## 2022       6903.704 6756.106 7051.301 6677.973 7129.434
## 2023       6903.704 6748.134 7059.273 6665.781 7141.627
## 2024       6903.704 6740.551 7066.856 6654.183 7153.224
## 2025       6903.704 6733.305 7074.102 6643.101 7164.306
## 2026       6903.704 6726.355 7081.052 6632.472 7174.935
## 2027       6903.704 6719.667 7087.740 6622.244 7185.164
## 2028       6903.704 6713.214 7094.194 6612.374 7195.033
## 2029       6903.704 6706.972 7100.435 6602.829 7204.579
## 2030       6903.704 6700.922 7106.485 6593.576 7213.831
## 2031       6903.704 6695.048 7112.359 6584.592 7222.815
## 2032       6903.704 6689.335 7118.073 6575.854 7231.553
## 2033       6903.704 6683.770 7123.638 6567.344 7240.064
## 2034       6903.704 6678.342 7129.065 6559.043 7248.365
## 2035       6903.704 6673.042 7134.365 6550.937 7256.470
## 2036       6903.704 6667.861 7139.546 6543.014 7264.394
## 2037       6903.704 6662.792 7144.616 6535.260 7272.147
## 2038       6903.704 6657.827 7149.581 6527.667 7279.740
## 2039       6903.704 6652.960 7154.447 6520.224 7287.183
## 2040       6903.704 6648.186 7159.221 6512.923 7294.484
## 2041       6903.704 6643.499 7163.908 6505.756 7301.652
## 2042       6903.704 6638.896 7168.511 6498.715 7308.692
## 2043       6903.704 6634.371 7173.036 6491.795 7315.612
## 2044       6903.704 6629.921 7177.486 6484.989 7322.418
## 2045       6903.704 6625.542 7181.865 6478.292 7329.115
## 2046       6903.704 6621.231 7186.176 6471.699 7335.708
## 2047       6903.704 6616.985 7190.422 6465.205 7342.202
## 2048       6903.704 6612.800 7194.607 6458.806 7348.602
## 2049       6903.704 6608.676 7198.732 6452.497 7354.910
## 2050       6903.704 6604.608 7202.800 6446.276 7361.132
## 2051       6903.704 6600.594 7206.813 6440.138 7367.270
## 2052       6903.704 6596.633 7210.774 6434.080 7373.327
## 2053       6903.704 6592.723 7214.685 6428.099 7379.308
## 2054       6903.704 6588.861 7218.547 6422.193 7385.214
## 2055       6903.704 6585.045 7222.362 6416.358 7391.049
## 2056       6903.704 6581.275 7226.132 6410.592 7396.815
## 2057       6903.704 6577.549 7229.858 6404.893 7402.514
## 2058       6903.704 6573.865 7233.543 6399.258 7408.149
## 2059       6903.704 6570.221 7237.186 6393.686 7413.721
## 2060       6903.704 6566.617 7240.790 6388.174 7419.233
## 2061       6903.704 6563.051 7244.357 6382.720 7424.687
## 2062       6903.704 6559.521 7247.886 6377.322 7430.085
## 2063       6903.704 6556.028 7251.379 6371.980 7435.427
## 2064       6903.704 6552.569 7254.838 6366.690 7440.717
## 2065       6903.704 6549.145 7258.263 6361.452 7445.955
## 2066       6903.704 6545.752 7261.655 6356.265 7451.143
## 2067       6903.704 6542.392 7265.015 6351.125 7456.282
## 2068       6903.704 6539.063 7268.344 6346.034 7461.373
## 2069       6903.704 6535.764 7271.643 6340.988 7466.419
## 2070       6903.704 6532.494 7274.913 6335.987 7471.420
## 2071       6903.704 6529.253 7278.155 6331.030 7476.377
## 2072       6903.704 6526.039 7281.368 6326.116 7481.292
## 2073       6903.704 6522.853 7284.554 6321.242 7486.165
## 2074       6903.704 6519.693 7287.714 6316.410 7490.997
autoplot(nasdaq.exp.smot)

autoplot(nasdaq.exp.smot) + autolayer(fitted(nasdaq.exp.smot),series = "Fitted")

It looks like it has been very well fitted. I am very much exited to see accuracy and the residuals.

accuracy(nasdaq.exp.smot, nasdaq.test)
##                      ME      RMSE       MAE        MPE      MAPE
## Training set   2.298229  38.62063  27.88564 0.04920158 0.7461496
## Test set     350.145246 393.77901 355.25203 4.76766284 4.8428156
##                    MASE        ACF1 Theil's U
## Training set  0.9996621 -0.00359401        NA
## Test set     12.7352994  0.83024431  3.947749
checkresiduals(nasdaq.exp.smot)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 13.006, df = 8, p-value = 0.1116
## 
## Model df: 2.   Total lags used: 10
# adding this as weel and comparing with all others.
accuracy(nasdaq.exp.smot, nasdaq.ts)["Test set", "MAPE"]
## [1] 4.842816
# showing the best model so far
best.model <- min(accuracy(mean.benchmark, nasdaq.ts)["Test set", "MAPE"],accuracy(naive.benchmark, nasdaq.ts)["Test set", "MAPE"], accuracy(drift.benchmark, nasdaq.ts)["Test set", "MAPE"], accuracy(nasdaq.exp.smot, nasdaq.ts)["Test set", "MAPE"])
best.model
## [1] 4.070211
print(paste("The best model is drift model with lowest MAPE : ", round(best.model, 2)))
## [1] "The best model is drift model with lowest MAPE :  4.07"

Exponential smoothing with trend

There are two approaces forecasting data with with trend using exponential smoothing; one uses local linear trend and the other one uses damped linear trend.

We can use holt() function to forecast. Let’s now forecast nasdaq data with the local linear trend apprach.

# performing forecast on training data
nasdaq.exp.loc.linear <- holt(nasdaq.train, h = 61) 

# checking the forecast with summary() function
summary(nasdaq.exp.loc.linear)
## 
## Forecast method: Holt's method
## 
## Model Information:
## Holt's method 
## 
## Call:
##  holt(y = nasdaq.train, h = 61) 
## 
##   Smoothing parameters:
##     alpha = 0.9902 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 2310.9521 
##     b = 2.2948 
## 
##   sigma:  38.5539
## 
##      AIC     AICc      BIC 
## 30026.85 30026.88 30054.88 
## 
## Error measures:
##                      ME     RMSE      MAE         MPE      MAPE      MASE
## Training set 0.04363645 38.55395 27.70352 -0.01337676 0.7417591 0.9931332
##                       ACF1
## Training set -0.0004673646
## 
## Forecasts:
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2014       6906.175 6856.766 6955.584 6830.611 6981.740
## 2015       6908.479 6838.943 6978.015 6802.133 7014.824
## 2016       6910.783 6825.755 6995.810 6780.745 7040.821
## 2017       6913.086 6814.981 7011.191 6763.048 7063.125
## 2018       6915.390 6805.754 7025.025 6747.716 7083.063
## 2019       6917.693 6797.627 7037.759 6734.068 7101.318
## 2020       6919.997 6790.335 7049.659 6721.696 7118.298
## 2021       6922.300 6783.704 7060.897 6710.335 7134.266
## 2022       6924.604 6777.613 7071.595 6699.800 7149.408
## 2023       6926.908 6771.974 7081.841 6689.958 7163.857
## 2024       6929.211 6766.722 7091.700 6680.706 7177.716
## 2025       6931.515 6761.805 7101.224 6671.966 7191.063
## 2026       6933.818 6757.181 7110.455 6663.675 7203.961
## 2027       6936.122 6752.818 7119.426 6655.782 7216.461
## 2028       6938.425 6748.687 7128.164 6648.246 7228.605
## 2029       6940.729 6744.766 7136.692 6641.030 7240.428
## 2030       6943.033 6741.036 7145.029 6634.105 7251.960
## 2031       6945.336 6737.480 7153.193 6627.447 7263.225
## 2032       6947.640 6734.083 7161.197 6621.033 7274.247
## 2033       6949.943 6730.833 7169.053 6614.844 7285.043
## 2034       6952.247 6727.720 7176.774 6608.863 7295.631
## 2035       6954.550 6724.733 7184.368 6603.075 7306.026
## 2036       6956.854 6721.865 7191.844 6597.469 7316.240
## 2037       6959.158 6719.106 7199.209 6592.031 7326.285
## 2038       6961.461 6716.451 7206.471 6586.751 7336.171
## 2039       6963.765 6713.894 7213.636 6581.621 7345.909
## 2040       6966.068 6711.429 7220.708 6576.630 7355.506
## 2041       6968.372 6709.050 7227.694 6571.773 7364.971
## 2042       6970.676 6706.753 7234.598 6567.041 7374.310
## 2043       6972.979 6704.534 7241.424 6562.428 7383.530
## 2044       6975.283 6702.390 7248.176 6557.929 7392.636
## 2045       6977.586 6700.316 7254.857 6553.537 7401.635
## 2046       6979.890 6698.309 7261.471 6549.249 7410.531
## 2047       6982.193 6696.366 7268.021 6545.058 7419.329
## 2048       6984.497 6694.484 7274.510 6540.961 7428.033
## 2049       6986.801 6692.661 7280.940 6536.954 7436.647
## 2050       6989.104 6690.895 7287.313 6533.033 7445.176
## 2051       6991.408 6689.182 7293.633 6529.194 7453.621
## 2052       6993.711 6687.522 7299.901 6525.435 7461.988
## 2053       6996.015 6685.911 7306.119 6521.752 7470.278
## 2054       6998.318 6684.348 7312.289 6518.143 7478.494
## 2055       7000.622 6682.832 7318.412 6514.604 7486.640
## 2056       7002.926 6681.360 7324.491 6511.134 7494.718
## 2057       7005.229 6679.931 7330.527 6507.729 7502.729
## 2058       7007.533 6678.544 7336.521 6504.388 7510.677
## 2059       7009.836 6677.197 7342.475 6501.109 7518.564
## 2060       7012.140 6675.889 7348.391 6497.889 7526.391
## 2061       7014.444 6674.619 7354.268 6494.727 7534.160
## 2062       7016.747 6673.385 7360.109 6491.620 7541.874
## 2063       7019.051 6672.187 7365.915 6488.568 7549.534
## 2064       7021.354 6671.022 7371.686 6485.568 7557.141
## 2065       7023.658 6669.891 7377.424 6482.619 7564.697
## 2066       7025.961 6668.793 7383.130 6479.719 7572.203
## 2067       7028.265 6667.726 7388.804 6476.868 7579.662
## 2068       7030.569 6666.689 7394.448 6474.063 7587.074
## 2069       7032.872 6665.682 7400.062 6471.304 7594.440
## 2070       7035.176 6664.705 7405.647 6468.589 7601.762
## 2071       7037.479 6663.755 7411.204 6465.917 7609.042
## 2072       7039.783 6662.832 7416.733 6463.287 7616.279
## 2073       7042.086 6661.937 7422.236 6460.698 7623.475
## 2074       7044.390 6661.067 7427.713 6458.149 7630.631
# plotting the forecast
autoplot(nasdaq.exp.loc.linear)

# let's check its residuals and see whether it looks like a white noise
checkresiduals(nasdaq.exp.loc.linear)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt's method
## Q* = 13.044, df = 6, p-value = 0.04234
## 
## Model df: 4.   Total lags used: 10
# let's also perform a test
Box.test(nasdaq.exp.loc.linear$residuals)
## 
##  Box-Pierce test
## 
## data:  nasdaq.exp.loc.linear$residuals
## X-squared = 0.0004397, df = 1, p-value = 0.9833
# p-value is 0.9833 far far above 0.05 that suggests the residuals is a white noise.

# checking it's accuracy
accuracy(nasdaq.exp.loc.linear, nasdaq.ts)
##                        ME      RMSE       MAE         MPE      MAPE
## Training set   0.04363645  38.55395  27.70352 -0.01337676 0.7417591
## Test set     278.56615547 331.22161 293.29157  3.78211323 3.9965088
##                    MASE          ACF1 Theil's U
## Training set  0.9931332 -0.0004673646        NA
## Test set     10.5141016  0.8279277754  3.318818
# just checking MAPE
accuracy(nasdaq.exp.loc.linear, nasdaq.ts)["Test set", "MAPE"]
## [1] 3.996509

Let’s again combine the accuracy of the models for the best model.

best.model <- min(accuracy(mean.benchmark, nasdaq.ts)["Test set", "MAPE"],accuracy(naive.benchmark, nasdaq.ts)["Test set", "MAPE"], accuracy(drift.benchmark, nasdaq.ts)["Test set", "MAPE"], accuracy(nasdaq.exp.smot, nasdaq.ts)["Test set", "MAPE"], accuracy(nasdaq.exp.loc.linear, nasdaq.ts)["Test set", "MAPE"])
best.model
## [1] 3.996509
print(paste("The best model is exponential model with local linear with lowest MAPE : ", round(best.model, 2)))
## [1] "The best model is exponential model with local linear with lowest MAPE :  4"

We have already used local linear approach and we have obtained the best result on test set so far. Let’s now move forward to the second aproach; damped linear trend. We can again use the holt() function but with also setting up damped = TRUE

# performing forecast on training data
nasdaq.exp.damped.linear <- holt(nasdaq.train, damped = TRUE ,h = 61) 

# checking the forecast with summary() function
summary(nasdaq.exp.damped.linear)
## 
## Forecast method: Damped Holt's method
## 
## Model Information:
## Damped Holt's method 
## 
## Call:
##  holt(y = nasdaq.train, h = 61, damped = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 0.9934 
##     beta  = 1e-04 
##     phi   = 0.8229 
## 
##   Initial states:
##     l = 2313.9931 
##     b = -0.9252 
## 
##   sigma:  38.6208
## 
##      AIC     AICc      BIC 
## 30035.83 30035.87 30069.47 
## 
## Error measures:
##                    ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 2.296246 38.62084 27.88755 0.04914543 0.7462363 0.9997306
##                      ACF1
## Training set -0.003787145
## 
## Forecasts:
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2014       6903.696 6854.201 6953.190 6828.000 6979.391
## 2015       6903.694 6833.926 6973.462 6796.992 7010.395
## 2016       6903.692 6818.335 6989.049 6773.150 7034.234
## 2017       6903.691 6805.180 7002.201 6753.032 7054.350
## 2018       6903.690 6793.585 7013.794 6735.299 7072.080
## 2019       6903.689 6783.099 7024.279 6719.262 7088.115
## 2020       6903.688 6773.454 7033.922 6704.512 7102.864
## 2021       6903.687 6764.475 7042.900 6690.780 7116.595
## 2022       6903.687 6756.040 7051.333 6677.881 7129.492
## 2023       6903.686 6748.062 7059.310 6665.680 7141.693
## 2024       6903.686 6740.474 7066.898 6654.074 7153.298
## 2025       6903.686 6733.222 7074.149 6642.985 7164.387
## 2026       6903.685 6726.267 7081.104 6632.347 7175.023
## 2027       6903.685 6719.574 7087.796 6622.112 7185.259
## 2028       6903.685 6713.116 7094.254 6612.235 7195.135
## 2029       6903.685 6706.870 7100.500 6602.683 7204.687
## 2030       6903.685 6700.816 7106.554 6593.424 7213.946
## 2031       6903.685 6694.937 7112.432 6584.433 7222.936
## 2032       6903.685 6689.220 7118.149 6575.689 7231.680
## 2033       6903.685 6683.651 7123.718 6567.172 7240.197
## 2034       6903.685 6678.219 7129.150 6558.865 7248.504
## 2035       6903.684 6672.916 7134.453 6550.754 7256.615
## 2036       6903.684 6667.731 7139.638 6542.825 7264.544
## 2037       6903.684 6662.658 7144.711 6535.067 7272.302
## 2038       6903.684 6657.690 7149.679 6527.468 7279.901
## 2039       6903.684 6652.820 7154.549 6520.020 7287.349
## 2040       6903.684 6648.042 7159.326 6512.714 7294.655
## 2041       6903.684 6643.353 7164.016 6505.541 7301.827
## 2042       6903.684 6638.746 7168.623 6498.496 7308.873
## 2043       6903.684 6634.218 7173.151 6491.571 7315.797
## 2044       6903.684 6629.765 7177.604 6484.761 7322.608
## 2045       6903.684 6625.383 7181.986 6478.059 7329.309
## 2046       6903.684 6621.069 7186.299 6471.462 7335.907
## 2047       6903.684 6616.820 7190.549 6464.963 7342.405
## 2048       6903.684 6612.633 7194.736 6458.560 7348.809
## 2049       6903.684 6608.505 7198.863 6452.247 7355.121
## 2050       6903.684 6604.435 7202.934 6446.021 7361.347
## 2051       6903.684 6600.419 7206.950 6439.879 7367.489
## 2052       6903.684 6596.455 7210.914 6433.818 7373.551
## 2053       6903.684 6592.542 7214.827 6427.833 7379.536
## 2054       6903.684 6588.677 7218.691 6421.923 7385.446
## 2055       6903.684 6584.860 7222.509 6416.084 7391.284
## 2056       6903.684 6581.087 7226.281 6410.315 7397.054
## 2057       6903.684 6577.358 7230.010 6404.612 7402.757
## 2058       6903.684 6573.672 7233.697 6398.973 7408.395
## 2059       6903.684 6570.026 7237.343 6393.397 7413.971
## 2060       6903.684 6566.419 7240.950 6387.882 7419.487
## 2061       6903.684 6562.851 7244.518 6382.424 7424.944
## 2062       6903.684 6559.319 7248.049 6377.023 7430.345
## 2063       6903.684 6555.823 7251.545 6371.677 7435.691
## 2064       6903.684 6552.363 7255.006 6366.384 7440.984
## 2065       6903.684 6548.936 7258.433 6361.143 7446.226
## 2066       6903.684 6545.541 7261.827 6355.952 7451.417
## 2067       6903.684 6542.179 7265.190 6350.810 7456.559
## 2068       6903.684 6538.847 7268.521 6345.715 7461.654
## 2069       6903.684 6535.546 7271.822 6340.666 7466.703
## 2070       6903.684 6532.274 7275.094 6335.662 7471.707
## 2071       6903.684 6529.031 7278.338 6330.701 7476.667
## 2072       6903.684 6525.815 7281.553 6325.784 7481.585
## 2073       6903.684 6522.627 7284.742 6320.907 7486.461
## 2074       6903.684 6519.465 7287.904 6316.072 7491.297
# plotting the forecast
autoplot(nasdaq.exp.damped.linear)

# let's check its residuals and see whether it looks like a white noise
checkresiduals(nasdaq.exp.damped.linear)

## 
##  Ljung-Box test
## 
## data:  Residuals from Damped Holt's method
## Q* = 13.023, df = 5, p-value = 0.02317
## 
## Model df: 5.   Total lags used: 10
# let's also perform a test
Box.test(nasdaq.exp.damped.linear$residuals)
## 
##  Box-Pierce test
## 
## data:  nasdaq.exp.damped.linear$residuals
## X-squared = 0.028871, df = 1, p-value = 0.8651
# p-value is 0.8651 far far above 0.05 that suggests the residuals is a white noise.

# checking it's accuracy
accuracy(nasdaq.exp.damped.linear, nasdaq.ts)
##                      ME      RMSE       MAE        MPE      MAPE
## Training set   2.296246  38.62084  27.88755 0.04914543 0.7462363
## Test set     350.163494 393.79555 355.26901 4.76791433 4.8430486
##                    MASE         ACF1 Theil's U
## Training set  0.9997306 -0.003787145        NA
## Test set     12.7359083  0.830244200  3.947916
# just checking MAPE
accuracy(nasdaq.exp.damped.linear, nasdaq.ts)["Test set", "MAPE"]
## [1] 4.843049

Having compared with local linear trend approach, dampled linear trend approach has generated lower quality forecast on nasdaq data.

Let’s again combine the accuracy of the models for the best model.

best.model <- min(accuracy(mean.benchmark, nasdaq.ts)["Test set", "MAPE"],accuracy(naive.benchmark, nasdaq.ts)["Test set", "MAPE"], accuracy(drift.benchmark, nasdaq.ts)["Test set", "MAPE"], accuracy(nasdaq.exp.smot, nasdaq.ts)["Test set", "MAPE"], accuracy(nasdaq.exp.loc.linear, nasdaq.ts)["Test set", "MAPE"], accuracy(nasdaq.exp.damped.linear, nasdaq.ts)["Test set", "MAPE"])
best.model
## [1] 3.996509
print(paste("The best model is exponential model with local linear with lowest MAPE : ", round(best.model, 2)))
## [1] "The best model is exponential model with local linear with lowest MAPE :  4"

Holt’s method is also efficient of forecasting data both with trend and seasonality(or just seasonality) however I’ll not show it here since nasdaq data is not seasonal.

There is also a method which chooses all above steps for us. it is ets() function for finding errors, trend, and seasonality. It chooses the optimal parameters for us automatically. The automated forecast is great way of obtaining quick results.

nasdaq.optimal <- ets(nasdaq.train)
# the result is ETS(A,A,N) which stands for Additive errors, additive trend and no seasonality

# let's perform forecast
nasdaq.optimal.fc <- nasdaq.optimal %>% forecast(h = 61)

# plotting forecast
autoplot(nasdaq.optimal.fc)

# checking residuals
checkresiduals(nasdaq.optimal.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,A,N)
## Q* = 13.043, df = 6, p-value = 0.04236
## 
## Model df: 4.   Total lags used: 10
# accuracy
accuracy(nasdaq.optimal.fc, nasdaq.ts)
##                        ME      RMSE       MAE         MPE      MAPE
## Training set   0.04491026  38.55395  27.70356 -0.01334001 0.7417598
## Test set     278.60418059 331.25262 293.32051  3.78263685 3.9969016
##                    MASE          ACF1 Theil's U
## Training set  0.9931348 -0.0005204969        NA
## Test set     10.5151391  0.8279258151  3.319129

ARIMA(Autoregressive Integrated Moving Average) Model

# fitting and forecasting
nasdaq.arima <- nasdaq.train %>% auto.arima() %>% forecast(h = 61)  

# plotting forecast
autoplot(nasdaq.arima)

# plotting acutal and fitted values together
autoplot(nasdaq.arima) + autolayer(fitted(nasdaq.arima), series = "Fitted")

# checking residuals
checkresiduals(nasdaq.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0) with drift
## Q* = 13.069, df = 9, p-value = 0.1595
## 
## Model df: 1.   Total lags used: 10
# model diagnostics looks good. The p-value is 0.1595 that suggests that the residuals are white noise.


# checking accuracy
accuracy(nasdaq.arima, nasdaq.ts)
##                        ME      RMSE       MAE         MPE      MAPE
## Training set 1.145621e-03  38.55398  27.69883 -0.01433411 0.7416218
## Test set     2.796615e+02 332.12677 294.15798  3.79720643 4.0083022
##                    MASE        ACF1 Theil's U
## Training set  0.9929653 -0.01033705        NA
## Test set     10.5451615  0.82789425  3.327893