Load data and create a subset dataset landings
with the years 1964 to 1987.
data(greeklandings, package="FishForecast")
landings = subset(greeklandings, Year <= 1987)
head(landings)
Year | Species | metric.tons | log.metric.tons |
1964 | Anchovy | 5.45e+03 | 8.6 |
1965 | Anchovy | 4.26e+03 | 8.36 |
1966 | Anchovy | 5.15e+03 | 8.55 |
1967 | Anchovy | 7.27e+03 | 8.89 |
1968 | Anchovy | 6.35e+03 | 8.76 |
1969 | Anchovy | 7.04e+03 | 8.86 |
unique(landings$Species)
## [1] Anchovy Sardine Chub.mackerel Horse.mackerel Mackerel
## [6] Jack.Mackerel
## 6 Levels: Anchovy Sardine Chub.mackerel Horse.mackerel ... Jack.Mackerel
val <- "Sardine"
dat <- subset(landings, Species==val)
class(dat)
## [1] "data.frame"
Because it is a data.frame, you can call the columns like so
dat$log.metric.tons
## [1] 9.471504 9.269656 9.344679 9.103534 9.118203 9.270052 9.085083 9.119573
## [9] 9.371089 9.491134 9.298525 9.432539 9.447339 9.404961 9.383293 9.450538
## [17] 9.397840 9.429829 9.423539 9.233666 9.245341 9.349267 9.246258 9.178334
library(ggplot2)
ggplot(dat, aes(x=Year, y=log.metric.tons)) +
geom_line()
plot(dat$Year, dat$log.metric.tons, type="l")
First add \(t\), \(t^2\) and \(t^3\) to the landings. \(t\) is Year minus first Year. So first year is 0.
landings$t <- landings$Year-landings$Year[1]
landings$t2 <- (landings$t)^2
landings$t3 <- (landings$t)^3
dat <- subset(landings, Species==val)
head(dat)
Year | Species | metric.tons | log.metric.tons | t | t2 | t3 |
1964 | Sardine | 1.3e+04 | 9.47 | 0 | 0 | 0 |
1965 | Sardine | 1.06e+04 | 9.27 | 1 | 1 | 1 |
1966 | Sardine | 1.14e+04 | 9.34 | 2 | 4 | 8 |
1967 | Sardine | 8.99e+03 | 9.1 | 3 | 9 | 27 |
1968 | Sardine | 9.12e+03 | 9.12 | 4 | 16 | 64 |
1969 | Sardine | 1.06e+04 | 9.27 | 5 | 25 | 125 |
We are fitting this model
\[log(Sardine catch) = \alpha + \beta t + \beta_2 t^2 + \beta_3 t^3 + e_t\]
model <- lm(log.metric.tons ~ t + t2 + t3, data=dat)
Look at the model.
summary(model)
##
## Call:
## lm(formula = log.metric.tons ~ t + t2 + t3, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.155240 -0.052781 0.008525 0.063977 0.200622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.361e+00 7.045e-02 132.860 < 2e-16 ***
## t -6.414e-02 2.710e-02 -2.366 0.02816 *
## t2 8.790e-03 2.773e-03 3.170 0.00481 **
## t3 -2.808e-04 7.917e-05 -3.547 0.00202 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1001 on 20 degrees of freedom
## Multiple R-squared: 0.4566, Adjusted R-squared: 0.375
## F-statistic: 5.601 on 3 and 20 DF, p-value: 0.005907
You don’t need to add \(t^2\) and \(t^3\) etc to your data frame. You can use the I()
function.
model2 <- lm(log.metric.tons ~ t + I(t^2) + I(t^3) + I(t^4), data=dat)
summary(model2)
##
## Call:
## lm(formula = log.metric.tons ~ t + I(t^2) + I(t^3) + I(t^4),
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.115300 -0.053090 -0.008895 0.041783 0.165885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.464e+00 6.822e-02 138.726 < 2e-16 ***
## t -1.754e-01 4.298e-02 -4.080 0.000638 ***
## I(t^2) 3.164e-02 7.842e-03 4.035 0.000707 ***
## I(t^3) -1.847e-03 5.176e-04 -3.569 0.002047 **
## I(t^4) 3.406e-05 1.116e-05 3.052 0.006562 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08412 on 19 degrees of freedom
## Multiple R-squared: 0.6353, Adjusted R-squared: 0.5586
## F-statistic: 8.275 on 4 and 19 DF, p-value: 0.0004846
The proper way to fit a time-varying regression with a polynomial \(t\) is to use poly()
, but Stergiou and Christou fit a raw polynomial regression. I repeat Stergiou and Christou’s approach but keep in mind that you should use poly()
for a real analysis.
model3 <- lm(log.metric.tons ~ poly(t,3), data=dat)
summary(model3)
##
## Call:
## lm(formula = log.metric.tons ~ poly(t, 3), data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.155240 -0.052781 0.008525 0.063977 0.200622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.31524 0.02043 455.923 < 2e-16 ***
## poly(t, 3)1 0.08314 0.10009 0.831 0.41602
## poly(t, 3)2 -0.18809 0.10009 -1.879 0.07488 .
## poly(t, 3)3 -0.35504 0.10009 -3.547 0.00202 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1001 on 20 degrees of freedom
## Multiple R-squared: 0.4566, Adjusted R-squared: 0.375
## F-statistic: 5.601 on 3 and 20 DF, p-value: 0.005907
coef(model)
## (Intercept) t t2 t3
## 9.3605023339 -0.0641416599 0.0087901653 -0.0002808215
This is the model predicted values.
fitted(model)
## 45 46 47 48 49 50 51 52
## 9.360502 9.304870 9.265133 9.239607 9.226606 9.224445 9.231441 9.245907
## 53 54 55 56 57 58 59 60
## 9.266159 9.290512 9.317281 9.344781 9.371327 9.395234 9.414817 9.428392
## 61 62 63 64 65 66 67 68
## 9.434273 9.430776 9.416215 9.388906 9.347163 9.289303 9.213639 9.118487
plot(dat$t, dat$log.metric.tons)
lines(dat$t, fitted(model))
or with ggplot
fitted.df <- data.frame(t = dat$t, fitted=fitted(model))
ggplot(dat, aes(x=t, y=log.metric.tons)) +
geom_point() +
geom_line(color='red',data = fitted.df, aes(x=t, y=fitted))
This is the difference between the data and the fitted line.
residuals(model)
## 45 46 47 48 49
## 0.1110015820 -0.0352141158 0.0795458353 -0.1360722943 -0.1084026116
## 50 51 52 53 54
## 0.0456061531 -0.1463576622 -0.1263341902 0.1049301962 0.2006218414
## 55 56 57 58 59
## -0.0187555657 0.0877587806 0.0760127752 0.0097270668 -0.0315241974
## 60 61 62 63 64
## 0.0221455159 -0.0364331808 -0.0009467464 0.0073234387 -0.1552398025
## 65 66 67 68
## -0.1018228357 0.0599644510 0.0326187567 0.0598468097
For the Ljung-Box test, we need to know how many parameters were estimated. We can see this by looking at the estimated coefficients.
coef(model)
## (Intercept) t t2 t3
## 9.3605023339 -0.0641416599 0.0087901653 -0.0002808215
np <- length(coef(model))
Stergiou and Christou use lag=14, which is a bit large, but we will use that to copy them.
x <- resid(model)
Box.test(x, lag = 14, type = "Ljung-Box", fitdf=np)
##
## Box-Ljung test
##
## data: x
## X-squared = 15.127, df = 10, p-value = 0.1275
We can also use the Breusch-Godfrey test, which is more standard for regression models. The null hypothesis is that the data are temporally independent. So we do not want to reject the null.
The checkresiduals()
function in the forecast package shows diagnostic plots and also the Breusch-Godfrey test results.
library(forecast)
checkresiduals(model)
##
## Breusch-Godfrey test for serial correlation of order up to 7
##
## data: Residuals
## LM test = 6.7563, df = 7, p-value = 0.4547
If the p-value is less than 0.05, it indicates support for temporally autocorrelated residuals, which means that your assumption of uncorrelated residuals is not supported. You can also look at the ACF function.
lm1 <- lm(log.metric.tons ~ t, data=dat)
lm1 <- lm(log.metric.tons ~ t + I(t^2), data=dat)
unique(landings$Species)
## [1] Anchovy Sardine Chub.mackerel Horse.mackerel Mackerel
## [6] Jack.Mackerel
## 6 Levels: Anchovy Sardine Chub.mackerel Horse.mackerel ... Jack.Mackerel
summary(model)
and evaluate what level of polynomial is supported.