Load the data

Load data and create a subset dataset landings with the years 1964 to 1987.

data(greeklandings, package="FishForecast")
landings = subset(greeklandings, Year <= 1987)

Look at the data

head(landings)

Year	Species	metric.tons	log.metric.tons
1964	Anchovy	5.45e+03	8.6
1965	Anchovy	4.26e+03	8.36
1966	Anchovy	5.15e+03	8.55
1967	Anchovy	7.27e+03	8.89
1968	Anchovy	6.35e+03	8.76
1969	Anchovy	7.04e+03	8.86

unique(landings$Species)

## [1] Anchovy        Sardine        Chub.mackerel  Horse.mackerel Mackerel      
## [6] Jack.Mackerel 
## 6 Levels: Anchovy Sardine Chub.mackerel Horse.mackerel ... Jack.Mackerel

Subset the data

val <- "Sardine"
dat <- subset(landings, Species==val)

Class of the data

class(dat)

## [1] "data.frame"

Because it is a data.frame, you can call the columns like so

dat$log.metric.tons

##  [1] 9.471504 9.269656 9.344679 9.103534 9.118203 9.270052 9.085083 9.119573
##  [9] 9.371089 9.491134 9.298525 9.432539 9.447339 9.404961 9.383293 9.450538
## [17] 9.397840 9.429829 9.423539 9.233666 9.245341 9.349267 9.246258 9.178334

Plot data with ggplot

library(ggplot2)
ggplot(dat, aes(x=Year, y=log.metric.tons)) +
  geom_line()

Plot data with base plot

plot(dat$Year, dat$log.metric.tons, type="l")

Fit a 3th-order linear regression against year

First add \(t\), \(t^2\) and \(t^3\) to the landings. \(t\) is Year minus first Year. So first year is 0.

landings$t <- landings$Year-landings$Year[1]
landings$t2 <- (landings$t)^2
landings$t3 <- (landings$t)^3
dat <- subset(landings, Species==val)
head(dat)

Year	Species	metric.tons	log.metric.tons	t	t2	t3
1964	Sardine	1.3e+04	9.47	0	0	0
1965	Sardine	1.06e+04	9.27	1	1	1
1966	Sardine	1.14e+04	9.34	2	4	8
1967	Sardine	8.99e+03	9.1	3	9	27
1968	Sardine	9.12e+03	9.12	4	16	64
1969	Sardine	1.06e+04	9.27	5	25	125

Fit the regression

We are fitting this model

\[log(Sardine catch) = \alpha + \beta t + \beta_2 t^2 + \beta_3 t^3 + e_t\]

model <- lm(log.metric.tons ~ t + t2 + t3, data=dat)

Look at the model.

summary(model)

## 
## Call:
## lm(formula = log.metric.tons ~ t + t2 + t3, data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.155240 -0.052781  0.008525  0.063977  0.200622 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.361e+00  7.045e-02 132.860  < 2e-16 ***
## t           -6.414e-02  2.710e-02  -2.366  0.02816 *  
## t2           8.790e-03  2.773e-03   3.170  0.00481 ** 
## t3          -2.808e-04  7.917e-05  -3.547  0.00202 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1001 on 20 degrees of freedom
## Multiple R-squared:  0.4566, Adjusted R-squared:  0.375 
## F-statistic: 5.601 on 3 and 20 DF,  p-value: 0.005907

Fit using the I() function

You don’t need to add \(t^2\) and \(t^3\) etc to your data frame. You can use the I() function.

model2 <- lm(log.metric.tons ~ t + I(t^2) + I(t^3) + I(t^4), data=dat)

summary(model2)

## 
## Call:
## lm(formula = log.metric.tons ~ t + I(t^2) + I(t^3) + I(t^4), 
##     data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.115300 -0.053090 -0.008895  0.041783  0.165885 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.464e+00  6.822e-02 138.726  < 2e-16 ***
## t           -1.754e-01  4.298e-02  -4.080 0.000638 ***
## I(t^2)       3.164e-02  7.842e-03   4.035 0.000707 ***
## I(t^3)      -1.847e-03  5.176e-04  -3.569 0.002047 ** 
## I(t^4)       3.406e-05  1.116e-05   3.052 0.006562 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08412 on 19 degrees of freedom
## Multiple R-squared:  0.6353, Adjusted R-squared:  0.5586 
## F-statistic: 8.275 on 4 and 19 DF,  p-value: 0.0004846

Fit using the poly function

The proper way to fit a time-varying regression with a polynomial \(t\) is to use poly(), but Stergiou and Christou fit a raw polynomial regression. I repeat Stergiou and Christou’s approach but keep in mind that you should use poly() for a real analysis.

model3 <- lm(log.metric.tons ~ poly(t,3), data=dat)

summary(model3)

## 
## Call:
## lm(formula = log.metric.tons ~ poly(t, 3), data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.155240 -0.052781  0.008525  0.063977  0.200622 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.31524    0.02043 455.923  < 2e-16 ***
## poly(t, 3)1  0.08314    0.10009   0.831  0.41602    
## poly(t, 3)2 -0.18809    0.10009  -1.879  0.07488 .  
## poly(t, 3)3 -0.35504    0.10009  -3.547  0.00202 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1001 on 20 degrees of freedom
## Multiple R-squared:  0.4566, Adjusted R-squared:  0.375 
## F-statistic: 5.601 on 3 and 20 DF,  p-value: 0.005907

Show the coefficients

coef(model)

##   (Intercept)             t            t2            t3 
##  9.3605023339 -0.0641416599  0.0087901653 -0.0002808215

Show the fitted values

This is the model predicted values.

fitted(model)

##       45       46       47       48       49       50       51       52 
## 9.360502 9.304870 9.265133 9.239607 9.226606 9.224445 9.231441 9.245907 
##       53       54       55       56       57       58       59       60 
## 9.266159 9.290512 9.317281 9.344781 9.371327 9.395234 9.414817 9.428392 
##       61       62       63       64       65       66       67       68 
## 9.434273 9.430776 9.416215 9.388906 9.347163 9.289303 9.213639 9.118487

Plot model fit over the data

plot(dat$t, dat$log.metric.tons)
lines(dat$t, fitted(model))

or with ggplot

fitted.df <- data.frame(t = dat$t, fitted=fitted(model))
ggplot(dat, aes(x=t, y=log.metric.tons)) +
  geom_point() +
  geom_line(color='red',data = fitted.df, aes(x=t, y=fitted))

Show the residuals

This is the difference between the data and the fitted line.

residuals(model)

##            45            46            47            48            49 
##  0.1110015820 -0.0352141158  0.0795458353 -0.1360722943 -0.1084026116 
##            50            51            52            53            54 
##  0.0456061531 -0.1463576622 -0.1263341902  0.1049301962  0.2006218414 
##            55            56            57            58            59 
## -0.0187555657  0.0877587806  0.0760127752  0.0097270668 -0.0315241974 
##            60            61            62            63            64 
##  0.0221455159 -0.0364331808 -0.0009467464  0.0073234387 -0.1552398025 
##            65            66            67            68 
## -0.1018228357  0.0599644510  0.0326187567  0.0598468097

Test the residuals for independence

For the Ljung-Box test, we need to know how many parameters were estimated. We can see this by looking at the estimated coefficients.

coef(model)

##   (Intercept)             t            t2            t3 
##  9.3605023339 -0.0641416599  0.0087901653 -0.0002808215

np <- length(coef(model))

Stergiou and Christou use lag=14, which is a bit large, but we will use that to copy them.

x <- resid(model)
Box.test(x, lag = 14, type = "Ljung-Box", fitdf=np)

## 
##  Box-Ljung test
## 
## data:  x
## X-squared = 15.127, df = 10, p-value = 0.1275

We can also use the Breusch-Godfrey test, which is more standard for regression models. The null hypothesis is that the data are temporally independent. So we do not want to reject the null.

The checkresiduals() function in the forecast package shows diagnostic plots and also the Breusch-Godfrey test results.

library(forecast)
checkresiduals(model)

## 
##  Breusch-Godfrey test for serial correlation of order up to 7
## 
## data:  Residuals
## LM test = 6.7563, df = 7, p-value = 0.4547

If the p-value is less than 0.05, it indicates support for temporally autocorrelated residuals, which means that your assumption of uncorrelated residuals is not supported. You can also look at the ACF function.

Problems

Fit a 1st order polynomial to the sardine data. This is \(C_t = \alpha + \beta t + e_t\). Plot the data and then the fit on top of the data.

lm1 <- lm(log.metric.tons ~ t, data=dat)

Fit a 2nd order polynomial to the sardine data. This is \(C_t = \alpha + \beta t + \beta_2 t^2 + e_t\). Plot the data and then the fit on top of the data.

lm1 <- lm(log.metric.tons ~ t + I(t^2), data=dat)

Try fitting a 4th order time-varying regression for one of the other species.

unique(landings$Species)

## [1] Anchovy        Sardine        Chub.mackerel  Horse.mackerel Mackerel      
## [6] Jack.Mackerel 
## 6 Levels: Anchovy Sardine Chub.mackerel Horse.mackerel ... Jack.Mackerel

Fit a 4th order polynomial.
Do a summary(model) and evaluate what level of polynomial is supported.
Plot the data with the fit on top.

Time Varying Regression Lab 1