6.6 Orthogonalization

The last approach that we will discuss for dealing with collinearity is orthogonalization. With this technique, we replace the set of collinear covariates \(X\) with a set of orthogonal, i.e. independent, covariates \(Z\), which are linear combinations of the original, collinear, covariates. Principal Components Analysis (PCA) is an example of using orthogonalization to deal with collinear covariates. Another example is when we use the poly() function to do polynomial regression. In this case, we are replacing a set of collinear variates, \(x\), \(x^2\), \(x^3\), etc., with a set of orthogonal covariates.

The ways that you can create a set of orthogonal covariates is not unique. There are many different sets of orthogonal covariates. We will show three ways you might create your orthogonal set: PCA, Gram-Schmidt Orthogonalization, and residuals from linear regressions. Note, it is important to use standardized covariates with the mean removed and variance scaled to 1 when doing orthogonalization. Thus we will use dfz instead of df.

6.6.0.1 Principal component regression

Principal component regression is a linear regression in which you transform your collinear covariates using the othogonal variates created by Principal Components Analysis (PCA). PCA uses an orthogonal set of variates \(Z\) in which the first variate accounts for as much of the variability in the variate dataset as possible, the second accounts for as much of the remaining variance as possible, the third accounts for as much of the variance remaining after the first two, etc., etc. Each variate is orthogonal to the preceding variate.

Singular value decomposition is a standard way to compute \(Z\) for PCA.

\[Z = XV\] where \(V\) is the right singular vector from the singular value decomposition of the matrix of covariates (\(X=UDV'\)). The \(V\) is the ‘loadings’ matrix in a PCA. The orthogonal covariates \(Z\) are linear combinations of the original, collinear, covariates. PCA covariates have the nice feature that they are ordered in terms of the amount of variance they explain, but the orthogonal variates, called axes in a PCA, can be a bit hard to interpret.

Let’s see an example with our data set. First we will create a matrix of our collinear variates. We need to use the scaled variates.

X <- as.matrix(dfz[,colnames(dfz)!="anchovy"])

We create our orthogonal PCA variates using the svd() function which does a singular value decomposition. We will re-label the variates as ‘principal components (PC)’. A corrplot shows that our variates are now uncorrelated.

loadings <- svd(X)$v
rownames(loadings) <- colnames(X)
Z <- X%*%loadings
colnames(Z) <- paste0("PC", 1:ncol(Z))
corrplot(cor(Z))

These new variates are linear combinations of the original variates. The “loadings” indicate the weight of each original variate in the new variate (“principal component”).

library(reshape2)
meltR = melt(loadings)
ggplot(meltR, aes(x=Var1, y = value)) + 
  geom_bar(stat="identity") + 
  coord_flip() + 
  facet_wrap(. ~ Var2) +
  ggtitle("Loadings")

The \(Z\) matrix gives us a set of orthogonal variates, but some of them do not explain much of the variance. We know this should be the case because we have collinearity in our data. The singular values (which are square root of the eigenvalues of \(X^\top X\)) show how much of the variance in \(X\) explained by each pricipal component (column in \(Z\)). In the plot, the singular values were of \(X/\sqrt{n}\) so that \(X^\top X\) is the correlation matrix. The average singular value for a correlation matrix is 1. With this scaling, any singlular value much less than one is small.

sing.val <- svd(X/sqrt(n))$d
plot(sing.val, xlab="axis", ylab="singular value")
abline(h=1, col="red")

We could run a linear regression with all the 11 orthogonal variates (principal components), but that would not be helpful. The point of orthogonalization is to find a smaller set of variates that explains the structure in the larger set of collinear variates. Based on the singular value plot, we will use the first 2 components. These 2 capture a fair bit of the variability in the anchovy catch.

dfpca <- data.frame(anchovy=dfz$anchovy, Z[,1:2])
pcalm <- lm(anchovy ~ ., data=dfpca)
summary(pcalm)
## 
## Call:
## lm(formula = anchovy ~ ., data = dfpca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0413 -0.4149  0.1160  0.4328  0.8402 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.209e-15  1.166e-01   0.000  1.00000    
## PC1          3.405e-01  5.090e-02   6.688 1.65e-06 ***
## PC2         -2.278e-01  7.671e-02  -2.970  0.00757 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5592 on 20 degrees of freedom
## Multiple R-squared:  0.7281,	Adjusted R-squared:  0.7009 
## F-statistic: 26.78 on 2 and 20 DF,  p-value: 2.209e-06

We can also plot the anchovy catch, broken into small, medium, and large, against the 2 components and see that the 2 components do separate these broad catch levels.

library(ggplot2)
library(grid)
library(gridExtra)
df_pca <- prcomp(X)
df_out <- as.data.frame(df_pca$x)
df_out$group <- cut(dfz$anchovy,3)
p<-ggplot(df_out,aes(x=PC1,y=PC2,color=group ))
p<-p+geom_point()
p

We can recover the effect sizes for the original variates from \(V\tilde{\beta}\). This gives a similar picture as the other methods we used. Year, TOP/HPP/BOP and Trachurus come out as important.

coef(pcalm)
##   (Intercept)           PC1           PC2 
##  6.209379e-15  3.404684e-01 -2.278441e-01
eff <- data.frame(value=svd(X)$v[,1:2] %*% coef(pcalm)[-1], var=colnames(X))
ggplot(eff, aes(x=var, y = value)) + 
  geom_bar(stat="identity") + 
  coord_flip() +
  ggtitle("Effects on scaled original vars")

Principal Components Regression (PCR) is a method for dimension reduction, like Lasso regression or variable selection, but your new principal components can be hard to interpret because they are linear combinations of the original variates. In addition, if you are trying to understand if a particular variate improves your model, then PCR is going to help you. Another approach for creating orthogonal variates is Gram-Schmidt orthogalization and this can help study the effect of adding specific variates.

6.6.0.2 Gram-Schmidt Orthogonalization

The Gram-Schmidt orthogonalization treats the variates in a specific order. The first orthogonal variate will be the first variate, the second othogonal variate will be the variation in the second variate that is not explained by the first, the third will be the variation in the third variate that is not explained by the first two, etc. This makes your orthogonal variates easier to interpret if they have some natural or desired ordering. For example, let’s say we want to study if adding TOP to our model help explain variance over what is already explained by Year. Putting both TOP and Year in a model won’t be helpful because they are highly correlated and we’ll just get effect sizes that offset each other (one negative, one positive) with high standard errors. Instead, we’ll add Year and then add a second variate that is the variability in TOP that is not explained by Year. Why not add TOP first? The ordering is up to you and depends on the specific question you are asking. In this case, we asking what TOP adds to a model with Year not what Year adds to a model with TOP.

We create the Let \(Z\) be the matrix of orthogonal variates and \(X\) be the matrix of original, collinear, covariates. The first column of \(Z\) is \(Z_1 = X_1\). The second column of \(Z\) is \[Z_2 = X_2 - Z_1(Z_1^\top Z_1)Z_1^\top X_2.\] The third column of \(Z\) is \[Z_3 = X_3 - Z_1(Z_1^\top Z_1)Z_1^\top X_3 - Z_2(Z_2^\top Z_2)Z_2^\top X_3.\]

Here is R code to create the first 3 columns of the Z matrix. The cross-product of Z is diagonal, indicating that our new variates are orthogonal.

pr <- function(y, x){ x%*%solve(t(x)%*%x)%*%t(x)%*%y }
Z <- cbind(X[,1], 0, 0)
Z[,2] <- X[,2] - pr(X[,2], Z[,1])
Z[,3] <- X[,3] - pr(X[,3], Z[,1]) - pr(X[,3], Z[,2]) 
zapsmall(crossprod(Z))
##      [,1]     [,2]    [,3]
## [1,]   23 0.000000  0.0000
## [2,]    0 6.757089  0.0000
## [3,]    0 0.000000 22.1069

To create our orthogonal variates, we have to give some thought to the ordering. Also not all the variates in our example are collinear. So we don’t need to do Gram-Schimdt Orthogonalization on all the variates. From the variance inflation factors, Year, HPP, TOP, BOP and Trachusus have the worst collinearity problems.

full <- lm(anchovy ~ ., data=df)
car::vif(full)
##       Year  Trachurus        air        slp        sst       vwnd      wspd3 
## 103.922970  18.140279   3.733963   3.324463   2.476689   2.010485   1.909992 
##        BOP        FIP        HPP        TOP 
##  13.676208   8.836446  63.507170 125.295727

We’ll do Gram-Schmidt orthogonalization on these 5. First let’s resort our variates to put these 5 first.

pr <- function(y, x){ x%*%solve(t(x)%*%x)%*%t(x)%*%y }
Z <- X[,c("Year","BOP","HPP","TOP","Trachurus","FIP","air","slp","sst","vwnd","wspd3")]
Z[,2] <- X[,2] - pr(X[,2], Z[,1])
Z[,3] <- X[,3] - pr(X[,3], Z[,1]) - pr(X[,3], Z[,2]) 
zapsmall(crossprod(Z))
##                 Year       BOP        HPP       TOP  Trachurus        FIP
## Year       23.000000  0.000000   0.000000 21.489875  19.328398 -13.995187
## BOP         0.000000  6.757089   0.000000  2.743112   6.757089  -0.836240
## HPP         0.000000  0.000000  22.106900  2.906483   0.000000   7.049420
## TOP        21.489875  2.743112   2.906483 23.000000  20.802454  -9.247200
## Trachurus  19.328398  6.757089   0.000000 20.802454  23.000000 -12.597306
## FIP       -13.995187 -0.836240   7.049420 -9.247200 -12.597306  23.000000
## air        -3.834139 -1.309927  22.106900 -1.207694  -4.532004   9.544555
## slp        12.722819  3.030045  -6.167315 11.179809  13.721858 -12.356850
## sst        -5.105424 -0.381354  14.974748 -1.749126  -4.671774   8.493349
## vwnd        2.278039  2.424582 -11.802674  3.176227   4.338966  -1.073977
## wspd3       2.219684 -0.922046  -6.961443 -0.226925   0.943299  -5.505128
##                  air        slp       sst       vwnd     wspd3
## Year       -3.834139  12.722819 -5.105424   2.278039  2.219684
## BOP        -1.309927   3.030045 -0.381354   2.424582 -0.922046
## HPP        22.106900  -6.167315 14.974748 -11.802674 -6.961443
## TOP        -1.207694  11.179809 -1.749126   3.176227 -0.226925
## Trachurus  -4.532004  13.721858 -4.671774   4.338966  0.943299
## FIP         9.544555 -12.356850  8.493349  -1.073977 -5.505128
## air        23.000000  -8.875634 15.899760 -12.652455 -7.152721
## slp        -8.875634  23.000000 -9.371464   7.341141 -1.298503
## sst        15.899760  -9.371464 23.000000  -7.755410 -3.926768
## vwnd      -12.652455   7.341141 -7.755410  23.000000  4.991220
## wspd3      -7.152721  -1.298503 -3.926768   4.991220 23.000000