Collinearity in Regression

Author
Affiliation

Alex Kaizer

University of Colorado-Anschutz Medical Campus

This page is part of the University of Colorado-Anschutz Medical Campus’ BIOS 6618 Recitation collection. To view other questions, you can view the BIOS 6618 Recitation collection page or use the search bar to look for keywords.

Collinearity

Collinearity is when we have two explanatory variables with a linear association. In a regression context, we often refer to this as multicollinearity, which is when we have two or more explanatory variables that are highly linearly related.

Why might we be concerned about multicollinearity? There are two major reasons:

  1. It can lead to instability in our beta coefficient estimates based on what variable(s) are in the model.
  2. It can lead to reduced precision (i.e., increased variance of our beta coefficients).

In our lecture on MLR diagnostics, we introduced the use of the variance inflation factor (VIF) as a way to evaluate multicollinearity. Our rule of thumb was that we should be concerned when VIF > 10 for any coefficient.

Let’s explore some different sources and issues with multicollinearity.

Structural Multicollinearity

This form of multicollinearity is introduced by the inclusion of polynomial or interaction terms in the model. In both cases, we can imagine that there probably should be correlation between these related variables.

Polynomial Regression

Let’s first see an example of our polynomial regression with raw polynomials:

Code
library(car) # load package for vif function
dat <- read.csv('../../.data/nhanes1516_sexhormone.csv')
dat <- dat[which(dat$MALE==T),]
dat <- dat[!is.na(dat$SHBG),]

dat[,c('age','age2','age3')] <- poly(dat$RIDAGEYR,3, raw=T)
mod_raw3 <- lm(SHBG ~ age + age2 + age3, data=dat)
vif(mod_raw3)
     age     age2     age3 
167.0021 852.3380 298.5833 

We see that our three age terms (i.e., \(X_{age}\), \(X_{age}^2\), and \(X_{age}^{3}\)) have VIF values >>10. This is expected since we simply squared and cubed our age term.

If we fit orthogonal polynomial terms, we see that the multicollinearity is removed:

Code
dat[,c('orthage','orthage2','orthage3')] <- poly(dat$RIDAGEYR,3, raw=F)
mod_orth3 <- lm(SHBG ~ orthage + orthage2 + orthage3, data=dat)
vif(mod_orth3)
 orthage orthage2 orthage3 
       1        1        1 

Now, all VIF values are less than 10, suggesting no concerns with multicollinearity.

As a brief, aside, we can see how poly() created our raw and orthogonal \(X_{age}\) terms:

Code
head(dat)
    X  SEQN MALE RIDAGEYR TESTOSTERONE ESTRADIOL  SHBG age age2   age3
1   1 83732 TRUE       62          367      19.9 42.86  62 3844 238328
2   2 83733 TRUE       53          505      28.3 29.04  53 2809 148877
3   3 83734 TRUE       78          104      12.7 27.02  78 6084 474552
8   8 83741 TRUE       22          543      34.4 21.76  22  484  10648
10 10 83743 TRUE       18          381      27.1 17.86  18  324   5832
11 11 83744 TRUE       56          685      31.9 73.08  56 3136 175616
       orthage      orthage2    orthage3
1   0.01766653 -0.0036872569 -0.02046891
2   0.01087120 -0.0151759086 -0.01646149
3   0.02974713  0.0324476628  0.02458503
8  -0.01253495 -0.0060449164  0.01960798
10 -0.01555510  0.0006320036  0.01440223
11  0.01313631 -0.0120533373 -0.01934986

Interaction Term

Structural multicollinearity can also occur due to the presence of an interaction:

Code
mod_int_all <- lm(SHBG ~ RIDAGEYR + ESTRADIOL + TESTOSTERONE + RIDAGEYR*ESTRADIOL + RIDAGEYR*TESTOSTERONE + RIDAGEYR*ESTRADIOL*TESTOSTERONE, data=dat)
mod_int_two <- lm(SHBG ~ RIDAGEYR + ESTRADIOL + TESTOSTERONE + RIDAGEYR*ESTRADIOL + RIDAGEYR*TESTOSTERONE, data=dat)
mod_int_none <- lm(SHBG ~ RIDAGEYR + ESTRADIOL + TESTOSTERONE, data=dat)

# compare VIFs from models with 3-way and 2-way interaction, only 2-way interaction, only main effects
vif(mod_int_all)
                       RIDAGEYR                       ESTRADIOL 
                       8.518435                       16.211995 
                   TESTOSTERONE              RIDAGEYR:ESTRADIOL 
                      14.326071                       32.468986 
          RIDAGEYR:TESTOSTERONE          ESTRADIOL:TESTOSTERONE 
                      25.602547                       27.562449 
RIDAGEYR:ESTRADIOL:TESTOSTERONE 
                      39.002071 
Code
vif(mod_int_two)
             RIDAGEYR             ESTRADIOL          TESTOSTERONE 
             3.989791             11.198447             10.400867 
   RIDAGEYR:ESTRADIOL RIDAGEYR:TESTOSTERONE 
            18.727071             14.772772 
Code
vif(mod_int_none)
    RIDAGEYR    ESTRADIOL TESTOSTERONE 
    1.177162     2.094151     1.908693 

We see the presence of multicollinearity in the models with interactions, but this is to be expected given the interaction term. Given the lack of multicollinearity in the model with only main effects, this is likely caused by the interactions and not underlying relationships.

Data Multicollinearity

Another type of multicollinearity is when two variables are closely related (e.g., they either measure similar constructs/phenomenon or are correlated by chance). This suggests that the variables may be attempting to explain the same or very similar aspects of the variability in our outcome \(Y\).

To examine this, let’s explore a simulation where we simulate a multiple linear regression and add an extra variable that is highly related to one of our simulated predictors:

Code
set.seed(6618) # set seed for reproducibility
n <- 100

x1 <- rbinom(n=n, size=1, prob=0.3)
x2 <- rnorm(n=n, mean=5, sd=2)
x3 <- rnorm(n=n, mean=0, sd=3)
x2_corr <- x2 + runif(n=n, -0.25, 0.25) # simulate highly correlated variable

y <- 5 + 0.8*x1 + 2*x2 + -1*x3 + rnorm(n=n, mean=0, sd=2)

# Fit true model, evaluate coefficients and VIF
lm_true <- lm(y ~ x1 + x2 + x3)
summary(lm_true)

Call:
lm(formula = y ~ x1 + x2 + x3)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9318 -1.2701  0.0105  1.3527  4.2596 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.09362    0.49547  10.280   <2e-16 ***
x1           0.12039    0.37765   0.319    0.751    
x2           2.05506    0.08522  24.114   <2e-16 ***
x3          -1.00879    0.05932 -17.005   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.774 on 96 degrees of freedom
Multiple R-squared:  0.8905,    Adjusted R-squared:  0.887 
F-statistic: 260.2 on 3 and 96 DF,  p-value: < 2.2e-16
Code
vif(lm_true)
      x1       x2       x3 
1.001741 1.015950 1.017582 

Now let’s add x2_corr and check its Pearson’s linear correlation with x2:

Code
# Fit model with x2_corr, evaluate coefficients and VIF
lm_mc <- lm(y ~ x1 + x2 + x3 + x2_corr)
summary(lm_mc)

Call:
lm(formula = y ~ x1 + x2 + x3 + x2_corr)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9656 -1.2914  0.0406  1.3509  4.1876 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.09832    0.49824  10.233   <2e-16 ***
x1           0.11427    0.38024   0.301    0.764    
x2           1.73501    1.24708   1.391    0.167    
x3          -1.01146    0.06051 -16.715   <2e-16 ***
x2_corr      0.32064    1.24639   0.257    0.798    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.783 on 95 degrees of freedom
Multiple R-squared:  0.8905,    Adjusted R-squared:  0.8859 
F-statistic: 193.2 on 4 and 95 DF,  p-value: < 2.2e-16
Code
vif(lm_mc)
        x1         x2         x3    x2_corr 
  1.005669 215.433748   1.048501 216.117251 
Code
cor(x2, x2_corr) # check correlation
[1] 0.997602

We see that our estimate of the effect of \(X_2\) has been decreased and is no longer statistically significant. In evaluating the VIF, we see it is >>10! This is because our two variables have a Pearson’s correlation of 0.998.

In practice, we’d need to decide whether to keep \(X_2\) or the \(X_{2,corr}\) in the model but we wouldn’t have the benefit for simulating data to know the truth. Instead, we’d need to make a decision based on the context of the data and the problem we are investigating.

If we did choose to keep \(X_{2,corr}\) we would see the following results:

Code
# Fit model with x2_corr, evaluate coefficients and VIF
lm_corr <- lm(y ~ x1 + x2_corr + x3)
summary(lm_corr)

Call:
lm(formula = y ~ x1 + x2_corr + x3)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2969 -1.2145 -0.0964  1.3580  3.8000 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.16641    0.49824  10.369   <2e-16 ***
x1           0.08148    0.38136   0.214    0.831    
x2_corr      2.05060    0.08601  23.842   <2e-16 ***
x3          -1.02518    0.05999 -17.088   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.792 on 96 degrees of freedom
Multiple R-squared:  0.8883,    Adjusted R-squared:  0.8848 
F-statistic: 254.5 on 3 and 96 DF,  p-value: < 2.2e-16
Code
vif(lm_corr)
      x1  x2_corr       x3 
1.001805 1.019173 1.020688 

In this case, given the high correlation, our findings look very similar to lm_true, even though technically x2_corr was not used to simulate the outcome values.