Confidence Interval Differences in lm and glm

Author
Affiliation

Alex Kaizer

University of Colorado-Anschutz Medical Campus

This page is part of the University of Colorado-Anschutz Medical Campus’ BIOS 6618 Recitation collection. To view other questions, you can view the BIOS 6618 Recitation collection page or use the search bar to look for keywords.

Confidence Interval Approaches

In lecture we discussed how you can calculate confidence intervals “by hand” or using R (or SAS) to do so for our beta coefficients. For example, by hand our confidence interval formula is generally along the lines of

\[ \hat{\beta_k} \pm t_{1-\alpha/2, n-p-1} SE(\hat{\beta}_k) \]

where \(n\) is our number of observations and \(p\) is our number of predictors in the model.

Let’s illustrate the calculation of the slope of \(\hat{\beta}_{1}\) for the following simulated simple linear regression model:

Code
set.seed(1103) # set seed for reproducibility

# Simulate data to use for our true regression model
x4 <- rnorm(n=10,mean=100,sd=15) # simulate a continuous predictor

# Simulate the error term for our regression model with SD=10
error <- rnorm(n=10, mean=0, sd=10) 

# Create the "true" model
y2 <- -15 + 0.5*x4 + error

# Fit regression model with both lm and glm
lm_ci <- lm(y2 ~ x4)
glm_ci <- glm(y2 ~ x4)

Using confint with lm vs. By Hand

Our summary output for our lm object is

Code
summary(lm_ci)

Call:
lm(formula = y2 ~ x4)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5470 -6.5629 -0.0575  5.6679 12.3070 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -17.8352    17.0028  -1.049   0.3249  
x4            0.5303     0.1627   3.260   0.0115 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.782 on 8 degrees of freedom
Multiple R-squared:  0.5705,    Adjusted R-squared:  0.5168 
F-statistic: 10.63 on 1 and 8 DF,  p-value: 0.01153

and its calculated confidence intervals are

Code
confint(lm_ci)
                  2.5 %     97.5 %
(Intercept) -57.0436575 21.3732652
x4            0.1551655  0.9053396

If we calculate the 95% CI by hand from the output provided we arrive at:

\[ \hat{\beta_1} \pm t_{1-\alpha/2, n-p-1} SE(\hat{\beta}_1) = 0.5303 \pm 2.306004 \times 0.1627 \]

We pull \(\hat{\beta}_1\) and \(SE(\hat{\beta}_1)\) directly from our regression output. We can calculate the value from our \(t\)-distribution corresponding to \(t_{1-\alpha/2, n-p-1} = t_{1-0.05/2, 10-1-1}\) by using qt(0.975, df=8). The estimated 95% CI is therefore

\[ (0.1551131, \; 0.9054869) \]

This does NOT exactly match the confidence interval provided by confint for our lm model. No need to be overly concerned in this situation, because it is simply due to rounding. R is able to retain many, many more decimal places of information. If we instead stored the pieces of information and had R calculate our confidence interval “by hand” we find that we do achieve agreement:

Code
b1 <- summary(lm_ci)$coefficients['x4','Estimate']
b1_se <- summary(lm_ci)$coefficients['x4','Std. Error']
tval <- qt(0.975, df=8)
b1 + c(-1,1)*tval*b1_se
[1] 0.1551655 0.9053396

Using confint with glm vs. By Hand

What would happen if we used glm instead of lm to calculate our confidence intervals:

Code
summary(glm_ci)

Call:
glm(formula = y2 ~ x4)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -17.8352    17.0028  -1.049   0.3249  
x4            0.5303     0.1627   3.260   0.0115 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 60.55895)

    Null deviance: 1128.05  on 9  degrees of freedom
Residual deviance:  484.47  on 8  degrees of freedom
AIC: 73.184

Number of Fisher Scoring iterations: 2
Code
confint(glm_ci)
Waiting for profiling to be done...
                  2.5 %     97.5 %
(Intercept) -51.1600159 15.4896236
x4            0.2114512  0.8490538

Ruh-oh, looks like this interval is very different from our lm output! In this case, we need to remember that glm uses the standard normal distribution in its calculation of the confidence interval:

\[ \hat{\beta_1} \pm Z_{1-\alpha/2} SE(\hat{\beta}_1) = 0.5303 \pm 1.959964 \times 0.1627 = (0.2114139,\; 0.849186) \]

Again, our “by hand” and confint intervals are slightly different due to rounding. If we have R do the calculations for us the confidence intervals match:

Code
b1 <- summary(glm_ci)$coefficients['x4','Estimate']
b1_se <- summary(glm_ci)$coefficients['x4','Std. Error']
zval <- qnorm(0.975)
b1 + c(-1,1)*zval*b1_se
[1] 0.2114512 0.8490538

When Should lm and glm More-or-Less Match?

We know that the \(t\)-distribution looks increasingly normal as \(n \to \infty\). Therefore, as our sample size increases the differences in our confidence interval grow increasingly small. Let’s examine the confidence intervals across a range of increasing sample sizes to see this firsthand:

Code
set.seed(515)

size_vec <- c(10,30,100,300,1000,3000,10000,30000)
conf_mat <- matrix(nrow=length(size_vec), ncol=5, dimnames = list(paste0('N=',size_vec), c('lm LCI','glm LCI','b1','glm UCI','lm UCI')) )

for( s in 1:length(size_vec) ){
  x <- rnorm(n=size_vec[s], mean=100, sd=50)
  y <- 100 + 3*x + rnorm(n=size_vec[s], mean=0, sd=45)
  lms <- lm(y ~ x)
  glms <- glm(y ~ x)
  conf_mat[s,] <- c( confint(lms)[2,1], confint(glms)[2,1], coef(lms)[2], confint(glms)[2,2], confint(lms)[2,2])
}

library(kableExtra)
kableExtra::kbl(conf_mat, col.names=c('lm LCI','glm LCI','$\\hat{\\beta}_{1}$','glm UCI','lm UCI'), align='ccccc', escape=F) %>%
      kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
lm LCI glm LCI \(\hat{\beta}_{1}\) glm UCI lm UCI
N=10 2.435313 2.550756 3.204622 3.858487 3.973930
N=30 2.652286 2.669110 3.041944 3.414778 3.431602
N=100 2.741768 2.744375 2.952861 3.161347 3.163953
N=300 2.903734 2.904129 3.000912 3.097694 3.098089
N=1000 2.903226 2.903297 2.962270 3.021243 3.021314
N=3000 2.952634 2.952647 2.985838 3.019028 3.019041
N=10000 2.979706 2.979708 2.997415 3.015122 3.015124
N=30000 2.989690 2.989691 2.999973 3.010256 3.010256

We see that at \(N=30000\) the confidence intervals are nearly identical (the LCI differs by 0.000001).