Author
Affiliation

Alex Kaizer

University of Colorado-Anschutz Medical Campus

This page is part of the University of Colorado-Anschutz Medical Campus’ BIOS 6618 Labs collection, the mini-lecture content delivered at the start of class before breaking out into small groups to work on the homework assignment.

What’s on the docket this week?

This week we will walk through the steps to generating our own ANOVA table for the summary of linear regression models in R. SAS provides a nicely formatted tabled automatically with its PROC REG output, but it is a bit trickier to create in R.

The ANOVA Table

We know the general set-up of the ANOVA looks something like the table below:

Code
library(kableExtra)

anova_df <- data.frame( 'Source' = c('Model','Error','Total'),
                        'Sums of Squares' = c('SS~Model~','SS~Error~','SS~Total~'),
                        'Degrees of Freedom' = c('p','n-p-1','n-1'),
                        'Mean Square' = c('MS~Model~','MS~Error~',''),
                        'F Value' = c('MS~Model~/MS~Error~','',''),
                        'p-value' = c('Pr(F~p,n-p-1~ > F)','',''))

kbl(anova_df, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'),
    align='lccccc', escape=F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Source Sums of Squares Degrees of Freedom Mean Square F-value p-value
Model SSModel p MSModel MSModel/MSError Pr(Fp,n-p-1 > F)
Error SSError n-p-1 MSError
Total SSTotal n-1

where \(p\) is the number of independent variables in the model (\(p=1\) for simple linear regression) and \(n\) is our total sample size.

It is also possible to represent the table in terms of the relevant equations to calculate each component:

Code
anova_df <- data.frame( 'Source' = c('Model','Error','Total'),
                        'Sums of Squares' = c('$\\sum_{i=1}^{n} (\\hat{Y}_{i} - \\bar{Y})^{2}$','$\\sum_{i=1}^{n} (Y_{i} - \\hat{Y}_{i})^{2}$','$\\sum_{i=1}^{n} (Y_{i} - \\bar{Y})^{2}$'),
                        'Degrees of Freedom' = c('$p$','$n-p-1$','$n-1$'),
                        'Mean Square' = c('$\\frac{\\sum_{i=1}^{n} (\\hat{Y}_{i} - \\bar{Y})^{2}}{p}$','$\\frac{\\sum_{i=1}^{n} (Y_{i} - \\hat{Y}_{i})^{2}}{n-p-1}$',''),
                        'F Value' = c('$F = \\frac{\\text{MS}_{\\text{Model}}}{\\text{MS}_{\\text{Error}}}$','',''),
                        'p-value' = c('$Pr(F_{p,n-p-1} > F)$','',''))

kbl(anova_df, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'),
    align='lccccc', escape=F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Source Sums of Squares Degrees of Freedom Mean Square F-value p-value
Model \(\sum_{i=1}^{n} (\hat{Y}_{i} - \bar{Y})^{2}\) \(p\) \(\frac{\sum_{i=1}^{n} (\hat{Y}_{i} - \bar{Y})^{2}}{p}\) \(F = \frac{\text{MS}_{\text{Model}}}{\text{MS}_{\text{Error}}}\) \(Pr(F_{p,n-p-1} > F)\)
Error \(\sum_{i=1}^{n} (Y_{i} - \hat{Y}_{i})^{2}\) \(n-p-1\) \(\frac{\sum_{i=1}^{n} (Y_{i} - \hat{Y}_{i})^{2}}{n-p-1}\)
Total \(\sum_{i=1}^{n} (Y_{i} - \bar{Y})^{2}\) \(n-1\)

Unfortunately, R doesn’t provide the information of the ANOVA table in a nice easy-to-use format. Instead we have to use specific functions to extract relevant pieces of information. However, we can calculate these all piece-by-piece and combine into an ANOVA table.

To illustrate this, let’s first simulate some data to fit our linear regression model to:

Code
set.seed(515)
x <- rnorm(n=100, mean=10, sd=3) # simulate a single continuous predictor with mean=10 and SD=3
reg <- 10 + 3 * x # set the regression equation so the intercept=10 and the slope=3
y <- rnorm(n=100, mean=reg, sd=8) # simulate the outcome based on the conditional mean

mod1 <- glm(y ~ x) # fit linear regression model
summary(mod1)

Call:
glm(formula = y ~ x)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.1900     3.3368   2.454   0.0159 *  
x             3.2751     0.3068  10.674   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 83.66255)

    Null deviance: 17731.8  on 99  degrees of freedom
Residual deviance:  8198.9  on 98  degrees of freedom
AIC: 730.45

Number of Fisher Scoring iterations: 2

For example, using the following functions we can extract the corresponding pieces of information:

Code
info_df <- data.frame( 'Term' = c('$Y_{i}$','$\\bar{Y}$','$\\hat{Y}_{i}$','$p$','$n$'),
                       'Code' = c('`y`','`mean(y)`','`predict(mod1)`','`length(mod1$coefficients)-1`','`length(y)`'))
kbl(info_df, escape=F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Term Code
\(Y_{i}\) y
\(\bar{Y}\) mean(y)
\(\hat{Y}_{i}\) predict(mod1)
\(p\) length(mod1$coefficients)-1
\(n\) length(y)

With this information we can apply the various equations from our ANOVA table above:

Code
ybar <- mean(y)
yhat <- predict(mod1)
p <- length(mod1$coefficients)-1
n <- length(y)

ssm <- sum( (yhat-ybar)^2 )
sse <- sum( (y-yhat)^2 )
sst <- sum( (y-ybar)^2 )

msm <- ssm/p
mse <- sse/(n-p-1)

f_val <- msm/mse
p_val <- pf(f_val, df1=p, df2=n-p-1, lower.tail=FALSE)

# Create an ANOVA table to summarize all our results:
p_val_tab <- if(p_val<0.001){'<0.001'}else{round(p_val,3)}

anova_table <- data.frame( 'Source' = c('Model','Error','Total'),
                        'Sums of Squares' = c(round(ssm,2), round(sse,2), round(sst,2)),
                        'Degrees of Freedom' = c(p, n-p-1, n-1),
                        'Mean Square' = c(round(msm,2), round(mse,2),''),
                        'F Value' = c(round(f_val,2),'',''),
                        'p-value' = c(p_val_tab,'',''))

kbl(anova_table, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'), align='lccccc', escape=F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Source Sums of Squares Degrees of Freedom Mean Square F-value p-value
Model 9532.85 1 9532.85 113.94 <0.001
Error 8198.93 98 83.66
Total 17731.78 99

Make an ANOVA Function

Better than having to recreate the same code manually each time, let’s write a function to generate a kable table of the ANOVA table for any glm or lm object we provide:

Code
linreg_anova_func <- function(mod, ndigits=2, p_ndigits=3, format='kable'){
### Function to create an ANOVA table linear regression results from lm or glm
# mod: an object with the fitted model results
# ndigits: number of digits to round to for most values, default is 2
# p_digits: number of digits to round the p-value to, default is 3
# format: desired format output (default is kable):
## "kable" for kable table
## "df" for data frame as table

  # extract outcome from the object produced by the glm or lm function
  if( class(mod)[1] == 'glm' ){
    y <- mod$y
  }
  if( class(mod)[1] == 'lm' ){
    y <- mod$model[,1] # first column contains outcome data
  }  
  
  ybar <- mean(y)
  yhat <- predict(mod)
  p <- length(mod$coefficients)-1
  n <- length(y)

  ssm <- sum( (yhat-ybar)^2 )
  sse <- sum( (y-yhat)^2 )
  sst <- sum( (y-ybar)^2 )
  
  msm <- ssm/p
  mse <- sse/(n-p-1)
  
  f_val <- msm/mse
  p_val <- pf(f_val, df1=p, df2=n-p-1, lower.tail=FALSE)
  
  # Create an ANOVA table to summarize all our results:
  p_digits <- (10^(-p_ndigits))
  p_val_tab <- if(p_val<p_digits){paste0('<',p_digits)}else{round(p_val,p_ndigits)}
  
  anova_table <- data.frame( 'Source' = c('Model','Error','Total'),
                          'Sums of Squares' = c(round(ssm,ndigits), round(sse,ndigits), round(sst,ndigits)),
                          'Degrees of Freedom' = c(p, n-p-1, n-1),
                          'Mean Square' = c(round(msm,ndigits), round(mse,ndigits),''),
                          'F Value' = c(round(f_val,ndigits),'',''),
                          'p-value' = c(p_val_tab,'',''))

  if( format == 'kable' ){  
    library(kableExtra)
    kbl(anova_table, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'), align='lccccc', escape=F) %>%
      kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
  }else{
    anova_table
  }
}

Now let’s try applying the function to two data sets (our previous one, and one null), with both glm and lm:

Code
set.seed(515)
x <- rnorm(n=100, mean=10, sd=3) # simulate a single continuous predictor with mean=10 and SD=3
reg <- 10 + 3 * x # set the regression equation so the intercept=10 and the slope=3
y <- rnorm(n=100, mean=reg, sd=8) # simulate the outcome based on the conditional mean
mod1 <- glm(y ~ x) # fit linear regression model with glm
lm1 <- lm(y ~ x) # fit linear regression model with lm

set.seed(303)
x0 <- rnorm(n=100, mean=10, sd=3) # simulate a single continuous predictor with mean=10 and SD=3
reg0 <- 10 + 0 * x0 # set the regression equation so the intercept=10 and the slope=0 (i.e., null)
y0 <- rnorm(n=100, mean=reg0, sd=8) # simulate the outcome based on the conditional mean
mod0 <- glm(y0 ~ x0) # fit linear regression model with glm
lm0 <- lm(y0 ~ x0) # fit linear regression model with lm

First let’s check out our non-null scenario:

Code
linreg_anova_func(mod=mod1)
Source Sums of Squares Degrees of Freedom Mean Square F-value p-value
Model 9532.85 1 9532.85 113.94 <0.001
Error 8198.93 98 83.66
Total 17731.78 99
Code
linreg_anova_func(mod=lm1)
Source Sums of Squares Degrees of Freedom Mean Square F-value p-value
Model 9532.85 1 9532.85 113.94 <0.001
Error 8198.93 98 83.66
Total 17731.78 99

Now let’s check out the null scenario and tweak our number of digits (both as a raw and formatted output version):

Code
linreg_anova_func(mod=mod0, ndigits=3, p_ndigits=4, format='df')
  Source Sums.of.Squares Degrees.of.Freedom Mean.Square F.Value p.value
1  Model         150.943                  1     150.943   2.596  0.1104
2  Error        5698.200                 98      58.145                
3  Total        5849.143                 99                            
Code
linreg_anova_func(mod=lm0, ndigits=3, p_ndigits=4, format='kable')
Source Sums of Squares Degrees of Freedom Mean Square F-value p-value
Model 150.943 1 150.943 2.596 0.1104
Error 5698.200 98 58.145
Total 5849.143 99

And as a final check, let’s compare the summary(lm) to our \(F\)-test result above:

Code
summary(lm0)

Call:
lm(formula = y0 ~ x0)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5065  -5.0881  -0.5393   6.5040  15.8947 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.4344     2.7142   4.950 3.09e-06 ***
x0           -0.4132     0.2564  -1.611     0.11    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.625 on 98 degrees of freedom
Multiple R-squared:  0.02581,   Adjusted R-squared:  0.01587 
F-statistic: 2.596 on 1 and 98 DF,  p-value: 0.1104