Week 8 Lab

Author

Affiliation

Alex Kaizer

University of Colorado-Anschutz Medical Campus

This page is part of the University of Colorado-Anschutz Medical Campus’ BIOS 6618 Labs collection, the mini-lecture content delivered at the start of class before breaking out into small groups to work on the homework assignment.

What’s on the docket this week?

This week we will walk through the steps to generating our own ANOVA table for the summary of linear regression models in R. SAS provides a nicely formatted tabled automatically with its PROC REG output, but it is a bit trickier to create in R.

The ANOVA Table

We know the general set-up of the ANOVA looks something like the table below:

Code

library(kableExtra)

anova_df <- data.frame( 'Source' = c('Model','Error','Total'),
                        'Sums of Squares' = c('SS~Model~','SS~Error~','SS~Total~'),
                        'Degrees of Freedom' = c('p','n-p-1','n-1'),
                        'Mean Square' = c('MS~Model~','MS~Error~',''),
                        'F Value' = c('MS~Model~/MS~Error~','',''),
                        'p-value' = c('Pr(F~p,n-p-1~ > F)','',''))

kbl(anova_df, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'),
    align='lccccc', escape=F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Source	Sums of Squares	Degrees of Freedom	Mean Square	F-value	p-value
Model	SS_Model	p	MS_Model	MS_Model/MS_Error	Pr(F_p,n-p-1 > F)
Error	SS_Error	n-p-1	MS_Error
Total	SS_Total	n-1

where $p$ is the number of independent variables in the model ($p=1$ for simple linear regression) and $n$ is our total sample size.

It is also possible to represent the table in terms of the relevant equations to calculate each component:

Code

anova_df <- data.frame( 'Source' = c('Model','Error','Total'),
                        'Sums of Squares' = c('$\\sum_{i=1}^{n} (\\hat{Y}_{i} - \\bar{Y})^{2}$','$\\sum_{i=1}^{n} (Y_{i} - \\hat{Y}_{i})^{2}$','$\\sum_{i=1}^{n} (Y_{i} - \\bar{Y})^{2}$'),
                        'Degrees of Freedom' = c('$p$','$n-p-1$','$n-1$'),
                        'Mean Square' = c('$\\frac{\\sum_{i=1}^{n} (\\hat{Y}_{i} - \\bar{Y})^{2}}{p}$','$\\frac{\\sum_{i=1}^{n} (Y_{i} - \\hat{Y}_{i})^{2}}{n-p-1}$',''),
                        'F Value' = c('$F = \\frac{\\text{MS}_{\\text{Model}}}{\\text{MS}_{\\text{Error}}}$','',''),
                        'p-value' = c('$Pr(F_{p,n-p-1} > F)$','',''))

kbl(anova_df, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'),
    align='lccccc', escape=F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Source	Sums of Squares	Degrees of Freedom	Mean Square	F-value	p-value
Model	$\sum_{i=1}^{n} (\hat{Y}_{i} - \bar{Y})^{2}$	$p$	$\frac{\sum_{i=1}^{n} (\hat{Y}_{i} - \bar{Y})^{2}}{p}$	$F = \frac{\text{MS}_{\text{Model}}}{\text{MS}_{\text{Error}}}$	$Pr(F_{p,n-p-1} > F)$
Error	$\sum_{i=1}^{n} (Y_{i} - \hat{Y}_{i})^{2}$	$n-p-1$	$\frac{\sum_{i=1}^{n} (Y_{i} - \hat{Y}_{i})^{2}}{n-p-1}$
Total	$\sum_{i=1}^{n} (Y_{i} - \bar{Y})^{2}$	$n-1$

Unfortunately, R doesn’t provide the information of the ANOVA table in a nice easy-to-use format. Instead we have to use specific functions to extract relevant pieces of information. However, we can calculate these all piece-by-piece and combine into an ANOVA table.

To illustrate this, let’s first simulate some data to fit our linear regression model to:

Code

set.seed(515)
x <- rnorm(n=100, mean=10, sd=3) # simulate a single continuous predictor with mean=10 and SD=3
reg <- 10 + 3 * x # set the regression equation so the intercept=10 and the slope=3
y <- rnorm(n=100, mean=reg, sd=8) # simulate the outcome based on the conditional mean

mod1 <- glm(y ~ x) # fit linear regression model
summary(mod1)


Call:
glm(formula = y ~ x)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.1900     3.3368   2.454   0.0159 *  
x             3.2751     0.3068  10.674   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 83.66255)

    Null deviance: 17731.8  on 99  degrees of freedom
Residual deviance:  8198.9  on 98  degrees of freedom
AIC: 730.45

Number of Fisher Scoring iterations: 2

For example, using the following functions we can extract the corresponding pieces of information:

Code

info_df <- data.frame( 'Term' = c('$Y_{i}$','$\\bar{Y}$','$\\hat{Y}_{i}$','$p$','$n$'),
                       'Code' = c('`y`','`mean(y)`','`predict(mod1)`','`length(mod1$coefficients)-1`','`length(y)`'))
kbl(info_df, escape=F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Term	Code
$Y_{i}$	`y`
$\bar{Y}$	`mean(y)`
$\hat{Y}_{i}$	`predict(mod1)`
$p$	`length(mod1$coefficients)-1`
$n$	`length(y)`

With this information we can apply the various equations from our ANOVA table above:

Code

ybar <- mean(y)
yhat <- predict(mod1)
p <- length(mod1$coefficients)-1
n <- length(y)

ssm <- sum( (yhat-ybar)^2 )
sse <- sum( (y-yhat)^2 )
sst <- sum( (y-ybar)^2 )

msm <- ssm/p
mse <- sse/(n-p-1)

f_val <- msm/mse
p_val <- pf(f_val, df1=p, df2=n-p-1, lower.tail=FALSE)

# Create an ANOVA table to summarize all our results:
p_val_tab <- if(p_val<0.001){'<0.001'}else{round(p_val,3)}

anova_table <- data.frame( 'Source' = c('Model','Error','Total'),
                        'Sums of Squares' = c(round(ssm,2), round(sse,2), round(sst,2)),
                        'Degrees of Freedom' = c(p, n-p-1, n-1),
                        'Mean Square' = c(round(msm,2), round(mse,2),''),
                        'F Value' = c(round(f_val,2),'',''),
                        'p-value' = c(p_val_tab,'',''))

kbl(anova_table, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'), align='lccccc', escape=F) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Source	Sums of Squares	Degrees of Freedom	Mean Square	F-value	p-value
Model	9532.85	1	9532.85	113.94	<0.001
Error	8198.93	98	83.66
Total	17731.78	99

Make an ANOVA Function

Better than having to recreate the same code manually each time, let’s write a function to generate a kable table of the ANOVA table for any glm or lm object we provide:

Code

linreg_anova_func <- function(mod, ndigits=2, p_ndigits=3, format='kable'){
### Function to create an ANOVA table linear regression results from lm or glm
# mod: an object with the fitted model results
# ndigits: number of digits to round to for most values, default is 2
# p_digits: number of digits to round the p-value to, default is 3
# format: desired format output (default is kable):
## "kable" for kable table
## "df" for data frame as table

  # extract outcome from the object produced by the glm or lm function
  if( class(mod)[1] == 'glm' ){
    y <- mod$y
  }
  if( class(mod)[1] == 'lm' ){
    y <- mod$model[,1] # first column contains outcome data
  }  
  
  ybar <- mean(y)
  yhat <- predict(mod)
  p <- length(mod$coefficients)-1
  n <- length(y)

  ssm <- sum( (yhat-ybar)^2 )
  sse <- sum( (y-yhat)^2 )
  sst <- sum( (y-ybar)^2 )
  
  msm <- ssm/p
  mse <- sse/(n-p-1)
  
  f_val <- msm/mse
  p_val <- pf(f_val, df1=p, df2=n-p-1, lower.tail=FALSE)
  
  # Create an ANOVA table to summarize all our results:
  p_digits <- (10^(-p_ndigits))
  p_val_tab <- if(p_val<p_digits){paste0('<',p_digits)}else{round(p_val,p_ndigits)}
  
  anova_table <- data.frame( 'Source' = c('Model','Error','Total'),
                          'Sums of Squares' = c(round(ssm,ndigits), round(sse,ndigits), round(sst,ndigits)),
                          'Degrees of Freedom' = c(p, n-p-1, n-1),
                          'Mean Square' = c(round(msm,ndigits), round(mse,ndigits),''),
                          'F Value' = c(round(f_val,ndigits),'',''),
                          'p-value' = c(p_val_tab,'',''))

  if( format == 'kable' ){  
    library(kableExtra)
    kbl(anova_table, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'), align='lccccc', escape=F) %>%
      kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
  }else{
    anova_table
  }
}

Now let’s try applying the function to two data sets (our previous one, and one null), with both glm and lm:

Code

set.seed(515)
x <- rnorm(n=100, mean=10, sd=3) # simulate a single continuous predictor with mean=10 and SD=3
reg <- 10 + 3 * x # set the regression equation so the intercept=10 and the slope=3
y <- rnorm(n=100, mean=reg, sd=8) # simulate the outcome based on the conditional mean
mod1 <- glm(y ~ x) # fit linear regression model with glm
lm1 <- lm(y ~ x) # fit linear regression model with lm

set.seed(303)
x0 <- rnorm(n=100, mean=10, sd=3) # simulate a single continuous predictor with mean=10 and SD=3
reg0 <- 10 + 0 * x0 # set the regression equation so the intercept=10 and the slope=0 (i.e., null)
y0 <- rnorm(n=100, mean=reg0, sd=8) # simulate the outcome based on the conditional mean
mod0 <- glm(y0 ~ x0) # fit linear regression model with glm
lm0 <- lm(y0 ~ x0) # fit linear regression model with lm

First let’s check out our non-null scenario:

Code

linreg_anova_func(mod=mod1)

Source	Sums of Squares	Degrees of Freedom	Mean Square	F-value	p-value
Model	9532.85	1	9532.85	113.94	<0.001
Error	8198.93	98	83.66
Total	17731.78	99

Code

linreg_anova_func(mod=lm1)

Source	Sums of Squares	Degrees of Freedom	Mean Square	F-value	p-value
Model	9532.85	1	9532.85	113.94	<0.001
Error	8198.93	98	83.66
Total	17731.78	99

Now let’s check out the null scenario and tweak our number of digits (both as a raw and formatted output version):

Code

linreg_anova_func(mod=mod0, ndigits=3, p_ndigits=4, format='df')

  Source Sums.of.Squares Degrees.of.Freedom Mean.Square F.Value p.value
1  Model         150.943                  1     150.943   2.596  0.1104
2  Error        5698.200                 98      58.145                
3  Total        5849.143                 99

Code

linreg_anova_func(mod=lm0, ndigits=3, p_ndigits=4, format='kable')

Source	Sums of Squares	Degrees of Freedom	Mean Square	F-value	p-value
Model	150.943	1	150.943	2.596	0.1104
Error	5698.200	98	58.145
Total	5849.143	99

And as a final check, let’s compare the summary(lm) to our $F$-test result above:

Code

summary(lm0)


Call:
lm(formula = y0 ~ x0)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5065  -5.0881  -0.5393   6.5040  15.8947 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.4344     2.7142   4.950 3.09e-06 ***
x0           -0.4132     0.2564  -1.611     0.11    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.625 on 98 degrees of freedom
Multiple R-squared:  0.02581,   Adjusted R-squared:  0.01587 
F-statistic: 2.596 on 1 and 98 DF,  p-value: 0.1104

Source	Sums of Squares	Degrees of Freedom	Mean Square	F-value	p-value
Model	\(\sum_{i=1}^{n} (\hat{Y}_{i} - \bar{Y})^{2}\)	\(p\)	\(\frac{\sum_{i=1}^{n} (\hat{Y}_{i} - \bar{Y})^{2}}{p}\)	\(F = \frac{\text{MS}_{\text{Model}}}{\text{MS}_{\text{Error}}}\)	\(Pr(F_{p,n-p-1} > F)\)
Error	\(\sum_{i=1}^{n} (Y_{i} - \hat{Y}_{i})^{2}\)	\(n-p-1\)	\(\frac{\sum_{i=1}^{n} (Y_{i} - \hat{Y}_{i})^{2}}{n-p-1}\)
Total	\(\sum_{i=1}^{n} (Y_{i} - \bar{Y})^{2}\)	\(n-1\)