This page is part of the University of Colorado-Anschutz Medical Campus’ BIOS 6618 Labs collection, the mini-lecture content delivered at the start of class before breaking out into small groups to work on the homework assignment.
What’s on the docket this week?
This week we will walk through the steps to generating our own ANOVA table for the summary of linear regression models in R. SAS provides a nicely formatted tabled automatically with its PROC REG output, but it is a bit trickier to create in R.
The ANOVA Table
We know the general set-up of the ANOVA looks something like the table below:
Code
library(kableExtra)anova_df <-data.frame( 'Source'=c('Model','Error','Total'),'Sums of Squares'=c('SS~Model~','SS~Error~','SS~Total~'),'Degrees of Freedom'=c('p','n-p-1','n-1'),'Mean Square'=c('MS~Model~','MS~Error~',''),'F Value'=c('MS~Model~/MS~Error~','',''),'p-value'=c('Pr(F~p,n-p-1~ > F)','',''))kbl(anova_df, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'),align='lccccc', escape=F) %>%kable_styling(bootstrap_options ="striped", full_width = F, position ="left")
Source
Sums of Squares
Degrees of Freedom
Mean Square
F-value
p-value
Model
SSModel
p
MSModel
MSModel/MSError
Pr(Fp,n-p-1 > F)
Error
SSError
n-p-1
MSError
Total
SSTotal
n-1
where \(p\) is the number of independent variables in the model (\(p=1\) for simple linear regression) and \(n\) is our total sample size.
It is also possible to represent the table in terms of the relevant equations to calculate each component:
Code
anova_df <-data.frame( 'Source'=c('Model','Error','Total'),'Sums of Squares'=c('$\\sum_{i=1}^{n} (\\hat{Y}_{i} - \\bar{Y})^{2}$','$\\sum_{i=1}^{n} (Y_{i} - \\hat{Y}_{i})^{2}$','$\\sum_{i=1}^{n} (Y_{i} - \\bar{Y})^{2}$'),'Degrees of Freedom'=c('$p$','$n-p-1$','$n-1$'),'Mean Square'=c('$\\frac{\\sum_{i=1}^{n} (\\hat{Y}_{i} - \\bar{Y})^{2}}{p}$','$\\frac{\\sum_{i=1}^{n} (Y_{i} - \\hat{Y}_{i})^{2}}{n-p-1}$',''),'F Value'=c('$F = \\frac{\\text{MS}_{\\text{Model}}}{\\text{MS}_{\\text{Error}}}$','',''),'p-value'=c('$Pr(F_{p,n-p-1} > F)$','',''))kbl(anova_df, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'),align='lccccc', escape=F) %>%kable_styling(bootstrap_options ="striped", full_width = F, position ="left")
Unfortunately, R doesn’t provide the information of the ANOVA table in a nice easy-to-use format. Instead we have to use specific functions to extract relevant pieces of information. However, we can calculate these all piece-by-piece and combine into an ANOVA table.
To illustrate this, let’s first simulate some data to fit our linear regression model to:
Code
set.seed(515)x <-rnorm(n=100, mean=10, sd=3) # simulate a single continuous predictor with mean=10 and SD=3reg <-10+3* x # set the regression equation so the intercept=10 and the slope=3y <-rnorm(n=100, mean=reg, sd=8) # simulate the outcome based on the conditional meanmod1 <-glm(y ~ x) # fit linear regression modelsummary(mod1)
Call:
glm(formula = y ~ x)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.1900 3.3368 2.454 0.0159 *
x 3.2751 0.3068 10.674 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 83.66255)
Null deviance: 17731.8 on 99 degrees of freedom
Residual deviance: 8198.9 on 98 degrees of freedom
AIC: 730.45
Number of Fisher Scoring iterations: 2
For example, using the following functions we can extract the corresponding pieces of information:
With this information we can apply the various equations from our ANOVA table above:
Code
ybar <-mean(y)yhat <-predict(mod1)p <-length(mod1$coefficients)-1n <-length(y)ssm <-sum( (yhat-ybar)^2 )sse <-sum( (y-yhat)^2 )sst <-sum( (y-ybar)^2 )msm <- ssm/pmse <- sse/(n-p-1)f_val <- msm/msep_val <-pf(f_val, df1=p, df2=n-p-1, lower.tail=FALSE)# Create an ANOVA table to summarize all our results:p_val_tab <-if(p_val<0.001){'<0.001'}else{round(p_val,3)}anova_table <-data.frame( 'Source'=c('Model','Error','Total'),'Sums of Squares'=c(round(ssm,2), round(sse,2), round(sst,2)),'Degrees of Freedom'=c(p, n-p-1, n-1),'Mean Square'=c(round(msm,2), round(mse,2),''),'F Value'=c(round(f_val,2),'',''),'p-value'=c(p_val_tab,'',''))kbl(anova_table, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'), align='lccccc', escape=F) %>%kable_styling(bootstrap_options ="striped", full_width = F, position ="left")
Source
Sums of Squares
Degrees of Freedom
Mean Square
F-value
p-value
Model
9532.85
1
9532.85
113.94
<0.001
Error
8198.93
98
83.66
Total
17731.78
99
Make an ANOVA Function
Better than having to recreate the same code manually each time, let’s write a function to generate a kable table of the ANOVA table for any glm or lm object we provide:
Code
linreg_anova_func <-function(mod, ndigits=2, p_ndigits=3, format='kable'){### Function to create an ANOVA table linear regression results from lm or glm# mod: an object with the fitted model results# ndigits: number of digits to round to for most values, default is 2# p_digits: number of digits to round the p-value to, default is 3# format: desired format output (default is kable):## "kable" for kable table## "df" for data frame as table# extract outcome from the object produced by the glm or lm functionif( class(mod)[1] =='glm' ){ y <- mod$y }if( class(mod)[1] =='lm' ){ y <- mod$model[,1] # first column contains outcome data } ybar <-mean(y) yhat <-predict(mod) p <-length(mod$coefficients)-1 n <-length(y) ssm <-sum( (yhat-ybar)^2 ) sse <-sum( (y-yhat)^2 ) sst <-sum( (y-ybar)^2 ) msm <- ssm/p mse <- sse/(n-p-1) f_val <- msm/mse p_val <-pf(f_val, df1=p, df2=n-p-1, lower.tail=FALSE)# Create an ANOVA table to summarize all our results: p_digits <- (10^(-p_ndigits)) p_val_tab <-if(p_val<p_digits){paste0('<',p_digits)}else{round(p_val,p_ndigits)} anova_table <-data.frame( 'Source'=c('Model','Error','Total'),'Sums of Squares'=c(round(ssm,ndigits), round(sse,ndigits), round(sst,ndigits)),'Degrees of Freedom'=c(p, n-p-1, n-1),'Mean Square'=c(round(msm,ndigits), round(mse,ndigits),''),'F Value'=c(round(f_val,ndigits),'',''),'p-value'=c(p_val_tab,'',''))if( format =='kable' ){ library(kableExtra)kbl(anova_table, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'), align='lccccc', escape=F) %>%kable_styling(bootstrap_options ="striped", full_width = F, position ="left") }else{ anova_table }}
Now let’s try applying the function to two data sets (our previous one, and one null), with both glm and lm:
Code
set.seed(515)x <-rnorm(n=100, mean=10, sd=3) # simulate a single continuous predictor with mean=10 and SD=3reg <-10+3* x # set the regression equation so the intercept=10 and the slope=3y <-rnorm(n=100, mean=reg, sd=8) # simulate the outcome based on the conditional meanmod1 <-glm(y ~ x) # fit linear regression model with glmlm1 <-lm(y ~ x) # fit linear regression model with lmset.seed(303)x0 <-rnorm(n=100, mean=10, sd=3) # simulate a single continuous predictor with mean=10 and SD=3reg0 <-10+0* x0 # set the regression equation so the intercept=10 and the slope=0 (i.e., null)y0 <-rnorm(n=100, mean=reg0, sd=8) # simulate the outcome based on the conditional meanmod0 <-glm(y0 ~ x0) # fit linear regression model with glmlm0 <-lm(y0 ~ x0) # fit linear regression model with lm
First let’s check out our non-null scenario:
Code
linreg_anova_func(mod=mod1)
Source
Sums of Squares
Degrees of Freedom
Mean Square
F-value
p-value
Model
9532.85
1
9532.85
113.94
<0.001
Error
8198.93
98
83.66
Total
17731.78
99
Code
linreg_anova_func(mod=lm1)
Source
Sums of Squares
Degrees of Freedom
Mean Square
F-value
p-value
Model
9532.85
1
9532.85
113.94
<0.001
Error
8198.93
98
83.66
Total
17731.78
99
Now let’s check out the null scenario and tweak our number of digits (both as a raw and formatted output version):
And as a final check, let’s compare the summary(lm) to our \(F\)-test result above:
Code
summary(lm0)
Call:
lm(formula = y0 ~ x0)
Residuals:
Min 1Q Median 3Q Max
-15.5065 -5.0881 -0.5393 6.5040 15.8947
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.4344 2.7142 4.950 3.09e-06 ***
x0 -0.4132 0.2564 -1.611 0.11
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.625 on 98 degrees of freedom
Multiple R-squared: 0.02581, Adjusted R-squared: 0.01587
F-statistic: 2.596 on 1 and 98 DF, p-value: 0.1104
Source Code
---title: "Week 8 Lab"author: name: Alex Kaizer roles: "Instructor" affiliation: University of Colorado-Anschutz Medical Campustoc: truetoc_float: truetoc-location: leftformat: html: code-fold: show code-overflow: wrap code-tools: true---```{r, echo=F, message=F, warning=F}library(kableExtra)library(dplyr)```This page is part of the University of Colorado-Anschutz Medical Campus' [BIOS 6618 Labs](/labs/index.qmd) collection, the mini-lecture content delivered at the start of class before breaking out into small groups to work on the homework assignment.# What's on the docket this week?This week we will walk through the steps to generating our own ANOVA table for the summary of linear regression models in `R`. SAS provides a nicely formatted tabled automatically with its `PROC REG` output, but it is a bit trickier to create in `R`.# The ANOVA TableWe know the general set-up of the ANOVA looks something like the table below:```{r, message=F, class.source = 'fold-hide'}#| code-fold: true#| fig-cap: "where $p$ is the number of independent variables in the model ($p=1$ for simple linear regression) and $n$ is our total sample size."#| filters:#| - parse-latexlibrary(kableExtra)anova_df <-data.frame( 'Source'=c('Model','Error','Total'),'Sums of Squares'=c('SS~Model~','SS~Error~','SS~Total~'),'Degrees of Freedom'=c('p','n-p-1','n-1'),'Mean Square'=c('MS~Model~','MS~Error~',''),'F Value'=c('MS~Model~/MS~Error~','',''),'p-value'=c('Pr(F~p,n-p-1~ > F)','',''))kbl(anova_df, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'),align='lccccc', escape=F) %>%kable_styling(bootstrap_options ="striped", full_width = F, position ="left")```It is also possible to represent the table in terms of the relevant equations to calculate each component:```{r, message=F, class.source = 'fold-hide'}#| code-fold: true#| fig-cap: "Unfortunately, `R` doesn't provide the information of the ANOVA table in a nice easy-to-use format. Instead we have to use specific functions to extract relevant pieces of information. However, we can calculate these all piece-by-piece and combine into an ANOVA table."#| filters:#| - parse-latexanova_df <-data.frame( 'Source'=c('Model','Error','Total'),'Sums of Squares'=c('$\\sum_{i=1}^{n} (\\hat{Y}_{i} - \\bar{Y})^{2}$','$\\sum_{i=1}^{n} (Y_{i} - \\hat{Y}_{i})^{2}$','$\\sum_{i=1}^{n} (Y_{i} - \\bar{Y})^{2}$'),'Degrees of Freedom'=c('$p$','$n-p-1$','$n-1$'),'Mean Square'=c('$\\frac{\\sum_{i=1}^{n} (\\hat{Y}_{i} - \\bar{Y})^{2}}{p}$','$\\frac{\\sum_{i=1}^{n} (Y_{i} - \\hat{Y}_{i})^{2}}{n-p-1}$',''),'F Value'=c('$F = \\frac{\\text{MS}_{\\text{Model}}}{\\text{MS}_{\\text{Error}}}$','',''),'p-value'=c('$Pr(F_{p,n-p-1} > F)$','',''))kbl(anova_df, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'),align='lccccc', escape=F) %>%kable_styling(bootstrap_options ="striped", full_width = F, position ="left")```To illustrate this, let's first simulate some data to fit our linear regression model to:```{r,echo=T}set.seed(515)x <-rnorm(n=100, mean=10, sd=3) # simulate a single continuous predictor with mean=10 and SD=3reg <-10+3* x # set the regression equation so the intercept=10 and the slope=3y <-rnorm(n=100, mean=reg, sd=8) # simulate the outcome based on the conditional meanmod1 <-glm(y ~ x) # fit linear regression modelsummary(mod1)```For example, using the following functions we can extract the corresponding pieces of information:```{r,message=F, class.source = 'fold-hide'}#| code-fold: true#| fig-cap: "With this information we can apply the various equations from our ANOVA table above:"#| filters:#| - parse-latexinfo_df <-data.frame( 'Term'=c('$Y_{i}$','$\\bar{Y}$','$\\hat{Y}_{i}$','$p$','$n$'),'Code'=c('`y`','`mean(y)`','`predict(mod1)`','`length(mod1$coefficients)-1`','`length(y)`'))kbl(info_df, escape=F) %>%kable_styling(bootstrap_options ="striped", full_width = F, position ="left")``````{r}ybar <-mean(y)yhat <-predict(mod1)p <-length(mod1$coefficients)-1n <-length(y)ssm <-sum( (yhat-ybar)^2 )sse <-sum( (y-yhat)^2 )sst <-sum( (y-ybar)^2 )msm <- ssm/pmse <- sse/(n-p-1)f_val <- msm/msep_val <-pf(f_val, df1=p, df2=n-p-1, lower.tail=FALSE)# Create an ANOVA table to summarize all our results:p_val_tab <-if(p_val<0.001){'<0.001'}else{round(p_val,3)}anova_table <-data.frame( 'Source'=c('Model','Error','Total'),'Sums of Squares'=c(round(ssm,2), round(sse,2), round(sst,2)),'Degrees of Freedom'=c(p, n-p-1, n-1),'Mean Square'=c(round(msm,2), round(mse,2),''),'F Value'=c(round(f_val,2),'',''),'p-value'=c(p_val_tab,'',''))kbl(anova_table, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'), align='lccccc', escape=F) %>%kable_styling(bootstrap_options ="striped", full_width = F, position ="left")```## Make an ANOVA FunctionBetter than having to recreate the same code manually each time, let's write a function to generate a `kable` table of the ANOVA table for any `glm` or `lm` object we provide:```{r, class.source = 'fold-show'}linreg_anova_func <-function(mod, ndigits=2, p_ndigits=3, format='kable'){### Function to create an ANOVA table linear regression results from lm or glm# mod: an object with the fitted model results# ndigits: number of digits to round to for most values, default is 2# p_digits: number of digits to round the p-value to, default is 3# format: desired format output (default is kable):## "kable" for kable table## "df" for data frame as table# extract outcome from the object produced by the glm or lm functionif( class(mod)[1] =='glm' ){ y <- mod$y }if( class(mod)[1] =='lm' ){ y <- mod$model[,1] # first column contains outcome data } ybar <-mean(y) yhat <-predict(mod) p <-length(mod$coefficients)-1 n <-length(y) ssm <-sum( (yhat-ybar)^2 ) sse <-sum( (y-yhat)^2 ) sst <-sum( (y-ybar)^2 ) msm <- ssm/p mse <- sse/(n-p-1) f_val <- msm/mse p_val <-pf(f_val, df1=p, df2=n-p-1, lower.tail=FALSE)# Create an ANOVA table to summarize all our results: p_digits <- (10^(-p_ndigits)) p_val_tab <-if(p_val<p_digits){paste0('<',p_digits)}else{round(p_val,p_ndigits)} anova_table <-data.frame( 'Source'=c('Model','Error','Total'),'Sums of Squares'=c(round(ssm,ndigits), round(sse,ndigits), round(sst,ndigits)),'Degrees of Freedom'=c(p, n-p-1, n-1),'Mean Square'=c(round(msm,ndigits), round(mse,ndigits),''),'F Value'=c(round(f_val,ndigits),'',''),'p-value'=c(p_val_tab,'',''))if( format =='kable' ){ library(kableExtra)kbl(anova_table, col.names=c('Source','Sums of Squares','Degrees of Freedom','Mean Square','F-value','p-value'), align='lccccc', escape=F) %>%kable_styling(bootstrap_options ="striped", full_width = F, position ="left") }else{ anova_table }}```Now let's try applying the function to two data sets (our previous one, and one null), with both `glm` and `lm`:```{r, class.source = 'fold-show'}set.seed(515)x <-rnorm(n=100, mean=10, sd=3) # simulate a single continuous predictor with mean=10 and SD=3reg <-10+3* x # set the regression equation so the intercept=10 and the slope=3y <-rnorm(n=100, mean=reg, sd=8) # simulate the outcome based on the conditional meanmod1 <-glm(y ~ x) # fit linear regression model with glmlm1 <-lm(y ~ x) # fit linear regression model with lmset.seed(303)x0 <-rnorm(n=100, mean=10, sd=3) # simulate a single continuous predictor with mean=10 and SD=3reg0 <-10+0* x0 # set the regression equation so the intercept=10 and the slope=0 (i.e., null)y0 <-rnorm(n=100, mean=reg0, sd=8) # simulate the outcome based on the conditional meanmod0 <-glm(y0 ~ x0) # fit linear regression model with glmlm0 <-lm(y0 ~ x0) # fit linear regression model with lm```First let's check out our non-null scenario:```{r, class.source = 'fold-show'}linreg_anova_func(mod=mod1)linreg_anova_func(mod=lm1)```Now let's check out the null scenario and tweak our number of digits (both as a raw and formatted output version):```{r, class.source = 'fold-show'}linreg_anova_func(mod=mod0, ndigits=3, p_ndigits=4, format='df')linreg_anova_func(mod=lm0, ndigits=3, p_ndigits=4, format='kable')```And as a final check, let's compare the `summary(lm)` to our $F$-test result above:```{r, class.source = 'fold-show'}summary(lm0)```