Data

In this HTML we will implement the multiple linear regression for the Rosner data set examining systolic blood pressure based on weight (oz) and age (days) for newborns:

SBP Weight Age
89 135 3
90 120 4
83 100 3
77 105 2
92 130 4
98 125 5
82 125 2
85 105 3
96 120 5
95 90 4
80 120 2
79 95 3
86 120 3
97 150 4
92 160 3
88 125 3

We will first implement the multiple linear regression model using matrices, then check our result using lm.

Matrix Approach

Calculating the various components

Create the outcome vector and design matrix

Y <- matrix( c(89,90,83,77,92,98,82,85,96,95,80,79,86,97,92,88), ncol=1 )
X <- matrix( c( rep(1, length(Y)),
                c(135,120,100,105,130,125,125,105,120,90,120,95,120,150,160,125),
                c(3,4,3,2,4,5,2,3,5,4,2,3,3,4,3,3)), 
             ncol=3, byrow=F)
cbind(Y,X)
##       [,1] [,2] [,3] [,4]
##  [1,]   89    1  135    3
##  [2,]   90    1  120    4
##  [3,]   83    1  100    3
##  [4,]   77    1  105    2
##  [5,]   92    1  130    4
##  [6,]   98    1  125    5
##  [7,]   82    1  125    2
##  [8,]   85    1  105    3
##  [9,]   96    1  120    5
## [10,]   95    1   90    4
## [11,]   80    1  120    2
## [12,]   79    1   95    3
## [13,]   86    1  120    3
## [14,]   97    1  150    4
## [15,]   92    1  160    3
## [16,]   88    1  125    3

\(\mathbf{X}^\top \mathbf{X}\)

XtX <- t(X) %*% X
XtX
##      [,1]   [,2] [,3]
## [1,]   16   1925   53
## [2,] 1925 236875 6405
## [3,]   53   6405  189

\(\mathbf{X}^\top \mathbf{Y}\)

XtY <- t(X) %*% Y
XtY
##        [,1]
## [1,]   1409
## [2,] 170350
## [3,]   4750

\(\mathbf{Y}^\top \mathbf{Y}\)

YtY <- t(Y) %*% Y
YtY
##        [,1]
## [1,] 124751

\((\mathbf{X}^\top \mathbf{X})^{-1}\)

XtX_inv <- solve(XtX)
XtX_inv
##             [,1]          [,2]          [,3]
## [1,]  3.34152652 -0.0217335058 -0.2005174644
## [2,] -0.02173351  0.0001918187 -0.0004059419
## [3,] -0.20051746 -0.0004059419  0.0752776910

Calculate \(\hat{\boldsymbol{\beta}}\)

beta_vec <- XtX_inv %*% t(X) %*% Y
beta_vec
##            [,1]
## [1,] 53.4501940
## [2,]  0.1255833
## [3,]  5.8877191

Calculate the MSE (i.e., the variance of Y|X)

mse <- ( YtY - t(beta_vec) %*% t(X) %*% Y ) / ( length(Y) - 2 - 1 ) #denominator is n-p-1, where n=16, p=2 (2 different predictors)
mse
##          [,1]
## [1,] 6.146297

Calculate the covariance matrix for our betas

Sigma_mat <- XtX_inv * as.numeric( mse ) #note: here we had to convert our sigma^2(Y|X) into a numeric from a 1x1 matrix object, otherwise R gives a "Error...non-conformable arrays"
Sigma_mat
##            [,1]         [,2]        [,3]
## [1,] 20.5380142 -0.133580580 -1.23243988
## [2,] -0.1335806  0.001178975 -0.00249504
## [3,] -1.2324399 -0.002495040  0.46267904

Testing a single covariate for \(\hat{\beta}_1\)

Set up a vector to extract \(\hat{\beta}_1\)

c_vec <- matrix(c(0,1,0), nrow=1) 

Extract \(\hat{\beta}_1\)

beta_1 <- c_vec %*% beta_vec
beta_1
##           [,1]
## [1,] 0.1255833

Calculate \(Var(\hat{\beta}_1)\)

var_beta_1 <- c_vec %*% XtX_inv %*% t(c_vec) * as.numeric(mse)
var_beta_1
##             [,1]
## [1,] 0.001178975

Calculate the t-statistic

t_stat <- beta_1 / sqrt( var_beta_1 )
t_stat
##          [,1]
## [1,] 3.657459

Calculate the p-value for our two-sided t-test

pval <- 2 * (1-pt( t_stat, df = length(Y) - 2 - 1) ) # df = n-p-1
pval
##             [,1]
## [1,] 0.002895789

Putting it all together

We can also calculate a 95% confidence interval based on our estimated variability and tie it all together in our brief, but complete, framework:

lci <- beta_1 - qt(0.975,13)*sqrt(var_beta_1)
uci <- beta_1 + qt(0.975,13)*sqrt(var_beta_1)
c(lci, uci)
## [1] 0.05140441 0.19976212

There is a significant increase in SBP (mmHg) for a one unit increase in weight (oz) (p=0.0029). On average, for a one ounce increase in weight, SBP increases by 0.1255833 mmHg (95% CI: 0.0514044, 0.1997621), when adjusting for age.

Compare results to linear regression output

birth <- data.frame(
  sbp = c(89,90,83,77,92,98,82,85,96,95,80,79,86,97,92,88),
  wgt = c(135,120,100,105,130,125,125,105,120,90,120,95,120,150,160,125),
  age = c(3,4,3,2,4,5,2,3,5,4,2,3,3,4,3,3)
)
mod1 <- lm(sbp ~ wgt + age, data=birth)
summary(mod1)
## 
## Call:
## lm(formula = sbp ~ wgt + age, data = birth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0438 -1.3481 -0.2395  0.9688  6.6964 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.45019    4.53189  11.794 2.57e-08 ***
## wgt          0.12558    0.03434   3.657   0.0029 ** 
## age          5.88772    0.68021   8.656 9.34e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.479 on 13 degrees of freedom
## Multiple R-squared:  0.8809, Adjusted R-squared:  0.8626 
## F-statistic: 48.08 on 2 and 13 DF,  p-value: 9.844e-07
confint(mod1)
##                   2.5 %     97.5 %
## (Intercept) 43.65964398 63.2407441
## wgt          0.05140441  0.1997621
## age          4.41822526  7.3572130