Code
set.seed(1207) # set seed for reproducibility
<- 50
n <- rnorm(n=n, mean=10, sd=2)
x1 <- rexp(n=n, rate=0.5)
x2 <- rbinom(n=n, size=1, prob=0.6)
x3 <- rnorm(n=n, mean=0, sd=3)
error <- 150 + -2*x1 + 4*x2 - 5*x3 + error y
Alex Kaizer
University of Colorado-Anschutz Medical Campus
This page is part of the University of Colorado-Anschutz Medical Campus’ BIOS 6618 Recitation collection. To view other questions, you can view the BIOS 6618 Recitation collection page or use the search bar to look for keywords.
It turns out much of what we have done this semester with linear regression can be directly related to a larger family of models known as generalized linear models. These include a family and link function.
Families refer to the type of outcome:
The link function connects how we estimate our outcome from our predictors and can vary by family. For Gaussian models there are three options for underlying models:
Let’s look at some examples of these and how they sometimes connect to other topics from the semester.
Let’s see some examples of how our models change with different approaches. We’ll simulate a simple case of \(n=50\) with a few predictors:
Call:
glm(formula = y ~ x1 + x2 + x3, family = gaussian(link = "identity"))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 150.4104 2.0191 74.492 < 2e-16 ***
x1 -2.0104 0.2039 -9.861 6.34e-13 ***
x2 3.9184 0.2023 19.374 < 2e-16 ***
x3 -5.4021 0.8358 -6.463 5.84e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 8.66977)
Null deviance: 4037.74 on 49 degrees of freedom
Residual deviance: 398.81 on 46 degrees of freedom
AIC: 255.72
Number of Fisher Scoring iterations: 2
This is our tried and true model, our linear regression with an additive interpretation. In this case, our simulation values are pretty darn close to what we set!
Call:
glm(formula = y ~ x1 + x2 + x3, family = gaussian(link = "inverse"))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.600e-03 1.062e-04 62.120 < 2e-16 ***
x1 1.064e-04 1.107e-05 9.613 1.41e-12 ***
x2 -1.986e-04 1.014e-05 -19.592 < 2e-16 ***
x3 2.731e-04 4.433e-05 6.161 1.66e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 8.993352)
Null deviance: 4037.74 on 49 degrees of freedom
Residual deviance: 413.69 on 46 degrees of freedom
AIC: 257.55
Number of Fisher Scoring iterations: 4
These estimates don’t have as nice an interpretation and it is hard to tell if they seem “right” given our simulation setting.
However, if we change our values of Y to be estimated as the inverse of our \(\mathbf{X}\) we will see a match of our simulation:
Call:
glm(formula = y2 ~ x1 + x2 + x3, family = gaussian(link = "inverse"))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 150.8010 2.0409 73.89 < 2e-16 ***
x1 -2.0499 0.2001 -10.25 1.87e-13 ***
x2 3.9120 0.2201 17.77 < 2e-16 ***
x3 -5.4886 0.8393 -6.54 4.48e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 2.496697e-08)
Null deviance: 1.0956e-05 on 49 degrees of freedom
Residual deviance: 1.1485e-06 on 46 degrees of freedom
AIC: -727.56
Number of Fisher Scoring iterations: 3
This suggests that the relationship between Y and \(\mathbf{X}\) is an inverse link (although note it isn’t the exact same \(\hat{\beta}\)’s from the identity link). However, it still doesn’t have a nice interpretation of the \(\hat{\beta}\)’s.
The idea behind this model is perhaps most similar to what we saw earlier this semester but with some different assumptions:
Call:
glm(formula = y ~ x1 + x2 + x3, family = gaussian(link = "log"))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.017036 0.014648 342.510 < 2e-16 ***
x1 -0.014655 0.001502 -9.758 8.82e-13 ***
x2 0.027948 0.001427 19.585 < 2e-16 ***
x3 -0.038556 0.006085 -6.336 9.05e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 8.767716)
Null deviance: 4037.74 on 49 degrees of freedom
Residual deviance: 403.31 on 46 degrees of freedom
AIC: 256.28
Number of Fisher Scoring iterations: 3
Let’s compare it to a Gaussian family with the identity link but with \(\log(Y)\):
Call:
glm(formula = log(y) ~ x1 + x2 + x3, family = gaussian(link = "identity"))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.019159 0.014829 338.465 < 2e-16 ***
x1 -0.014885 0.001497 -9.941 4.91e-13 ***
x2 0.028017 0.001485 18.862 < 2e-16 ***
x3 -0.039117 0.006139 -6.372 8.00e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.0004676353)
Null deviance: 0.209303 on 49 degrees of freedom
Residual deviance: 0.021511 on 46 degrees of freedom
AIC: -235.67
Number of Fisher Scoring iterations: 2
In this case, our predictors are not identical (and, in fact, they happen to be quite similar). BUT, it turns out these two models are making very different assumptions:
Why might we care? It turns out this removes the retransformation problem where \(E[\log(Y)] = \mathbf{X}'\boldsymbol{\beta}\) but \(E(\log(Y)|X) \neq \log(E(Y|X))\)
---
title: "Generalized Linear Models and Their Connection to Ordinary Least Squares Linear Regression"
author:
name: Alex Kaizer
roles: "Instructor"
affiliation: University of Colorado-Anschutz Medical Campus
toc: true
toc_float: true
toc-location: left
format:
html:
code-fold: show
code-overflow: wrap
code-tools: true
---
```{r, echo=F, message=F, warning=F}
library(kableExtra)
library(dplyr)
```
This page is part of the University of Colorado-Anschutz Medical Campus' [BIOS 6618 Recitation](/recitation/index.qmd) collection. To view other questions, you can view the [BIOS 6618 Recitation](/recitation/index.qmd) collection page or use the search bar to look for keywords.
# Generalized Linear Models
It turns out much of what we have done this semester with linear regression can be directly related to a larger family of models known as *generalized linear models*. These include a **family** and **link function**.
Families refer to the type of outcome:
- Gaussian (i.e., normal) for continuous outcomes
- Gamma for continuous outcomes restricted to positive values (also often skewed)
- Binomial for binary outcomes
- Poisson for count outcomes
- Quasi-models (binomial, Poisson) that relax the dispersion parameter of our exponential families (i.e., are a little more flexible)
The link function connects how we estimate our outcome from our predictors and can vary by family. For Gaussian models there are three options for underlying models:
- Identity: $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2$
- Inverse: $\frac{1}{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 \implies Y = \frac{1}{\beta_0 + \beta_1 X_1 + \beta_2 X_2}$
- Log: $\log(Y) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 \implies Y = \exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2) = \exp(\beta_0) \times \exp(\beta_1 X_1) \times \exp(\beta_2 X_2)$ (i.e., our multiplicative interpretation since the relationship between Y and $\mathbf{X}$ is no longer additive)
Let's look at some examples of these and how they sometimes connect to other topics from the semester.
## Gaussian, Identity Link
Let's see some examples of how our models change with different approaches. We'll simulate a simple case of $n=50$ with a few predictors:
```{r}
set.seed(1207) # set seed for reproducibility
n <- 50
x1 <- rnorm(n=n, mean=10, sd=2)
x2 <- rexp(n=n, rate=0.5)
x3 <- rbinom(n=n, size=1, prob=0.6)
error <- rnorm(n=n, mean=0, sd=3)
y <- 150 + -2*x1 + 4*x2 - 5*x3 + error
```
```{r}
# Gaussian, Identity link
m_identity <- glm(y ~ x1 + x2 + x3, family=gaussian(link = "identity"))
summary(m_identity)
```
This is our tried and true model, our linear regression with an additive interpretation. In this case, our simulation values are pretty darn close to what we set!
## Gaussian, Inverse Link
```{r}
# Gaussian, Inverse link
m_inverse <- glm(y ~ x1 + x2 + x3, family=gaussian(link = "inverse"))
summary(m_inverse)
```
These estimates don't have as nice an interpretation and it is hard to tell if they seem "right" given our simulation setting.
However, if we change our values of Y to be estimated as the inverse of our $\mathbf{X}$ we will see a match of our simulation:
```{r}
y2 <- (1 / (150 + -2*x1 + 4*x2 - 5*x3 + error)) # inverse of 1/XB
m_inverse2 <- glm(y2 ~ x1 + x2 + x3, family=gaussian(link = "inverse"))
summary(m_inverse2)
```
This suggests that the relationship between Y and $\mathbf{X}$ is an inverse link (although note it isn't the exact same $\hat{\beta}$'s from the identity link). However, it still doesn't have a nice interpretation of the $\hat{\beta}$'s.
## Gaussian, Log Link
The idea behind this model is perhaps most similar to what we saw earlier this semester but with some different assumptions:
```{r}
# Gaussian, Log link
m_log <- glm(y ~ x1 + x2 + x3, family=gaussian(link = "log"))
summary(m_log)
```
Let's compare it to a Gaussian family with the identity link but with $\log(Y)$:
```{r}
# Gaussian, Identity Link with log(Y)
m_identity_lnY <- glm(log(y) ~ x1 + x2 + x3, family=gaussian(link = "identity"))
summary(m_identity_lnY)
```
In this case, our predictors are not identical (and, in fact, they happen to be quite similar). BUT, it turns out these two models are making very different assumptions:
- The log link transforms the model itself (i.e., we have a log-likelihood of $l(\mu,\sigma^2;y) = \sum_{i=1}^{n} \left\{\frac{y_i \exp(\mathbf{X}'\boldsymbol{\beta}) - (\exp(\mathbf{X}'\boldsymbol{\beta}))^2 / 2}{\sigma^2} - \frac{y_i^2}{2 \sigma^2} - \frac{1}{2}\log(2\pi\sigma^2) \right\}$). The Gaussian errors are still on the natural scale, so the error variance is constant for all mean values of $Y$.
- The $\log(Y)$ with identity link transforms the data itself. Here, if we retransform $\log(Y)$ back to $Y$ the variance will change with the mean.
Why might we care? It turns out this removes the retransformation problem where $E[\log(Y)] = \mathbf{X}'\boldsymbol{\beta}$ but $E(\log(Y)|X) \neq \log(E(Y|X))$