Interpreting Polynomial Regression Models, Selecting Your Highest Order Polynomial Term, and Calculus in R

Author

Affiliation

Alex Kaizer

University of Colorado-Anschutz Medical Campus

This page is part of the University of Colorado-Anschutz Medical Campus’ BIOS 6618 Recitation collection. To view other questions, you can view the BIOS 6618 Recitation collection page or use the search bar to look for keywords.

Polynomial Regression

As a motivating example, let’s consider a example from the NHANES 2015-2016 cycle. We are interested in the association between sex hormone binding globulin (SHBG) and age in years for males 6-80 years old from the NHANES 2015-2016 cycle (\(n=3390\) with SHBG). Here we will fit our models using the poly() function with raw=T to mimic a model with higher order terms:

Code

dat <- read.csv('../../.data/nhanes1516_sexhormone.csv')
dat <- dat[which(dat$MALE==T),]
dat <- dat[!is.na(dat$SHBG),]

plot(x=dat$RIDAGEYR, y=dat$SHBG, xlab='Age (Years)', ylab='SHBG (nmol/L)', col='gray85', cex.lab=0.8, cex.axis=0.8, cex=0.5)
mod_lm <- lm(SHBG~RIDAGEYR, data=dat)
mod_poly2 <- lm(SHBG~poly(RIDAGEYR,2, raw=T), data=dat)
mod_poly3 <- lm(SHBG~poly(RIDAGEYR,3, raw=T), data=dat)
mod_poly4 <- lm(SHBG~poly(RIDAGEYR,4, raw=T), data=dat)
age <- 6:80
lines(x=age, y=predict(mod_lm, newdata=data.frame(RIDAGEYR=age)), lwd=2, col='black')
lines(x=age, y=predict(mod_poly2, newdata=data.frame(RIDAGEYR=age)), lwd=2, col='orangered2', lty=5)
lines(x=age, y=predict(mod_poly3, newdata=data.frame(RIDAGEYR=age)), lwd=2, col='red4', lty=2)
lines(x=age, y=predict(mod_poly4, newdata=data.frame(RIDAGEYR=age)), lwd=2, col='orange', lty=4)
legend('top', horiz=T, bty='n', xpd=T, inset=-0.15, lty=c(1,5,2,4), lwd=c(2,2,2,2), legend=c('Order 1','Order 2','Order 3','Order 4'), col=c('black','orangered2','red4','orange'))

Interpretation of Trends

Let’s examine the model fit for each of the models:

\(\hat{Y} = 55.903 + -0.095 X\)
\(\hat{Y} = 102.362 + -3.351 X + 0.039 X^2\)
\(\hat{Y} = 152.581 + -9.177 X + 0.203 X^2 + -0.001 X^3\)
\(\hat{Y} = 221.308 + -20.371 X + 0.730 X^2 + -0.011 X^3 + 0.00005 X^4\)

Our simple linear regression model (i.e., a polynomial regression with order 1), is easy to interpret and follows our usual pattern: For a 1 year increase in age, SHBG significantly decreases on average by 0.095 nmol/L (95% CI: 0.042 to 0.148 nmol/L decrease; p<0.001).

However, the interpretations of any models with higher order terms is made far more challenging. Now, the expected change in our outcome will also depend on the value of both \(X\) and \(X^2\) since \(X=20 \implies X^2=400\) which is not an equivalent change to \(X=60 \implies X^2=3600\). In other words, a one-unit increase in age will have different effects depending on the starting point given its non-linear trend over age.

Making it a “brief, but complete” interpretation is a little more challenging since we cannot meaningful interpret the beta coefficients and individual p-values are affected by multicollinearity (in this model with raw, not orthogonal, polynomial terms). However, you consider a variety of approaches:

One of the easiest ways to interpret these models is to plot the trend and describe the trends.
Evaluate the overall contribution of the polynomials (e.g., with a partial \(F\)-test).
Describe the predicted outcome at select predictors (e.g., using GLHT; may be challenging if adjusting for other variables since those would also have an influence).
Calculate the local/global min max, estimate inflection points, or estimate rates of change using calculus.

Calculus in R

Consider our quadratic polynomial. We can estimate the min/max point by taking the first order derivative with respect to \(X\) of our model and setting it equal to 0, then solving for \(X\):

\[\begin{align*} \hat{Y} = f(X) &= 102.362 + -3.351 X + 0.039 X^2 \\ f'(X) &= -3.351 + 2 \times 0.039 X = -3.351 + 0.078 X \\ 0 &= -3.351 + 0.078 X \implies X = \frac{3.351}{0.078} = 42.96 \end{align*}\]

If we plug this back into our regression equation we get the predicted minimum: \(\hat{Y} = 102.362 + -3.351(42.96) + 0.039(42.96)^2 = 30.38\) nmol/L of SHBG.

R also has the ability to do this calculus for us:

Code

# first write our expression
f <- expression( 102.362 + -3.351*x + 0.039 * x^2 )

# then use the D() function and specify the variable to derive with respect to
D(f, 'x')

0.039 * (2 * x) - 3.351

Code

# uniroot() function calculates where it equals 0 given a range of X values
find_root <- function(x) eval( D(f,'x') )
uniroot(find_root, c(6,80))

$root
[1] 42.96154

$f.root
[1] -4.440892e-16

$iter
[1] 2

$init.it
[1] NA

$estim.prec
[1] 6.103516e-05

More generally, we can use mosaicCalc::Zeros() to find all candidate points:

Code

library(mosaicCalc)

# first write our expressions for the first order derivatives, then findZeros
f2 <- D( 102.362 + -3.351*x + 0.039 * x^2 ~ x )
findZeros(f2(x) ~ x, xlim = range(6, 80))

   x
1 43

Code

f3 <- D( 152.581 + -9.177*x + 0.203*x^2 + -0.001*x^3 ~ x )
findZeros(f3(x) ~ x, xlim = range(6, 80))

        x
1 28.6824

Code

f4 <- D( 221.308 + -20.371*x + 0.730*x^2 + -0.011*x^3 + 0.00005*x^4 ~ x )
findZeros(f4(x) ~ x, xlim = range(6, 80))

numeric(0)

Code

# notice that none were found for f4 due to rounding, we can instead use paste0() and pull the coefficients directly:
f4_eq <- paste0(coef(mod_poly4)[1], '+', coef(mod_poly4)[2], '*x +', coef(mod_poly4)[3], '*x^2 +', coef(mod_poly4)[4], '*x^3 +', coef(mod_poly4)[5], '*x^4 ~ x')
f4_eq # print regression equation with more decimals to check

[1] "221.308434170064+-20.3710021214272*x +0.729584043956372*x^2 +-0.0105089215893748*x^3 +5.36746951571261e-05*x^4 ~ x"

Code

f4_coef <- D( as.formula(f4_eq) )
findZeros(f4_coef(x) ~ x, xlim = range(6, 80))

        x
1 25.8878

From these models, we could interpret that the minimum SHBG is expected, on average, at 43 (quadratic), 28.7 (cubic), or 25.9 (cuartic) years old. We could also estimate the minimum (and 95% CI) by plugging it into our regression equation.

Polynomial Regression and Parsimony

When we fit a regression, we want to find a model that avoids under- and over-fitting the data. Further, we also want to consider the parsimony of our model (i.e., does it have to be as complex or could it be simpler?).

Parsimony in the context of polynomial regression means considering models with fewer higher order terms both to avoid overfitting and having extra, potentially unnecessary, coefficients in our model.

The evaluation of parsimony is often more subjective, but we can use quantitative approaches to determine how many polynomial terms are statistically justified as a starting place.

Partial F-test on Raw Polynomials

If we fit models on our raw polynomials, we may have concerns about collinearity, which could make it challenging to identify the appropriate number of terms.

Let’s start by fitting a sequence of models and running partial F-tests:

Code

mod_lm <- lm(SHBG~RIDAGEYR, data=dat)
mod_poly2 <- lm(SHBG~poly(RIDAGEYR,2, raw=T), data=dat)
mod_poly3 <- lm(SHBG~poly(RIDAGEYR,3, raw=T), data=dat)
mod_poly4 <- lm(SHBG~poly(RIDAGEYR,4, raw=T), data=dat)
mod_poly5 <- lm(SHBG~poly(RIDAGEYR,5, raw=T), data=dat)
mod_poly6 <- lm(SHBG~poly(RIDAGEYR,6, raw=T), data=dat)
mod_poly7 <- lm(SHBG~poly(RIDAGEYR,7, raw=T), data=dat)
mod_poly8 <- lm(SHBG~poly(RIDAGEYR,8, raw=T), data=dat)
mod_poly9 <- lm(SHBG~poly(RIDAGEYR,9, raw=T), data=dat)
mod_poly10 <- lm(SHBG~poly(RIDAGEYR,10, raw=T), data=dat)
mod_poly11 <- lm(SHBG~poly(RIDAGEYR,11, raw=T), data=dat)

# partial F-tests (i.e., equivalent to observing coefficients and analyzing highest order t-statistic and p-value)
anova(mod_lm,mod_poly2,mod_poly3,mod_poly4,mod_poly5,mod_poly6,mod_poly7,mod_poly8,mod_poly9,mod_poly10,mod_poly11)

Analysis of Variance Table

Model  1: SHBG ~ RIDAGEYR
Model  2: SHBG ~ poly(RIDAGEYR, 2, raw = T)
Model  3: SHBG ~ poly(RIDAGEYR, 3, raw = T)
Model  4: SHBG ~ poly(RIDAGEYR, 4, raw = T)
Model  5: SHBG ~ poly(RIDAGEYR, 5, raw = T)
Model  6: SHBG ~ poly(RIDAGEYR, 6, raw = T)
Model  7: SHBG ~ poly(RIDAGEYR, 7, raw = T)
Model  8: SHBG ~ poly(RIDAGEYR, 8, raw = T)
Model  9: SHBG ~ poly(RIDAGEYR, 9, raw = T)
Model 10: SHBG ~ poly(RIDAGEYR, 10, raw = T)
Model 11: SHBG ~ poly(RIDAGEYR, 11, raw = T)
   Res.Df     RSS Df Sum of Sq         F    Pr(>F)    
1    3388 4286690                                     
2    3387 3276833  1   1009857 1351.0353 < 2.2e-16 ***
3    3386 2873485  1    403348  539.6180 < 2.2e-16 ***
4    3385 2618205  1    255281  341.5269 < 2.2e-16 ***
5    3384 2573535  1     44670   59.7615 1.403e-14 ***
6    3383 2558283  1     15252   20.4048 6.482e-06 ***
7    3382 2550263  1      8020   10.7289  0.001065 ** 
8    3381 2537447  1     12817   17.1467 3.545e-05 ***
9    3380 2529858  1      7589   10.1524  0.001454 ** 
10   3379 2525059  1      4799    6.4198  0.011330 *  
11   3378 2524950  1       109    0.1465  0.701973    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on these comparisons, it seems that a model with order 10 is considered the “best”.

Lack-of-Fit F-test

When we have replicate observations at each level, we can also estimate the pure versus model error and use it to identify the number of polynomial terms. In the case of our NHANES data, age is only provided as a whole number and we could assume we have lots of categorical data:

Code

# fit saturated model
mod_pure <- lm(SHBG ~ as.factor(RIDAGEYR), data=dat)

# save p-values from partial F-tests
p1 <- anova(mod_pure, mod_lm)["Pr(>F)"][2,1]
p2 <- anova(mod_pure, mod_poly2)["Pr(>F)"][2,1]
p3 <- anova(mod_pure, mod_poly3)["Pr(>F)"][2,1]
p4 <- anova(mod_pure, mod_poly4)["Pr(>F)"][2,1]
p5 <- anova(mod_pure, mod_poly5)["Pr(>F)"][2,1]
p6 <- anova(mod_pure, mod_poly6)["Pr(>F)"][2,1]
p7 <- anova(mod_pure, mod_poly7)["Pr(>F)"][2,1]
p8 <- anova(mod_pure, mod_poly8)["Pr(>F)"][2,1]
p9 <- anova(mod_pure, mod_poly9)["Pr(>F)"][2,1]
p10 <- anova(mod_pure, mod_poly10)["Pr(>F)"][2,1]
p11 <- anova(mod_pure, mod_poly11)["Pr(>F)"][2,1]

# print p-values
round(c('Order 1'=p1, 'Order 2'=p2, 'Order 3'=p3, 'Order 4'=p4, 'Order 5'=p5, 'Order 6'=p6, 'Order 7'=p7, 'Order 8'=p8, 'Order 9'=p9, 'Order 10'=p10, 'Order 11'=p11),4)

 Order 1  Order 2  Order 3  Order 4  Order 5  Order 6  Order 7  Order 8 
  0.0000   0.0000   0.0000   0.0000   0.0001   0.0044   0.0234   0.2086 
 Order 9 Order 10 Order 11 
  0.4763   0.6655   0.6374

Our results indicate that an octic (order 8) polynomial regression is not significantly better than a septic (order 7) polynomial regression. However, the septic (order 7) is better than the sextic (order 6).

This differs from our approach using just raw polynomials, where we chose a more complex order 10 model as “best”.

Orthogonal Polynomial Contrasts

An approach we alluded to, but didn’t delve into details, is the construction of orthogonal polynomial contrasts which remove the correlation between polynomial terms. While the math behind it is a bit lengthy, it is fairly easy to implement in R by removing our raw=T argument (or setting equal to the default FALSE) from poly():

Code

mod_orth1 <- lm( SHBG~poly(RIDAGEYR,1, raw=F), data=dat)
mod_orth2 <- lm( SHBG~poly(RIDAGEYR,2, raw=F), data=dat)
mod_orth3 <- lm( SHBG~poly(RIDAGEYR,3, raw=F), data=dat)
mod_orth4 <- lm( SHBG~poly(RIDAGEYR,4, raw=F), data=dat)
mod_orth5 <- lm( SHBG~poly(RIDAGEYR,5, raw=F), data=dat)
mod_orth6 <- lm( SHBG~poly(RIDAGEYR,6, raw=F), data=dat)
mod_orth7 <- lm( SHBG~poly(RIDAGEYR,7, raw=F), data=dat)
mod_orth8 <- lm( SHBG~poly(RIDAGEYR,8, raw=F), data=dat)
mod_orth9 <- lm( SHBG~poly(RIDAGEYR,9, raw=F), data=dat)
mod_orth10 <- lm( SHBG~poly(RIDAGEYR,10, raw=F), data=dat)
mod_orth11 <- lm( SHBG~poly(RIDAGEYR,11, raw=F), data=dat)

# print summary of Order 11 polynomial regression
summary(mod_orth11)


Call:
lm(formula = SHBG ~ poly(RIDAGEYR, 11, raw = F), data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-100.74  -15.81   -4.69   10.99  254.05 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     52.2391     0.4696 111.250  < 2e-16 ***
poly(RIDAGEYR, 11, raw = F)1  -125.7103    27.3399  -4.598 4.42e-06 ***
poly(RIDAGEYR, 11, raw = F)2  1004.9164    27.3399  36.756  < 2e-16 ***
poly(RIDAGEYR, 11, raw = F)3  -635.0966    27.3399 -23.230  < 2e-16 ***
poly(RIDAGEYR, 11, raw = F)4   505.2531    27.3399  18.480  < 2e-16 ***
poly(RIDAGEYR, 11, raw = F)5  -211.3525    27.3399  -7.731 1.40e-14 ***
poly(RIDAGEYR, 11, raw = F)6   123.4989    27.3399   4.517 6.48e-06 ***
poly(RIDAGEYR, 11, raw = F)7    89.5518    27.3399   3.276  0.00107 ** 
poly(RIDAGEYR, 11, raw = F)8  -113.2105    27.3399  -4.141 3.54e-05 ***
poly(RIDAGEYR, 11, raw = F)9    87.1125    27.3399   3.186  0.00145 ** 
poly(RIDAGEYR, 11, raw = F)10  -69.2721    27.3399  -2.534  0.01133 *  
poly(RIDAGEYR, 11, raw = F)11   10.4627    27.3399   0.383  0.70197    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.34 on 3378 degrees of freedom
Multiple R-squared:  0.4131,    Adjusted R-squared:  0.4112 
F-statistic: 216.2 on 11 and 3378 DF,  p-value: < 2.2e-16

We see that the 11th order polynomial is the first to have a p-value >0.05. This agrees with our raw polynomial calculation that Order 10 would be “best”.

With the correlation removed between our predictors, we can also see that a sequence of F-tests for adding each higher order term results in the same p-values:

Code

# print ANOVA comparisons adding 1 higher orthogonal term, notice p-values match those from the regression coefficient table for mod_orth11
anova(mod_orth1,mod_orth2,mod_orth3,mod_orth4,mod_orth5,mod_orth6,mod_orth7,mod_orth8,mod_orth9,mod_orth10,mod_orth11)

Analysis of Variance Table

Model  1: SHBG ~ poly(RIDAGEYR, 1, raw = F)
Model  2: SHBG ~ poly(RIDAGEYR, 2, raw = F)
Model  3: SHBG ~ poly(RIDAGEYR, 3, raw = F)
Model  4: SHBG ~ poly(RIDAGEYR, 4, raw = F)
Model  5: SHBG ~ poly(RIDAGEYR, 5, raw = F)
Model  6: SHBG ~ poly(RIDAGEYR, 6, raw = F)
Model  7: SHBG ~ poly(RIDAGEYR, 7, raw = F)
Model  8: SHBG ~ poly(RIDAGEYR, 8, raw = F)
Model  9: SHBG ~ poly(RIDAGEYR, 9, raw = F)
Model 10: SHBG ~ poly(RIDAGEYR, 10, raw = F)
Model 11: SHBG ~ poly(RIDAGEYR, 11, raw = F)
   Res.Df     RSS Df Sum of Sq         F    Pr(>F)    
1    3388 4286690                                     
2    3387 3276833  1   1009857 1351.0353 < 2.2e-16 ***
3    3386 2873485  1    403348  539.6180 < 2.2e-16 ***
4    3385 2618205  1    255281  341.5269 < 2.2e-16 ***
5    3384 2573535  1     44670   59.7615 1.403e-14 ***
6    3383 2558283  1     15252   20.4048 6.482e-06 ***
7    3382 2550263  1      8020   10.7289  0.001065 ** 
8    3381 2537447  1     12817   17.1467 3.545e-05 ***
9    3380 2529858  1      7589   10.1524  0.001454 ** 
10   3379 2525059  1      4799    6.4198  0.011330 *  
11   3378 2524950  1       109    0.1465  0.701973    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, if we did a series of partial F-tests comparing to our saturated pure fit model (mod_pure), we’d arrive at the same conclusion as the lack-of-fit \(F\)-test that a model with order 7 is “best”:

Code

o1 <- anova(mod_pure, mod_orth1)["Pr(>F)"][2,1]
o2 <- anova(mod_pure, mod_orth2)["Pr(>F)"][2,1]
o3 <- anova(mod_pure, mod_orth3)["Pr(>F)"][2,1]
o4 <- anova(mod_pure, mod_orth4)["Pr(>F)"][2,1]
o5 <- anova(mod_pure, mod_orth5)["Pr(>F)"][2,1]
o6 <- anova(mod_pure, mod_orth6)["Pr(>F)"][2,1]
o7 <- anova(mod_pure, mod_orth7)["Pr(>F)"][2,1]
o8 <- anova(mod_pure, mod_orth8)["Pr(>F)"][2,1]
o9 <- anova(mod_pure, mod_orth9)["Pr(>F)"][2,1]
o10 <- anova(mod_pure, mod_orth10)["Pr(>F)"][2,1]
o11 <- anova(mod_pure, mod_orth11)["Pr(>F)"][2,1]

round(c('Order 1'=o1, 'Order 2'=o2, 'Order 3'=o3, 'Order 4'=o4, 'Order 5'=o5, 'Order 6'=o6, 'Order 7'=o7, 'Order 8'=o8, 'Order 9'=o9, 'Order 10'=o10, 'Order 11'=o11),4)

 Order 1  Order 2  Order 3  Order 4  Order 5  Order 6  Order 7  Order 8 
  0.0000   0.0000   0.0000   0.0000   0.0001   0.0044   0.0234   0.2086 
 Order 9 Order 10 Order 11 
  0.4763   0.6655   0.6374

Here, since we are comparing to the saturated model, we can be fairly confident that order 7 may be best. If we did not have a saturated model (e.g., no replicates), we may have chosen the order 10 as optimal based on the orthogonal polynomials (and potentially decided a lower order based on parsimony).

Which Model to Use?

We saw that the raw polynomial approach suggested a model with Order 10, and that both saturated or orthogonal polynomial approaches identified a model with Order 7. Both of these are quite high orders. We may also wish to evaluate other summaries, like the adjusted \(R^2\) value:

Code

rawr2 <- c('Order 1'=summary(mod_lm)$adj.r.squared, 'Order 2'=summary(mod_poly2)$adj.r.squared, 'Order 3'=summary(mod_poly3)$adj.r.squared, 'Order 4'=summary(mod_poly4)$adj.r.squared, 'Order 5'=summary(mod_poly5)$adj.r.squared, 'Order 6'=summary(mod_poly6)$adj.r.squared, 'Order 7'=summary(mod_poly7)$adj.r.squared, 'Order 8'=summary(mod_poly8)$adj.r.squared, 'Order 9'=summary(mod_poly9)$adj.r.squared, 'Order 10'=summary(mod_poly10)$adj.r.squared, 'Order 11'=summary(mod_poly11)$adj.r.squared)

ortr2 <- c('Orth Order 1'=summary(mod_orth1)$adj.r.squared, 'Orth Order 2'=summary(mod_orth2)$adj.r.squared, 'Orth Order 3'=summary(mod_orth3)$adj.r.squared, 'Orth Order 4'=summary(mod_orth4)$adj.r.squared, 'Orth Order 5'=summary(mod_orth5)$adj.r.squared, 'Orth Order 6'=summary(mod_orth6)$adj.r.squared, 'Orth Order 7'=summary(mod_orth7)$adj.r.squared, 'Orth Order 8'=summary(mod_orth8)$adj.r.squared, 'Orth Order 9'=summary(mod_orth9)$adj.r.squared, 'Orth Order 10'=summary(mod_orth10)$adj.r.squared, 'Orth Order 11'=summary(mod_orth11)$adj.r.squared)

rbind('Raw' = rawr2, 'Orthogonal' = ortr2)

               Order 1   Order 2   Order 3   Order 4   Order 5   Order 6
Raw        0.003378928 0.2379376 0.3315431 0.3907489 0.4009665 0.4043407
Orthogonal 0.003378928 0.2379376 0.3315431 0.3907489 0.4009665 0.4043407
             Order 7   Order 8   Order 9  Order 10  Order 11
Raw        0.4060323 0.4088426 0.4104361 0.4113803 0.4112316
Orthogonal 0.4060323 0.4088426 0.4104361 0.4113803 0.4112316

We can note 2 things from these \(R^2_{adj}\) results:

The estimated percent of variability explained by each raw and orthogonal polynomial model of the same order are identical! This is because while the orthogonal polynomials transform our predictors to remove collinearity, the same data/information is included in the model.
It appears that after Order 4, the jumps in \(R^2_{adj}\) get smaller and smaller. This may suggest a model with Order 4 may be sufficient.

To evaluate, let’s compare the some models while also illustrating the dangers of extrapolation:

Code

plot(x=dat$RIDAGEYR, y=dat$SHBG, xlab='Age (Years)', ylab='SHBG (nmol/L)', col='gray85', cex.lab=0.8, cex.axis=0.8, cex=0.5, xlim=c(-10,100))
age <- -10:100
lines(x=age, y=predict(mod_poly4, newdata=data.frame(RIDAGEYR=age)), lwd=2, col='black', lty=1)
lines(x=age, y=predict(mod_poly7, newdata=data.frame(RIDAGEYR=age)), lwd=2, col='orangered2', lty=2)
lines(x=age, y=predict(mod_poly10, newdata=data.frame(RIDAGEYR=age)), lwd=2, col='blue', lty=4)
lines(x=age, y=predict(mod_poly11, newdata=data.frame(RIDAGEYR=age)), lwd=2, col='purple', lty=5)
legend('top', horiz=T, bty='n', xpd=T, inset=-0.15, lty=c(1,2,4,5), lwd=c(2,2,2,2), legend=c('Order 4','Order 7','Order 10','Order 11'), col=c('black','orangered2','blue','purple'))

First, note extrapolation beyond our observed data quickly gets crazy and shouldn’t be done for our polynomial models.

Second, I think the choice of “best” Order 4 or Order 7 depends on our own belief of how well it fits the data without being potentially overfit. Further, Orders 10 and 11 are quite similar to Order 7, so even if we had used a method and chosen Order 10, it wouldn’t be wrong (just overly complex).

Other Modeling Approaches

There are also other modeling strategies that may be helpful beyond polynomial regression and are covered in some other lectures:

Spline modeling provides more flexibility for the trends (but is still hard to interpret)
Piecewise regression models are more interpretable and identify break/changepoints where the trend changes/shifts