Flexible Modeling of Continuous Predictors

Lecture 11

Chris Slaughter

October 30, 2025

Code

library(rms)
library(ggplot2)
library(splines)
library(lspline)
library(lmtest)
library(sandwich)
library(car)
library(knitr)
tryCatch(source('pander_registry.R'), error = function(e) invisible(e))

1 Flexible models of continuous predictors

1.1 Overview

When covariates are measured on a continuous scale, we have many different approaches to analysis
To date, we have focused primarily on approaches that facilitate interpretation of our parameter estimates
- Untransformed
  - Additive changes
  - Compare group differing by $X$ units in the covariate of interest
- Log transformed
  - Fold (multiplicative) changes
  - Compares groups with a $k$-fold difference in the covariate of interest
Flexible modeling of continuous covariates are another option
- Interpretation often relies on plots, except in special cases
- It is best to pre-specify how we intend to model continuous covariates to avoid data-driven modeling decision

1.1.1 Common continuous measures

Dose. Typically, dose of some treatment. Higher doses presumed to lead to greater (or lesser) response.
Exposure. Typically, greater precision (more power) to consider amount of exposure rather than just Exposed versus Unexposed
Many others. Age, serum cholesterol, blood pressure, height, etc.
“Dose-response” in the following notes that follow more generally refers to “doses” of any type of continuous covariate (“doses of age”, “doses of cholesterol”, etc.)

1.1.2 Scientific questions when modeling dose-response

We have many choices for modeling dose-response
Consider the scientific question when deciding on an appropriate model
How does the scientific question you want to answer relate to how you model the dose-response relationship?
- Models for detecting an association in subgroups defined by $X$
- Models for estimating the mean response in subgroups defined by $X$
- Models for predicting individual response in subgroups defined by $X$
Will describe methods for
- Modeling complex dose-response curves
- Flexible methods (non-parametric loess smooths, parameteric regression splines)

1.2 Hierarchical Models

1.2.1 Full and restricted models

When testing for associations, we are implicitly comparing two models
- “Full” model
  - Usually corresponds to alternative hypothesis
  - Contains all terms in the “restricted” model plus some terms that we want to test for inclusion
  - Example: $g(\theta) = \beta_0 + \beta_1 * X_1 + \beta_2 * X_2$
- “Restricted” model
  - Usually corresponds to the null hypothesis
  - Terms in the model are the subset of the terms in the full model that are not being tested
  - Example: $g(\theta) = \beta_0 + \beta_1 * X_1$
- Jargon: The restricted model is “nested” in the full model
  - If we set some parameters in the full model equal to zero, we will get the restricted model
  - We have very good methods for testing differences between nested models
    - Is a model with weight and height as predictors better than a model with just weight?
  - Our methods are not as good for comparing models that are not nested
    - Is a model with just weight as a predictor better than a model with just height as a predictor?
- Non-nested models (e.g. $X_1$ is weight, $X_2$ is height)
  - $g(\theta) = \beta_0 + \beta_1 * X_1$
  - $g(\theta) = \gamma_0 + \gamma_2 * X_2$

1.2.2 Scientific interpretation of full versus restricted models

Our scientific interpretation of our statistical tests depends on the meaning of the restricted model compared to the full model
- Consider the the additional associations that are possible with the full model that are not possible with the restricted model
Appears very straightforward when we are only considering one full model compared to one restricted model
- In practice, we may have many restricted models in mind
- Multiple testing problem appears if we are considering many restricted models

1.2.2.1 Example 1: Adjusted effects

Full model: FEV vs smoking, age, height
Restricted model: FEV vs age, height
If the full model provides no clear advantage over the restricted model, then we conclude that there is insufficient evidence to suggest an association between FEV and smoking when controlling for age and height
Note how carefully I stated the above
- “Insufficient evidence” (not evidence that smoking is not associated)
- Conditional on controlling for age and height. Smoking, alone, may be an important predictor of FEV. The question answered by the full and reduced model is a comparison of FEV “among individuals of the same age and height but differing in smoking status” not just individuals who differ in smoking status.

1.2.2.2 Example 2: Tests of linearity

Full model: Survival vs cholesterol and cholesterol$^2$
- Linear and quadratic terms for cholesterol
- Model allows for a parabolic shape for chelesterol
Restricted model: Survival vs cholesterol
If the full model provides no clear advantage over the restricted model, we conclude that there is insufficient evidence to suggest a U-shaped trend in survival with cholesterol
Note that a quadratic term does not include all possibilities for non-linearity. We will cover other (better) approaches later in the notes

1.2.3 Statistical comparison of full and reduced models

Wald test
- Fits only the full model
- Tests if the parameter (or parameters) in the full model are equal to zero
- Can be used for robust and non-robust standard error estimates (fit using GEE)
Likelihood Ratio test
- Fits both the full and reduced model
- Compares the likelihood functions to determine if they are significantly different
- Can be used for non-robust models only (fit using likelihood)
Score test
- Only fits the reduced model
- Considers the gradient of the likelihood function. After fitting the reduced model, looks at what is left over to see if it can be predicted by additional covariates
- Not generally reported in regression model setting, but some common tests are Score tests (e.g. log-rank test)
In large samples, the Wald, Likelihood Ratio, and Score tests converge
- Given our focus on robust standard error estimates, we have used Wald tests almost exclusively

1.2.4 Models with interactions (effect modification)

Models to test for effect modification (interactions) are hierarchical
- Full Model: Include the interactions
- Reduced Model: Set the interactions to be tested in the full model to zero
Best scientific approach is to pre-specify the statistical model that will be used for analysis
Sometimes we choose a relatively large model including interactions
- Allows us to address more questions (e.g. effect modification)
- Sometimes gives greater precision
  - Trade-offs between more parameters to estimate versus smaller within group variability
Deciding which parameters to test from the full model
- It can be difficult to decide the statistical test that corresponds to specific scientific questions
- Need to consider the restricted model that corresponds to your null hypothesis
  - Which parameters in the full model need to be set to zero to test your desired hypothesis

1.2.4.1 Examples of different scientific tests from the same full model

In the following examples, the full model is the same but we change the restricted model based on the scientific question
- Full model: Survival vs sex, smoking, sex-smoking interaction
- Example Question 1: Is there effect modification?
  - Restricted model: Survival vs sex and smoking
    - Test that the sex-smoking interaction is zero
- Example question 2: Is there an association between survival and sex?
  - Restricted model: Survival vs smoking
    - Test that parameters for sex and sex-smoking interaction are zero
    - Does sex effect survival in either smokers or non-smokers?
- Example question 3: Is there an association between survival and smoking?
  - Restricted model: Survival vs sex
    - Test that parameters for smoking and sex-smoking interaction are zero

1.2.4.2 Be careful when making data-driven model building decisions

We are often tempted to remove parameters that are not statistically significant before proceeding to other tests
- You will find many textbooks that advocate using this analysis approach
  - e.g. fit the interaction, if it is not significant, exclude it from the model
  - Such texts are invariable written by mathematicians with no understanding of the underlying science
  - When you work in the same scientific area, you will eventually “discover” false-positive assocations if you use this model building technique
- Such data-driven analyses tend to suggest that failure to reject the null means equivalence
  - But, if you make this conclusion, you are wrong (large p-values do not indicate equivalence)
  - Abscence of evidence is not evidence of abscence
- A procedure where we fit one model, and then use the results to determine the next model we will fit tends to underestimate the true standard error
  - A consequence of the genearl multiple testing problem
  - Will lead to declaring statistical significance more often than you should (inflates type-I error rate)
  - May also lead to parameter estimates that are biased away from the null
    - Statistical significance increases as a function of the effect size and the inverse of the standard error
    - If, by chance, you see a large effect in your sample, you will be more likely to declare significance
How we interpret “negative” studies
- There are other many reasons for not rejecting the null hypothesis of zero slope
  - There may be no association
  - There may be an association, but not in the parameter considered (i.e. the mean response)
  - There may be an association in the parameter considered, but the best fitting line has zero slope (a curvilinear association in the parameter)
  - There may be a first order trend in the parameter, but we lack sufficient statistical precision to be confident that it truly exists (a type-II error)
- Be careful to not just assume that it the first reason, no association
- Statistically, there is no way of discriminating between these four possibilities unless you have an infinite sample size

1.3 Modeling Complex Dose-Response

All models are wrong, but some are useful.

1.3.1 S-Shaped dose-response curves

It is common in biology, medicine, and science to find “S-shaped” dose-response curves

Code

set.seed(1234)
n <- 100
plotdata <- data.frame(x=1:n,
                       y=20 + 50 / (1+exp(-.15*(1:n-50))))
plotdata$y <- plotdata$y + rnorm(n,0,4)
ggplot(plotdata, aes(x=x, y=y)) + geom_point() + theme_bw() + geom_smooth()

S-shaped dose response curve fit with loess smoother

If we happen to sample over the entire range of $X$ (in this example, from 0 to 100), we will would observe all parts of the S-curve
However, if we were only to sample from a limited range of $X$, we might observe only parts of the curve where it was approximately linear

Code

plotdata$xint <- cut(plotdata$x, c(0,30,70,100))
ggplot(plotdata, aes(x=x, y=y)) + facet_wrap(~ xint) + geom_point() + theme_bw() + geom_smooth()

The S-shaped dose response curve fit is approximately linear within three ranges of X.

Even if we expect a complex dose response relationship, we must also consider the range of dose that we are sampling over when considering the needed complexity of the model

1.3.1.1 Loess smoothing

The default smooth for ‘geom_smooth’ when there are fewer than 1000 data points is a loess smooth. These fits appear in the plots above
Loess is a non-parametric smooth where fitting is done locally.
That is, for the fit at point x, the fit is made using points in a neighborhood of x, weighted by their distance from x
There are arguments to control the degree of smoothing, or how close we consider points to be in the “neighborhood of x”
‘loess’ provides a nice visual representation to variety of data without having to specify a model
We will primarily be concerned with parametric smoothers as the parameters can be estimated (and in special cases interpreted) from multivariable regression models

1.3.2 Linear predictors

The most commonly-used regression models all consider “linear predictors”
- Linear refers to linear in the parameters ($\beta$s), and is not a reference to transformations of the predictors
- The modeled predictors can be transformations of the scientific measurements and still be linear (in the parameters). These models are linear (in the parameters)
  - $g[\theta | X_i] = \beta_0 + \beta_1 \textrm{log}(X_i)$
  - $g[\theta | X_i] = \beta_0 + \beta_1 X_i + \beta_2 X_i^2$
- Examples of models that are not linear in the parameters (We do not consider such models)
  - $g[\theta | X_i] = \beta_0 X_i^{\beta_1}$
  - $g[\theta | X_i] = \beta_0 + \beta_1 \textrm{exp}(-\beta_2 X_i)$
  - $g[\theta | X_i] = \frac{\beta_0 + \beta_1 X_i}{1 + \beta_2 X_i}$

1.3.3 Types of transformations

We transform predictors to provide more flexible description of complex associations between the response and some scientific measure
- Threshold effects
- Exponentially increasing effects
- U-shaped functions
- S-shaped functions
- etc.
Flexible modeling of continuous predictors was briefly covered in the notes on Poisson regression. We used a flexible dose-response model to examine the association between BMI and reflux rate by Esophagitis status

Code

bmi.data <- read.csv("data/bmi.csv", header=TRUE)
# Events are pH less than 4
bmi.data$events <- bmi.data$totalmins4
m.spline2.adj <- glm(events ~ ns(bmi,4) + offset(log(totalmins)) + esop, data=bmi.data, family="poisson")
m.spline3.adj <- lm(events / totalmins ~ ns(bmi,4) + esop, data=bmi.data)
par(mfrow=c(1,2), mar=c(5,4,4,0.5))
plot(18:40, exp(predict(m.spline2.adj, newdata=data.frame(bmi=18:40, totalmins=720, esop=1), type="link")), type='l', ylab="Predicted number of events per day", xlab="BMI", ylim=c(0,100), main="Poisson Reg")
axis(4, labels=FALSE, ticks=TRUE)
legend("bottomright", c("Esophagitis Pos","Esophagitis Neg"), inset=0.05, col=1, lty=1:2)
lines(18:40, exp(predict(m.spline2.adj, newdata=data.frame(bmi=18:40, totalmins=720, esop=0), type="link")), lty=2)
plot(18:40, 720*predict(m.spline3.adj, newdata=data.frame(bmi=18:40, esop=1), type="response"), type='l', col='Blue', ylab="", xlab="BMI", ylim=c(0,100), main="Linear Reg", axes=FALSE)
axis(1)
axis(4)
axis(2, labels=FALSE, ticks=TRUE)
box()
lines(18:40, 720*predict(m.spline3.adj, newdata=data.frame(bmi=18:40, esop=0), type="response"), type='l', col='Blue', lty=2)
legend("bottomright", c("Esophagitis Pos","Esophagitis Neg"), inset=0.05, col="Blue", lty=1:2)

1:1 Transformations
- Sometimes we can transform 1 scientific measurement into 1 modeled predictor and acheve approximate linearity
- Ex: log transformations will sometimes address apparent threshold effects

Code

set.seed(33)
n <- 200

x <- runif(n)
logx <- log(x)

y <- 0 + logx + rnorm(n)
par(mfrow=c(1,2))
plot(x,y, main="Untransformed")
lines(lowess(y~x), lwd=2)

plot(logx, y, main="Log Transformed X")
lines(lowess(y ~ logx), lwd=2)

Log transformation addresses threshold effect in untransformed X

Ex: cubing height produces a more linear association with FEV

Code

fev <- read.csv(file="data/FEV.csv")
par(mfrow=c(1,2))
plot(fev$height, fev$fev, main="Untransformed", pch='.', xlab="Height",ylab="FEV")
lines(lowess(fev$fev ~ fev$height), lwd=2)

fev$height3 <- fev$height^3
plot(fev$height3, fev$fev, main="Cubed Height", pch='.', xlab="Height^3",ylab="FEV")
lines(lowess(fev$fev ~ fev$height3), lwd=2)

Cubing height produces a more linear association with FEV

1:Many transformations
- Sometimes we transform 1 scientific measurement into several modeled predictors
  - Polynomial regression
  - Dummy variables (factor or class variables)
  - Piecewise linear
  - Splines
Polynomial regression
- Fit linear term plus higher order terms (squared, cubic, …)
- Can fit an arbitrarily-complex function
  - An $n$-th order polynomial can fit $n+1$ points exactly
- Generally, very difficult to interpret parameters
  - I will usually graph functions when I want an interpretation
- Special uses
  - 2nd order (quadratic) model to look for U-shaped trends (e.g. alcohol consumption and cardiovascular risk? Potential hypothesis being that moderate consumption of alcohol beneficial relative to either no consumption or high consumption)
  - Tests for linearity achieved by testing that all higher order terms have parameters equal to zero (hierarchical model)
  - Full Model: $g(\theta) = \beta_0 + \beta_1 * X + \beta_2 * X^2 + \beta_3 * X^3$
  - Reduced Model: $g(\theta) = \beta_0 + \beta_1 * X$

1.3.4 Example: FEV and Height association

We can try to assess whether any association between mean FEV and height follows a straight line association
Fit a third order (cubic) polynomial due to the known scientific relationship between volume and height (using robust standard error estimates)

Code

# Create squared and cubic terms
fev$ht2 <- fev$height^2
fev$ht3 <- fev$height^3

m1 <- lm(fev ~ height + ht2 + ht3, data=fev)
coeftest(m1, vcov=sandwich)

t test of coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	4.5691e-01	1.2339e+01	0.037030	0.97047
height	3.0595e-02	6.3266e-01	0.048359	0.96144
ht2	-1.5222e-03	1.0746e-02	-0.141648	0.88740
ht3	2.5797e-05	6.0481e-05	0.426524	0.66987

From the model output, is height a significant predictor of FEV?
Note that the p-values for each term are not significant
- But each of these tests are addressing irrelevant questions
  - height: After adjusting for the 2nd and 3rd order terms relationships, is the linear term important?
  - htsqr: After adjusting for the linear and 3rd order terms relationships, is the quadratic term important?
  - htcub: After adjusting for the linear and 2nd order terms relationships, is the cubic term important?
- We need to test if test the 2nd and 3rd order terms simultaneously
  - Will test if height-FEV has a linear relationship

Code

linearHypothesis(m1, c("ht2",
                       "ht3"),
                 vcov=sandwich(m1))

Linear hypothesis test
Res.Df	Df	F	Pr(>F)
652
650	2	30.641	0

We find clear evidence that the trend in mean FEV versus height is non-linear ($p < 0.001$)
Note that if we had seen $p > 0.05$, we could not be sure it was linear. The relationship may have been non-linear, but not in a way the cubic polynomial could detect

1.3.4.1 Example: log FEV and polynomial height

We can try to assess whether any association between mean log FEV and height follows a straight line relationship
Again, we will fit a third order (cubic) polynomial, but this time we do not have any good scientific justification for such a model

Code

fev$logfev <- log(fev$fev)

m2 <- lm(logfev ~ height + ht2 + ht3, data=fev)
coeftest(m2, vcov=sandwich)

t test of coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-2.7918e+00	4.9700e+00	-0.561739	0.57449
height	7.0664e-02	2.4759e-01	0.285404	0.77543
ht2	-1.8281e-04	4.0899e-03	-0.044697	0.96436
ht3	3.2400e-07	2.2404e-05	0.014461	0.98847

Note that again, the p-values for the individual terms are not significant and are still addressing uninteresting scientific questions
Test the squared and cubed terms simultaneously

Code

linearHypothesis(m2, c("ht2",
                       "ht3"),
                 vcov=sandwich(m2))

Linear hypothesis test
Res.Df	Df	F	Pr(>F)
652
650	2	0.29449	0.74501

We do not find clear evidence that the trend in mean log FEV versus height is non-linear
- This does not prove linearity because it could have been nonlinear in a way that a cubic polynomial could not detect
- However, my guess it the cubic polynomial would have picked up most reasonable patterns of non-linearity likely to occur in this setting
We have not addressed the question of whether log FEV is associated with height
- This question could have been addressed in the cubic model by
  - Testing all three height-derived variable simultaneously
  - OR looking at the overall F-test (because only height variables are in the model)
- Another alternative would be to fit a model with only the linear term for height
  - In general, it is a very bad idea to go fishing for models, so pick an approach beforehand and use that
  - If I suspected a complex dose-response relationship beforehand, I would fit the complex model and test all of the height coefficients
  - If I just cared about showing a first order trend, the following output would answer that question

Code

m3 <- lm(logfev ~ height, data=fev)
coeftest(m3, vcov=sandwich)

t test of coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-2.271312	0.068447	-33.184	0
height	0.052119	0.001121	46.494	0

1.3.5 Dummy Variables

Indicator variables for all but one group
This is the only appropriate method for modeling modeling nominal (unordered) categorical variables
- Example: Marital status using indicator for
  - Married (married=1, everything else=0)
  - Widowed (widowed=1, everything else=0)
  - Divorced (divorced=1, everything else=0)
  - Single would then be the intercept
While it is the only approach for nominal variables, dummy variable coding is often used for other settings
Makes regression equivalent to Analysis of Variance (ANOVA)
- ANOVA is an old technique developed by R.A. Fisher in 1921
- Can reproduce results of ANOVA exactly with regression, so ANOVA is of limited use today (but will find it around)

1.3.5.1 Example: Mean salary by Field

Field is a nominal variable, so we must use dummy variables
- I decided to use Other as a reference group. By default, R will order alphabetically so Arts would have been the default reference group

Code

salary <- read.csv("data/salary.csv")
salary <- salary[salary$year==95,]

salary$field <- factor(salary$field)
salary$field <- relevel(salary$field, ref="Other")

m4 <- lm(salary ~ field, data=salary)
coeftest(m4, vcov=sandwich)

t test of coefficients
	Estimate	Std. Error	t value
(Intercept)	6291.6	61.01	103.1252
fieldArts	-1013.6	104.73	-9.6775
fieldProf	1225.0	133.56	9.1718

Code

coefci(m4, vcov=sandwich)

	2.5 %	97.5 %
(Intercept)	6171.97	6411.31
fieldArts	-1218.99	-808.13
fieldProf	963.05	1487.01

Interpretation based on coding used
- Intercept corresponds to the mean salary in the Other field
  - These faculty will have both arts==0 and prof==0
  - Estimated mean salary of $6292 per month (95% CI: [6172, 6411])
  - Highly statistically different from $0 per month (a worthless test)
- Slope for arts is difference in mean salary between Fine Arts and Other fields
  - Fine Arts faculty will have arts==1 and prof==0; Other faculty will have arts==0 and prof==0
  - Estimated difference in mean salary is $1014 lower (95% CI: [808, 1219] lower salary)
  - Highly statistically different from $0 per month (a useful test, if we specified it a priori)
- Slope for prof is difference in mean salary between Professional and Other fields
  - Professional faculty will have arts==0 and prof==1; Other faculty will have arts==0 and prof==0
  - Estimated difference in mean salary is $1225 higher (95% CI: [963, 1487] higher salary)
  - Highly statistically different from $0 per month (a useful test, if we specified it a priori)
Because we modeled the three groups with two predictors plus an intercept, the estimates agree exactly with the sample means
- No borrowing of information across field
- If we had used a different reference group rather than Other, would also see agreement between model estimates and sample means
Hypothesis test: To test for mean salary differences by field
- We have modeled field using two variables
- Both slopes would have to be zero for there to be no association between field and mean salary
- Want to simultaneously tests the two slopes

Code

linearHypothesis(m4, c("fieldArts","fieldProf"),
                 vcov=sandwich(m4))

Linear hypothesis test
Res.Df	Df	F	Pr(>F)
1596
1594	2	121.08	0

We can conclude there is a there is a significant difference in salary comparing faculty with different fields ($P<0.001$)

1.3.5.2 Continuous variables

We can also use dummy variables to represent continuous variables
- Continuous variables measured at discrete levels (e.g. dose in an interventional experiment)
- Continuous variables divided into categories (often a bad idea)
Dummy variables will fit groups exactly
- If no other predictors in the model, parameter estimates correspond exactly to descriptive statistics
With continuous variables, dummy variable coding assumes a step-function is true
Modeling with dummy variables ignores order of predictor of interest

Code

m.dummy <- lm(height ~ factor(age), data=fev)
newdata <- data.frame(age=3:19)
newdata$ht.dummy <- predict(m.dummy, newdata=newdata)
ggplot(fev, aes(x=age, y=height)) + geom_point() + theme_bw() +
  geom_point(data=newdata, aes(x=age, y=ht.dummy), col="red", size=3)

Red points are means at each level of age

The red points in the plot represents the mean height by subgroups defined by years of age
- Dummy variables are used for each age (measured in years)
- Step function (e.g. when age changes from 8.999 to 9.000, expected height increases)
- Several ages have few subjects, which makes the estimates unreliable
Do we really believe that average height of 18 year olds is less than the average height of 14, 15, 16, 17, and 19 year old?
- Probably not (the model is overfit)
- By ignoring the ordering of age, we are not borrowing strength across adjacent age groups
- When building model, we can decide how much information we want to borrow from nearby groups
Predicted heights for observed ages from the dummy variable and two polynomial models

Code

newdata[12:17,]

	age	ht.dummy
12	14	67.520
13	15	66.368
14	16	67.385
15	17	68.938
16	18	66.000
17	19	67.833

1.4 Flexible Methods

We have methods that can fit a wide variety of curve shapes
- Dummy variables: Step function with tiny steps
- Polynomials: If high degree, allows many patterns of curvature
- Splines: Piecewise linear or piecewise polynomial

1.4.1 Polyhomial functions

We can fit high order polynomials to flexibly model a continuous covariate
A continuous covariate with $k$ levels can be perfectly modeled using polynomials up to degree $k-1$.
In the Height and Age dataset, there are 17 unique values of age (age is rounded to the nearest year), so we can perfectly model the means using polynomials up to degree 16
- $E[Height | Age = a] = \beta_0 + \beta_1 * a + \beta_2 * a^2 + \ldots + \beta_{16} * a^{16}$
- Each $\beta$ is difficult to interpret, but the predicted value at the integer levels of age make sense.
- Predicted values of height at non-integer values of age (e.g. age of 18.5) may not make sense
- Polynomials become especially unstable near the upper and lower bounds of the age distribution. Stability near the upper and lower bounds is a general concern in many spline approaches.
A polynomial of lesser degree could also be considered. I show a model that includes polynomials for age up to degree 3 in the following plot. It is also less stable near the upper and lower bounds of age, but not as obviously so as the 16 degree polynomial.

Code

ggplot(fev, aes(x=age, y=height)) + geom_point() + theme_bw() + 
  geom_smooth(method=lm, formula=y ~ poly(x,16), se=FALSE) + 
  geom_smooth(method=lm, formula=y ~ poly(x,3), se=FALSE, col="Green") +
  geom_point(data=newdata, aes(x=age, y=ht.dummy), col="red", size=3)

Sixteen degree and three degree polynomial model

Code

m.poly16 <- lm(height ~ poly(age,16), data=fev)
m.poly3 <- lm(height ~ poly(age,3), data=fev)
newdata$ht.poly16 <- predict(m.poly16, newdata)
newdata$ht.poly3 <- predict(m.poly3, newdata)
newdata[12:17,]

	age	ht.dummy	ht.poly16	ht.poly3
12	14	67.520	67.520	67.288
13	15	66.368	66.368	67.745
14	16	67.385	67.385	67.829
15	17	68.938	68.937	67.513
16	18	66.000	66.000	66.768
17	19	67.833	67.833	65.566

1.4.2 Piecewise Linear Splines

Piecewise linear curves
- Joined at “knots”
- Straight lines in between the knots
- Convenient parametrizations such that (1) coefficients are slopes of consecutive segments or (2) coefficients are slope changes at consecutive knots
- Advantages
  - Interpretability. Has the interpretation as the slope (or slope change) from the regression model, but within a range of $X$
  - Best to specify know locations in advance for desired scientific interpretation
  - Nested within a restriced model with a single slope term (hierarchcical structure allow comparisons of complex to simpler models)
- Disadvantages
  - Abrupt change in slope at each knot location
Stata: mkspline newvar0 knot1 newvar1 knot2 newvar2 ... knotp newvarp = oldvar
- Would then run regression using newvar0 … newvarp
- Assumes straight lines between min and knot1; knot1 and knot2; etc.
In R, we can use the lspline package to make the linear splines
The following results will look at age and height using linear splines with 2 knots and 4 knots

Code

ggplot(fev, aes(x=age, y=height)) + geom_point() + theme_bw() + 
  geom_smooth(method=lm, formula=y ~ lspline(x, knots=c(9.5,15.5)), se=FALSE) + 
  geom_smooth(method=lm, formula=y ~ lspline(x, knots=c(6.5,9.5,12.5,15.5)), se=FALSE, col="Green") +
  geom_point(data=newdata, aes(x=age, y=ht.dummy), col="red", size=3)

Piecewise linear splines with 2 or 4 knots. Red points are mean height at each subgroup of age.

1.4.2.1 Piecewise linear spline interpretation (marginal is FALSE, the default)

Focusing on the model with 2 knots for interpretation within age groups of [3 to 9], [10 to 15], and [16 to 19]
- Knots at 9.5 and 15.5 define three age groups

Code

m.linspline.nonmar <- lm(height ~ lspline(age, knots=c(9.5,15.5)), data=fev)
coeftest(m.linspline.nonmar, varcov=sandwich)

t test of coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	38.49107	0.813179	47.3341	0.00000
lspline(age, knots = c(9.5, 15.5))1	2.48855	0.099940	24.9005	0.00000
lspline(age, knots = c(9.5, 15.5))2	1.08521	0.088609	12.2472	0.00000
lspline(age, knots = c(9.5, 15.5))3	-0.55308	0.371351	-1.4894	0.13687

Code

coefci(m.linspline.nonmar, varcov=sandwich)

	2.5 %	97.5 %
(Intercept)	36.89430	40.08785
lspline(age, knots = c(9.5, 15.5))1	2.29231	2.68480
lspline(age, knots = c(9.5, 15.5))2	0.91122	1.25921
lspline(age, knots = c(9.5, 15.5))3	-1.28228	0.17611

(Intercept) is the expect height when age is 0. Since this is outside the range of our data, it has no scientific interpretation.
The first slope term (‘lspline(age, knots = c(9.5, 15.5))1’) is the expected change in height comparing two individuals who differ in age by one year when both subjects are 9 years old or less. Comparing two children who are both 3 to 9 years old and differ in age by one year, the older child on average will be 2.49 inches taller than the younger child. We are 95% confident the true difference in height is between 2.29 and 2.68 inches taller.
- That is, the model assumes that the change in height comparing 3 to 4, 4 to 5, 5 to 6, 6 to 7, 7 to 8, 8 to 9 is the same (a similar growth rate over these age ranges)
- The model borrow strength across the age ranges by assuming the same slope
- The model borrows strength across the next age range by assuming the mean height at the knot (9.5 years) is the same at the upper end of first age group and the lower end of the second age group.
The second slope term (‘lspline(age, knots = c(9.5, 15.5))2’) is the expected change in height comparing two individuals who differ in age by one year when both subjects are between 10 and 15 years old
The third slope term (‘lspline(age, knots = c(9.5, 15.5))3’) is the expected change in height comparing two individuals who differ in age by one year when both subjects are 16 years old or older
We can test if the age-height slope significantly changes with age category
- $H_0: \beta_1 = \beta_2, \beta_1 = \beta_3$
- $H_A$: At least one not equal

Code

linearHypothesis(m.linspline.nonmar, 
                 c("lspline(age, knots = c(9.5, 15.5))1=lspline(age, knots = c(9.5, 15.5))2",
                   "lspline(age, knots = c(9.5, 15.5))1=lspline(age, knots = c(9.5, 15.5))3"),
                  vcov=sandwich(m.linspline.nonmar))

Linear hypothesis test
Res.Df	Df	F	Pr(>F)
652
650	2	85.721	0

We reject the null hypothesis ($P<0.001$) and conclude the linear spline model is a better fit than the restricted model, $E[Height | Age = a] = \beta_0 + \beta_1 * a$
The following is the basis matrix for the linear spline (non-marginal) for ages 3 to 19. It is relatively easy to calculate this matrix
- First columns is age
- Second columns is max of 0 and (age minus the first knot)
- Third columns is max of 0 and (age minus the second knot)

Code

cbind(3:19, lspline(3:19,  knots = c(9.5, 15.5)))

	1	2	3
3	3.0	0.0	0.0
4	4.0	0.0	0.0
5	5.0	0.0	0.0
6	6.0	0.0	0.0
7	7.0	0.0	0.0
8	8.0	0.0	0.0
9	9.0	0.0	0.0
10	9.5	0.5	0.0
11	9.5	1.5	0.0
12	9.5	2.5	0.0
13	9.5	3.5	0.0
14	9.5	4.5	0.0
15	9.5	5.5	0.0
16	9.5	6.0	0.5
17	9.5	6.0	1.5
18	9.5	6.0	2.5
19	9.5	6.0	3.5

We have transformed one measured predictor, age, into three predictors

1.4.2.2 Piecewise linear spline interpretation (marginal is TRUE, the default)

Focusing on the model with 3 knots for interpretation with age groups of [3 to 9], [10 to 15], and [16 to 19]
- Knots at 9.5 and 15.5 define these age groups

Code

m.linspline.mar <- lm(height ~ lspline(age, knots=c(9.5,15.5), marginal=TRUE), data=fev)
coeftest(m.linspline.mar, varcov=sandwich)

t test of coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	38.4911	0.81318	47.3341	0.00000000
lspline(age, knots = c(9.5, 15.5), marginal = TRUE)1	2.4886	0.09994	24.9005	0.00000000
lspline(age, knots = c(9.5, 15.5), marginal = TRUE)2	-1.4033	0.16335	-8.5911	0.00000000
lspline(age, knots = c(9.5, 15.5), marginal = TRUE)3	-1.6383	0.42028	-3.8981	0.00010701

Code

coefci(m.linspline.mar, varcov=sandwich)

	2.5 %	97.5 %
(Intercept)	36.8943	40.08785
lspline(age, knots = c(9.5, 15.5), marginal = TRUE)1	2.2923	2.68480
lspline(age, knots = c(9.5, 15.5), marginal = TRUE)2	-1.7241	-1.08259
lspline(age, knots = c(9.5, 15.5), marginal = TRUE)3	-2.4636	-0.81303

(Intercept) is the expect height when age is 0. Since this is outside the range of our data, it has no scientific interpretation.
The first slope term (‘lspline(age, knots = c(9.5, 15.5))1’) is the expected change in height comparing two individuals who differ in age by one year when both subjects are 9 years old or less.
The second slope term (‘lspline(age, knots = c(9.5, 15.5))2’) is the change in the age-height slope for 10 to 15 year olds versus 3 to 9 year olds. This output is useful for estimating and testing the slope changes over age.
The third slope term (‘lspline(age, knots = c(9.5, 15.5))3’) is the change in the age-height slope for 16 to 19 year olds versus 3 to 9 year olds. This output is useful for estimating and testing the slope changes over age.
Using this model, we can test if the age-height slope significantly changes with age category
- $H_0: \beta_2 = 0, \beta_3 = 0$
- $H_A$: At least one not equal

Code

linearHypothesis(m.linspline.mar, 
                 c("lspline(age, knots = c(9.5, 15.5), marginal = TRUE)2",
                   "lspline(age, knots = c(9.5, 15.5), marginal = TRUE)3"),
                 vcov=sandwich(m.linspline.mar))

Linear hypothesis test
Res.Df	Df	F	Pr(>F)
652
650	2	85.721	0

This is the same scientific test as for the non-marginal model, just a different parameteriziation. So, we obtain the same inference (F-stat and p-values are identical).

1.4.3 Polynomial splines

Polynomial splines introduces squared or cubic terms to increase the smoothness of the estimate. In particular, compare to linear splines, they are no abrupt changes in the slope at each knot point.
There are various types of polynomial splines, each with different formulas applied to generate the basis matrix for the regression model
- B-spline: ‘bs()’
  - Representing the family of piecewise polynomials with the specified interior knots and degree
- Natural Cubic spline ‘ns()’
  - Enforce the constraint that the function is linear beyond the boundary knots, which can either be supplied or default to the extremes of the data.
- Restricted Cubic spline ‘rcs()’
  - rcs is a also a linear tail-restricted cubic spline function
We can choose to modify several parameters, depending on the function
- Number of knots
- Where to place the knots (or use defaults based on quantiles and number of knots)
- Degree of the polynomial. In practice, rarely is a degree over 3 considered
- Increasing the number of knots and the degree of the polynomial will increase the flexibility of the model
Several different comparisons are shown in the following plots

B-spline versus polynomial function

Code

ggplot(fev, aes(x=age, y=height)) + geom_point() + theme_bw() + 
  geom_smooth(method=lm, formula=y ~ poly(x, 3), se=FALSE, aes(color="Blue")) + 
  geom_smooth(method=lm, formula=y ~ bs(x,3), se=FALSE, aes(col="Green")) +
  geom_point(data=newdata, aes(x=age, y=ht.dummy), col="Red", size=3) + 
  scale_colour_manual(name="Model", values=c("Blue", "Green"), labels=c("poly(x,3)","bs(x,3)")) + theme(legend.position = c(0.8, 0.2))

B-Spline with 3 degrees of freedom is the same as a 3 degree of polynomial. Two lines are not visible because the blue and green line are on top of each other.

B-spline versus piecewise linear spline, function

Code

ggplot(fev, aes(x=age, y=height)) + geom_point() + theme_bw() + 
  geom_smooth(method=lm, formula=y ~ lspline(x,knots=c(9.5, 15.5)), se=FALSE, aes(color="Blue")) + 
  geom_smooth(method=lm, formula=y ~ bs(x,knots=c(9.5, 15.5), degree=3), se=FALSE, aes(color="Green")) +
  geom_point(data=newdata, aes(x=age, y=ht.dummy), col="red", size=3) + 
  scale_colour_manual(name="Model", values=c("Blue", "Green"), labels=c("lspline(x,knots=c(9.5, 15.5))","bs(x,knots=c(9.5, 15.5)")) + theme(legend.position = c(0.8, 0.2))

B-Spline with 3 degrees of freedom and knots at 9.5 and 15.5 compared to piecewise linear spline with knots at 9.5 and 15.5

B-spline versus natural spline

Code

ggplot(fev, aes(x=age, y=height)) + geom_point() + theme_bw() + 
  geom_smooth(method=lm, formula=y ~ ns(x, df=3), se=FALSE, aes(color="Blue")) + 
  geom_smooth(method=lm, formula=y ~ bs(x,3, knots=c(9.5, 15.5)), se=FALSE, aes(color="Green")) +
  geom_point(data=newdata, aes(x=age, y=ht.dummy), col="red", size=3)+  
  scale_colour_manual(name="Model", values=c("Blue", "Green"), labels=c("ns(x, df=3)","bs(x,3, knots=c(9.5, 15.5)")) + theme(legend.position = c(0.8, 0.2))

B-Spline with 3 degrees of freedom and knots at 9.5 and 15 compared to ns (3 df). The enforcement of the linearity constraint beyond the boundary knots is particular apparent for the natural spline at the low end of age.

Nautral spline versus restricted cubic spline

Natural splines and restricted cubic spline are both cubic splines with restrictions of linearity beyond the boundary knot locations
They have different inputs which can result in different model fits

Code

ggplot(fev, aes(x=age, y=height)) + geom_point() + theme_bw() + 
  geom_smooth(method=lm, formula=y ~ ns(x, knots=c(9.5, 15.5)), se=FALSE, aes(color="Blue")) + 
  geom_smooth(method=lm, formula=y ~ rcs(x, parms=c(0, 9.5, 15.5, 19)), se=FALSE, aes(color="Green")) +
  geom_point(data=newdata, aes(x=age, y=ht.dummy), col="red", size=3) +  
  scale_colour_manual(name="Model", values=c("Blue", "Green"), labels=c("ns(x, knots=c(9.5, 15.5)","rcs(x, parms=c(0, 9.5, 15.5, 19)")) + theme(legend.position = c(0.8, 0.2))

Natrual spline (with knots at 3.5 and 9.5) compared with restricted cubic spline with parms set to give the same fit. Both spline approaches impose a linearity constraint at the tails. Fits are same using these inputs.

Code

ggplot(fev, aes(x=age, y=height)) + geom_point() + theme_bw() + 
  geom_smooth(method=lm, formula=y ~ ns(x, df=3), se=FALSE, aes(color="Blue")) + 
  geom_smooth(method=lm, formula=y ~ rcs(x, nk=5), se=FALSE, aes(color="Green")) +
  geom_point(data=newdata, aes(x=age, y=ht.dummy), col="red", size=3) +  
  scale_colour_manual(name="Model", values=c("Blue", "Green"), labels=c("ns(x, df=3)","rcs(x, nk=5)")) + theme(legend.position = c(0.8, 0.2))

Natrual cubic spline (3 d.f) compared with restricted cubic spline using default inputs. Both spline approaches impose a linearity constraint at the tails, but with different default knot locations the fits are different.

Nautral spline versus piecewise linear spline

Code

ggplot(fev, aes(x=age, y=height)) + geom_point() + theme_bw() + 
  geom_smooth(method=lm, formula=y ~ ns(x, df=3), se=FALSE, aes(color="Blue")) + 
  geom_smooth(method=lm, formula=y ~ lspline(x, knots=c(9.5,15.5)), se=FALSE, aes(color="Green")) +
  geom_point(data=newdata, aes(x=age, y=ht.dummy), col="red", size=3) +  
  scale_colour_manual(name="Model", values=c("Blue", "Green"), labels=c("ns(x, df=3)","lspline(x, knots=c(9.5,15.5)")) + theme(legend.position = c(0.8, 0.2))

Natrual spline (3 d.f) compared with restricted cubic spline with knots at 9.5 and 15.5. Both spline approaches impose a linearity constraint at the tails.

1.5 Summary comments

There are a variety of methods for flexibly modeling continuous covariates
Approach you use should be pre-specified and based on scientific considerations
Splines for predictor of interest in an association study
- If you want to be able to write numerical summaries of your results, consider piecewise linear splines
  - May provide an adequate summary of your results, particularly for comparison between two knots
  - Comparison between subgroups separated by a knot location more challenging as you have a change in slope
  - Formal test can be conducted to determine if the slope is significantly different on ranges of $X$ (hierarchical models)
- Best to pre-specify your knot locations based on relevant scientific meaning
  - Lacking scientific reasons, it is OK to use quantiles as the quantiles are not related to the outcome (unsupervised learning)
  - Should not pick knot locations after looking at the outcome unless you have a plan for how to adjust your analysis for making data-driven decision (this type of adjustment is hard to do, so I recommend avoiding it)
Spline for estimating means or prediction of new observations
- Flexible methods most appropriate in this approach
- Be careful of behavior of your splines at the extremes of your continuous covariates
  - (Also, don’t extrapolate beyond the range of the date)
- Degrees of freedom, or allowed model complexity needs to be considered in terms of the effective sample size for the type of outcome you are studying
Effective sample sizes
- Linear regression: Number of subjects
- Logistic regression: Lesser of number of events and number of non-events
- Cox regression: Number of events
- Poisson regression: If counts are relatively continuous, more like linear regression. If counts are all small, more like Cox regression
- Multinomial regression: Number of events in smallest categorical outcome level
- Ordinal regression: If few severity events, more like logistic regression. If a wider range of severity, more like linear regression
Allowed model complexity is, roughly, gain one available degree of freedom per every 10-15 of effective sample size
- Degrees of freedom can be spent on dummy variable, knot locations, and degree of polynomial
- Degrees of freedom can be saved by constraints, such as the linearity constraint at the tails for natural splines and restricted cubic spline. These are popular splines because the constraint often saves degrees of freedom while improving the generalizability of the fit at the tails of the continuous distribution
- These are approximatione, and the allowed model complexity can depend on other characteristics of the data too
Splines for effect modification (continuous effect modifier and/or continuous predictor of interest)
- Choice of model should be dictated by scientific considerations first
- Statistically, we have limited power to detect effect modification
- Linear terms for modeling effect modification may suffice
Flexible models for a precision variable
- Often most of the association of the precision variable with the outcome can be capture in a linear term
- However, if justified, can consider a flexible model too
- Goal of a precision variable is to explain residual variability in the outcome after accounting for the predictor of interest (and, possibly, confounders that are included in the model)
Flexible models for confounders
- Often it is important to adjust for confounders adequately, so a spline or other flexible function can help
- If we are not interested in interpreting the effect of the confounder, then we are free to model it a variety of different ways
  - Argues for flexibly modeling continuous confounder (limited by the effective sample size)
  - Note that if the confounder effect really is a straight line, then a cubic spline with linearity constraints on the tails should also fit adequately
- There is not much to be lost and potentially much to be gained by pre-specifying a flexible model for a continuous confounder
  - If it is a straight line, a flexibly spline or a linear tearm can capture the association (either would adjust for confounding)
  - If it is not a straight line, a flexible spline can capture the assocation while a linear term cannot (spline would more completly adjust for confouding)

Code

ggplot(fev, aes(x=height, y=logfev)) + geom_point() + geom_smooth(method=lm, se=FALSE) +
  geom_smooth(method=lm, formula=y ~ ns(x, df=3), se=FALSE, color="Green")

Linear (sraight line) and natural spline (3 d.f.) fits to log of FEV and height. The relationship is approximately a straight line, which can be capture by either assuming a straight line association or allowing for a flexible fit.

Example interpretation for linear model with FEV as outcome, smoking the predictor of interest controlling for age and height. There are 654 observations in the dataset, so allowed model complexity is at least 43 degrees of freedom.

Code

nrow(fev) / 15

[1] 43.6

Code

fev$smoker <- (fev$smoke=="current smoker")+0
fev$male <- (fev$sex=="male")+0
m1 <- lm(fev ~ smoker + ns(age,3) + ns(height,3) + male, data= fev)
coeftest(m1, vcov=sandwich(m1))

t test of coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.090121	0.071522	15.2418	0.0000e+00
smoker	-0.147092	0.076797	-1.9153	5.5892e-02
ns(age, 3)1	0.494765	0.102371	4.8330	1.6823e-06
ns(age, 3)2	0.506067	0.205409	2.4637	1.4011e-02
ns(age, 3)3	0.807844	0.147608	5.4729	6.3400e-08
ns(height, 3)1	1.430324	0.088439	16.1731	0.0000e+00
ns(height, 3)2	3.262242	0.215641	15.1281	0.0000e+00
ns(height, 3)3	2.709726	0.158847	17.0588	0.0000e+00
male	0.096504	0.033492	2.8814	4.0906e-03

The betas for the naturally-splined terms (age, height) are difficult to interpret directly. We could test for age and or height effect if that was of scientific interest using the Wald approach.
For the predictor of interest, smoker, we have the usual interpretation in terms of holding other covariates constant. “Among subjects of the same age, height, and sex but differing in smoking status…” However, we may be controlling for the confounding effects of age and height effects more completely by using flexible spline functions.