Logistic Regression

Lecture 04

Chris Slaughter

October 30, 2025

Code

tryCatch(source('pander_registry.R'), error = function(e) invisible(e))

1 General Regression Setting

Types of variables
- Binary data: e.g. sex, death
- Nominal (unordered categorical) data: e.g. race, martial status
- Ordinal (ordered categorical data): e.g. cancer stage, asthma severity
- Quantitative data: e.g. age, blood pressure
- Right censored data: e.g. time to death
The measures used to summarize and compare distributions vary according to the type of variable
- Means: Binary, quantitative
- Medians: Ordered, quantitative, censored
- Proportions: Binary, nominal, ordinal
- Odds: Binary, nominal, ordinal
- Hazards: Censored
Which regression model you choose to use is based on the parameter being compared across groups

Parameter	Approach
Means	Linear regression
Geometric means	Linear regression on log scale
Odds	Logistic regression
Rates	Poisson regression
Hazards	Proportional Hazards (Cox) regression

General notation for variables and parameters

\(Y_i\)	Response measured on the \(i\)th subject
\(X_i\)	Value of the predictor measured on the \(i\)th subject
\(\theta_i\)	Parameter summarizing distribution of \(Y_i \| X_i\)

The parameter (\(\theta_i\)) might be the mean, geometric mean, odds, rate, instantaneous risk of an event (hazard), etc.
In linear regression on means, \(\theta_i = E[Y_i | X_i]\)
Choice of correct \(\theta_i\) should be based on scientific understanding of problem
General notation for simple regression model

\[g(\theta_i) = \beta_0 + \beta_1 \times X_i\]

General notation for regression model with one predictor

\(g( )\)	Link function used for modeling
\(\beta_0\)	Intercept
\(\beta_1\)	Slope for predictor \(X\)

The link function is often either the identity function (for modeling means) or log (for modeling geometric means, odds, hazards)
- Identity function: \(f(x) = x\)

1.1 Uses of General Regression

Borrowing information
- Use other groups to make estimates in groups with sparse data
- Intuitively, 67 and 69 year olds would provide some relevant information about 68 year olds
- Assuming a straight line relationship tells us about other, even more distant, individuals
- If we do not want to assume a straight line, we may only want to borrow information from nearby groups
Defining “Contrasts”
- Define a comparison across groups to use when answering scientific questions
- If the straight line relationship holds, the slope is the difference in parameter between groups differing by 1 unit in \(X\)
- If a non-linear relationship in parameter, the slope is still the average difference in parameter between groups differing by 1 unit in \(X\)
- Slope is a (first order or linear) test for trend in the parameter
- Statistical jargon: “a contrast” across groups
The major difference between different regression models is the interpretation of the parameters
- How do I want to summarize the outcome?
- Mean, geometric mean, odds, hazard
How do I want to compare groups?
- Difference, ratio
Answering these two simple questions provides a starting road-map as to which regression model to choose
Issues related to the inclusion of covariates remains the same
- Address the scientific question: Predictor of interest, effect modification
- Address confounding
- Increase precision

2 Simple Logistic Regression

2.1 Uses of logistic regression

Use logistic regression when you want to make inference about the odds
- Allows for continuous (or multiple) grouping variables
- Is OK with binary grouping variables too
- Compares odds of responses across groups using ratios
  - “Odds ratio”
Binary response variable
When using regression with binary response variables, we typically model the (log) odds using logistic regression
- Conceptually there should be no problem modeling the proportion (which is the mean of the distribution)
- However, there are several technical reasons why we do not use linear regression very often with binary responses
- Why not use linear regression for binary responses?
Many misconceptions about the advantages and disadvantages of analyzing the odds
- Reasons I consider valid: Scientific basis
  - Uses of odds ratios in case control studies
  - Plausibility of linear trends and no effect modifiers
- Reasons I consider valid: Statistical basis
  - There is a mean variance relationship (if not using robust SE) that can be incorporate in the logistic regression model

2.2 Reasons to use logistic regression

First (scientific) reason: Case-Control Studies

Studying a rare disease, so we do study in reverse
- e.g. find subjects with cancer (and suitable controls) and then ascertain exposure of interest
- Estimate distribution of the “effect” across groups defined by “cause”
- Proportion (or odds) of smokers among people with or without lung cancer

Case-Control or Cohort 2x2 Table. In the Case-Control design, the total number of subjects with Cancer \((a+c)\) and without cancer \((b+d)\) are fixed by design.
	Lung Cancer +	Lung Cancer -
Smoker	a	b
Non-Smoker	c	d

In contrast, a cohort study samples by exposure (smoking) and then estimates the distribution of the effect in exposure groups
In a case-control study, we cannot estimate prevalence (without knowing selection probabilities)
- e.g. if doing a 1:1 case-control study, \((a+c) = b+d\) so it would look like \(50\%\) of the subjects have cancer
Odds ratios are estimable in either case-control or cohort sampling scheme
- Cohort study: Odds of cancer among smoker compared to odds of cancer among nonsmokers
- Case-control study: Odds of smoking among cancer compared to odds of smoking among non-cancer
Mathematically, these two odds ratios are the same
Odds ratios are easy to interpret when investigating rare events
- Odds = prob / (1 - prob)
- For rare events, (1 - prob) is approximately 1
  - Odds is approximately the probability
  - Odds ratios are approximately risk ratios
- Case-control studies usually used when events are rare

Second (scientific) reason: Linearity
- Proportions are bounded by 0 and 1
- It is thus unlikely that a straight line relationship would exists between a proportion and a predictor
  - Unless the predictor itself is bounded
  - Otherwise, there eventually must be a threshold above which the probability does not increase (or only increases a little)

Code

expit <- function(x) {exp(x)/(1+exp(x))}
plot(function(x) expit(x), -4,4, ylab="Probabilty", xlab="Predictor")

Logistic function will bound probabilities between 0 and 1

Third (scientific) reason: Effect modification
- The restriction on ranges for probabilities makes it likely that effect modification must be present with proportions
- Example: Is the association between 2-year relapse rates and having a positive scan modified by gender?
  - Women relapse 40% of the time when the scan is negative, and 95% of the time when the scan is positive (an increase of 55%)
    - If men relapse 75% of the time when the scan is negative, then a positive scan can increase the relapse rate to at most 100%, which is only a 25% increase
      
      Proportions
      
      Women Men
      
      Negative Scan 40% 75%
      
      Positive Scan 95% (up to 100%)
      
      Difference 55% Up to 25%
      
      Ratio 1.64 \(\leq 1.33\)
    - With the odds, the association can hold without effect modification
      
      Odds
      
      Women Men
      
      Negative Scan 0.67 3
      
      Positive Scan 19 (up to \(\infty\))
      
      Ratio 28.5 \(< \infty\)
If the o dds of positive scan in men was 85.5, then the odds ratio would be exactly 28.5 (no effect modification)
Fourth (statistics) reason:
- Classical linear regression requires equal variances across each predictor group
- But, with binary data, the variance within a group depends on the mean
- For binary \(Y\), \(E(Y) = p\) and \(Var(Y) = p(1-p)\)
- With robust standard errors, the mean-variance relationship is not a major problem. However, a logistic model that correctly models the mean-variance relationship will be more efficient.

2.3 The simple logistic regression model

Modeling the odds of binary response variable \(Y\) on predictor \(X\)
- Distribution: \(\textrm{Pr}(Y_i = 1) = p_i\)
- Model: \(\textrm{logit}(p_i) = \textrm{log}\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 \times X_i\)
- When \(X_i = 0\): log odds = \(\beta_0\)
- When \(X_i = x\): log odds = \(\beta_0 + \beta_1 \times x\)
- When \(X_i = x+1\): log odds = \(\beta_0 + \beta_1 \times x + \beta_1\)
To interpret as odds, exponentiate the regression parameters
- Distribution: \(\textrm{Pr}(Y_i = 1) = p_i\)
- Model: \(\frac{p_i}{1-p_i} = \exp(\beta_0 + \beta_1 \times X_i) = e^{\beta_0} \times e^{\beta_1 \times X_i}\)
- When \(X_i = 0\): odds = \(e^{\beta_0}\)
- When \(X_i = x\): odds = \(e^{\beta_0} \times e^{\beta_1 \times x}\)
- When \(X_i = x+1\): odds = \(e^{\beta_0} \times e^{\beta_1 \times x} \times e^{\beta_1}\)
To interpret as proportions (remember proportion = odds / (1 + odds))
- Distribution: \(\textrm{Pr}(Y_i = 1) = p_i\)
- Model: \(p_i = \frac{e^{\beta_0} e^{\beta_1 \times X_i}}{1 + e^{\beta_0} e^{\beta_1 \times X_i}}\)
- When \(X_i = 0\): \(p_i = \frac{e^{\beta_0}}{1 + e^{\beta_0}}\)
- When \(X_i = x\): \(p_i = \frac{e^{\beta_0} e^{\beta_1 \times x}}{1 + e^{\beta_0} e^{\beta_1 \times x}}\)
- When \(X_i = x+1\): \(p_i = \frac{e^{\beta_0} e^{\beta_1 \times x} e^{\beta_1}}{1 + e^{\beta_0} e^{\beta_1 \times x}e^{\beta_1}}\)
Most common interpretations found by exponentiating the coefficients
- Odds when predictor is 0 found by exponentiating the intercept: \(\exp(\beta_0)\)
- Odds ratio between groups differing in the values of the predictor by 1 unit found by exponentiating the slope: \(\exp(\beta_1)\)
Stata commands
```
logit respvar predvar, [robust]
```
- Provides regression parameter estimates an inference on the log odds scale (both coefficients with CIs, SEs, p-values)
```
logistic respvar predvar, [robust]
```
- Provides regression parameter estimates and inference on the odds ratio scale (only slope with CIs, SEs, p-values)
R Commands
- With rms package, lrm(respvar ~ predvar, ...)
- robcov(fit) gives robust standard error estimates in `lrm’
- In general, glm(respvar ~ predvar, family=“binomial”)
- sandwich() in the sandwich library can give robust standard errors when using glm

3 Example: Survival on the Titanic and Age

Dataset at https://statcomp2.app.vumc.org/modern-regression/lectures/data/
Describes the survival status of individual passengers on the Titanic
Data on age available for many, but not all, subjects (data continually being updated)
Response variables is Survival
- Binary variable: 1=Survived, 0=Died
Predictor variable is Age
- Continuous grouping variable
Possibly different age effects by sex (effect modification by sex)

3.1 Descriptive Plots

Code

# | fig-cap: Missing data patterns in the Tianic dataset
library(rms)
library(ggplot2)
titanic <- read.csv(file="data/titanic3.csv")
plot(naclus(titanic)) # study patterns of missing values

Code

# | fig-cap: Scatterplot of age versus survival in the Titanic data with lowess smooth.  This simple plot is not very useful because survival is either 0 or 1, making it hard to visualize any trends.
ggplot(titanic, aes(x=age, y=survived)) + geom_jitter(width=0, height=.02, alpha=.5) + geom_smooth()

Code

# | fig-cap: Age versus survival by sex in the Titanic data by age using a super smoother.  The trends are clearer with this smoothing approach.
with(titanic, 
     plsmo(age, survived, group=sex, datadensity=T, ylab="Survived (1=Yes, 0=No)", xlab="Age (years)")
)

Comments on the plots
- Age is missing for many subjects, which we will not worry about in the following analysis
- The simple scatterplot, even with superimposes lowess smooth, is worthless. I have jittered the point and altered their opacity to help visualize overlapping point.
- More advanced plotting available in R (in this case, the plsmo() function) can help to visualize the data

3.2 Regression Model

Regression model for survival on age (ignoring possible effect modification for now)
Answer question by assessing linear trends in log odds of survival by age
Estimate the best fitting line to log odds of survival within age groups

\[\textrm{logodds}(\textrm{Survival} | \textrm{Age}) = \beta_0 + \beta_1 \times \textrm{Age}\]

An association will exist if the slope \(\beta_1\) is nonzero
In that case, the odds (and probability) of survival will be different across different age groups

Code

m.titanic <- glm(survived ~ age, data=titanic, family = "binomial")
summary(m.titanic)

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.1365340	0.1447153	-0.94347	0.345443
age	-0.0078985	0.0044065	-1.79246	0.073059

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	1414.6 on 1045 degrees of freedom
Residual deviance:	1411.4 on 1044 degrees of freedom

(263 observations deleted due to missingness)

AIC: 1415.38

Number of Fisher Scoring Iterations: 4

\(\textrm{logodds}(\textrm{Survival} | \textrm{Age}) = -0.1365 - 0.007899 \times \textrm{Age}\)

General interpretation
- Intercept is labeled “(Intercept)”
- Slope for age is labeled “age”
Interpretation of intercept

*Estimated log odds for newborns (age=0) is \(-0.136534\)
- Odds of survival for newborns is \(e^{-0.136534} = 0.8724\)
- Probability of survival
  - Prob = odds / (1 + odds)
  - \(0.8724 / (1 + .8724) = 0.4659\)

Code

predict(m.titanic, newdata=data.frame(age=0), type='response')

        1 
0.4659194

Interpretation of slope
- Estimate difference in the log odds of survival for two groups differing by one year in age is \(-0.0078985\)
- This estimate averages over males and females
- Older groups tend to have lower log odds
- Odds Ratio: \(e^{-0.0078985} = 0.9921\)
- For five year difference in age: \(e^{-0.0078985 \times 5} = 0.9612\)
- In Stata use “lincom age, or” or “lincom 5*age, or”

Note that if the straight line relationship does not hold true, we interpret the slope as an average difference in the log odds of survival per one year difference in age

There are several ways to get the odds ratio and confidence interval in R

Code

# The coefficient and confidence interval (on the log-odds scale)
coef(m.titanic)["age"]

         age 
-0.007898504

Code

confint.default(m.titanic, "age")

	2.5 %	97.5 %
age	-0.016535	0.00073809

Code

# Odds ratio for age and confidence interval for age (1 year increase)
exp(coef(m.titanic)["age"])

      age 
0.9921326

Code

exp(confint.default(m.titanic, "age"))

	2.5 %	97.5 %
age	0.9836	1.0007

Code

# Odds ratio for age and confidence interval for age (5 year increase)
exp(5*coef(m.titanic)["age"])

      age 
0.9612771

Code

exp(5*confint.default(m.titanic, "age"))

	2.5 %	97.5 %
age	0.92065	1.0037

Using finalfit to create a nicer output table of the coefficients and confidence intervals
- For finalfit to use a logistic regression model by default, survived must be defined as a factor variable with two levels

Code

library(finalfit)
library(rms)
explanatory = c("age")

titanic$survived.factor <- factor(titanic$survived, levels=0:1, labels=c("Died","Survived"))
dependent = 'survived.factor'

label(titanic$age) <- "Age (years)"
finalfit(.data = titanic, dependent, explanatory)

Dependent: survived.factor		Died	Survived	OR (univariable)	OR (multivariable)
Age (years)	Mean (SD)	30.5 (13.9)	28.9 (15.1)	0.99 (0.98-1.00, p=0.073)	0.99 (0.98-1.00, p=0.073)

3.3 Comments on Interpretation

The slope for age is expressed as a difference in group means, not the difference due to aging. We did not do a longitudinal study in which repeated measurements were taken on the same subject.
If the group log odds are truly linear, then the slope has an exact interpretation as the change in survival due to a one year change in (any) age
Otherwise, the slope estimates the first order trend of the sample data and we should not treat the estimates of group odds or probabilities as accurate
It is difficult to see in the above example, but the CIs around the odds ratios are not symmetric
- (Symmetric) CIs are calculated on the log odds scale, and then transformed to the odds scale by expoenentiating the lower and upper limits of the CI
“From logistic regression analysis, we estimate that for each 5 year difference in age, the odds of survival on the Titanic decreased by 3.9%, though this estimate is not statistically significant (\(p = 0.07\)). A 95% CI suggests that this observation is not unusual if a group that is five years older might have an odds of survival that was anywhere between 7.9% lower and 0.4% higher than the younger group.”
The confidence interval and statistical test given in the output is called a Wald test. Other tests (Score, Likelihood Ratio) are also possible.
- All tests are asymptotically equivalent
- The Wald test is easiest to obtain, but generally performs the poorest in small sample sizes
- The Likelihood Ratio test performs the best in small samples. We will discuss it later, including how to obtain the test using post-estimation commands.
- The Score test is not bad in small samples, but is often hard to obtain from software. It is exactly equal to the Chi-squared test for binary outcomes and categorical predictors.

3.3.1 Bayesian Estimates and Interpretation

Bayesian approach to the logistic model requires specifying the model, prior distributions, and the likelihood
The model
- e.g. a model for the log odds of survival that is linear in the parameters with an intercept and slope for age
Prior distributions on parameters
- For the simple logistic regression model, we have parameters \(\beta_0\), and \(\beta_1\).
- For now, we will use default prior distributions that are are intended to be weakly informative in that they provide moderate regularization and help stabilize computation. See the STAN documentation for more details
- Appropriate priors can be based on scientific considerations
- Sensitivity analyses can evaluate the the robustness of finding to different prior assumptions
The likelihood
- For a binomial GLM the likelihood for one observation \(y\) can be written as a conditionally binomial probability mass function

\[\binom{n}{y} \pi^{y} (1 - \pi)^{n - y},\]

\(n\) is the known number of trials, \(\pi = g^{-1}(\eta)\) is the probability of success and \(\eta = \alpha + \mathbf{x}^\top \boldsymbol{\beta}\) is a linear predictor

For a sample of size \(N\), the likelihood of the entire sample is the product of \(N\) individual likelihood contributions.

Because \(\pi\) is a probability, for a binomial model the link function \(g\) maps between the unit interval (the support of \(\pi\)) and the set of all real numbers \(\mathbb{R}\). When applied to a linear predictor \(\eta\) with values in \(\mathbb{R}\), the inverse link function \(g^{-1}(\eta)\) therefore returns a valid probability between 0 and 1.
The two most common link functions used for binomial GLMs are the
- logit and
- probit
With the logit (or log-odds) link function \(g(x) = \ln{\left(\frac{x}{1-x}\right)}\), the likelihood for a single observation becomes

\[\binom{n}{y}\left(\text{logit}^{-1}(\eta)\right)^y \left(1 - \text{logit}^{-1}(\eta)\right)^{n-y} = \binom{n}{y} \left(\frac{e^{\eta}}{1 + e^{\eta}}\right)^{y} \left(\frac{1}{1 + e^{\eta}}\right)^{n - y}\]

With the probit link function \(g(x) = \Phi^{-1}(x)\) yields the likelihood

\[\binom{n}{y} \left(\Phi(\eta)\right)^{y} \left(1 - \Phi(\eta)\right)^{n - y},\]

where \(\Phi\) is the CDF of the standard normal distribution.

Output from Bayesian logistic regression using logit link function

Code

library(rstanarm)
library(bayesplot)

fit2 <- stan_glm(survived ~ age,
                 data=titanic, family=binomial(),
                 seed=1234,
                 refresh=0)
summary(fit2, digits=4, prob=c(.025, .5, .975))

Model Info
function	family	formula	algorithm	sample	observations	predictors
stan_glm	binomial [logit]	survived ~ age	sampling	4000	1046	2

Estimates
	mean	sd	2.5%	50%	97.5%
(Intercept)	-0.1343122	0.145938	-0.413530	-0.1336335	0.15599526
age	-0.0079361	0.004433	-0.016909	-0.0077841	0.00035338

Fit Diagnostics
	mean	sd	2.5%	50%	97.5%
mean_PPD	0.40887	0.02167	0.3652	0.40918	0.45124

The mean_ppd is the sample average posterior predictive distribution of the outcome variable.

MCMC diagnostics
	mcse	Rhat	n_eff
(Intercept)	0.00290	0.99941	2533
age	0.00009	0.99964	2570
mean_PPD	0.00039	1.00042	3073
log-posterior	0.02844	1.00475	1340

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

And a summary of the prior distributions used

Code

prior_summary(fit2, digits = 2)

Priors for model ‘fit2’
	Specified prior	Adjusted prior
Intercept (after predictors centered)	~ normal(location = 0, scale = 2.50000)
Coefficients	~ normal(location = 0, scale = 2.50000)	~ normal(location = 0, scale = 0.17345)

Interpretation
- Slope for age is of primary scientific importance
- A priori we assume that no association between age and survival. Specifically, we assumed a Normal prior with location (mean) of 0 and scale (standard devation) of 0.17 for \(\beta_1\).
- Conditional on the data, we estimate that for every 1 year increase in age, the log odds of decreases by -0.0079 (95% credible interval -0.0169 to 0.0004).
- To obtain the posterior odds ratio and 95% credible intervals, some additional commands are needed

Code

# 1 year change in age
exp(coef(fit2)["age"]) # Posterior Odds Ratio

      age 
0.9922461

Code

exp(posterior_interval(fit2, prob = 0.95)) # 95% credible interval

	2.5%	97.5%
(Intercept)	0.66131	1.1688
age	0.98323	1.0004

Code

# 5 year increase in age
exp(5*coef(fit2)["age"]) # Posterior Odds Ratio

      age 
0.9618272

Code

exp(5*posterior_interval(fit2, prob = 0.95))[2,] # 95% credible interval

     2.5%     97.5% 
0.9189294 1.0017684

Code

# Could also flip (invert) these odds ratios to interpret as decrease in age being associated with increased survival
# 5 year decrease in age
1/exp(5*coef(fit2)["age"]) # Posterior Odds Ratio

     age 
1.039688

Code

1/exp(5*posterior_interval(fit2, prob = 0.95))[2,] # 95% credible interval

     2.5%     97.5% 
1.0882228 0.9982347

The association between age and probability of survival was estimated using a Bayesian logistic regression model. The model did not adjust for other covariates and assumed a logit link function and Binomial likelihood. We assumed a weakly informative prior distribution for the log odds of survival given age (Normal prior with mean 0 and scale 0.17 for \(\beta_1\)). Conditional on the data, the posterior mean estimate indicates that comparing two subjects who differ in age by 5 years, the younger subject has a 1.04 fold increased odds of survival compared to the older subject. A 95% credible for this posterior odds ratio is from 1.00 to 1.09.
Note that this model does not consider gender, so it is averaging over the males and females. We will revisits this analysis where the age effect is analyzed separately in males and females.

4 Inference with Logistic Regression

The ideas of Signal and Noise found in simple linear regression do not translate well to logistic regression
We do not tend to quantify an error distribution with logistic regression
Valid statistical inference (CIs, p-values) about associations requires three general assumptions
Assumption 1: Approximately Normal distributions for the parameter estimates
- Large N
- Need for either robust standard errors or classical logistic regression
- Definition of large depends on the underlying probabilities (odds)
- Recall the rule of thumb for chi-squared tests based on the expected number of events
Assumption 2: Assumptions about the independence of observations
- Classical regression: Independence of all observation
- Robust standard errors: Correlated observations within identified clusters
Assumption 3: Assumptions about variance of observations within groups
- Classical regression: Mean-variance relationship for binary data
  - Classical logistic regression estimates SE using model based estimates
  - Hence in order to satisfy this requirement, linearity of log odds across groups must hold
- Robust standard errors
  - Allows unequal variance across groups
  - Hence, do not need linearity of log odds across groups to hold
Valid statistical inference (CIs, p-values) about odds of response in specific groups requires a further assumption
Assumption 4: Adequacy of the linear model
- If we are trying to borrow information about the log odds from neighboring groups, and we are assuming a straight line relationship, the straight line needs to be true
  - Needed for either classical or robust standard errors
  - Note that we can model transformations of the measured predictor if we feel a straight line is not appropriate
- Inference about individual observations (prediction intervals, P-values) in specific groups requires no further assumptions because we have binary data
  - For binary data, if we know the mean (proportion), we know everything about the distribution including the variance
  - This differs from linear regression where we can have a correct model for the mean, but the assumption about the error distribution (Normality, homoskedasticity) can be incorrect

4.1 Interpreting “Positive” Results

Slope is statistically different from 0 using robust standard errors
Observed data is atypical of a setting with no linear trend in odds of response across groups
Data suggests evidence of a trend toward larger (or smaller) odds in groups having larger values of the predictor
(To the extent the data appears linear, estimates of the group odds or probabilities will be reliable)

4.2 Interpreting “Negative” Results

Many possible reasons why the slope is not statistically different from 0 using robust standard errors
- There may be no association between the response and predictor
- There may be an association in the parameter considered, but the best fitting line has zero slope
- There may be a first order trend in the log odds, but we lacked the precision to be confident that it truly exists (a type II error)

5 Example analysis revisited: Effect Modification

Recall in our Titanic example that the effect of age appeared to differ by sex
- We ignored this difference earlier, so our estimated age effect was a (weighted) average of the age effect in males and the age effect in female
- Here is the plot again describing the trends we see in survival by age and sex (using plsmo).

Code

# | fig-cap: Age versus survival by gender in the Titanic data by age using a super smoother.
with(titanic, 
     plsmo(age, survived, group=sex, datadensity=T, ylab="Survived (1=Yes, 0=No)", xlab="Age (years)")
)

We could describe the observed differences in two way, both being correct
- Gender modifies the age effect
  - In males, the probability of survival worsened with age while in female the probability of survival improved with age
  - Emphasizes that the female age slope is positive while the male age slope is negative
- Age modifies the gender effect
  - The survival rates of male and females were more similar at younger ages than older ages
  - Could specify the odds ratio of survival comparing females to males at specific ages

5.1 Stratified analysis by sex

The log odds of survival in females

Code

fit.titanic.female <- glm(survived ~ age, data=titanic, subset=sex=="female", family="binomial")
fit.titanic.female

Fitting generalized (binomial/logit) linear model: survived ~ age
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	0.493374	0.2541885	1.9410	0.0522612
age	0.022516	0.0085354	2.6379	0.0083415

Degrees of Freedom	387 Total (i.e. Null)
Degrees of Freedom	386 Residual
Null Deviance	434.161
Residual Deviance	426.878
AIC	430.878

(78 observations deleted due to missingness)

The log odds of survival in males

Code

fit.titanic.male <- glm(survived ~ age, data=titanic, subset=sex=="male", family="binomial")
fit.titanic.male

Fitting generalized (binomial/logit) linear model: survived ~ age
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.66076	0.2248071	-2.9392	0.0032902
age	-0.02376	0.0072757	-3.2657	0.0010921

Degrees of Freedom	657 Total (i.e. Null)
Degrees of Freedom	656 Residual
Null Deviance	667.847
Residual Deviance	656.557
AIC	660.557

(185 observations deleted due to missingness)

5.1.1 Odds ratios and confidence intervals for age effect by sex

Consider a 5 year change in age

Code

# Females
exp(5*coef(fit.titanic.female)["age"])

     age 
1.119161

Code

exp(5*confint.default(fit.titanic.female,"age"))

	2.5 %	97.5 %
age	1.0294	1.2168

Code

# Males
exp(5*coef(fit.titanic.male)["age"])

      age 
0.8879858

Code

exp(5*confint.default(fit.titanic.male,"age"))

	2.5 %	97.5 %
age	0.82688	0.95361

5.2 Effect modification using interaction terms

Instead of fitting two separate models for male and females, we could estimate all parameters in a single regression model
- Let \(p_i\) be the probability of survival for passenger \(i\) and \(\textrm{logit}(p)= \textrm{log}\left(\frac{p}{1-p}\right)\)
- Let \(X_{1i}\) be the age of subject \(i\)
- Let \(X_{2i}\) be an indicator variable for female sex. \(X_{2i}=1\) if a subject is female and \(X_{2i} = 0\) if a subject is male

\[\textrm{logit}(p_i | X_{1i},X_{2i}) = \beta_0 + \beta_1 * X_{1i} + \beta_{2i} * X_{2i} + \beta_3*X_{1i}*X_{2i}\]

In males, \(X_{2i} = 0\), this model reduces to

\[\textrm{logit}(p_i | X_{1i}, X_{2i}=0) = \beta_0 + \beta_1 * X_{1i}\]

In females, \(X_{2i} = 1\), this model can be expressed as

\[\textrm{logit}(p | X_{1i}, X_{2i}=1) = (\beta_0+\beta_2) + (\beta_1+\beta_3) * X_{1i}\]

\(\hat{\beta_1}\) is the estimate age effect in males
\(\hat{\beta_1} + \hat{\beta_3}\) is the estimated age effect in females
\(\hat{\beta_3}\) is the estimated difference between the age effect in male and the age effect in females

Code

# female has already been defined in the dataset, but if I wanted to create this variable I could do so
titanic$female <- (titanic$sex=="female")+0

fit.titanic.interact <- glm(survived ~ age + female + age*female, data=titanic, family="binomial")
summary(fit.titanic.interact)

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.660761	0.2248071	-2.9392	3.2902e-03
age	-0.023760	0.0072757	-3.2657	1.0921e-03
female	1.154134	0.3393376	3.4011	6.7106e-04
age:female	0.046276	0.0112156	4.1260	3.6909e-05

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	1414.6 on 1045 degrees of freedom
Residual deviance:	1083.4 on 1042 degrees of freedom

(263 observations deleted due to missingness)

AIC: 1091.44

Number of Fisher Scoring Iterations: 4

We can see that the parameter estimates from the interaction model are the same as the estimates from the two stratified models

Code

# Interaction model
coef(fit.titanic.interact)

(Intercept)         age      female  age:female 
-0.66076066 -0.02375991  1.15413441  0.04627571

Code

# Model fit just on male subjects
coef(fit.titanic.male)

(Intercept)         age 
-0.66076066 -0.02375991

Code

# Model fit just on female subjects
coef(fit.titanic.female)

(Intercept)         age 
 0.49337375  0.02251581

Code

# Linear combinations from the interaction model give the female intercept and age slope
coef(fit.titanic.interact)[1] + coef(fit.titanic.interact)[3]

(Intercept) 
  0.4933738

Code

coef(fit.titanic.interact)[2] + coef(fit.titanic.interact)[4]

       age 
0.02251581

Proportions
	Women	Men
Negative Scan	40%	75%
Positive Scan	95%	(up to 100%)

Difference	55%	Up to 25%
Ratio	1.64	\(\leq 1.33\)

Odds
	Women	Men
Negative Scan	0.67	3
Positive Scan	19	(up to \(\infty\))

Ratio	28.5	\(< \infty\)