Code
library(rms)
library(ggplot2)
tryCatch(source('pander_registry.R'), error = function(e) invisible(e))Lecture 07
October 30, 2025
library(rms)
library(ggplot2)
tryCatch(source('pander_registry.R'), error = function(e) invisible(e))Scientific questions
Most often scientific questions are translated into comparing the distribution of some response variable across groups of interest
Groups are defined by the predictor of interest (POI)
Categorical predictors of interest: Treatment or control, knockout or wild type, ethnic group
Continuous predictors of interest: Age, BMI, cholesterol, blood pressure
If we only considered the response and POI, this is referred to as a simple (linear, logistic, PH, etc.) regression model
Often we need to consider additional variables other than POI because…
We want to make comparisons in different strata
Groups being compared differ in other ways
Less variability in the response if we control for other variables
Statistics: Covariates other than the Predictor of Interest are included in the model as…
Effect modifiers
Confounders
Precision variables
Two main statistical methods to adjust for covariates
Stratified analyses
Combines information about associations between response across strata
Will not borrow information about (or even estimate) associations between response and adjustment variables
Adjustment in multiple regression
Can (but does not have to) borrow information about associations between response and all modeled variables
Could conduct a stratified analysis using regression
In practice, when researchers say they are using regression, they are almost certainly doing so to borrow information
Example: Is smoking associated with FEV in teenagers?
Stratified Analysis
Separately estimate mean FEV in 19 year olds, 18 year olds, 17 year olds, etc. by smoking status
Average means (using weights) to come up with overall effect of smoking on FEV
Key: Not trying to estimate a common effect of age across strata (not borrowing information across age)
No estimate of the age effect in this analysis
Multiple regression
Fit a regression model with FEV as the outcome, smoking as the POI, and age as an adjustment variable
Will provide you an estimate of the association between FEV and age (but do you care?)
Can borrow information across ages to estimate the age effect
Linear/spline function for age would borrow information
Separate indicator variable for each age would borrow less information (would still assume that all 19.1 and 19.2 year olds are the same)
Adjustment for two factors: Age and Sex
Stratified analyses
Calculate separate means by age and sex, combine using weight averages as before
This method adjusts for the interaction of age and sex (in addition to age and sex main effects)
Multiple regression
“We adjusted for age and sex…” or “Holding age and sex constant, we found …”
Almost certainly the research adjusted for age and sex, but not the interaction of the two variables (but they could have)
General approach to conducting a stratified analysis
Divide the data into strata based on all combinations of the “adjustment” covariates
Within each strata, perform an analysis comparing responses across POI groups
Use (weighted) average of estimated associations across groups
Combining responses: Easy if estimates are independent and approximately Normally distributed
For independent strata \(k\), \(k = 1, \ldots, K\)
Estimate in stratum \(k\): \(\hat{\theta}_k \sim N (\theta_k, se^2_k)\)
Weight in stratum \(k\): \(w_k\)
Stratified estimate is
\[\hat{\theta} = \frac{\sum_{k=1}^K w_k \hat{\theta}_k}{\sum_{k=1}^K w_k} \sim N\left(\frac{\sum_{k=1}^K w_k \theta_k}{\sum_{k=1}^K w_k}, \frac{\sum_{k=1}^K w_k^2 se^2_k}{\left(\sum_{k=1}^K w_k\right)^2} \right)\]
How to choose the weights?
Scientific role of the stratified estimate
Statistical precision of the stratified estimate
Just because the data are more variable in women than men, does that mean I should down-weight women? Maybe, maybe not.
Weight usually chosen on statistical criteria
Weights should be chosen based on the statistical role of the adjustment variable
Effect modifiers
Confounding
Precision
Scientific criteria
Sometimes we anticipate effect modification by some variables, but
We do not choose to report estimates of the association between the response and POI in each stratum separately
We are interested in estimating the “average association” for a population
Choosing weights according to scientific importance
Want to estimate the average effect in some population of interest
The real population, or,
Some standard population used for comparisons
Example: Ecologic studies comparing incidence of hip fractures across countries
Hip fracture rates increase with age
Industrialized countries and developing world have very different age distributions
Choose a standard age distribution to remove confounding by age
Comment on oversampling
In political polls or epidemiologic studies we sometimes oversample some strata in order to gain precision
For fixed maximal sample size, we gain most precision if stratum samples size is proportional to weight times standard deviation of measurements in stratum
Example: Oversample swing-voters relative to individuals who we can be more certain about their voting preferences
For independent strata \(k\), \(k = 1, \ldots, K\)
Sample size in stratum \(k\): \(n_k\)
Estimate in stratum \(k\): \(\hat{\theta}_k \sim N\left(\theta_k, se^2_k = \frac{V_k}{n_k} \right)\)
Importance weight for stratum \(k\): \(w_k\)
Optimal sample size when \(N = \sum_{k=1}^K n_k\) is:
\[\frac{w_1 \sqrt{V_1}}{n_1} = \frac{w_2 \sqrt{V_2}}{n_2} = \ldots = \frac{w_k \sqrt{V_k}}{n_k}\]
If the true association is the same in each stratum, we are free to consider statistical criteria
Statistical criteria
Optimal statistical weights
For independent strata \(k\), \(k = 1, \ldots, K\)
Sample size in stratum \(k\): \(n_k\)
Estimate in stratum \(k\): \(\hat{\theta}_k \sim N\left(\theta_k, se^2_k = \frac{V_k}{n_k} \right)\)
Importance weight for stratum \(k\): \(w_k\)
Optimal sample size when \(N = \sum_{k=1}^K n_k\) is:
\[\frac{w_1 \sqrt{V_1}}{n_1} = \frac{w_2 \sqrt{V_2}}{n_2} = \ldots = \frac{w_k \sqrt{V_k}}{n_k}\]
We often ignore the aspect that variability may differ across strata
Example: Mantel-Haenszel Statistic
Popular method used to create a common odds ratio estimate across strata
Hypothesis test comparing odds (proportions) across two groups
Adjust for confounding in a stratified analysis
Weights chosen for statistical precision
Approximate weighting of difference in proportions based on harmonic means of sample sizes in each stratum
Usually viewed as a weighted odds ratio
(Why not weight by log odds or probabilities?)
For independent strata \(k\), \(k = 1, \ldots, K\)
Sample size in stratum \(k\): \(n_{1k}, n_{0k}\)
Estimates in stratum \(k\): \(\hat{p}_{1k}, \hat{p}_{2k}\)
Precision weight for stratum \(k\):
\[w_k = \frac{n_{1k} n_{0k}}{n_{1k} + n_{0k}} \div \sum_{k=1}^K \frac{n_{1k} n_{0k}}{n_{1k} + n_{0k}}\]
These weights work well in practice, and are not as complicated as some other weighting systems
Odds of being full professor by sex
. cc full female if year==95, by(field)
field | OR [95% Conf. Interval] M-H Weight
-----------------+-------------------------------------------------
Arts | .5384615 .2927119 .9835152 16.54545 (exact)
Other | .2540881 .1870044 .3440217 91.6448 (exact)
Prof | .3434705 .1640076 .7048353 14.42581 (exact)
-----------------+-------------------------------------------------
Crude | .290421 .226544 .3715365 (exact)
M-H combined | .3029764 .2378934 .385865
-------------------------------------------------------------------
Test of homogeneity (M-H) chi2(2) = 5.47 Pr>chi2 = 0.0648
Test that combined OR = 1:
Mantel-Haenszel chi2(1) = 99.10
Pr>chi2 = 0.0000Questions about output
What hypothesis is being tested by the “Test of Homogeneity”?
Should we use the test of homogeneity to decide if we need to use the M-H adjustment?
Compare the Crude and M-H combined OR
Is there evidence of confounding?
Would there be evidence of confounding if the M-H estimate was further from the null (got smaller in this case)?
How would you determine if field is a precision variable from the output?
R code for similar output as above
salary <- read.csv(file="data/salary.csv")
salary.95 <- subset(salary, year==95)
salary.95$female <- factor((salary.95$sex=="F")+0,levels=0:1, labels=c("Male","Female"))
salary.95$full <- factor((salary.95$rank=="Full")+0,levels=0:1, labels=c("Assist/Assoc","Full"))
salary.95$field <- factor(salary.95$field)
partial.tables <- xtabs(~female + full + field, salary.95)
pander(partial.tables, keep.trailing.zeros = FALSE)| field | Arts | Other | Prof | ||
| female | full | ||||
| Male | Assist/Assoc | 70 | 303 | 96 | |
| Full | 70 | 477 | 172 | ||
| Female | Assist/Assoc | 52 | 205 | 26 | |
| Full | 28 | 82 | 16 |
marginal.table <- xtabs(~female + full, salary.95)
marginal.table| Assist/Assoc | Full | |
|---|---|---|
| Male | 469 | 719 |
| Female | 283 | 126 |
library(epiR)
epi.2by2(partial.tables) Outcome + Outcome - Total Inc risk *
Exposed + 469 719 1188 39.48 (36.69 to 42.32)
Exposed - 283 126 409 69.19 (64.47 to 73.64)
Total 752 845 1597 47.09 (44.62 to 49.57)
Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio (crude) 0.57 (0.52, 0.63)
Inc risk ratio (M-H) 0.58 (0.53, 0.64)
Inc risk ratio (crude:M-H) 0.98
Inc odds ratio (crude) 0.29 (0.23, 0.37)
Inc odds ratio (M-H) 0.30 (0.24, 0.39)
Inc odds ratio (crude:M-H) 0.96
Attrib risk in the exposed (crude) * -29.72 (-34.98, -24.45)
Attrib risk in the exposed (M-H) * -28.77 (-39.96, -17.59)
Attrib risk (crude:M-H) 1.03
-------------------------------------------------------------------
M-H test of homogeneity of IRRs: chi2(2) = 6.957 Pr>chi2 = 0.031
M-H test of homogeneity of ORs: chi2(2) = 5.604 Pr>chi2 = 0.061
Test that M-H adjusted OR = 1: chi2(1) = 99.101 Pr>chi2 = <0.001
Wald confidence limits
M-H: Mantel-Haenszel; CI: confidence interval
* Outcomes per 100 population units
library(rms)
m.unadj <- lrm(full ~ female, data=salary.95)
dd <- datadist(salary.95)
options(datadist='dd')
m.unadj| Model Likelihood Ratio Test | Discrimination Indexes | Rank Discrim. Indexes | |
|---|---|---|---|
| Obs | LR chi2 | R2 | C |
| 1597 | 109.43 | 0.088401 | 0.61361 |
| Assist/Assoc | d.f. | g | Dxy |
| 752 | 1 | 0.47141 | 0.22722 |
| Full | Pr(> chi2) | gr | gamma |
| 845 | 0 | 1.6023 | 0.54988 |
| max |deriv| | gp | tau-a | |
| 0 | 0.11329 | 0.11329 | |
| Brier | |||
| 0.23233 |
| Coef | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| Intercept | 0.42726 | 0.059355 | 7.1984 | 0 |
| female=Female | -1.23642 | 0.122446 | -10.0977 | 0 |
summary(m.unadj)| Factor | Low | High | Diff. | Effect | S.E. | Lower 0.95 | Upper 0.95 |
|---|---|---|---|---|---|---|---|
| female - Female:Male | 1 | 2 | -1.2364 | 0.12245 | -1.4764 | -0.99643 | |
| Odds Ratio | 1 | 2 | 0.29042 | 0.22846 | 0.36919 |
Logistic regression Number of obs = 1597
LR chi2(1) = 109.43
Prob > chi2 = 0.0000
Log likelihood = -1049.533 Pseudo R2 = 0.0495
------------------------------------------------------------------------------
full | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .290421 .035561 -10.10 0.000 .2284555 .3691939
------------------------------------------------------------------------------m.unadj.arts <- lrm(full ~ female, data=salary.95[salary.95$field=="Arts",])
m.unadj.other <- lrm(full ~ female, data=salary.95[salary.95$field=="Other",])
m.unadj.prof <- lrm(full ~ female, data=salary.95[salary.95$field=="Prof",])m.unadj.arts| Model Likelihood Ratio Test | Discrimination Indexes | Rank Discrim. Indexes | |
|---|---|---|---|
| Obs | LR chi2 | R2 | C |
| 220 | 4.6887 | 0.028228 | 0.57026 |
| Assist/Assoc | d.f. | g | Dxy |
| 122 | 1 | 0.28781 | 0.14052 |
| Full | Pr(> chi2) | gr | gamma |
| 98 | 0.030362 | 1.3335 | 0.3 |
| max |deriv| | gp | tau-a | |
| 2e-09 | 0.069738 | 0.069738 | |
| Brier | |||
| 0.24182 |
| Coef | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| Intercept | 0.00000 | 0.16903 | 0.0000 | 1.000000 |
| female=Female | -0.61904 | 0.28899 | -2.1421 | 0.032188 |
summary(m.unadj.arts)| Factor | Low | High | Diff. | Effect | S.E. | Lower 0.95 | Upper 0.95 |
|---|---|---|---|---|---|---|---|
| female - Female:Male | 1 | 2 | -0.61904 | 0.28899 | -1.1855 | -0.052625 | |
| Odds Ratio | 1 | 2 | 0.53846 | 0.30561 | 0.94874 |
m.unadj.other| Model Likelihood Ratio Test | Discrimination Indexes | Rank Discrim. Indexes | |
|---|---|---|---|
| Obs | LR chi2 | R2 | C |
| 1067 | 91.165 | 0.10927 | 0.62843 |
| Assist/Assoc | d.f. | g | Dxy |
| 508 | 1 | 0.5393 | 0.25685 |
| Full | Pr(> chi2) | gr | gamma |
| 559 | 0 | 1.7148 | 0.59478 |
| max |deriv| | gp | tau-a | |
| 0 | 0.12825 | 0.12825 | |
| Brier | |||
| 0.22855 |
| Coef | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| Intercept | 0.45378 | 0.073463 | 6.1771 | 7e-10 |
| female=Female | -1.37007 | 0.149900 | -9.1399 | 0e+00 |
summary(m.unadj.other)| Factor | Low | High | Diff. | Effect | S.E. | Lower 0.95 | Upper 0.95 |
|---|---|---|---|---|---|---|---|
| female - Female:Male | 1 | 2 | -1.3701 | 0.1499 | -1.6639 | -1.0763 | |
| Odds Ratio | 1 | 2 | 0.25409 | 0.1894 | 0.34086 |
m.unadj.prof| Model Likelihood Ratio Test | Discrimination Indexes | Rank Discrim. Indexes | |
|---|---|---|---|
| Obs | LR chi2 | R2 | C |
| 310 | 10.095 | 0.043396 | 0.564 |
| Assist/Assoc | d.f. | g | Dxy |
| 122 | 1 | 0.25115 | 0.12801 |
| Full | Pr(> chi2) | gr | gamma |
| 188 | 0.0014867 | 1.2855 | 0.48868 |
| max |deriv| | gp | tau-a | |
| 9e-09 | 0.061301 | 0.061301 | |
| Brier | |||
| 0.2307 |
| Coef | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| Intercept | 0.58315 | 0.12740 | 4.5773 | 0.00000471 |
| female=Female | -1.06865 | 0.34233 | -3.1217 | 0.00179824 |
summary(m.unadj.prof)| Factor | Low | High | Diff. | Effect | S.E. | Lower 0.95 | Upper 0.95 |
|---|---|---|---|---|---|---|---|
| female - Female:Male | 1 | 2 | -1.0687 | 0.34233 | -1.7396 | -0.39769 | |
| Odds Ratio | 1 | 2 | 0.34347 | 0.17559 | 0.67187 |
. sort field
. by field: logistic full female if year==95
----------------------------------------------------------------------------------
-> field = Arts
Logistic regression Number of obs = 220
LR chi2(1) = 4.69
Prob > chi2 = 0.0304
Log likelihood = -148.83634 Pseudo R2 = 0.0155
------------------------------------------------------------------------------
full | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .5384615 .1556108 -2.14 0.032 .305608 .9487343
------------------------------------------------------------------------------
----------------------------------------------------------------------------------
-> field = Other
Logistic regression Number of obs = 1067
LR chi2(1) = 91.17
Prob > chi2 = 0.0000
Log likelihood = -692.78622 Pseudo R2 = 0.0617
------------------------------------------------------------------------------
full | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .2540881 .0380877 -9.14 0.000 .1894041 .3408624
------------------------------------------------------------------------------
----------------------------------------------------------------------------------
-> field = Prof
Logistic regression Number of obs = 310
LR chi2(1) = 10.10
Prob > chi2 = 0.0015
Log likelihood = -202.74823 Pseudo R2 = 0.0243
------------------------------------------------------------------------------
full | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .3434705 .1175803 -3.12 0.002 .175589 .6718641
------------------------------------------------------------------------------m.adj <- lrm(full ~ female + field, data=salary.95, x=TRUE, y=TRUE)
m.adj| Model Likelihood Ratio Test | Discrimination Indexes | Rank Discrim. Indexes | |
|---|---|---|---|
| Obs | LR chi2 | R2 | C |
| 1597 | 114.38 | 0.092258 | 0.63218 |
| Assist/Assoc | d.f. | g | Dxy |
| 752 | 3 | 0.54186 | 0.26436 |
| Full | Pr(> chi2) | gr | gamma |
| 845 | 0 | 1.7192 | 0.37262 |
| max |deriv| | gp | tau-a | |
| 0 | 0.12956 | 0.13181 | |
| Brier | |||
| 0.23154 |
| Coef | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| Intercept | 0.19017 | 0.14608 | 1.3017 | 0.19300 |
| female=Female | -1.19979 | 0.12358 | -9.7084 | 0.00000 |
| field=Other | 0.22203 | 0.15474 | 1.4348 | 0.15134 |
| field=Prof | 0.41096 | 0.18557 | 2.2146 | 0.02679 |
summary(m.adj)| Factor | Low | High | Diff. | Effect | S.E. | Lower 0.95 | Upper 0.95 |
|---|---|---|---|---|---|---|---|
| female - Female:Male | 1 | 2 | -1.1998 | 0.12358 | -1.442 | -0.95758 | |
| Odds Ratio | 1 | 2 | 0.30126 | 0.23645 | 0.38382 | ||
| field - Arts:Other | 2 | 1 | -0.22203 | 0.15474 | -0.52533 | 0.081261 | |
| Odds Ratio | 2 | 1 | 0.80089 | 0.59136 | 1.0847 | ||
| field - Prof:Other | 2 | 3 | 0.18893 | 0.13535 | -0.076346 | 0.4542 | |
| Odds Ratio | 2 | 3 | 1.208 | 0.9265 | 1.5749 |
. xi: logistic full female i.field if year==95
i.field _Ifield_1-3 (_Ifield_1 for field==Arts omitted)
Logistic regression Number of obs = 1597
LR chi2(3) = 114.38
Prob > chi2 = 0.0000
Log likelihood = -1047.0583 Pseudo R2 = 0.0518
------------------------------------------------------------------------------
full | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .3012563 .0372301 -9.71 0.000 .2364515 .3838222
_Ifield_2 | 1.248611 .1932157 1.43 0.151 .9219526 1.691009
_Ifield_3 | 1.508263 .2798894 2.21 0.027 1.048381 2.169877
------------------------------------------------------------------------------M-H OR (.303) differs slightly from the multiple logistic regression OR for (.301).
Conceptually, the are attempting to do the same thing, but are using different weights
Multivariable logistic regression would allow you to control for continuous covariates without complete stratification
We could also fit the logistic regression models using robust standard error estimate
We did not fit a saturated model to exactly predict the probability of being a full professor in all 6 combinations of field and female. So, it is possible that we have a poorly fitting mean model, and because of the mean-variance relationship, our variance estimates could be poor too.
robcov(m.adj)| Model Likelihood Ratio Test | Discrimination Indexes | Rank Discrim. Indexes | |
|---|---|---|---|
| Obs | LR chi2 | R2 | C |
| 1597 | 114.38 | 0.092258 | 0.63218 |
| Assist/Assoc | d.f. | g | Dxy |
| 752 | 3 | 0.54186 | 0.26436 |
| Full | Pr(> chi2) | gr | gamma |
| 845 | 0 | 1.7192 | 0.37262 |
| max |deriv| | gp | tau-a | |
| 0 | 0.12956 | 0.13181 | |
| Brier | |||
| 0.23154 |
| Coef | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| Intercept | 0.19017 | 0.14981 | 1.2693 | 0.204322 |
| female=Female | -1.19979 | 0.12362 | -9.7056 | 0.000000 |
| field=Other | 0.22203 | 0.15891 | 1.3972 | 0.162359 |
| field=Prof | 0.41096 | 0.18933 | 2.1706 | 0.029959 |
summary(robcov(m.adj))| Factor | Low | High | Diff. | Effect | S.E. | Lower 0.95 | Upper 0.95 |
|---|---|---|---|---|---|---|---|
| female - Female:Male | 1 | 2 | -1.1998 | 0.12362 | -1.4421 | -0.95751 | |
| Odds Ratio | 1 | 2 | 0.30126 | 0.23643 | 0.38385 | ||
| field - Arts:Other | 2 | 1 | -0.22203 | 0.15891 | -0.5335 | 0.089434 | |
| Odds Ratio | 2 | 1 | 0.80089 | 0.58655 | 1.0936 | ||
| field - Prof:Other | 2 | 3 | 0.18893 | 0.13568 | -0.077011 | 0.45486 | |
| Odds Ratio | 2 | 3 | 1.208 | 0.92588 | 1.576 |
m.saturated <- lrm(full ~ field + female + field*female, data=salary.95)
m.saturated| Model Likelihood Ratio Test | Discrimination Indexes | Rank Discrim. Indexes | |
|---|---|---|---|
| Obs | LR chi2 | R2 | C |
| 1597 | 119.75 | 0.096433 | 0.6345 |
| Assist/Assoc | d.f. | g | Dxy |
| 752 | 5 | 0.56112 | 0.269 |
| Full | Pr(> chi2) | gr | gamma |
| 845 | 0 | 1.7526 | 0.37917 |
| max |deriv| | gp | tau-a | |
| 0 | 0.13413 | 0.13413 | |
| Brier | |||
| 0.2308 |
| Coef | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| Intercept | 0.00000 | 0.16903 | 0.0000 | 1.0000000 |
| field=Other | 0.45378 | 0.18430 | 2.4621 | 0.0138111 |
| field=Prof | 0.58315 | 0.21166 | 2.7550 | 0.0058684 |
| female=Female | -0.61904 | 0.28899 | -2.1421 | 0.0321885 |
| **field=Other * female=Female** | -0.75104 | 0.32556 | -2.3069 | 0.0210585 |
| **field=Prof * female=Female** | -0.44961 | 0.44801 | -1.0036 | 0.3155748 |
rms::lrtest(m.saturated, m.adj)stats:
| L.R. Chisq | d.f. | P |
|---|---|---|
| 5.3751 | 2 | 0.068048 |
formula1: full ~ field + female + field * female
formula2: full ~ female + field
# Predict probability of being a full professor at all 6 combinations of field and salary
newdata <- expand.grid(female=levels(salary.95$female),
field=levels(salary.95$field)
)
newdata$phat.adj <- predict(m.adj, newdata, type = "fitted")
newdata$phat.sat <- predict(m.saturated, newdata, type = "fitted")
newdata| female | field | phat.adj | phat.sat |
|---|---|---|---|
| Male | Arts | 0.54740 | 0.50000 |
| Female | Arts | 0.26705 | 0.35000 |
| Male | Other | 0.60161 | 0.61154 |
| Female | Other | 0.31268 | 0.28571 |
| Male | Prof | 0.64591 | 0.64179 |
| Female | Prof | 0.35465 | 0.38095 |
partial.tables[,,'Arts']| Assist/Assoc | Full | |
|---|---|---|
| Male | 70 | 70 |
| Female | 52 | 28 |
70/(70+70)[1] 0.5
28/(28+52)[1] 0.35
partial.tables[,,'Other']| Assist/Assoc | Full | |
|---|---|---|
| Male | 303 | 477 |
| Female | 205 | 82 |
477/(477+303)[1] 0.6115385
82/(82+205)[1] 0.2857143
partial.tables[,,'Prof']| Assist/Assoc | Full | |
|---|---|---|
| Male | 96 | 172 |
| Female | 26 | 16 |
172/(172+96)[1] 0.641791
16/(16+26)[1] 0.3809524
Types of variables
Binary data: e.g. sex, death
Nominal (unordered categorical) data: e.g. race, martial status
Ordinal (ordered categorical data): e.g. cancer stage, asthma severity
Quantitative data: e.g. age, blood pressure
Right censored data: e.g. time to death
Which regression model you choose to use is based on the parameter being compared across groups
| Parameter | Approach |
|---|---|
| Means | Linear regression |
| Geometric means | Linear regression on log scale |
| Odds | Logistic regression |
| Rates | Poisson regression |
| Hazards | Proportional Hazards (Cox) regression |
| \(Y_i\) | Response measured on the \(i\)th subject |
| \(X_i\) | Value of the predictor measured on the \(i\)th subject |
| \(W_{1i}, W_{2i}, \ldots\) | Value of the adjustment variable for the \(i\)th subject |
| \(\theta_i\) | Parameter summarizing distribution of \(Y_i | X_i\) |
The parameter (\(\theta_i\)) might be the mean, geometric mean, odds, rate, instantaneous risk of an event (hazard), etc.
In multiple linear regression on means, \(\theta_i = E[Y_i | X_i, W_{1i}, W_{1i}, \ldots]\)
Choice of correct \(\theta_i\) should be based on scientific understanding of problem
General notation for multiple regression model
\(g(\theta_i) = \beta_0 + \beta_1 \times X_i + \beta_2 \times W_{1i} + \beta_3 \times W_{2i} + \ldots\)
| \(g( )\) | Link function used for modeling |
|---|---|
| \(\beta_0\) | Intercept |
| \(\beta_1\) | Slope for predictor of interest \(X\) |
| \(\beta_j\) | Slope for covariate \(W_{j-1}\) |
The link function is often either none (for modeling means) or log (for modeling geometric means, odds, hazards)
Borrowing information
Use other groups to make estimates in groups with sparse data
Intuitively, 67 and 69 year olds would provide some relevant information about 68 year olds
Assuming a straight line relationship tells us about other, even more distant, individuals
If we do not want to assume a straight line, we may only want to borrow information from nearby groups
Defining “Contrasts”
Define a comparison across groups to use when answering scientific questions
If the straight line relationship holds, the slope for the POI is the difference in parameter between groups differing by 1 unit in \(X\) when all other covariates are held constant
If a non-linear relationship in parameter, the slope is still the average difference in parameter between groups differing by 1 unit in \(X\) when all other covariates are held constant
Slope is a (first order or linear) test for trend in the parameter
Statistical jargon: “a contrast” across groups
The major difference between different regression models is the interpretation of the parameters
How do I want to summarize the outcome?
How do I want to compare groups?
Issues related to the inclusion of covariates remains the same
Address the scientific question: Predictor of interest, effect modification
Address confounding
Increase precision
Interpretation of parameters
Intercept
Corresponds to a population with all modeled covariates equal to zero
Slope
A comparison between groups differing by 1 unit in corresponding covariate, but agreeing on all other modeled covariates
Stratification versus regression
Generally, any stratified analysis could be performed as a regression model
Stratification adjusts for covariates and all interactions among those covariates
Our habit in regression is to just adjust for covariates as main effects, and consider interactions less often
In Stata or R, we use the same commands as were used for simple regression models
We just list more variable names
Interpretations of the CIs, p-values for coefficients estimates now relate to new scientific interpretations of intercept and slopes
Test of entire regression model also provided (a test that all slopes are equal to zero)
Scientific question: Is the maximum forced expiatory volume (FEV) related to smoking status in children?
Age ranges from 3 to 19, but no child under 9 smokes in the sample
Models we will compare
Unadjusted (simple) model: FEV and smoking
Adjusted for age: FEV and smoking with age (confounder)
Adjusted for age, height: FEV and smoking with age (confounder) and height (precision)
fev <- read.csv(file="data/FEV.csv")
# Restrict to ages 9 and above
fev <- fev[fev$age>=9,]
fev$logfev <- log(fev$fev)
fev$loght <- log(fev$height)
fev$smoker <- (fev$smoke=="current smoker") + 0
#table(fev$smoker, fev$smoke)
library(rms)
m.unadj <- ols(logfev ~ smoker, data=fev, x=TRUE, y=TRUE)
m.adj1 <- ols(logfev ~ smoker + age, data=fev, x=TRUE, y=TRUE)
m.adj2 <- ols(logfev ~ smoker + age + loght, data=fev, x=TRUE, y=TRUE)
robcov(m.unadj)| Model Likelihood Ratio Test | Discrimination Indexes | |
|---|---|---|
| Obs | LR chi2 | R2 |
| 439 | 9.3921 | 0.021167 |
| sigma | d.f. | R2 adj |
| 0.24765 | 1 | 0.018927 |
| d.f. | Pr(> chi2) | g |
| 437 | 0.0021792 | 0.025869 |
| Min | 1Q | Median | 3Q | Max |
|---|---|---|---|---|
| -0.6811 | -0.15903 | -0.0052531 | 0.15579 | 0.69848 |
| Coef | S.E. | t | Pr(>|t|) | |
|---|---|---|---|---|
| Intercept | 1.05817 | 0.012904 | 82.0031 | 0.0000000 |
| smoker | 0.10231 | 0.031582 | 3.2394 | 0.0012893 |
robcov(m.adj1)| Model Likelihood Ratio Test | Discrimination Indexes | |
|---|---|---|
| Obs | LR chi2 | R2 |
| 439 | 157.34 | 0.30121 |
| sigma | d.f. | R2 adj |
| 0.20949 | 2 | 0.298 |
| d.f. | Pr(> chi2) | g |
| 436 | 0 | 0.14784 |
| Min | 1Q | Median | 3Q | Max |
|---|---|---|---|---|
| -0.61071 | -0.15064 | 0.0027967 | 0.14619 | 0.5383 |
| Coef | S.E. | t | Pr(>|t|) | |
|---|---|---|---|---|
| Intercept | 0.351817 | 0.0573043 | 6.1394 | 1.900e-09 |
| smoker | -0.051349 | 0.0342645 | -1.4986 | 1.347e-01 |
| age | 0.063596 | 0.0051225 | 12.4150 | 0.000e+00 |
robcov(m.adj2)| Model Likelihood Ratio Test | Discrimination Indexes | |
|---|---|---|
| Obs | LR chi2 | R2 |
| 439 | 487.04 | 0.67025 |
| sigma | d.f. | R2 adj |
| 0.14407 | 3 | 0.66798 |
| d.f. | Pr(> chi2) | g |
| 435 | 0 | 0.23297 |
| Min | 1Q | Median | 3Q | Max |
|---|---|---|---|---|
| -0.50786 | -0.089931 | 0.010612 | 0.095589 | 0.37788 |
| Coef | S.E. | t | Pr(>|t|) | |
|---|---|---|---|---|
| Intercept | -11.094618 | 0.5129794 | -21.6278 | 0.0000e+00 |
| smoker | -0.053590 | 0.0240774 | -2.2257 | 2.6545e-02 |
| age | 0.021529 | 0.0034658 | 6.2120 | 1.2000e-09 |
| loght | 2.869659 | 0.1274099 | 22.5230 | 0.0000e+00 |
. regress logfev smoker if age>=9, robust
Linear regression Number of obs = 439
F( 1, 437) = 10.45
Prob > F = 0.0013
R-squared = 0.0212
Root MSE = .24765
------------------------------------------------------------------------------
| Robust
logfev | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoker | .1023056 .0316539 3.23 0.001 .0400927 .1645184
_cons | 1.05817 .0129335 81.82 0.000 1.032751 1.08359
------------------------------------------------------------------------------
.
. regress logfev smoker age if age>=9, robust
Linear regression Number of obs = 439
F( 2, 436) = 82.28
Prob > F = 0.0000
R-squared = 0.3012
Root MSE = .20949
------------------------------------------------------------------------------
| Robust
logfev | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoker | -.0513495 .0343822 -1.49 0.136 -.118925 .016226
age | .0635957 .0051401 12.37 0.000 .0534932 .0736981
_cons | .3518165 .0575011 6.12 0.000 .2388028 .4648303
------------------------------------------------------------------------------
.
. regress logfev smoker age loght if age >=9, robust
Linear regression Number of obs = 439
F( 3, 435) = 284.22
Prob > F = 0.0000
R-squared = 0.6703
Root MSE = .14407
------------------------------------------------------------------------------
| Robust
logfev | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoker | -.0535896 .0241879 -2.22 0.027 -.1011293 -.00605
age | .0215295 .0034817 6.18 0.000 .0146864 .0283725
loght | 2.869658 .1279943 22.42 0.000 2.618093 3.121222
_cons | -11.09461 .5153323 -21.53 0.000 -12.10746 -10.08176
------------------------------------------------------------------------------Adjusting for covariates changes the scientific question
Unadjusted models: Slope compares parameters across groups differing by 1 unit in the modeled predictor
Adjusted models: Slope compares parameters across groups differing by 1 unit in the modeled predictor but similar with respect to other model covariates
Interpretation of Slopes
Unadjusted model: \(g(\theta | X_i) = \beta_0 + \beta_1 \times X_i\)
\(\beta_1\): Compares \(\theta\) for groups differing by 1 unit in \(X\)
Adjusted model: \(g(\theta | X_i, W_i) = \gamma_0 + \gamma_1 \times X_i + \gamma_2 \times W_i\)
Comparing unadjusted and adjusted models
Science questions
When does \(\gamma_1 = \beta_1\)?
When does \(\hat{\gamma}_1 = \hat{\beta}_1\)?
Statistics questions
When does \(se(\hat{\gamma_1}) = se(\hat{\beta_1})\)?
When does \(\hat{se}(\hat{\gamma_1}) = \hat{se}(\hat{\beta_1})\)
In above, note placement of the \(\hat{ }\) (“hat”) which signifies estimates of population parameters
Lack of a hat indicates “the truth” in the population
When \(\hat{\gamma}_1 = \hat{\beta}_1\) (the formulas are the same), then it must be the case that \(se(\hat{\gamma_1}) = se(\hat{\beta_1})\)
Example of when \(\gamma_1 = \beta_1\)
Want to compare smokers to non-smokers (POI) with respect to their FEV (response) and have conducted a randomized experiment in which boys and girls (\(W\)) are equally represented as smokers and non-smokers
When we compare a random smoker to a random non-smoker, that average difference will be the same if we adjust for gender or not
Numbers in unadjusted and adjusted analyses are the same, but interpretation is different
Answering these four questions cannot be done in the general case
However, in linear regression we can derive exact results
These exact results can serve as a basis for examination of other regression models
Logistic regression
Poisson regression
Proportional hazards regression
Unadjusted model: \(E[Y_i | X_i] = \beta_0 + \beta_1 \times X_i\)
\(\beta_1\): Difference in mean \(Y\) for groups differing by 1 unit in \(X\)
Adjusted model: \(E[Y_i | X_i, W_i] = \gamma_0 + \gamma_1 \times X_i + \gamma_2 \times W_i\)
The slope of the unadjusted model will tend to be
\[\beta_1 = \gamma_1 + \rho_{XW} \frac{\sigma_W}{\sigma_X} \gamma_2\]
Hence, true adjusted and unadjusted slopes for \(X\) are estimating the same quantity on if
\(\rho_{XW} = 0\) (\(X\) and \(W\) are truly uncorrelated), OR
\(\gamma_2 = 0\) (no association between \(W\) and \(Y\) after adjusting for \(X\))
The estimated slope of the unadjusted model will be
\[\hat{\beta}_1 = \hat{\gamma}_1\left(1 + \hat{\gamma}_2 r_{XW} \left[\frac{s_W}{s_X(r_{YX} - r_{YW} r_{XW}} \right] \right)\]
Hence, estimated adjusted and unadjusted slopes for \(X\) are equal only if
\(r_{XW} = 0\) (\(X\) and \(W\) are uncorrelated in the sample, which can be arranged by experimental design, OR
\(\hat{\gamma}_2 = 0\) (which cannot be predetermined because \(Y\) is random)
Unadjusted model: \([se(\hat{\beta}_1)]^2 = \frac{Var(Y|X)}{n Var(X)}\)
Adjusted model: \([se(\hat{\gamma}_1)]^2 = \frac{Var(Y|X, W)}{n Var(X)(1 - r^2_{XW})}\)
\[\begin{aligned} Var(Y|X) & = & \gamma_2^2 Var(W|X) + Var(Y | X, W) \\ \sigma^2_{Y|X} & = & \gamma_2^2 \sigma^2_{W|X} + \sigma^2_{Y|X,W} \end{aligned}\]
Hence, \(se(\hat{\beta}_1) = se(\hat{\gamma}_1)\) if,
\(r_{XW} = 0\) AND
\(\gamma_2 = 0\) OR \(Var(W|X) = 0\)
Unadjusted model: \([\hat{se}(\hat{\beta}_1)]^2 = \frac{SSE(Y|X)/(n-2)}{(n-1) s^2_X}\)
Adjusted model: \([\hat{se}(\hat{\gamma}_1)]^2 = \frac{SSE(Y|X, W)/(n-3)}{(n-1) s^2_X (1 - r^2_{XW})}\)
\[\begin{aligned} SSE(Y|X) & = & \sum(Y_i - \hat{\beta}_0 - \hat{\beta}_1 \times X_i)^2 \\ SSE(Y|X, W) & = & \sum(Y_i - \hat{\gamma}_0 - \hat{\gamma}_1 \times X_i - \hat{\gamma}_2 \times W_i)^2 \end{aligned}\]
Hence, \(\hat{se}(\hat{\beta}_1) = \hat{se}(\hat{\gamma}_1)\) if,
\(r_{XW} = 0\) AND
\(SSE(Y|X)/(n-2) = SSE(Y|X,W)/(n-3)\)
Note than when calculated on the same data,
Now \(\hat{\beta}_1 = \hat{\gamma}_1\) if
\(\hat{\gamma}_2 = 0\), in which case \(SSE(Y | X) = SSE(Y | X, W)\), OR
\(r_{XW} = 0\), and \(SSE(Y | X) > SSE(Y | X, W)\) if \(\hat{\gamma}_2 \neq 0\)
We are interested in knowing the behavior of unadjusted and adjusted models according to whether
\(X\) and \(W\) are uncorrelated
\(W\) is associated with \(Y\) after adjustment for \(X\)
The 4 key cases are summarized below
| \(r_{XW} = 0\) | \(r_{XW} \neq 0\) | |
|---|---|---|
| \(\gamma_2 \neq 0\) | Precision | Confounding |
| \(\gamma_2 = 0\) | Irrelevant | Variance Inflation |
When \(X\) is associated with \(W\), and \(W\) is associated with \(Y\) after we control for \(X\), that is what we call confounding
When \(X\) is associated with \(W\), and \(W\) is not associated with \(Y\) after we control for \(X\), this inflates the variance of the association between \(X\) and \(Y\) (more on this follows)
When \(X\) is not associated with \(W\), and \(W\) is associated with \(Y\) after we control for \(X\), this increases the precision of our estimate of the association between \(X\) and \(Y\)
When \(X\) is not associated with \(W\), and \(W\) is not associated with \(Y\) after we control for \(X\), there is no reason to be concerned with modeling \(W\)
Adjusting for a true precision variable should not impact the point estimate of the association between the POI and the response, but will decrease the standard error
\(X, W\) independent in the population (or a completely randomized experiment) AND \(W\) is associated with \(Y\) independent of \(X\)
\(\rho_{XW} = 0\)
\(\gamma_2 \neq 0\)
| True Value | Estimates | |
|---|---|---|
| Slopes | \(\beta_1 = \gamma_1\) | \(\hat{\beta}_1 \approx \hat{\gamma}_1\) |
| Std Errors | \(se(\hat{\beta}_1) > se(\hat{\gamma}_1)\) | \(\hat{se}(\hat{\beta}_1) > \hat{se}(\hat{\gamma}_1)\) |
Can no longer use the formulas for linear regression
Adjusting for a precision variable
Deattenuates slope away from the null
Standard errors reflect the mean-variance relationship
| True Value | Estimates | |
|---|---|---|
| Slopes if \(\beta_1 > 0\) | \(\beta_1 < \gamma_1\) | \(\hat{\beta}_1 < \hat{\gamma}_1\) |
| Slopes if \(\beta_1 < 0\) | \(\beta_1 > \gamma_1\) | \(\hat{\beta}_1 > \hat{\gamma}_1\) |
| Std Errors | \(se(\hat{\beta}_1) < se(\hat{\gamma}_1)\) | \(\hat{se}(\hat{\beta}_1) < \hat{se}(\hat{\gamma}_1)\) |
Note that the standard errors will be smaller in the unadjusted model due to the mean-variance relationship
Proportions have minimum variance when \(p\) gets close to 0 or 1, and maximum variance at \(p = 0.5\)
Odds have minimum variance when \(p\) is \(0.5\)
Precision variables should be driving probabilities toward 0 or 1
Adjusting for a precision variable will
Have no effect on the slope (log ratios are linear in log means)
Standard errors reflect the mean-variance relationship (virtually no effect on power)
| True Value | Estimates | |
|---|---|---|
| Slopes | \(\beta_1 = \gamma_1\) | \(\hat{\beta}_1 \approx \hat{\gamma}_1\) |
| Std Errors | \(se(\hat{\beta}_1) \approx se(\hat{\gamma}_1)\) | \(\hat{se}(\hat{\beta}_1) \approx \hat{se}(\hat{\gamma}_1)\) |
Adjusting for a precision variable
Deattenuates slope away from the null
Standard errors stay fairly constant (complicated result of binomial mean-variance relationship)
| True Value | Estimates | |
|---|---|---|
| Slopes if \(\beta_1 > 0\) | \(\beta_1 < \gamma_1\) | \(\hat{\beta}_1 < \hat{\gamma}_1\) |
| Slopes if \(\beta_1 < 0\) | \(\beta_1 > \gamma_1\) | \(\hat{\beta}_1 > \hat{\gamma}_1\) |
| Std Errors | \(se(\hat{\beta}_1) \approx se(\hat{\gamma}_1)\) | \(\hat{se}(\hat{\beta}_1) \approx \hat{se}(\hat{\gamma}_1)\) |
Will get some gain in power due to deattenuation of \(\beta_1\) while standard errors remain similar
However, it is rare to have PH assumption hold for both adjusted and unadjusted models
Stratified randomization in a designed experiment
\(r_{XW} = 0\)
\(\gamma_2 \neq 0\)
| True Value | Estimates | |
|---|---|---|
| Slopes | \(\beta_1 = \gamma_1\) | \(\hat{\beta}_1 = \hat{\gamma}_1\) |
| Std Errors | \(se(\hat{\beta}_1) = se(\hat{\gamma}_1)\) | \(\hat{se}(\hat{\beta}_1) > \hat{se}(\hat{\gamma}_1)\) |
Need to adjust for the blocking variable in regression analysis to get improved standard error
Causally associated with the response and associated with POI in the sample
\(r_{XW} \neq 0\)
\(\gamma_2 \neq 0\)
| True Value | Estimates | |
|---|---|---|
| Slopes | \(\beta_1 = \gamma_1 + \rho_{XW} \frac{\sigma_W}{\sigma_X} \gamma_2\) | \(\hat{\beta}_1 = \hat{\gamma}_1\left(1 + \hat{\gamma}_2 r_{XW} \left[\frac{s_W}{s_X(r_{YX} - r_{YW} r_{XW}} \right] \right)\) |
| Std Errors | \(se(\hat{\beta}_1) \{>, =, or <\} se(\hat{\gamma}_1)\) | \(\hat{se}(\hat{\beta}_1) \{>, =, or <\} \hat{se}(\hat{\gamma}_1)\) |
Slopes could increase or decrease
Standard errors could increase, decrease, or stay the same
With logistic, Poisson, PH regression we cannot write down formula, but
As in linear regression, anything can happen
| True Value | Estimates | |
|---|---|---|
| Slopes | \(\beta_1 \{>, =, or <\} \gamma_1\) | \(\hat{\beta}_1 \{>, =, or <\} \hat{\gamma}_1\) |
| Std Errors | \(se(\hat{\beta}_1) \{>, =, or <\} se(\hat{\gamma}_1)\) | \(\hat{se}(\hat{\beta}_1) \{>, =, or <\} \hat{se}(\hat{\gamma}_1)\) |
Slopes could increase or decrease
Standard errors could increase, decrease, or stay the same
Associated with POI in the sample, but not associated with response
\(r_{XW} \neq 0\)
\(\gamma_2 = 0\)
| True Value | Estimates | |
|---|---|---|
| Slopes | \(\beta_1 = \gamma_1\) | \(\hat{\beta}_1 = \hat{\gamma}_1\left(1 + \hat{\gamma}_2 r_{XW} \left[\frac{s_W}{s_X(r_{YX} - r_{YW} r_{XW}} \right] \right)\) |
| Std Errors | \(se(\hat{\beta}_1) < se(\hat{\gamma}_1)\) | \(\hat{se}(\hat{\beta}_1) < \hat{se}(\hat{\gamma}_1)\) |
With logistic, Poisson, PH regression we cannot write down formula, but
Similar to linear regression
| True Value | Estimates | |
|---|---|---|
| Slopes | \(\beta_1 = \gamma_1\) | \(\hat{\beta}_1 \{>, =, or <\} \hat{\gamma}_1\) |
| Std Errors | \(se(\hat{\beta}_1) < se(\hat{\gamma}_1)\) | \(\hat{se}(\hat{\beta}_1) < \hat{se}(\hat{\gamma}_1)\) |
Uncorrelated with POI in sample, and not associated with response
\(r_{XW} = 0\)
\(\gamma_2 = 0\)
Inclusion of irrelevant variables results in slight loss in precision in all regressions
| True Value | Estimates | |
|---|---|---|
| Slopes | \(\beta_1 = \gamma_1\) | \(\hat{\beta}_1= \hat{\gamma}_1\) |
| Std Errors | \(se(\hat{\beta}_1) = se(\hat{\gamma}_1)\) | \(\hat{se}(\hat{\beta}_1) < \hat{se}(\hat{\gamma}_1)\) |
Association between lung function and self-reported smoking in children
Compare geometric means of FEV of children who smoke to comparable non-smokers
Restrict analysis to children 9 years and older
No smokers less than 9
Still about 6 : 1 ratio of non-smokers to smokers
Little precision gained by keeping younger children
Borrowing information from young kids problematic if not a linear relationship between log(FEV) and predictors
With confounding, want to get the model correct
Academic Exercise: Compare alternative models
In real life, we should choose a single model in advance of looking at the data
Here we will observe what happens to parameter estimates and SE across models
Smoking
Smoking adjusted for age
Smoking adjusted for age and height
robcov(m.unadj)| Model Likelihood Ratio Test | Discrimination Indexes | |
|---|---|---|
| Obs | LR chi2 | R2 |
| 439 | 9.3921 | 0.021167 |
| sigma | d.f. | R2 adj |
| 0.24765 | 1 | 0.018927 |
| d.f. | Pr(> chi2) | g |
| 437 | 0.0021792 | 0.025869 |
| Min | 1Q | Median | 3Q | Max |
|---|---|---|---|---|
| -0.6811 | -0.15903 | -0.0052531 | 0.15579 | 0.69848 |
| Coef | S.E. | t | Pr(>|t|) | |
|---|---|---|---|---|
| Intercept | 1.05817 | 0.012904 | 82.0031 | 0.0000000 |
| smoker | 0.10231 | 0.031582 | 3.2394 | 0.0012893 |
. regress logfev smoker if age>=9, robust
Linear regression Number of obs = 439
F( 1, 437) = 10.45
Prob > F = 0.0013
R-squared = 0.0212
Root MSE = .24765
------------------------------------------------------------------------------
| Robust
logfev | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoker | .1023056 .0316539 3.23 0.001 .0400927 .1645184
_cons | 1.05817 .0129335 81.82 0.000 1.032751 1.08359
------------------------------------------------------------------------------Intercept
Geometric mean of FEV in nonsmokers is 2.88 l/sec
The scientific relevance is questionable here because we do not really know the population our sample represents
Calculations \(e^{1.05817} = 2.88\)
The p-value is completely unimportant here as it is testing that the log geometric mean is 0, or that the geometric mean is 1. Why would we care?
Because smoker is a binary variable, the estimate corresponds to a geometric mean. In many regression models, the intercept will have not interpretation
Smoking effect
Geometric mean of FEV is \(10.8\%\) higher in smokers than in nonsmokers (95% CI: \(4.1\%\) to \(17.9\%\) higher)
These results are atypical of what we might expect with no true difference between groups (\(p = 0.001\))
Calculations: \(e^{.102} = 1.108; e^{.040} = 1.041; e^{0.165} = 1.179\)
(Note that \(e^x\) is approximately \((1+x)\) for \(x\) close to 0)
Because smoker is a binary variable, this analysis is nearly identical to a two sample t-test allowing for unequal variances
robcov(m.adj1)| Model Likelihood Ratio Test | Discrimination Indexes | |
|---|---|---|
| Obs | LR chi2 | R2 |
| 439 | 157.34 | 0.30121 |
| sigma | d.f. | R2 adj |
| 0.20949 | 2 | 0.298 |
| d.f. | Pr(> chi2) | g |
| 436 | 0 | 0.14784 |
| Min | 1Q | Median | 3Q | Max |
|---|---|---|---|---|
| -0.61071 | -0.15064 | 0.0027967 | 0.14619 | 0.5383 |
| Coef | S.E. | t | Pr(>|t|) | |
|---|---|---|---|---|
| Intercept | 0.351817 | 0.0573043 | 6.1394 | 1.900e-09 |
| smoker | -0.051349 | 0.0342645 | -1.4986 | 1.347e-01 |
| age | 0.063596 | 0.0051225 | 12.4150 | 0.000e+00 |
. regress logfev smoker age if age>=9, robust
Linear regression Number of obs = 439
F( 2, 436) = 82.28
Prob > F = 0.0000
R-squared = 0.3012
Root MSE = .20949
------------------------------------------------------------------------------
| Robust
logfev | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoker | -.0513495 .0343822 -1.49 0.136 -.118925 .016226
age | .0635957 .0051401 12.37 0.000 .0534932 .0736981
_cons | .3518165 .0575011 6.12 0.000 .2388028 .4648303
------------------------------------------------------------------------------Intercept
Geometric mean of FEV in newborn nonsmokers is 1.42 l/sec
Intercept corresponds to the log geometric mean in a group having all predictors equal to \(0\)
There is no scientific relevance here as we are extrapolating beyond the range of our data
Calculations \(e^{0.352} = 1.422\)
Age effect
Geometric mean of FEV is \(6.6\%\) higher for each year difference in age between two groups with the same smoking status (95% CI: \(5.5\%\) to \(7.6\%\) higher)
Smoking effect (age adjusted interpretation)
Geometric mean of FEV is \(5.0\%\) lower in smokers than in nonsmokers of the same age (95% CI: \(12.2\%\) lower to \(1.6\%\) higher)
These results are not atypical of what we might expect with no true difference between groups of the same age (\(p = 0.136\))
Lack of statistical significance can also be noted by the fact that the CI for the ratio contain 1 or the CI for the percent difference contains 0
Calculations: \(e^{-.051} = 0.950; e^{-.119} = 0.888; e^{0.016} = 1.016\)
Comparing unadjusted and age adjusted analyses
Marked differences in effect of smoking suggests that there was indeed confounding
Age is a relatively strong predictor of FEV
Age is associated with smoking in the sample
Mean (SD) of age in analyzed non-smokers: 11.1 (2.04)
Mean (SD) of age in analyzed smokers: 13.5 (2.34)
Effect of age adjustment on precision
Lower Root MSE (0.209 vs 0.248) would tend to increase precision of estimate of smoking effect
Association between smoking and age tends to lower precision
Net effect: Less precision (adj SE 0.034 vs unadj SE 0.031)
robcov(m.adj2)| Model Likelihood Ratio Test | Discrimination Indexes | |
|---|---|---|
| Obs | LR chi2 | R2 |
| 439 | 487.04 | 0.67025 |
| sigma | d.f. | R2 adj |
| 0.14407 | 3 | 0.66798 |
| d.f. | Pr(> chi2) | g |
| 435 | 0 | 0.23297 |
| Min | 1Q | Median | 3Q | Max |
|---|---|---|---|---|
| -0.50786 | -0.089931 | 0.010612 | 0.095589 | 0.37788 |
| Coef | S.E. | t | Pr(>|t|) | |
|---|---|---|---|---|
| Intercept | -11.094618 | 0.5129794 | -21.6278 | 0.0000e+00 |
| smoker | -0.053590 | 0.0240774 | -2.2257 | 2.6545e-02 |
| age | 0.021529 | 0.0034658 | 6.2120 | 1.2000e-09 |
| loght | 2.869659 | 0.1274099 | 22.5230 | 0.0000e+00 |
. regress logfev smoker age loght if age >=9, robust
Linear regression Number of obs = 439
F( 3, 435) = 284.22
Prob > F = 0.0000
R-squared = 0.6703
Root MSE = .14407
------------------------------------------------------------------------------
| Robust
logfev | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoker | -.0535896 .0241879 -2.22 0.027 -.1011293 -.00605
age | .0215295 .0034817 6.18 0.000 .0146864 .0283725
loght | 2.869658 .1279943 22.42 0.000 2.618093 3.121222
_cons | -11.09461 .5153323 -21.53 0.000 -12.10746 -10.08176
------------------------------------------------------------------------------Intercept
Geometric mean of FEV in newborn nonsmokers who are 1 inch in height is 0.000015 l/sec
Intercept corresponds to the log geometric mean in a group having all predictors equal to \(0\) (nonsmokers, age 0, log height 0)
There is no scientific relevance because there are no such people in our data in either our sample or the population
Age effect
Geometric mean of FEV is \(2.2\%\) higher for each year difference in age between two groups with the same height and smoking status (95% CI: \(1.5\%\) to \(2.9\%\) higher for each year difference in age)
Note that there is clear evidence that height confounded the age effect estimated in the analysis that only considered age and smoking, but there is still a clearly independent effect of age on FEV
Height effect
Geometric mean of FEV is \(31.5\%\) higher for each \(10\%\) difference in height between two groups with the same age and smoking status (95% CI: \(28.3\%\) to \(34.6\%\) higher for each \(10\%\) difference in height)
These results are highly atypical of what we might expect with no true difference in geometric means between height groups having similar smoking status and age (\(p < 0.001\))
Calculations: \(1.1^{2.867} = 1.315\)
Note that the CI for the regression coefficient is consistent with the scientifically-hypothesize value of 3
Smoking effect (age, height adjusted interpretation)
Geometric mean of FEV is \(5.2\%\) lower in smokers than in nonsmokers of the same age and height (95% CI: \(9.6\%\) to \(0.6\%\) lower)
These results are atypical of what we might expect with no true difference between groups of the same age and height (\(p = 0.027\))
Calculations: \(e^{-.054} = 0.948; e^{-.101} = 0.904; e^{-0.006} = .994\)
Comparing age-adjusted to age- and height-adjusted analyses
No difference in effect of smoking suggests that there was no more confounding after age adjustment
Effect of height adjustment on precision
Lower Root MSE (0.144 vs 0.209) would tend to increase precision of estimate of smoking effect
Little association between smoking and height after adjusting for age will not tend to lower precision
Net effect: Higher precision (adj SE 0.024 vs unadj SE 0.034)
Continue our academic exercise using logistic regression
Dichotomize FEV at median (2.93)
In real life, we would not want to make a continuous variable like FEV into a binary variable
Here we will observe what happens to parameter estimates and SE across models
Smoking
Smoking adjusted for age
Smoking adjusted for height
Smoking adjusted for age and height
fev$fevbin <- (fev$fev>=2.93)+0
lrm.unadj <- lrm(fevbin ~ smoker, data=fev)
lrm.adj1 <- lrm(fevbin ~ smoker + age, data=fev)
lrm.adj2 <- lrm(fevbin ~ smoker + loght, data=fev)
lrm.adj3 <- lrm(fevbin ~ smoker + age + loght, data=fev). logit fevbin smoker if age>=9
Logistic regression Number of obs = 439
LR chi2(1) = 15.81
Prob > chi2 = 0.0001
Log likelihood = -296.38409 Pseudo R2 = 0.0260
------------------------------------------------------------------------------
fevbin | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoker | 1.120549 .2959672 3.79 0.000 .540464 1.700634
_cons | -.1607732 .1037519 -1.55 0.121 -.3641231 .0425767
------------------------------------------------------------------------------
.
. logit fevbin smoker age if age>=9
Logistic regression Number of obs = 439
LR chi2(2) = 88.68
Prob > chi2 = 0.0000
Log likelihood = -259.95032 Pseudo R2 = 0.1457
------------------------------------------------------------------------------
fevbin | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoker | .1971403 .3402482 0.58 0.562 -.469734 .8640146
age | .470792 .0637495 7.39 0.000 .3458453 .5957386
_cons | -5.360912 .7056265 -7.60 0.000 -6.743914 -3.977909
------------------------------------------------------------------------------
.
. logit fevbin smoker loght if age>=9
Logistic regression Number of obs = 439
LR chi2(2) = 224.11
Prob > chi2 = 0.0000
Log likelihood = -192.23646 Pseudo R2 = 0.3682
------------------------------------------------------------------------------
fevbin | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoker | .4159311 .3646506 1.14 0.254 -.2987709 1.130633
loght | 33.08872 3.169516 10.44 0.000 26.87659 39.30086
_cons | -137.6043 13.16723 -10.45 0.000 -163.4116 -111.797
------------------------------------------------------------------------------
.
. logit fevbin smoker age loght if age>=9
Logistic regression Number of obs = 439
LR chi2(3) = 230.19
Prob > chi2 = 0.0000
Log likelihood = -189.19487 Pseudo R2 = 0.3782
------------------------------------------------------------------------------
fevbin | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoker | .047695 .3933191 0.12 0.903 -.7231963 .8185863
age | .1753759 .0723176 2.43 0.015 .033636 .3171157
loght | 31.04791 3.289031 9.44 0.000 24.60152 37.49429
_cons | -131.0647 13.5393 -9.68 0.000 -157.6013 -104.5282
------------------------------------------------------------------------------Coefficients (logit scale)
| Model | Smoker | Age | Log(Height) |
|---|---|---|---|
| Smoke | 1.12 | ||
| Smoke+Age | 0.19 | 0.47 | |
| Smoke+Ht | 0.42 | 33.1 | |
| Smoke+Age+Ht | 0.47 | 0.18 | 31.0 |
Std Errors (logit scale)
| Model | Smoker | Age | Log(Height) |
|---|---|---|---|
| Smoke | 0.30 | ||
| Smoke+Age | 0.34 | 0.06 | |
| Smoke+Ht | 0.36 | 3.2 | |
| Smoke+Age+Ht | 0.39 | 0.07 | 3.3 |
Adjusting for the height
Comparing Smoke+Age to Smoke+Age+Ht
Height is being added as a precision variable
Deattenuated the smoking effect (estimate changes from 0.19 to 0.47, which is further from 0)
Increased the standard error estimate (0.34 to 0.39)
Both of these change are predicted by the previous discusssion
Comparing Smoke to Smoke+Ht
Attenuated the smoking effect (1.12 to 0.42, closer to 0)
Increased the standard error estimate
Height now also acting acting as a confounding variable (because we are not modeling age), so these changes were not predictable
Adjusting for the confounding variable (age)
Comparing Smoke+Ht to Smoke+Age+Ht
Deattenuated the estimated coefficient (0.42 to 0.47)
Increased the standard error
Comparing Smoke to Smoke+Age
Attenuated the coefficient estimate (1.12 to 0.19)
Increased the standard error
Changes in estimates and standard errors were not predictable for this, or any, confounder