Lab 5: Multivariable Regression

Author

Background

Another way to think about regression is the amount of variablity in the outcome (Y) that is explained by the predictors (X). In simple linear regression, we regress X on Y because we believe that X will explain some of the variability in Y. This leads to an alternate way of thinking about statistical tests for regression coefficients in terms of the variability of the outcome. Specifically, the null hypothesis is that X explains none of the variability in Y, and the alternative hypothesis is that X explains more variability than would be expected by chance alone. In this lab we will consider the interpretation of statistical tests in a multivariable model that adds another predictor (W) to the model. Salary will be outcome (Y), male gender the predictor of interest (X), and year of degree the additional covariate (W).

An added-variable plot is a scatterplot of the transformations of an independent variable (say, \(X_1\)) and the dependent variable (\(y\)) that nets out the influence of all the other independent variables. The fitted regression line through the origin between these transformed variables has the same slope as the coefficient on x1 in the full regression model which includes all the independent variables. An added-variable plot is the multivariable analogue of using a simple scatterplot with a regression fit when there are no other covariates to show the relationship between a single x variable and a y variable. An added-variable plot is a visually compelling method for showing the nature of the partial correlation between \(X_1\) and y as estimated in a multiple regression.

New functions

To obtain residuals for a model fit, use the resid() function on the model fit.

avPlots in the car packages will added-variable, also called partial-regression, plots for linear and generalized linear models.

In part 1 and part 2 of the lab, we will construct added variable plots in steps without the use of a specialized function.

Example 1: Duncan Data

Duncan’s Occupational Prestige. Data on the prestige and other characteristics of 45 U. S. occupations in 1950.

library(car)
Loading required package: carData
head(Duncan)
           type income education prestige
accountant prof     62        86       82
pilot      prof     72        76       83
architect  prof     75        92       90
author     prof     55        90       76
chemist    prof     64        86       90
minister   prof     21        84       87

Obtain summary statistics for the variables in Duncan:

summary(Duncan)
   type        income        education         prestige    
 bc  :21   Min.   : 7.00   Min.   :  7.00   Min.   : 3.00  
 prof:18   1st Qu.:21.00   1st Qu.: 26.00   1st Qu.:16.00  
 wc  : 6   Median :42.00   Median : 45.00   Median :41.00  
           Mean   :41.87   Mean   : 52.56   Mean   :47.69  
           3rd Qu.:64.00   3rd Qu.: 84.00   3rd Qu.:81.00  
           Max.   :81.00   Max.   :100.00   Max.   :97.00  

As a first graph, we view a histogram of the variable prestige:

with(Duncan, hist(prestige))

Examining the Data

The scatterplotMatrix() function in the car package produces scatterplots for all paris of variables. A few relatively remote points are marked by case names, in this instance by occupation.

scatterplotMatrix( ~ prestige + education + income, 
    id=list(n=3), data=Duncan)

Regression Analysis

We use thelm() function to fit a linear regression model to the data:

(duncan.model <- lm(prestige ~ education + income, data=Duncan))

Call:
lm(formula = prestige ~ education + income, data = Duncan)

Coefficients:
(Intercept)    education       income  
    -6.0647       0.5458       0.5987  
summary(duncan.model)

Call:
lm(formula = prestige ~ education + income, data = Duncan)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.538  -6.417   0.655   6.605  34.641 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.06466    4.27194  -1.420    0.163    
education    0.54583    0.09825   5.555 1.73e-06 ***
income       0.59873    0.11967   5.003 1.05e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.37 on 42 degrees of freedom
Multiple R-squared:  0.8282,    Adjusted R-squared:   0.82 
F-statistic: 101.2 on 2 and 42 DF,  p-value: < 2.2e-16

Added-variable plot

Added-variable plots for Duncan’s regression, looking for influential cases:

avPlots(duncan.model, 
    id=list(cex=0.75, n=3, method="mahal"))

Example 2: FEV and smoking, age, height

Recall the FEV data presented in the notes

  • In unadjusted models, smokers had lower FEV that non-smokers. We suspect this is due to confouding by age

  • Age is suspected to be associated with smoking (in the sample) and FEV (in truth)

  • In unadjusted models, height is likely associated with FEV and age. There is likely an association between height and smoking through age.

  • In adjusted models, we might expect

  1. Lower FEV comparing a smoker to non-smoker of the same age and height
  2. Higher FEV comparing individuals differing by on year of age but with the same smoking status and height
  3. Higher FEV in taller individuals of the same age and smoking status

Try to identify these observations in the plots that follow

fev <- read.table("http://www.emersonstatistics.com/Datasets/fev.txt", header = TRUE)
fev$smoker <- ifelse(fev$smoke == 2, 0, 1)
summary(fev)
     seqnbr          subjid           age              fev       
 Min.   :  1.0   Min.   :  201   Min.   : 3.000   Min.   :0.791  
 1st Qu.:164.2   1st Qu.:15811   1st Qu.: 8.000   1st Qu.:1.981  
 Median :327.5   Median :36071   Median :10.000   Median :2.547  
 Mean   :327.5   Mean   :37170   Mean   : 9.931   Mean   :2.637  
 3rd Qu.:490.8   3rd Qu.:53639   3rd Qu.:12.000   3rd Qu.:3.119  
 Max.   :654.0   Max.   :90001   Max.   :19.000   Max.   :5.793  
     height           sex            smoke           smoker       
 Min.   :46.00   Min.   :1.000   Min.   :1.000   Min.   :0.00000  
 1st Qu.:57.00   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:0.00000  
 Median :61.50   Median :1.000   Median :2.000   Median :0.00000  
 Mean   :61.14   Mean   :1.486   Mean   :1.901   Mean   :0.09939  
 3rd Qu.:65.50   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.00000  
 Max.   :74.00   Max.   :2.000   Max.   :2.000   Max.   :1.00000  

Scatterplot matrix

The scatterplot matrix summarizes the unadjusted associations.

scatterplotMatrix( ~ fev + age + height + smoker, data=fev)

Regression Analysis

We use thelm() function to fit a linear regression model to the data in those 9 and older.

(fev.model <- lm(fev ~ smoker + height + age, data=fev[fev$age>=9,]))

Call:
lm(formula = fev ~ smoker + height + age, data = fev[fev$age >= 
    9, ])

Coefficients:
(Intercept)       smoker       height          age  
    -6.4313      -0.1784       0.1353       0.0712  
summary(fev.model)

Call:
lm(formula = fev ~ smoker + height + age, data = fev[fev$age >= 
    9, ])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.63401 -0.27416  0.00003  0.26550  1.82040 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.431311   0.355959 -18.068  < 2e-16 ***
smoker      -0.178429   0.065107  -2.741  0.00639 ** 
height       0.135303   0.006322  21.401  < 2e-16 ***
age          0.071201   0.011884   5.991 4.37e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4478 on 435 degrees of freedom
Multiple R-squared:  0.6632,    Adjusted R-squared:  0.6609 
F-statistic: 285.5 on 3 and 435 DF,  p-value: < 2.2e-16

Added-variable plot

Added-variable plots for FEV regression to visualize the adjusted associations

avPlots(fev.model)

Initial Lab Setup

For this lab, we will be using salary data from they year 1995. We will focus on the variables: salary, sex, and yrdeg

Initial dataset manipulations

  1. Read in the salary dataset
  2. Remove any observations that are not from 1995 (use the ‘year’ variable)
  3. Describe the dataset
  4. Create an indicator variable for male gender

Lab Part 1

Model 1: Fit a simple linear regression model with salary as the outcome and male as the predictor. Save the residuals from this model. Interpret these residuals in terms of the unexplained variability in salary.

Model 2: Fit a simple linear regression model with yrdeg as the outcome and male as the predictor. Save the residuals from this model. Interpret these residuals in terms of the unexplained variability in yrdeg.

Plot the residuals from model 2 (X-axis) versus the residuals from model 1 (y-axis). Describe any association you see. It may be helpful to add a lowess smooth or other smooth line to the plot.

Model 3: Fit a simple linear regression model using the residuals from model 1 as the outcome and the residuals from model 2 as the predictor. Interpret the slope coefficient from this model.

Model 4: Fit a multivariable linear regression model with salary as the outcome using predictors male and yrdeg. What is the interpretation of the male coefficient in this model? What is the interpretation of the yrdeg coefficient?

Compare the slope estimate for yrdeg from Model 3 to the slope estimate obtained in Model 4. Explain your findings.

Lab Part 2

Create a new variable FULL that takes on the value 1 for full professors and 0 for Assistant or Associate Professors.

Determine if FULL explains some of the variability in salary after adjusting for year of degree and gender by fitting the multivariable regression model and by regressing residuals from “Model A” on the residuals from “Model B” other as was done previously (you will need to figure out what “Model A” and “Model B” should be). Compare the results from the two models.