Another way to think about regression is the amount of variablity in the outcome (Y) that is explained by the predictors (X). In simple linear regression, we regress X on Y because we believe that X will explain some of the variability in Y. This leads to an alternate way of thinking about statistical tests for regression coefficients in terms of the variability of the outcome. Specifically, the null hypothesis is that X explains none of the variability in Y, and the alternative hypothesis is that X explains more variability than would be expected by chance alone. In this lab we will consider the interpretation of statistical tests in a multivariable model that adds another predictor (W) to the model. Salary will be outcome (Y), male gender the predictor of interest (X), and year of degree the additional covariate (W).
An added-variable plot is a scatterplot of the transformations of an independent variable (say, \(X_1\)) and the dependent variable (\(y\)) that nets out the influence of all the other independent variables. The fitted regression line through the origin between these transformed variables has the same slope as the coefficient on x1 in the full regression model which includes all the independent variables. An added-variable plot is the multivariable analogue of using a simple scatterplot with a regression fit when there are no other covariates to show the relationship between a single x variable and a y variable. An added-variable plot is a visually compelling method for showing the nature of the partial correlation between \(X_1\) and y as estimated in a multiple regression.
New functions
To obtain residuals for a model fit, use the resid() function on the model fit.
avPlots in the car packages will added-variable, also called partial-regression, plots for linear and generalized linear models.
In part 1 and part 2 of the lab, we will construct added variable plots in steps without the use of a specialized function.
Example 1: Duncan Data
Duncan’s Occupational Prestige. Data on the prestige and other characteristics of 45 U. S. occupations in 1950.
library(car)
Loading required package: carData
head(Duncan)
type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
minister prof 21 84 87
Obtain summary statistics for the variables in Duncan:
summary(Duncan)
type income education prestige
bc :21 Min. : 7.00 Min. : 7.00 Min. : 3.00
prof:18 1st Qu.:21.00 1st Qu.: 26.00 1st Qu.:16.00
wc : 6 Median :42.00 Median : 45.00 Median :41.00
Mean :41.87 Mean : 52.56 Mean :47.69
3rd Qu.:64.00 3rd Qu.: 84.00 3rd Qu.:81.00
Max. :81.00 Max. :100.00 Max. :97.00
As a first graph, we view a histogram of the variable prestige:
with(Duncan, hist(prestige))
Examining the Data
The scatterplotMatrix() function in the car package produces scatterplots for all paris of variables. A few relatively remote points are marked by case names, in this instance by occupation.
seqnbr subjid age fev
Min. : 1.0 Min. : 201 Min. : 3.000 Min. :0.791
1st Qu.:164.2 1st Qu.:15811 1st Qu.: 8.000 1st Qu.:1.981
Median :327.5 Median :36071 Median :10.000 Median :2.547
Mean :327.5 Mean :37170 Mean : 9.931 Mean :2.637
3rd Qu.:490.8 3rd Qu.:53639 3rd Qu.:12.000 3rd Qu.:3.119
Max. :654.0 Max. :90001 Max. :19.000 Max. :5.793
height sex smoke smoker
Min. :46.00 Min. :1.000 Min. :1.000 Min. :0.00000
1st Qu.:57.00 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:0.00000
Median :61.50 Median :1.000 Median :2.000 Median :0.00000
Mean :61.14 Mean :1.486 Mean :1.901 Mean :0.09939
3rd Qu.:65.50 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.00000
Max. :74.00 Max. :2.000 Max. :2.000 Max. :1.00000
Scatterplot matrix
The scatterplot matrix summarizes the unadjusted associations.
scatterplotMatrix( ~ fev + age + height + smoker, data=fev)
Regression Analysis
We use thelm() function to fit a linear regression model to the data in those 9 and older.
Call:
lm(formula = fev ~ smoker + height + age, data = fev[fev$age >=
9, ])
Residuals:
Min 1Q Median 3Q Max
-1.63401 -0.27416 0.00003 0.26550 1.82040
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.431311 0.355959 -18.068 < 2e-16 ***
smoker -0.178429 0.065107 -2.741 0.00639 **
height 0.135303 0.006322 21.401 < 2e-16 ***
age 0.071201 0.011884 5.991 4.37e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4478 on 435 degrees of freedom
Multiple R-squared: 0.6632, Adjusted R-squared: 0.6609
F-statistic: 285.5 on 3 and 435 DF, p-value: < 2.2e-16
Added-variable plot
Added-variable plots for FEV regression to visualize the adjusted associations
avPlots(fev.model)
Initial Lab Setup
For this lab, we will be using salary data from they year 1995. We will focus on the variables: salary, sex, and yrdeg
Initial dataset manipulations
Read in the salary dataset
Remove any observations that are not from 1995 (use the ‘year’ variable)
Describe the dataset
Create an indicator variable for male gender
Lab Part 1
Model 1: Fit a simple linear regression model with salary as the outcome and male as the predictor. Save the residuals from this model. Interpret these residuals in terms of the unexplained variability in salary.
Model 2: Fit a simple linear regression model with yrdeg as the outcome and male as the predictor. Save the residuals from this model. Interpret these residuals in terms of the unexplained variability in yrdeg.
Plot the residuals from model 2 (X-axis) versus the residuals from model 1 (y-axis). Describe any association you see. It may be helpful to add a lowess smooth or other smooth line to the plot.
Model 3: Fit a simple linear regression model using the residuals from model 1 as the outcome and the residuals from model 2 as the predictor. Interpret the slope coefficient from this model.
Model 4: Fit a multivariable linear regression model with salary as the outcome using predictors male and yrdeg. What is the interpretation of the male coefficient in this model? What is the interpretation of the yrdeg coefficient?
Compare the slope estimate for yrdeg from Model 3 to the slope estimate obtained in Model 4. Explain your findings.
Lab Part 2
Create a new variable FULL that takes on the value 1 for full professors and 0 for Assistant or Associate Professors.
Determine if FULL explains some of the variability in salary after adjusting for year of degree and gender by fitting the multivariable regression model and by regressing residuals from “Model A” on the residuals from “Model B” other as was done previously (you will need to figure out what “Model A” and “Model B” should be). Compare the results from the two models.