(Mini) Lab: Weighted Regression

Chris Slaughter

Preliminary

Goals

  1. Implement weighted regression (using inverse variance weights) in R. Compare the unweighted, weighted estimates. Also compare the standard error estimates.

  2. Demonstrate how R does not handle frequency weights appropriately using the weights option.

Packages

library(ggplot2)
library(sandwich)
library(lmtest)
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Data

fev <- read.table("http://www.emersonstatistics.com/Datasets/fev.txt", header = TRUE)
fev$ht3 <- fev$height^3/1E5

Descriptive plots

FEV and height

ggplot(fev, aes(y=fev, x=height)) + geom_point() + geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'

FEV and height cubed

ggplot(fev, aes(y=fev, x=ht3)) + geom_point() + geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'

Linear regression results

Unweighted, classical and sandwich

m1 <- lm(fev ~ ht3, data=fev)
coeftest(m1) # Classical std error

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.174112   0.061955 -2.8103  0.005098 ** 
ht3          1.198644   0.025501 47.0047 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(m1, vcov=sandwich) # Sandwich estimator

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.174112   0.062434 -2.7888  0.005445 ** 
ht3          1.198644   0.029889 40.1038 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fev$wts <- 1/(fev$ht3)^2
m2 <- lm(fev ~ ht3, weights=wts, data=fev)
coeftest(m2)

t test of coefficients:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.078921   0.045783 -1.7238  0.08522 .  
ht3          1.156422   0.022137 52.2393  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(m2, vcov=sandwich)

t test of coefficients:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.078921   0.049714 -1.5875   0.1129    
ht3          1.156422   0.024179 47.8271   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparison of standard errors

Unweighted Weighted
Classical (model based) 0.0255 0.0221
Robust 0.0299 0.0242

Note on Frequency weights in R

Note that R does frequency (replication) weights incorrectly. The simplest answer as to why this is the case is that frequency weights are not very useful in practice and require different methodology to implement. See the Details of the help(lm) for details on the problem. In the example below, the standard errors are not the same.

If you look closely at the output below, the estimate and the standard error of height from m3 (all weights set to 2) is the same as m1 above (unweighted, which implicitly is all weights set to 1). R is almost certainly standardizing the weights so that they sum to one. That is, in m1, every observation gets weight of \(1/n\); In m3, every observation gets a weight of \(2/2n=1/n\).

# For simplicity, just say every observation observed twice
fev$freqwts <- rep(2, nrow(fev))

m3 <- lm(fev ~ ht3, data=fev, weights=freqwts)
coeftest(m3)

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.174112   0.061955 -2.8103  0.005098 ** 
ht3          1.198644   0.025501 47.0047 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(m3, vcov=sandwich)

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.174112   0.062434 -2.7888  0.005445 ** 
ht3          1.198644   0.029889 40.1038 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fev.big <- rbind(fev,fev)
m4 <- lm(fev ~ ht3, data=fev.big)
coeftest(m4)

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.174112   0.043775 -3.9774 7.349e-05 ***
ht3          1.198644   0.018018 66.5257 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(m4, vcov=sandwich)

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.174112   0.044147 -3.9439 8.442e-05 ***
ht3          1.198644   0.021134 56.7154 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m4 gives the correct results for the frequency weights. By doubling the sample size, we have decreased the standard errors of beta0 and beta1 by (about) the square root of 2.

sqrt(diag(vcov(m1)))
(Intercept)         ht3 
 0.06195478  0.02550051 
sqrt(diag(vcov(m4)))
(Intercept)         ht3 
 0.04377509  0.01801777 
sqrt(diag(vcov(m1))) / sqrt(diag(vcov(m4)))
(Intercept)         ht3 
   1.415298    1.415298 
sqrt(2)
[1] 1.414214