library(ggplot2)
library(sandwich)
library(lmtest)Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Implement weighted regression (using inverse variance weights) in R. Compare the unweighted, weighted estimates. Also compare the standard error estimates.
Demonstrate how R does not handle frequency weights appropriately using the weights option.
library(ggplot2)
library(sandwich)
library(lmtest)Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
fev <- read.table("http://www.emersonstatistics.com/Datasets/fev.txt", header = TRUE)
fev$ht3 <- fev$height^3/1E5ggplot(fev, aes(y=fev, x=height)) + geom_point() + geom_smooth(method="lm")`geom_smooth()` using formula = 'y ~ x'
ggplot(fev, aes(y=fev, x=ht3)) + geom_point() + geom_smooth(method="lm")`geom_smooth()` using formula = 'y ~ x'
m1 <- lm(fev ~ ht3, data=fev)
coeftest(m1) # Classical std error
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.174112 0.061955 -2.8103 0.005098 **
ht3 1.198644 0.025501 47.0047 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(m1, vcov=sandwich) # Sandwich estimator
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.174112 0.062434 -2.7888 0.005445 **
ht3 1.198644 0.029889 40.1038 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fev$wts <- 1/(fev$ht3)^2
m2 <- lm(fev ~ ht3, weights=wts, data=fev)
coeftest(m2)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.078921 0.045783 -1.7238 0.08522 .
ht3 1.156422 0.022137 52.2393 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(m2, vcov=sandwich)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.078921 0.049714 -1.5875 0.1129
ht3 1.156422 0.024179 47.8271 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Unweighted | Weighted | |
| Classical (model based) | 0.0255 | 0.0221 |
| Robust | 0.0299 | 0.0242 |
Note that R does frequency (replication) weights incorrectly. The simplest answer as to why this is the case is that frequency weights are not very useful in practice and require different methodology to implement. See the Details of the help(lm) for details on the problem. In the example below, the standard errors are not the same.
If you look closely at the output below, the estimate and the standard error of height from m3 (all weights set to 2) is the same as m1 above (unweighted, which implicitly is all weights set to 1). R is almost certainly standardizing the weights so that they sum to one. That is, in m1, every observation gets weight of \(1/n\); In m3, every observation gets a weight of \(2/2n=1/n\).
# For simplicity, just say every observation observed twice
fev$freqwts <- rep(2, nrow(fev))
m3 <- lm(fev ~ ht3, data=fev, weights=freqwts)
coeftest(m3)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.174112 0.061955 -2.8103 0.005098 **
ht3 1.198644 0.025501 47.0047 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(m3, vcov=sandwich)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.174112 0.062434 -2.7888 0.005445 **
ht3 1.198644 0.029889 40.1038 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fev.big <- rbind(fev,fev)
m4 <- lm(fev ~ ht3, data=fev.big)
coeftest(m4)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.174112 0.043775 -3.9774 7.349e-05 ***
ht3 1.198644 0.018018 66.5257 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(m4, vcov=sandwich)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.174112 0.044147 -3.9439 8.442e-05 ***
ht3 1.198644 0.021134 56.7154 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m4 gives the correct results for the frequency weights. By doubling the sample size, we have decreased the standard errors of beta0 and beta1 by (about) the square root of 2.
sqrt(diag(vcov(m1)))(Intercept) ht3
0.06195478 0.02550051
sqrt(diag(vcov(m4)))(Intercept) ht3
0.04377509 0.01801777
sqrt(diag(vcov(m1))) / sqrt(diag(vcov(m4)))(Intercept) ht3
1.415298 1.415298
sqrt(2)[1] 1.414214