0

Given a clustered data set with no variation within a cluster, shouldn't a regression weighted with the inverse cluster size give the same results as a regression with only one observation per cluster?

Here is sample data (5 clusters with same observations per cluster) with sample code in R:

data = as.data.frame(cbind(
  c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 2, 3, 3),
  c("A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", "C", "D", "E", "E"),
  c(1/3, 1/3, 1/3, 1/4, 1/4, 1/4, 1/4, 1/5, 1/5, 1/5, 1/5, 1/5, 1, 1/2, 1/2),
  c(2, 2, 2, 1, 1, 1, 1, 3, 3, 3, 3, 3, 1, 2, 2),
  c(4, 4, 4, 2, 2, 2, 2, 4, 4, 4, 4, 4, 3, 3, 3),
  c(1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0)
))
colnames(data) <- c("y", "cl", "weights", "x1", "x2", "firsts")
data$y = as.numeric(data$y)
data$x1 = as.numeric(data$x1)
data$x2 = as.numeric(data$x2)
data$weights = as.numeric(data$weights)
data$firsts = as.numeric(data$firsts)

Regression with one observation per cluster

summary(lm(y ~ x1 + x2, data = data[data$firsts == 1, ])) #> #> Call: #> lm(formula = y ~ x1 + x2, data = data[data$firsts == 1, ]) #> #> Residuals: #> 1 4 8 13 14 #> -0.6000 -0.4667 0.1333 0.6000 0.3333 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 3.3333 1.5202 2.193 0.160 #> x1 1.2667 0.7055 1.795 0.214 #> x2 -1.0667 0.7055 -1.512 0.270 #> #> Residual standard error: 0.7303 on 2 degrees of freedom #> Multiple R-squared: 0.619, Adjusted R-squared: 0.2381 #> F-statistic: 1.625 on 2 and 2 DF, p-value: 0.381

Regression with weights equal to inverse cluster size

summary(lm(y ~ x1 + x2, data = data, weights = weights)) #> #> Call: #> lm(formula = y ~ x1 + x2, data = data, weights = weights) #> #> Weighted Residuals: #> Min 1Q Median 3Q Max #> -0.34641 -0.23333 0.05963 0.05963 0.60000 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.3333 0.6206 5.371 0.000168 *** #> x1 1.2667 0.2880 4.398 0.000869 *** #> x2 -1.0667 0.2880 -3.703 0.003018 ** #> --- #> Signif. codes: 0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.2981 on 12 degrees of freedom #> Multiple R-squared: 0.619, Adjusted R-squared: 0.5556 #> F-statistic: 9.75 on 2 and 12 DF, p-value: 0.003056

Created on 2023-09-14 with reprex v2.0.2

The coefficients are the same but not the standard errors. Why?

Side note: The standard errors also don't get similar if I use cluster standard errors.

Irazall
  • 170
  • 1
    With anova(fit1) and anova(fit2) you can check that the sum of squares match because of how you've chosen the weights. The residual degrees of freedom don't match though. So the residual standard error is not the same, and hence all other differences between the two summary tables. – dipetkov Sep 14 '23 at 19:02
  • Thank you @dipetkov! But wouldn't this mean that I could artificially increase my degrees of freedom (and thus increase my t-value) by creating duplicate observations and then downsizing them via inverse cluster size weights? – Irazall Sep 14 '23 at 20:19
  • 1
    I'm not sure what the purpose of this exercise is. The weights in lm are "precision weights" to be used in the (weighted) least squares fitting. They are not "frequency weights" to represent multiple observations with the same measurements. – dipetkov Sep 14 '23 at 20:43
  • Threads about different types of weights: 1, 2, 3.... – dipetkov Sep 14 '23 at 20:45

0 Answers0