Weighted regression with no variance within cluster

Question

Given a clustered data set with no variation within a cluster, shouldn't a regression weighted with the inverse cluster size give the same results as a regression with only one observation per cluster?

Here is sample data (5 clusters with same observations per cluster) with sample code in R:

data = as.data.frame(cbind(
  c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 2, 3, 3),
  c("A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", "C", "D", "E", "E"),
  c(1/3, 1/3, 1/3, 1/4, 1/4, 1/4, 1/4, 1/5, 1/5, 1/5, 1/5, 1/5, 1, 1/2, 1/2),
  c(2, 2, 2, 1, 1, 1, 1, 3, 3, 3, 3, 3, 1, 2, 2),
  c(4, 4, 4, 2, 2, 2, 2, 4, 4, 4, 4, 4, 3, 3, 3),
  c(1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0)
))
colnames(data) <- c("y", "cl", "weights", "x1", "x2", "firsts")
data$y = as.numeric(data$y)
data$x1 = as.numeric(data$x1)
data$x2 = as.numeric(data$x2)
data$weights = as.numeric(data$weights)
data$firsts = as.numeric(data$firsts)
Regression with one observation per cluster
summary(lm(y ~ x1 + x2, data = data[data$firsts == 1, ]))
#> 
#> Call:
#> lm(formula = y ~ x1 + x2, data = data[data$firsts == 1, ])
#> 
#> Residuals:
#>       1       4       8      13      14 
#> -0.6000 -0.4667  0.1333  0.6000  0.3333 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)   3.3333     1.5202   2.193    0.160
#> x1            1.2667     0.7055   1.795    0.214
#> x2           -1.0667     0.7055  -1.512    0.270
#> 
#> Residual standard error: 0.7303 on 2 degrees of freedom
#> Multiple R-squared:  0.619,  Adjusted R-squared:  0.2381 
#> F-statistic: 1.625 on 2 and 2 DF,  p-value: 0.381
Regression with weights equal to inverse cluster size
summary(lm(y ~ x1 + x2, data = data, weights = weights))
#> 
#> Call:
#> lm(formula = y ~ x1 + x2, data = data, weights = weights)
#> 
#> Weighted Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.34641 -0.23333  0.05963  0.05963  0.60000 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)

#> (Intercept)   3.3333     0.6206   5.371 0.000168 ***
#> x1            1.2667     0.2880   4.398 0.000869 ***
#> x2           -1.0667     0.2880  -3.703 0.003018 ** 
#> ---
#> Signif. codes:  0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.2981 on 12 degrees of freedom
#> Multiple R-squared:  0.619,  Adjusted R-squared:  0.5556 
#> F-statistic:  9.75 on 2 and 12 DF,  p-value: 0.003056

^{Created on 2023-09-14 with reprex v2.0.2}

The coefficients are the same but not the standard errors. Why?

Side note: The standard errors also don't get similar if I use cluster standard errors.

With anova(fit1) and anova(fit2) you can check that the sum of squares match because of how you've chosen the weights. The residual degrees of freedom don't match though. So the residual standard error is not the same, and hence all other differences between the two summary tables. — dipetkov, Sep 14 '23 at 19:02
Thank you @dipetkov! But wouldn't this mean that I could artificially increase my degrees of freedom (and thus increase my t-value) by creating duplicate observations and then downsizing them via inverse cluster size weights? — Irazall, Sep 14 '23 at 20:19
I'm not sure what the purpose of this exercise is. The weights in lm are "precision weights" to be used in the (weighted) least squares fitting. They are not "frequency weights" to represent multiple observations with the same measurements. — dipetkov, Sep 14 '23 at 20:43

Weighted regression with no variance within cluster

Regression with one observation per cluster

Regression with weights equal to inverse cluster size

0 Answers0