P-value decrease upon increase in levels of categorical predictor

Question

When does a p-value (standard error) of a linear model coefficient decrease with increasing levels of categorical predictor variable? Why this happens? I fail to see how collinearity and/or regressing residuals on one of the variables [1] is relevant for contrasts of categorical predictors.

Suppose we have:

dat <- data.frame(
  label = rep(LETTERS[1:3], each=4),
  value = c(
    1, 0.96, 0.96, 1.03,      # A
    0.74, 0.45, 0.01, 0.89,   # B
    1.00, 1.02, 1.04, 1.06    # C
  )
)

Notice that one value is suspicious for level B. In any case, the overall mean of group B seems slightly lower, too. Now, then:

 round(coef(summary(lm(value ~ label, dat[1:8, ]))), 3)    # levels A and B
 #             Estimate Std. Error t value Pr(>|t|)
 # (Intercept)    0.988      0.137   7.182    0.000
 # labelB        -0.465      0.194  -2.391    0.054
round(coef(summary(lm(value ~ label, dat))), 3)           # levels A, B, and C
Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.988      0.113   8.777    0.000
labelB        -0.465      0.159  -2.922    0.017
labelC         0.042      0.159   0.267    0.795

The p-value for level B has decreased 3x for the second case. If anything, I would expect it to increase, e.g. to counteract the inflation of family-wise error rate due to multiple testing.

I can't wrap my head around that if A were my baseline condition (control), without going bayesian, I could indefinitely increase my confidence in treatment B to have an effect by just testing more treatments (C, D, E, ..., Z) without ever increasing the sample size in group A nor B.

dariober · Accepted Answer · 2021-03-09T14:08:53.797

I'll give it a stab at answering this... The p-values depend on the estimated standard error (SE) and the SE changed because you have more data points and more factor levels.

The reason for this is that estimated variance of the coefficients, upon which the SE depends is:

$$ var(\hat{\beta}) = \hat{\sigma}^2 (\textbf{X}^T\textbf{X})^{-1} $$

with $\sigma^2$ the mean squared error (thus depending on all the data you have) and $\textbf{X}$ the design matrix (thus depending on the number of factor levels).

I guess that the intuition behind your example is that the values in levels A and C have little variation compared to the values in level B so adding A and C improves the estimates. Note also that the estimate for coefficient B didn't change between the two models.

For the sake of illustration, here are the standard errors calculated step by step in R for the full dataset:

Prepare the design matrix, first column is the intercept:

X <- c(rep(1, 12),
       0,0,0,0,1,1,1,1,0,0,0,0,
       0,0,0,0,0,0,0,0,1,1,1,1)
X <- matrix(data= X, ncol= length(unique(dat$label)))
colnames(X) <- unique(dat$label)
X
      A B C
 [1,] 1 0 0
 [2,] 1 0 0
 [3,] 1 0 0
 [4,] 1 0 0
 [5,] 1 1 0
 [6,] 1 1 0
 [7,] 1 1 0
 [8,] 1 1 0
 [9,] 1 0 1
[10,] 1 0 1
[11,] 1 0 1
[12,] 1 0 1

Fit the model, use 0 to tell lm not to fit the intercept - the intercept is already in the design matrix:

fit <- lm(dat$value ~ 0 + X)
summary(fit)
...
Coefficients:
   Estimate Std. Error t value Pr(>|t|)

XA   0.9875     0.1125   8.777 1.05e-05 ***
XB  -0.4650     0.1591  -2.922    0.017 *

XC   0.0425     0.1591   0.267    0.795

Calculate $\sigma^2$ from mean squared error:

mse <- sum(fit$residuals^2) / (length(fit$residuals) - ncol(X))
sigma_sq <- diag(rep(mse, ncol(X)))
sigma_sq
       [,1]   [,2]   [,3]
[1,] 0.0506 0.0000 0.0000
[2,] 0.0000 0.0506 0.0000
[3,] 0.0000 0.0000 0.0506

Calculate standard errors:

var_beta <- sigma_sq %*% solve(t(X) %*% X)
se <- sqrt(diag(var_beta))
se
[1] 0.1125 0.1591 0.1591

Finally, this how you get p-values from estimates of coefficients and SE.

The estimates of the coefficients are $\hat{\beta} = (\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{y}$ and the t-statistics here is $t = \frac{\hat{\beta} - 0}{SE}$:

betas <- solve(t(X) %*% X) %*% t(X) %*% dat$value
t_values <- betas / se

The degrees of freedom for the t-distribution are $df = n - k$ with $k$ the number of estimated parameters. So the p-values are:

df <- length(fit$residuals) - ncol(X)
pvalues <- 2 * pt(-abs(t_values), df)

Putting it all together, this should recapitulate the output of summary(fit)

cbind(betas, se, t_values, pvalues)
     betas     se t_values   pvalues
XA  0.9875 0.1125   8.7766 1.048e-05
XB -0.4650 0.1591  -2.9223 1.697e-02
XC  0.0425 0.1591   0.2671 7.954e-01

Very nice and thorough. I'll wait a little and probably accept it as an answer. Little suggestions to improve it further. 1) How about you begin with just simply and plainly saying that lm assumes homogeneity of variance. That seems to be the essence of it. 2) Perhaps you also mention that if the homogeneity of variance assumption is not met, one could consult post1 and post2. — Vallo Varik, Mar 10 '21 at 17:30

P-value decrease upon increase in levels of categorical predictor

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.988 0.113 8.777 0.000

labelB -0.465 0.159 -2.922 0.017

labelC 0.042 0.159 0.267 0.795

1 Answers1