Demonstrating complete-pooling, no-pooling, and partial-pooling regression in R

Question

Gelman & Hill (pp. 255-259) demonstrate in R how to achieve a "complete-pooling regression", "no-pooling regression", and "partial-pooling regression".

I don't have their data to replicate what they did. But using the data below, I was wondering if my understanding of these 3 types of regression is correct?

library(lme4)                                   # needed for partial-pooling
group <- gl(2, 50, labels = c("Ctl","Trt"))     # group indicator
    y <- c(Ctl = rnorm(50), Trt = rnorm(50, 1)) # dependent variable
complete_pooling <- lm(y ~ 1)
      no_pooling <- lm(y ~ group)
 partial_pooling <- lmer(y ~ 1 + (1|group))

Demetri Pananos · Accepted Answer · 2020-07-07T21:37:43.677

So I went ahead and generated some data to demonstrate that these work as expected.

library(tidyverse)
library(lme4)
if(!require(modelr)){
  install.packages('modelr')
}
library(modelr)
pop_mean<-10
n_groups<-4
groups<-gl(n_groups, 20)
Z<-model.matrix(~groups-1)
group_means<-rnorm(n_groups, 0, 2.5)
y<- pop_mean + Z%*%group_means + rnorm(length(groups), 0, 0.5)
d<-tibble(y, groups)

The data generating mechanism from the top down is as follows...

$$ \theta_i \sim \mathcal{N}(10, 2.5) $$

$$y_{i,j} \sim \mathcal{N}(\theta_i, 0.5) $$

Let's take a look at complete, no, and partial pooling.

Complete Pooling

This should return the same as the sample mean of y. This assumes that all the data are generated from a single normal distribution, with some mean and variance. The complete pooling uses all the data to estimate that one mean.

complete_pooling<-lm(y~1, data = d)
summary(complete_pooling)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)

(Intercept)    9.264      0.214   43.29   <2e-16 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.914 on 79 degrees of freedom

No Pooling

In this scenario, we agree that the groups are distinct, but we estimate their means using the data from those groups.

no_pooling<-lm(y~groups-1, data = d) #remove the intercept from the model
summary(no_pooling)
Coefficients:
        Estimate Std. Error t value Pr(>|t|)

groups1   6.2116     0.1045   59.44   <2e-16 ***
groups2  10.9183     0.1045  104.48   <2e-16 ***
groups3  10.5156     0.1045  100.63   <2e-16 ***
groups4   9.4088     0.1045   90.04   <2e-16 ***

group_means + pop_means # pretty close
>>> 6.311974 10.878787 10.354225  9.634138

So we estimate the group means fairly well.

Partial Pooling

partial_pooling<-lmer(y~ 1 + 1|groups, data = d)
summary(partial_pooling)
Random effects:
 Groups   Name        Variance Std.Dev.
 groups   (Intercept) 4.5362   2.1298

 Residual             0.2184   0.4673

Number of obs: 80, groups:  groups, 4
Fixed effects:
            Estimate Std. Error t value
(Intercept)    9.264      1.066   8.688
modelr::data_grid(d, groups) %>% modelr::add_predictions(partial_pooling)
A tibble: 4 x 2
groups  pred
  <fct>  <dbl>
1 1       6.22
2 2      10.9 
3 3      10.5 
4 4       9.41

As you can see, the estimates for the groups are partially pooled towards the population mean (they are slightly less extreme than the complete pooling model).

Here is some code to reproduce these results. The results are not exactly the same because I didn't set the random seed when I wrote this.

library(tidyverse)
library(lme4)
if(!require(modelr)){
  install.packages('modelr')
}
library(modelr)
#Generate data
set.seed(123)
pop_mean<-10
n_groups<-4
groups<-gl(n_groups, 20)
Z<-model.matrix(~groups-1)
group_means<-rnorm(n_groups, 0, 2.5)
y<- pop_mean + Z%*%group_means + rnorm(length(groups), 0, 0.5)
d = tibble(y, groups)
complete_pooling<-lm(y~1, data = d)
no_pooling<-lm(y~groups-1, data = d)
partial_pooling<-lmer(y~ 1 + 1|groups, data = d)
modelr::data_grid(d, groups) %>% modelr::add_predictions(partial_pooling)

EDIT:

Here is an example with a fixed effect.

library(tidyverse)
library(lme4)
if(!require(modelr)){
  install.packages('modelr')
}
library(modelr)
#Generate data
set.seed(123)
pop_mean<-10
n_groups<-4
groups<-gl(n_groups, 20)
x<-rnorm(length(groups))
Z<-model.matrix(~groups-1)
group_means<-rnorm(n_groups, 0, 2.5)
y<- pop_mean + 2x + Z%%group_means + rnorm(length(groups), 0, 0.5)
d = tibble(y, groups,x)
complete_pooling<-lm(y~x, data = d)
no_pooling<-lm(y~groups + x -1, data = d)
partial_pooling<-lmer(y~ x + 1 + 1|groups, data = d)
modelr::data_grid(d, groups,x=0) %>% modelr::add_predictions(partial_pooling)

You will note that the effect estimates in the partial pooling model are pooled towards the complete pooling estimates. They are ever so slightly closer.

@rnorouzian The bug is fixed. If the book your referencing has a particular example, please post a minimal example in your question. I can't access the pages from you link. The model formula in the lm call will depend on what is being modelled. — Demetri Pananos, Jul 05 '20 at 04:47
@rnorouzian I removed the intercept just as a means of more easily demonstrating what the group means were. Leaving the intercept in the model is equally valid, but changes the coefficients from the estimated group means to the difference between a reference group's mean and the indicated group's mean. — Demetri Pananos, Jul 05 '20 at 05:07
I think that is the correct expression for the ICC. So far as fixed effects go, let me see if I have time later today — Demetri Pananos, Jul 07 '20 at 14:00
@rnorouzian I'm not particualtly interested in reading that article at the moment. I use x=0 because the output would essentially be the estimates of the means for the groups when x=0, similar to what is reported in the complete pooling model. — Demetri Pananos, Jul 09 '20 at 04:12
The coefficients are fixed, I just generate them as random. So I don't care what the values of group_means are, but in the model they are fixed, not random. — Demetri Pananos, Jul 09 '20 at 04:56
@rnorouzian can you explain a little more? You want to estimate the error the partial pooling model makes in estimating group level effects? — Demetri Pananos, Jul 11 '20 at 23:45
I'm not sure what theory dictates, but intuitively the standard error in the partial pooling model should be larger. — Demetri Pananos, Jul 12 '20 at 02:02
Again, its only my intuition and if the results are contrary to that then I suggestss actually referencing the theory. If you have future questions, I would suggest you ask them in a new thread and let other people provide answers. I won't be responding to further questions in this thread. — Demetri Pananos, Jul 12 '20 at 02:20
@ Demetri Pananos: Hello! I recently posted a question here on Partial Pooling (https://stats.stackexchange.com/questions/615785/understanding-the-meaning-of-pooling-in-statistics) . My question was deemed as a "duplicate" of this question (https://stats.stackexchange.com/questions/475549/demonstrating-complete-pooling-no-pooling-and-partial-pooling-regression-in-r), but I believe that my question is not a duplicate as it is asking a different question (i.e. why does partial pooling fundamentally not occur in fixed effects models). I would be curious to hear your opinions on this - thanks! — stats_noob, May 14 '23 at 04:17

Demonstrating complete-pooling, no-pooling, and partial-pooling regression in R

1 Answers1

Complete Pooling

No Pooling

Partial Pooling

A tibble: 4 x 2

Linked