GEEs in case of a small number of clusters with strongly heterogeneous cluster size

Question

My data set includes 400 records. Each record comprises values for the binary outcome variable $y$ and 12 categorical predictor variables $x_1, ..., x_{12}$, most of which are binary too. The records originate from 10 different studies, of which one is much larger than the remaining ones (it contributes nearly three-third of the records, while each of the remaining studies contributes 4 to 30 records).

My plan is to fit a logistic regression model. Since I cannot rule out that records from the same study are correlated, I consider using GEEs instead of ordinary logistic regression modeling. The records from the same study would form the clusters. As working correlation structure I would use 'exchangable'. However, I am still unsure if GEEs are a suitable approach in my case, which is distinguished by

the low number of clusters (viz., ten)
the large differences in cluster size.

Are there any advices/experiences on this? I found in the literature that actually 40, 50 or even more clusters would be needed (opinions seem to diverge) in order to get reliable 'sandwich' errors for the model parameter estimates. For smaller numbers of clusters, some correction would be needed. However, I neither found much on how to precisely apply this correction, nor on the second issue, if differences in cluster size affects the results of GEEs. I would highly appreciate any advice.

score 3 · Answer 1 · answered Dec 29 '15 at 04:17

There are clustered bootstrap approaches, including percentile-t and wild. See Cameron & Miller, 2015: A Practitioner's Guide to Cluster-Robust Inference.

One thing I don't see mentioned in the Cameron & Miller article, but I see no reason why it couldn't be applied in the context of clustered data, is to modify the cluster bootstrapped percentile-t approach by using a variance-stabilizing transformation, as discussed in Tibshirani, 1988: Variance Stabilization and the Bootstrap.

Another option derives cluster-robust variance estimates on the basis of estimating separate regressions for each cluster. See Ibragimov & Müller, 2010: t-Statistic Based Correlation and Heterogeneity Robust Inference, although I've seen a slide deck where Cameron cautions that this approach is no good if there is any cross-cluster correlation (such as time fixed effects in panel data scenarios).

GEEs in case of a small number of clusters with strongly heterogeneous cluster size

1 Answers1