Seeking software recommendations for efficiently analyzing large 3-Level data with generalized mixed effect model

Question

I am currently working on analyzing a substantial dataset using generalized mixed-effect models. The dataset has a 3-level structure: 400,000 individuals nested within 500 neighborhoods, which are further nested within 30 regions.

I have attempted to perform the analysis using various software platforms including R's lme4 package, SPSS, and STATA. Unfortunately, all of these approaches have been unsuccessful. The computations either take an exceptionally long time or eventually result in errors such as "convergence failed."

Interestingly, when I opt for a fixed effect model instead of a mixed-effect model, the results seem meaningful. This leads me to believe that I shouldn't abandon this line of analysis. Unfortunately, I don't have access to high-end computing resources at the moment.

Has anyone encountered similar challenges? I would greatly appreciate any software recommendations or alternative methods for conducting this type of complex analysis without requiring high computational power.

Thank you for your time and expertise.

    Family: nbinom2  ( log )
    Formula:          phq ~ 1 + (1 | level3/level2)
    Data: data_raw
      AIC       BIC    logLik  deviance  df.resid 
1517727.3 1517771.0 -758859.6 1517719.3    406760 

Random effects:

Conditional model:
 Groups        Name        Variance Std.Dev.
 level2:level3 (Intercept) 0.08246  0.28715 
 level3        (Intercept) 0.00961  0.09803 
Number of obs: 406764, groups:  level2:level3, 510; level3, 34

Dispersion parameter for nbinom2 family (): 0.621 

Conditional model:
            Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept)  0.64588    0.02252   28.68   &lt;2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I've used glmmTMB package and this is the summary of the null model! @RobertLong

What is the model formula you are using (in lme4), and what is your research question ? When you say "exceptionally long time", can you be more specific? — Robert Long, Sep 11 '23 at 09:31
@RobertLong Thank you for your comments. My research aims to explore whether depression scores are influenced by neighborhood-level variables. We assumed a negative binomial distribution for the depression scores due to the excess of zeros. The model includes around 30 independent variables.
As for the computational challenges, the model didn't converge even after running for an extended period. Specifically, I left the computer running the calculations for over six hours, only to find that the process had failed. It has been quite a frustrating experience. — Wernicke, Sep 11 '23 at 14:00
Ok so pasting the whole model formula isn’t practical. Please tell us the random part of the formula. 30 independent variables could be over-adjustment. Have you carefully considered what could be confounders that should be adjusted for and mediators that should not ? What happens when you omit all the fixed effects and just retain the random part ? — Robert Long, Sep 11 '23 at 18:27
@RobertLong Thanks. The random part is (1 | Region/Neighborhood). No random slope, only random intercept. The 30 fixed effects were chosen via elastic net and forward selection methods. (may be due to the large dataset) The model with only random effects converged, suggesting issues with the fixed effects. Any further feature selection advice would be appreciated. — Wernicke, Sep 12 '23 at 01:57
No worries. However, stepwise selection or any algorithmic method for feature selection is a very bad idea. Just Google that sentence for lots of references about why. I would suggest drawing a DAG. Check this answer I wrote, to see how to do this: https://stats.stackexchange.com/questions/445578/how-do-dags-help-to-reduce-bias-in-causal-inference/445606#445606 — Robert Long, Sep 12 '23 at 06:37
I've summarised these comments into an answer, which I will update if needed. One more thing - please can you post the output from summary(mod) of the fitted model (the one with no fixed effects) — Robert Long, Sep 12 '23 at 08:39
I just saw your latest comments and added the summary(mod) to my answer. Thanks for being so helpful. — Wernicke, Sep 12 '23 at 15:57

Robert Long · Accepted Answer · 2023-09-14T17:17:43.867

As gleaned from the comments, the problem seems to be related to the fixed effects. Since you seem to be interested in causal inference rather than prediction, stepwise procedures are a very bad way to choose fixed effects / features. For example, see here:

Algorithms for automatic model selection

A DAG is a good way to avoid / reduce bias due to incorrect adjustments. In your case it will also hopefully help solce the convergence issue by reducing the set of variables to adjust for. See the following for some detail about how to do this:

How do DAGs help to reduce bias in causal inference?

——- Edit: I just noticed that you added the model output into the question. I was just looking to see if there were any variance components that were close to zero, as that sometimes indicates that the random structure is too complex. They doesn’t seem to be the case here. So I would add a minimally sufficient set of variables as fixed effects.

Also since you mention an abundance of zeros in the response, you could look into a zero-inflated model.

Thank you for your insightful suggestions. I am in the process of refining my model by incorporating the principles of DAGs, and I find that it's proving to be quite beneficial. — Wernicke, Sep 12 '23 at 14:24

Seeking software recommendations for efficiently analyzing large 3-Level data with generalized mixed effect model

1 Answers1