Added more data and suddenly GLMM fails to converge (R)

Question

I have a dataset where I randomly sampled housing developments, and then within these I systematically sampled every habitat patch. I now have a dataset where each observation is a patch_ID, and I have variables recorded for the patch such as project_ID (the development it came from), Area_ha (the area of the patch, in hectares), Pre_post (whether the patch is from a pre-development site plan or a post-development one). I would like to conduct a GLMM to investigate whether there is a difference between the size (Area_ha) of habitat patches before and after development (eg. "pre" or "post" in the Pre_post variable), whilst accounting for my grouping factor (project_ID). I am using a GLMM as my Area_ha data are heavily right skewed and have no zero values. Area is continuous, Pre_post is a factor with two levels ("pre", "post"), and project_ID is a factor with 25 levels (each of the 25 developments).

I successfully conducted this GLMM on a smaller dataset last month, using the code:

glmm_gamma <- glmer(Area_ha ~ Pre_post + (1 | project_ID), 
                    family = Gamma(link = "log"), 
                    data = size3)

However, now I have increased the size of my dataset, and the same line of code as above to run the model is giving me the error:

Warning: Model failed to converge with max|grad| = 0.0209587 (tol = 0.002, component 1)

The data is still right skewed and I have removed the most troublesome outlier.

I have tried the following options to troubleshoot but to no avail:

#To rescale and centre my stuff:
mu <- mean(size3$Area_ha, na.rm = TRUE)
sigma <- sd(size3$Area_ha, na.rm = TRUE)
size3$Area_ha_scaled <- (size3$Area_ha - mu) / sigma
#But I then realised I can't do this because gamma distributions don't allow for negative values

Trying to check singularity:

tt <- getME(glmm_gamma,"theta")
ll <- getME(glmm_gamma,"lower")
min(tt[ll==0])
#0.8909557, so no worry of singularity

Trying with more iterations

glmm_gamma <- glmer(Area_ha ~ Pre_post + (1 | project_ID), 
                    family = Gamma(link = "log"), 
                    data = size3,
                    control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1000000)))
#Still didn't work

I'm completely stumped for what to do next, and would really appreciate some help! I'm confused as to why adding more data has made my model stop working completely - I'm not very confident at statistics so hopefully this community can help!

I have put my data here - both the data and some of my code: https://github.com/sirianmckellen/stackex.git If anyone has any ideas on how I can move forward, whether this is a solvable issue or whether I need a new model type (if so, what should I consider?), etc, I'd be SO grateful! Thank you all!

For Data sharing see here: https://stats.meta.stackexchange.com/q/5258/341520 — Lukas Lohse, Mar 07 '24 at 13:26
Thank you! I have made a GitHub folder (I think!) with my data in and my example code in. Do let me know if this works :) — sirianmckellan, Mar 07 '24 at 14:28
Fitting GLMMs is not plug-and-play statistics. It is not true that "adding more data" will in general improve fitting algorithms. In this answer, I cover several diagnostic plots in the context of mixed modeling to assess the quality and reliability of fits https://stats.stackexchange.com/questions/35071/what-is-rank-deficiency-and-how-to-deal-with-it/151116#151116 — AdamO, Mar 07 '24 at 15:48

Lukas Lohse · Accepted Answer · 2024-03-10T21:50:32.633

Outlier deletion is questionable at the best of times, as it is basically throwing way information, but in your case you detect 10% of your data as outliers, which is just unacceptable. Also if you already want to use a statistical model for skewed data, then don't classify outliers with symmetric criterion. On a log-scale your data is fine.

Now since the GLMM does converge if we don't remove the "outliers" I checked with DHARMa to compare the Data against a simulated Data from the model and we can see, that gamma is a bad fit. To be precise, it is actually insufficiently right skewed, which explains why the outlier removed data doesn't converge at all anymore. I guess the folk theorem of statistical computing holds: https://statmodeling.stat.columbia.edu/2008/05/13/the_folk_theore/

A log-normal model however does great. Note that a sigma of 1 on a log scale is a lot of variation.

library(lme4)
library(DHARMa)
size2 <- read.csv("https://raw.githubusercontent.com/sirianmckellen/stackex/main/Size2.csv")
size2 <- size2[!is.na(size2$project_ID), ]
size_quartiles <- quantile(size2$Area_ha, probs=c(.25, .75), na.rm = TRUE)
size_quartiles
size_IQR <- IQR(size2$Area_ha, na.rm = TRUE)
size_Lower <- size_quartiles[1] - 1.5size_IQR
size_Upper <- size_quartiles[2] + 1.5size_IQR
size3 <- subset(size2, size2$Area_ha > size_Lower & size2$Area_ha < size_Upper)
table(size2$project_ID) - table(size3$project_ID)
glmm_gamma <- glmer(Area_ha ~ Pre_post + (1 | project_ID), 
                    family = Gamma(link = "log"), 
                    data = size2)
summary(glmm_gamma)
#glmm_gamma@optinfo
#sessionInfo()
m2 <- lmer(log(Area_ha) ~ Pre_post + (1 | project_ID), 
           data = size2)
summary(m2)
confint(m2)
res_gamma <- simulateResiduals(glmm_gamma)
plot(res_gamma, title = "Gamma")
res_lognormal <- simulateResiduals(m2)
plot(res_lognormal, title = "log-normal")
a plot for your consideration
library(tidyverse)
ggplot(size2, aes(y = Area_ha, x = as_factor(project_ID), color = Pre_post)) +
  geom_boxplot(varwidth = T) +
  scale_y_log10() +
  theme_bw()

Thank you so much! I really appreciate your help. I'm very new to statistics and only fit a GLMM as I'd read it could be a good fit, so this is hugely helpful. Could you possibly point me towards some resources on how to interpret linear mixed model outputs when values are logged? I assume some kind of back transformation might be needed for my p values, effect sizes, etc? Thank you!! — sirianmckellan, Mar 11 '24 at 11:30
@sirianmckellan p-values don't need back transformation, but lmer doesn't provide for technical reasons(https://stats.stackexchange.com/q/22988/341520). Also I think confidence Intervals are superior. — Lukas Lohse, Mar 11 '24 at 13:12
@sirianmckellan Other then that you need to understand that everything is multiplicative, i.e. pre has linear effect 0.34 95% CI (-0.02 - 0.68) meaning ratios of 1.38 (0.97 - 1.98) or +38% (-3% - +98%). The sigmas can be reinterpreted from +/- to $\cdot / \div$ the exponent. — Lukas Lohse, Mar 11 '24 at 13:18

Added more data and suddenly GLMM fails to converge (R)

1 Answers1

a plot for your consideration