4

I've constructed a logistic glmm for predicting if individuals of a single species of fish caught will have an empty stomach or not (1=emtpy, 0=not empty). My predictor variables are salinity (sal), temperature (temp), standard length (SL), Station/location (which are not the same each year, but neither are they randomly chosen) and calendar year (CYR). I've modelled Station and CYR as random intercepts because, 1. They're both factors with many levels, and 2. (my hypothesis), the log-likelihood of a fish unsuccessfully catching something (correct to say?) depends on the prevailing conditions (presence of predators and prey, water quality, etc..) all change/depend on the location and year. So, hopefully I've got the random intercept correct.

Where I get confused is the logic for including random slopes and translating it to my scenario. This resource has been very helpful, but I'm not sure I have it right or how to "test" the idea/visualize including a random slope for logistic regression. My best guess is that the size of a fish (hence its age/experience) influences how good it is at seeking out and catching food, but that that also depends on the location and time. As in, maybe smaller fish stay in certain locations during different years based on changing conditions (hence a random slope)(?).

Plots:

enter image description here

(Many Stations make-up a "zone" and smaller fish (red boxplot with smaller median all have "non-empty stomachs) seem to eat better @ Whipray)

enter image description here

Would an ANOVA comparing mod1 and mod2 show which is better (random slope + intercept or just random intercept)? Any help interpreting the summary output as well is greatly appreciated! I'm not sure how to interpret the variance from the random effects or how to decide if my variables are too strongly correlated (or if it matters since 2/3 aren't significant). I also get an error for the more complicated model (mod2).

Only random intercepts:

mod1 <- glmer(empty ~ SL + sal + temp + (1 | CYR) + (1 | Station), family=binomial, data = c_neb, control=glmerControl(optimizer = "bobyqa"))

Random slopes and intercepts:

mod2 <- glmer(empty ~ SL + sal + temp + (SL | CYR) + (SL | Station), family=binomial, data = c_neb, control=glmerControl(optimizer = "bobyqa"))

Summaries:

  > summary(mod1)

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod'] Family: binomial ( logit ) Formula: empty ~ SL + sal + temp + (1 | CYR) + (1 | Station) Data: c_neb Control: glmerControl(optimizer = "bobyqa")

 AIC      BIC   logLik deviance df.resid 

757.6 784.8 -372.8 745.6 682

Scaled residuals: Min 1Q Median 3Q Max -1.1509 -0.5940 -0.4666 0.8429 3.3439

Random effects: Groups Name Variance Std.Dev. Station (Intercept) 0.1986 0.4457
CYR (Intercept) 0.1249 0.3534
Number of obs: 688, groups: Station, 90; CYR, 13

Fixed effects: Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6814783 1.1748120 0.580 0.562
SL -0.0300014 0.0064659 -4.640 3.49e-06 *** sal 0.0001171 0.0246037 0.005 0.996
temp -0.0234828 0.0337120 -0.697 0.486


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects: (Intr) SL sal
SL -0.474
sal -0.457 -0.011
temp -0.627 0.322 -0.367

##################################################################################################

> summary(mod2)

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod'] Family: binomial ( logit ) Formula: empty ~ SL + sal + temp + (SL | CYR) + (SL | Station) Data: c_neb Control: glmerControl(optimizer = "bobyqa")

 AIC      BIC   logLik deviance df.resid 

765.0 810.4 -372.5 745.0 678

Scaled residuals: Min 1Q Median 3Q Max -1.0746 -0.5984 -0.4609 0.8671 3.1028

Random effects: Groups Name Variance Std.Dev. Corr Station (Intercept) 3.057e-01 0.552944
SL 1.020e-04 0.010101 -0.58 CYR (Intercept) 3.257e-02 0.180476
SL 2.048e-05 0.004526 1.00 Number of obs: 688, groups: Station, 90; CYR, 13

Fixed effects: Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5848489 1.2086079 0.484 0.628
SL -0.0327655 0.0082548 -3.969 7.21e-05 *** sal 0.0006626 0.0249615 0.027 0.979
temp -0.0176335 0.0351176 -0.502 0.616


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects: (Intr) SL sal
SL -0.348
sal -0.463 -0.021
temp -0.638 0.159 -0.340 optimizer (bobyqa) convergence code: 0 (OK) boundary (singular) fit: see help('isSingular')

Nate
  • 1,529
  • 2
    Do you know the latitude/longitude location of the stations? I'm wondering whether a GAM with smooths for time and location might be another modeling option to consider. – dipetkov Dec 10 '22 at 19:25
  • Yes, I do have lat/lon! Great suggestion. I really don't understand how to build GAMMs yet though, haha. – Nate Dec 10 '22 at 20:07
  • Also, does anyone know if the r.squaredGLMM function from the MuMIn package can be used on logistic glmm or is it only for linear glmm? – Nate Dec 10 '22 at 20:11
  • 1
    Be very wary of pseudo-R-squared values for generalized models. See this page and its multiple links. As I understand the suggestion from @dipetkov you might be able to completely replace your random effects with smooths of location and time in a GAM. – EdM Dec 10 '22 at 20:20
  • @dipetkov, is a non-linear relationship between the log odds ratio and the independent variables another reason to try a GAMM? I was checking the assumption of linearity for a logistic glmm and it doesn't seem to hold. – Nate Dec 10 '22 at 23:56
  • 1
    I'm guessing that nearby stations are more "similar" to each other than farther apart stations. Similarly for year. See for example Why does including latitude and longitude in a GAM account for spatial autocorrelation?. – dipetkov Dec 11 '22 at 00:18

1 Answers1

3

The results of your second model suggest that you don't have enough data to fit a model with two random slopes reliably this way.

When you fit random slopes in that second model, you ask for a random slope of SL within each of CYR and Station, with each of those those slopes correlated with the corresponding intercepts. The random slopes and intercepts are constrained to follow normal distributions, and you fit the corresponding normal variances and correlation coefficients. See this page, for example. That adds 4 more parameters to estimate than in the first model with only two random intercepts.

The warning boundary (singular) fit corresponds to the very small variances for the random slopes associated with SL: approximately 0.0001 or less for both. At least one of those variances couldn't be reliably distinguished from 0, even though the model converged numerically. Furthermore, the correlation between random slopes and intercepts for CYR is exactly 1, which is also a sign of a singular fit.

The lme4 package reference manual page for isSingular discusses the problem and ways to proceed. Quoting in part:

While singular models are statistically well defined (it is theoretically sensible for the true maximum likelihood estimate to correspond to a singular fit), there are real concerns that (1) singular fits correspond to overfitted models that may have poor power... (3) standard inferential procedures such as Wald statistics and likelihood ratio tests may be inappropriate.

That last point argues against using a standard Anova to compare the two models.

If you think that both random slopes are important, based on your understanding of the subject matter, then you could take the advice on that manual page to use partially or fully Bayesian approaches. Otherwise, you could consider reducing the complexity of the model, perhaps by removing the random SL slope with respect to CYR, which seems to be the major source of difficulty.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Thank you for the quick response! I just have a few clarifying questions, if don't mind. What part of the second model tells you this "The results of your second model suggest that you don't have enough data"? - for my own understanding. Also, my reasoning of how to identify a potential need for a random slope is correct (effects of one variable are different at various levels of a grouping factor)? Finally, the variance for each random effect, how do we tell if this is "a lot" of the variance? – Nate Dec 09 '22 at 21:14
  • 1
    @Nate the isSingular help page says a singular fit means you have "estimated variance-covariance [vcov] matrices with less than full rank." Such a vcov matrix typically either means not enough data, or data so highly correlated that you can't make independent estimates of some coefficients with the data you have. Your idea of when random slopes make sense is correct; you just can't always fit them this way in practice. For quickly evaluating "a lot" of some random effect, I'd recommend comparing the standard deviation of the random effect against the corresponding fixed intercept or slope. – EdM Dec 09 '22 at 22:10
  • Thank you. Lastly, the non-significant intercept in the fixed effects table means I can't rule out that the intercept coef. might be zero? Are there any further interpretations or implications (anything to worry about/fix)? – Nate Dec 09 '22 at 22:57
  • Nevermind, think I got it: β0 is the log odds only when x1=x2=0 (i.e. what are the log odds of a fish having an empty stomach with standard length (SL)=0, which is meaningless in my case). https://stats.stackexchange.com/questions/92903/intercept-term-in-logistic-regression – Nate Dec 10 '22 at 00:03
  • By "the major source of difficulty" in the last paragraph, do you mean the fact that the SL variance in the (SL|CYR) term is estimated to be 2.048e-05? To me, the variance of 1.020e-04 for the random slope in the (SL|Station) term isn't much more convincing. When I think about it, I don't see how this random slope makes sense; it means that we expect that the effect of size varies in excess of the average differences between locations. – dipetkov Dec 10 '22 at 19:21
  • @dipetkov I'm referring to the singularity problem. (SL|Station) shows a reasonable correlation between slope and intercept, unlike the perfect correlation in (SL|CYR) presumably from the singular fit. Interpreting small variance values is tricky, as they depend on the scales of predictors and outcomes. The variance among slopes in (SL|Station) is 5 times that in (SL|CYR), at least suggesting less of a problem with (SL|Station). This OP claims a subject-matter basis for including these random slopes; I tend to defer to an OP on such matters unless it's very obviously wrong. – EdM Dec 10 '22 at 19:38
  • This is a logistic regression, so at least as far as the outcome is concerned, scale is irrelevant? I acknowledge that the OP is the domain expert but to me random slopes tend to make sense only when it's about a group-level variable (ie. a station-specific variable in this case). – dipetkov Dec 10 '22 at 19:40
  • @dipetkov in this case, yes. But the variance would be substantially different depending on whether SL had been expressed in millimeters or meters. For me, the main issue is the source of a less-than-full-rank vcov matrix. That (absent strictly numeric precision problems) should show up regardless of predictor scale, in this case evidently leading to a perfect slope-intercept correlation in (SL|CYR). I do like your suggestion for the OP to try modeling actual latitude/longitude and time values, perhaps a better choice for this problem than using random effects. – EdM Dec 10 '22 at 19:50
  • 1
    I see, I think. Thanks for the explanation. It's hard to reason about this without the data. (I can guess though from the plot though that SL is -- most likely -- in centimeters.) – dipetkov Dec 10 '22 at 20:06
  • Yes, SL is measured in centimeters. We'd need a bigger boat for fish measured in meters :p – Nate Dec 10 '22 at 20:21