3

Overview

I got the following error when I ran a binomial logistic regression with the glm function in R:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 

Therefore, I am wondering if there is something inherent in the structure of my data that is causing this issue.

I have observation data on two species, Species 1 and Species 2 (response variable), and ecological data for these two species, including foraging height, attack maneuver, foliage density, and foraging substrate (predictor variables). My goal is to infer what predictor variables contribute to the ecological differences between those species.

Input data 'niche_data_recoded.csv':

Species Attack_maneuver Foraging_height Foraging_substrate Foliage_density
Species_1 1 3 1 3
Species_1 2 1 2 1
Species_1 1 3 1 3
Species_1 1 3 1 3
Species_1 3 5 3 3
Species_1 1 4 1 3
Species_1 3 4 3 2
Species_1 2 1 2 1
Species_1 2 1 2 1
Species_1 1 4 1 1
Species_2 1 1 1 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 1 1 1 1
Species_2 2 1 2 1
Species_2 1 3 1 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_1 2 1 2 1
Species_1 2 1 2 1
Species_1 2 1 2 1
Species_1 2 1 2 1
Species_1 1 5 1 2
Species_1 1 2 1 2
Species_1 1 3 1 3
Species_1 2 1 2 4
Species_1 1 1 1 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_1 1 2 1 1
Species_1 2 1 2 1
Species_1 2 1 2 1
Species_1 2 1 2 1
Species_1 2 1 2 1
Species_1 2 1 2 1
Species_1 2 1 2 1
Species_1 1 1 1 1
Species_1 1 1 1 1
Species_1 2 1 2 1
Species_1 1 5 1 1
Species_1 1 5 1 2
Species_1 3 5 3 3
Species_1 1 5 1 1
Species_1 1 4 1 1
Species_1 1 5 1 1
Species_1 1 4 1 1
Species_1 1 3 1 1
Species_1 2 1 2 2
Species_1 2 1 2 1
Species_1 2 1 2 1
Species_1 2 1 2 1
Species_1 1 1 1 3
Species_1 1 1 1 1
Species_1 1 1 1 2
Species_1 2 1 2 1
Species_1 1 1 1 1
Species_2 2 1 2 1
Species_2 1 1 1 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 1 1 1 1
Species_2 2 1 2 1
Species_2 1 1 1 1
Species_1 2 1 2 1
Species_1 1 3 1 2
Species_1 1 1 1 1
Species_1 1 3 1 1
Species_1 1 5 1 1
Species_1 2 1 3 1
Species_1 2 1 3 1
Species_1 2 1 3 1
Species_1 1 1 1 1
Species_1 2 1 2 1
Species_1 1 2 1 2
Species_1 2 1 2 2
Species_1 1 3 1 2
Species_1 1 1 1 2
Species_1 1 1 1 2
Species_1 3 5 3 4
Species_1 2 1 2 3
Species_2 1 1 1 1
Species_2 2 1 2 1
Species_2 1 1 1 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 1 3 1 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 1 2 1 1
Species_2 1 3 1 1
Species_2 2 1 2 1
Species_2 1 2 1 1
Species_2 2 1 2 1
Species_2 1 1 1 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 1 1 1 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 2
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 1
Species_2 2 1 2 1

Code:

niche.dat <- read.csv('niche_data_recoded.csv',header=T,na.strings=c(""))
cols <- c("Species", "Attack_maneuver", "Foraging_height", "Foraging_substrate", "Foliage_density")
niche.dat[cols] <- lapply(niche.dat[cols], factor)
str(niche.dat)

model <- glm(Species ~ Attack_maneuver + Foraging_height + Foraging_substrate + Foliage_density,family=binomial,data=niche.dat) summary(model)

Output:

> model <- glm(Species ~ Attack_maneuver + Foraging_height + Foraging_substrate + Foliage_density,family=binomial,data=niche.dat)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 
> summary(model)

Call: glm(formula = Species ~ Attack_maneuver + Foraging_height + Foraging_substrate + Foliage_density, family = binomial, data = niche.dat)

Deviance Residuals: Min 1Q Median 3Q Max
-1.45706 -1.25836 -0.00008 0.92146 2.04286

Coefficients: (1 not defined because of singularities) Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.18833 0.49204 0.383 0.7019
Attack_maneuver2 -18.75440 3765.84720 -0.005 0.9960
Attack_maneuver3 4.80062 5754.42487 0.001 0.9993
Foraging_height2 0.21481 1.17207 0.183 0.8546
Foraging_height3 0.07142 0.98094 0.073 0.9420
Foraging_height4 -19.31781 4991.11811 -0.004 0.9969
Foraging_height5 -17.73391 1922.24348 -0.009 0.9926
Foraging_substrate2 19.20303 3765.84717 0.005 0.9959
Foraging_substrate3 NA NA NA NA
Foliage_density2 -2.59109 1.10344 -2.348 0.0189 * Foliage_density3 -18.30482 1988.67327 -0.009 0.9927
Foliage_density4 -18.37471 4310.78632 -0.004 0.9966


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 156.77  on 113  degrees of freedom

Residual deviance: 111.63 on 103 degrees of freedom AIC: 133.63

Number of Fisher Scoring iterations: 17

I'm also not sure what the NAs mean in this situation.

  • 2
    You have very very large standard errors which could mean you have problems with quasi complete or complete separation. – Demetri Pananos Aug 30 '22 at 01:12
  • 1
    Note also that many coefficients are nearly identical but have opposite sign, meaning which may indicate problems with Collin rarity between variables. I can’t check your data right now, perhaps tomorrow, but these could be two probable sources. – Demetri Pananos Aug 30 '22 at 01:13
  • 2
    For Species_2, Foliage_density has the value of 1 for 50 out 51 cases. One possible cause leading to the warning. – Dave2e Aug 30 '22 at 01:35
  • @Dave2e There is no warning after removing that predictor variable. But is removing that variable the best approach or is there a solution for keeping it in the model? Alternatively, is the warning safe to ignore? I'm also still not sure why there are NAs for Foraging_substrate3. – user12167116 Aug 30 '22 at 04:16
  • 2
    There's some really good discussion of (quasi)complete separation at the following linked question: https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression – James Stanley Aug 30 '22 at 04:51

1 Answers1

4

The foraging substrate category 3 has 100% species_1 as response. This makes that you can not model it well with the logit link function. That is because the fitting/modelling will try to make an estimate equal to one, $\hat{p} = 1/(1+e^{-X\hat\beta}) = 1$, and this will be associated with an infinite coefficient.

See also

Why is logistic regression particularly prone to overfitting in high dimensions?

How to deal with perfect separation in logistic regression?