I am trying to model what environmental data increases the probability of my response variable occurring. My data covers 30 years of daily observations. I have narrowed my predictor variables down with PCA to 3 continuous variables and month (temporal/categorical). My response is a binary presence (1) or absence (0). The problem is the absences greatly outnumber the presences. I have roughly 10,000 zeros and only 200 ones...
I am coding in r. I have run a glm with the family set to binomial, to run a logictic regression. While my model shows good significance and other measures the overall model fit is poor. I believe this due primarily to my unbalanced ratios in the binary response. I have tried weighting the 1s higher at various weight levels. Unfortunately, this did not improve my model's fit. I did try a zero inflated poisson model but these are better for count data with many zeros and did not work well with a binary variable. I have been comparing the deviances: Null deviance: 2104.3 on 11687 degrees of freedom Residual deviance: 2017.6 on 11683 degrees of freedom to determine model fit. I've been reading about embedded models, which may be what I try next, however I am not familiar with these. I would appreciate any advice on how to deal with an imbalanced binary response. Thanks!