1

I am trying to model what environmental data increases the probability of my response variable occurring. My data covers 30 years of daily observations. I have narrowed my predictor variables down with PCA to 3 continuous variables and month (temporal/categorical). My response is a binary presence (1) or absence (0). The problem is the absences greatly outnumber the presences. I have roughly 10,000 zeros and only 200 ones...

I am coding in r. I have run a glm with the family set to binomial, to run a logictic regression. While my model shows good significance and other measures the overall model fit is poor. I believe this due primarily to my unbalanced ratios in the binary response. I have tried weighting the 1s higher at various weight levels. Unfortunately, this did not improve my model's fit. I did try a zero inflated poisson model but these are better for count data with many zeros and did not work well with a binary variable. I have been comparing the deviances: Null deviance: 2104.3 on 11687 degrees of freedom Residual deviance: 2017.6 on 11683 degrees of freedom to determine model fit. I've been reading about embedded models, which may be what I try next, however I am not familiar with these. I would appreciate any advice on how to deal with an imbalanced binary response. Thanks!

  • 1
    This will probably get moved or locked because it's a stats question, not an R question. But class imbalance doesn't affect glm. Class imbalance is thought to affect tree regressions, and even that is debated. I don't think this is your problem. I think the problem is either the predictor variables don't explain as much variance in the response as you would like, the PCA is destroying useful information, or the linear form of glm is insufficient to describe more complex relationships. I would try glmnet with all the predictors and xgboost, both fit with cross-validation. – Arthur Oct 12 '23 at 12:22
  • On class imbalance, see https://stats.stackexchange.com/q/247871/232706 and linked questions. // How are you assessing that the "model fit is poor"? – Ben Reiniger Oct 12 '23 at 13:57
  • Do you mean that measures like accuracy, sensitivity, specificity, precision, recall, and $F_1$ score are poor? – Dave Oct 12 '23 at 14:46
  • How do you know your deviance to be poor? – Dave Oct 12 '23 at 16:49
  • The analysis of deviance provided has nothing to do with goodness of fit. Goodness of fit comes from comparing deviance of your model to that of a model that is richer than your model but still not having an enormous number of parameters. And yes accuracy, sens, spec, prec, recall, $F_1$ are poor accuracy score choices. Pseudo $R^2$ and $c$-index (concordance probability; AUROC) would be better but they are still not related to lack of fit. – Frank Harrell Oct 15 '23 at 12:42
  • 1
    can you please add a description of the dependent and independent variables to your original question. PCA doesn't sound like a good idea (only useful if the variables are correlated like measurements over a day, not distinct measurements eg height/ weight diet/.. ). Is your original dependent variable binary or have you decided to binarise it (and lose information) – seanv507 Oct 18 '23 at 19:43

1 Answers1

0

You have over ten-thousand observations. This means that hypothesis tests will be quite sensitive to even small differences. The hypothesis test sees an improvement in model performance (measured by deviance) above some baseline measure (always prediction the prior probability) and has the sample size to give a small p-value. That's all the statistical significance is telling you.

Then you look at the effect size, which is how your model performance compared to that baseline model. It might be that this improvement is not very much, which means that your model misses something about the relationship between your features and the binary outcome.

The main point of this is that it is unsurprising for a large sample size like ten-thousand to lead to a small p-value despite a small effet size, same as is the case for other hypothesis tests.

I don't see the imbalance as much of an issue here except that you have a fairly limited number of minority-class points. My suspicion is that your features (or how you use them) do not allow you to tell that an observation will belong to the minority class with a high probability. If most of your predicted probabilities are quite low and near the prior probability, this would be evidence in support of my suspicion.

EDIT

Even if you consider the "effective sample size" that is discussed in the comments, you wind up with an effective sample size of $3\times 10200 \times\frac{200}{10200} \times \frac{10000}{10200} \approx 588 $, so stil quite a large (effective) sample size that should be able to detect fairly small differences. (Even the sample size of $200$ members of the minority class is fairly large.)

Dave
  • 62,186
  • What matters more than the sample size is the effective sample size, which for binary Y is np(1-p) where p=proportion of Y=1. Also, for rare outcomes (rare Y=0 or Y=1) only tendencies (probabilities) are of interest so classification and classification accuracy measures are not relevant. – Frank Harrell Oct 23 '23 at 17:56
  • @FrankHarrell That calculation for the effective sample size does not make sense to me. If there are $1000$ from each category, the effective sample size is $500$. Why would the effective sample size be a quarter of the true sample size? I could believe the effective sample size to be limited by the number of observations of the minority category, but having $500$ in this situation is strange. – Dave Oct 23 '23 at 18:12
  • Sorry the formula should be 3np(1-p). Your logic is right for estimating an overall proportion. But when needing to predict Y from X when Y is binary, the effective sample size needs to take into account that if the number of Y=0 or number of Y=1 is small, there is less basis for prediction. In the limit if there are no events, you have no information so effective sample size=0. See here. – Frank Harrell Oct 23 '23 at 19:14
  • Thank you all for your thoughts. I had a colleague mention trying a random forest model instead. I am unfamiliar with these. Would this be a better approach? – Greatwhite4 Oct 25 '23 at 16:09
  • @Greatwhite4 A random forest can fit more compex relationships than your current PCA approach can, at the expense that it is at a higher risk of overfitting to the data and giving results that will not generalize well. – Dave Oct 25 '23 at 18:25