How critical is positive/negative example ratio for training LR model as probability estimator?

Question

Is it correct to assume that when we train Logistic Regression as probability estimator, then the ratio of examples in training data with Label 1 and 0 is absolutely critical and will be determining factor at least for mean of probabilities produced as an estimate.

what do you mean by "least for mean of probabilities produces an estimate"? — gunes, Jan 30 '20 at 17:59
I meant that 'poor' model would learn at least to produce estimate that approximates base rate.
For example for weather prediction, let's say if my feature X1 is just a cloud presence (0 or 1), and when cloudy then it rains with 30% chance. Then poor model would be able to produce probability estimates with mean of 30% when that feature X1 is 1.

My question - is it correct to assume that to be able to produce such probability estimates, it is important that ratio of training examples with X1==1 for Rain/NoRain has to be 0.3 too (assuming no weights are assigned via some sampling)? — viggen, Jan 30 '20 at 20:07
@AdamO

Do you have a source from which you got this understanding?

Probably I came to this conclusion (don't remember if I read something that was similar or lead). I am not certain about it and could construct opposite arguments - that's why I am asking this question. — viggen, Jan 30 '20 at 22:08

kjetil b halvorsen · Answer 1 · 2022-11-19T14:08:24.357

Your question is somewhat unclear, but as I read it, no, the ratio of positive to negative examples is not crucial. All other equal, it might be better with equal number of both, but what is really crucial is the minimum number of the two. For details of sample size requirements for logistic regression see Sample size for logistic regression?

For details and references see Unbalanced data with logistic regression: good references?

How critical is positive/negative example ratio for training LR model as probability estimator?

1 Answers1