Is it correct to assume that when we train Logistic Regression as probability estimator, then the ratio of examples in training data with Label 1 and 0 is absolutely critical and will be determining factor at least for mean of probabilities produced as an estimate.
Asked
Active
Viewed 37 times
0
1 Answers
0
Your question is somewhat unclear, but as I read it, no, the ratio of positive to negative examples is not crucial. All other equal, it might be better with equal number of both, but what is really crucial is the minimum number of the two. For details of sample size requirements for logistic regression see Sample size for logistic regression?
For details and references see Unbalanced data with logistic regression: good references?
kjetil b halvorsen
- 77,844
For example for weather prediction, let's say if my feature X1 is just a cloud presence (0 or 1), and when cloudy then it rains with 30% chance. Then poor model would be able to produce probability estimates with mean of 30% when that feature X1 is 1.
My question - is it correct to assume that to be able to produce such probability estimates, it is important that ratio of training examples with X1==1 for Rain/NoRain has to be 0.3 too (assuming no weights are assigned via some sampling)?
– viggen Jan 30 '20 at 20:07Probably I came to this conclusion (don't remember if I read something that was similar or lead). I am not certain about it and could construct opposite arguments - that's why I am asking this question.
– viggen Jan 30 '20 at 22:08