Say you want to build a predictive model to classify two groups, Yellow cars VS Silver cars. There are over 1 million silver cars, but only 30,000 Yellow cars. Would a model be biased if you randomly sample 10,000 Silver cars and 10,000 yellow cars to build a training dataset? Would you need equal proportions of yellow to silver cars to reflect their true population in the universe of cars?
Is a predictive model biased if you sample 10,000 values from 2 significantly different populations?
Asked
Active
Viewed 44 times
1
-
See Does down-sampling change logistic regression coefficients?. – Scortchi - Reinstate Monica Oct 07 '16 at 21:53
-
Thanks Scortchi, I have a math background but had some difficultly following that post, pretty detailed statistics. Is that saying that the influence of downsampling is non-significant? Also that posts hints as computational restrictions, which I am facing (billions of records in my case) – barker Oct 07 '16 at 22:01
1 Answers
0
If the algorithm used to build the predictive model does not allow for a weighted sample as input, the answer is yes.
piccolbo
- 116
-
Thanks piccolbo, would you suggest Probability-proportional-to-size sampling? I'm reading this wiki https://en.wikipedia.org/wiki/Sampling_(statistics) – barker Oct 07 '16 at 21:10