1

Say you want to build a predictive model to classify two groups, Yellow cars VS Silver cars. There are over 1 million silver cars, but only 30,000 Yellow cars. Would a model be biased if you randomly sample 10,000 Silver cars and 10,000 yellow cars to build a training dataset? Would you need equal proportions of yellow to silver cars to reflect their true population in the universe of cars?

barker
  • 261

1 Answers1

0

If the algorithm used to build the predictive model does not allow for a weighted sample as input, the answer is yes.

piccolbo
  • 116
  • Thanks piccolbo, would you suggest Probability-proportional-to-size sampling? I'm reading this wiki https://en.wikipedia.org/wiki/Sampling_(statistics) – barker Oct 07 '16 at 21:10