0

We are taking anonymised demographic data of n people with a rare medical condition, and trying to train a binary classifier (currently using xgboost) to reach more similar people with let's say a public service announcement.

To train we have been taking a random sample of n non-patients (with their demo data). The classifier helps us reach many people with what we believe is a propensity to have the disease.

Starting to have doubts if an artificially balanced sample like this is a good thing. Should there be more non-patients in the training sample since in real life there is MANY more non-patients and why?

Thanks!

  • you might find my answer here helpful https://stats.stackexchange.com/questions/488939/does-up-sampling-lead-to-lots-of-false-positives-in-production/569098#569098 – Dikran Marsupial Apr 01 '22 at 15:29
  • Thanks! Going through it! – Gamliel Beyderman Apr 01 '22 at 18:02
  • I am puzzled how to post-process as you suggest, will it affect the final result though as we just classify people into one of two classes and get a value of 1 or 0 for each person... – Gamliel Beyderman Apr 01 '22 at 21:02
  • You wrote: "Fortunately this bias tends to go away rapidly as the size of the dataset increases, so if you have lots of data, you don't need to do anything." Not to do anyth = keep the sample balanced? How do I quantify how much data is enough to overcome this bias. Let's say is 40 thousand positive case and 40 thousand negative, a lot, a little? – Gamliel Beyderman Apr 01 '22 at 21:06
  • 1
    That is probably a lot, but it also depends on the dimensionality of the data, in high dimensions you need more data to characterise the statistical distribution of the minority class. It is important to consider the misclassification costs, and factor those in. For problems like this it is probably best to go for a probabilistic classifier that estimates the probability of class membership rather than a discrete 0/1 classifier, as you can then cope with misclassification costs and differences in priors using the tricks in my other answer. – Dikran Marsupial Apr 01 '22 at 21:30

1 Answers1

0

This depends on the KPI that you want to achieve with your classifier. Depending on different costs involved in false positives and false negatives you want to optimize precision or recall.

Generally, I would recommend to use a very large amount of examples (much more negative than positive) and use class_weight parameter of your ML algorithm to optimize performance based on your necessities.

  • We want to maximize recall. But how to decide on the "right" imbalance? – Gamliel Beyderman Apr 01 '22 at 20:51
  • You want the test set as close to reality as possible. Than you could treat the training set imbalance as a hyperparameter and optimize it using model selection. – user2672299 Apr 01 '22 at 21:09
  • However, generally I would not recommend to optimize recall using under sampling. You would like to optimize recall by changing the error function of your ML algorithm. – user2672299 Apr 01 '22 at 21:10