3

I am using a random forest classifier to predict plant color in my study species, using a variety of environmental variables. My data comes from citizen scientists and I am worried that the class imbalance I'm seeing between my color categories may be due to a sampling bias from the observers. For example, the same flower may have been documented twice by different people. Or within an area, maybe color 1 gets documented more than color 2, even if color 1 isn't the majority color in the area (Because maybe it's a prettier color).

There is also an issue of areas with high human population density having overrepresentation than areas with lower human population density (or bad cell coverage).

Is this something I should worry about when using random forest? I'm worried that maybe this could cause a higher emphasis on a predictor? If it is something I should be concerned about, what can I do?

mdewey
  • 17,806
Rachel
  • 41
  • Your problem occurs because you have miscast a prediction problem as a classification program. See https://fharrell.com/post/classification. Use random forest to estimate probabilities and imbalance is not a problem. But note that random forest, because of its tendency to overfit, requires a small number of features and a large number of observations. – Frank Harrell Mar 07 '22 at 11:48
  • 1
    @FrankHarrell This doesn’t seem like the typical “class imbalance” question we get on here. Implicit in the estimation of tendencies that I’ve learned from people like you and Stephan Kolassa is that the prior probability of class membership Is reasonable. If it is biased, as it seems to be here, then the posterior probabilities we get from the model would be misleading. – Dave Mar 07 '22 at 11:56
  • Yes it's hard for me to fully understand your imbalance issue. Typically having one observation see by multiple observers would be handled with clustering (e.g., random effects). – Frank Harrell Mar 07 '22 at 22:01

1 Answers1

1

Random forests shouldn't be immune to this kind of bias. An overrepresented data segment will be overrepresented in the splitting criterion, and so the trees will tend to favor splits that perform well for that segment at the expense of other segments. That's not to say the result will be poor, but there will be a bias. In the particular case of the classes, the final leaf scores will (on average) be biased in exactly the same way your data is.

If you can quantify the extent to which data segments are under/over-represented, then you can add weights to the random forest to counteract that effect (see wikipedia). Similarly, if you can quantify the class balance change that arises from sampling, you can apply class weights to get the leaf scores back to the right proportions. You can also apply a post-model adjustment for the class balance issues on the final scores, see e.g. Convert predicted probabilities after downsampling to actual probabilities in classification, but I don't think there's an analogue for data segments.

Ben Reiniger
  • 4,514