5

I often read about the problematic of doing classification in imbalanced datasets and methods to address it. Namely, off-the-shelf classifiers learn to minimize some form of total miss-clasffication cost, and thus have a bias towards the most frequent class in the training set.

My question is: what other serious problems exist in this setting that can not be solved by simply adjusting the thresholding function of the response of the classifier?

Gecko
  • 171

2 Answers2

1

You've made an important observation that many miss by noticing that so-called classifiers like logistic regressions and neural networks return class membership probability values instead of discrete classes.

And I am with you that it is misleading to say that your classifier is really accurate at, say, $98\%$ if you have a $99:1$ class imbalance and could get $99\%$ accuracy by always guessing the majority class.

Take it a step further and directly evaluate the probability values. Two common metrics are log loss and Brier score. I'll give the equations for each of those in the binary case, where $y$ is the $0/1$ vector of observed classes and $\hat y$ is the vector of predicted probabilities of class $1$.

$$ \text{Log Loss}(y, \hat y) = -\dfrac{1}{n}\sum_{i=1}^n\left[ y_i\log(\hat y_i) + (1-y_i)\log(1-\hat y_i) \right]\\ \text{Brier Score}(y, \hat y) = \dfrac{1}{n}\sum_{i=1}^n\left( y_i - \hat y_i \right)^2 $$

By looking at the probability values, we do away with hard classifications and can make decisions based on how bad it is to make wrong classifications. For instance, even in a balanced problem, it might be the case that it is much worse to call a $0$ a $1$ than to call a $1$ a $0$, so we might require extreme evidence to call a case a $1$, say $P(1)>0.9$, in order to keep to a minimum the number of times a $0$ gets called a $1$. This is an example of changing the threshold like you mention in your question, but this has the advantage of being driven by the cost of misclassification, not arbitrary goals about misclassification rate that might be fairly unrelated to actual costs of misclassifications. Further, the cost of misclassification might be different for different subjects. By forcing a discrete choice, you lose the useful information contained in the probabilities.

Frank Harrell has two good blog posts on this topic (1) (2) and often writes about it here on Cross Validated.

Dave
  • 62,186
1

"Namely, off-the-shelf classifiers learn to minimize some form of total miss-clasffication cost, and thus have a bias towards the most frequent class in the training set. ... My question is: what other serious problems exist..."

This is not a problem. The bias is correct, it is just Bayes rule

$p(\mathcal{C}_+|\vec{x}) = \frac{p(\vec{x}|\mathcal{C}_+)p(\mathcal{C}_+)}{p(\vec{x})}$

Note the probability that a pattern belongs to the minority positive class is proportional to the frequency of positive patterns in the data set (which should be the same as that in operational use).

If it is unacceptable that all patterns are assigned to the majority class, it is just an indication that the misclassification costs are not equal and the threshold on the probability of class membership should be adjusted accordingly.

In most cases, there is no "class imbalance problem", it is just the "cost-sensitive learning problem" in disguise.

Now if the dataset is very small, there can be an undue bias against the minority class (see my answer to this related question), but in that case there generally isn't much you can do about it anyway.

Dikran Marsupial
  • 54,432
  • 9
  • 139
  • 204