3

I'm working on a classification problem from a dataset containing three classes, with proportions {"0":0.43, "1":0.25, "2":0.30}. However, whenever I train a model, it never predicts class "1" (literally ZERO predictions).

Even weirder, this happens regardless of the model I train (tried the usual, SVM, RF, tree boosting, etc.). So I guess there is something in the data that I did not catch, but the fact that there is not even one or two predictions for class 1 is really disturbing.

Attached is the distribution of my test set, vs. the distribution of my predictions (here for a baseline SVM) :

Test set distribution

Predictions distribution

Thanks.

Mordechai
  • 31
  • 3
  • What adjustments have you tried within methods? Also, what about multinomial logistic reg.? – Peter Flom Feb 23 '24 at 19:31
  • I have tried oversampling the "1" class that poses me problem, but it still does not appear. I'll try multinomial logistic regression, thanks. Coming back to you with the result! – Mordechai Feb 23 '24 at 19:35
  • 2
    What do you get when you predict probabilities, e.g., predict_proba instead of predict? $//$ Is there a reason to believe that your variables should be able to distinguish between the categories? – Dave Feb 23 '24 at 20:21
  • When predicting the probabilities, class 1 is mostly under 25% while classes 0 and 2 share the remaining 75% Is there a reason to believe that your variables should be able to distinguish between the categories? I am not sure how I would assess that ? – Mordechai Feb 26 '24 at 19:52
  • $1)$ So the least likely category is predicted as the least likely. That seems reasonable, does it not? $//$ $2)$ You might not be able to asses this easily. However, one consideration is if a human could use the features to make an accurate prediction. For instance, we know the pixels of the MNIST digits to be adequate for making accurate classifications because humans can look at the images and make accurate classifications without additional information. – Dave Feb 26 '24 at 20:21

3 Answers3

3

Let's take a much simpler example that might lead to that kind of result, using this frequency table roughly based on your charts:

$$\begin{array}{cccccc} & & \text{feature:} & A & B \\ &\text{class:} & & & \\ & 0 & & 1050 & 20 \\ & 1 & & 550 & 100 \\ & 2 & & 150 & 600 \\ \end{array}$$

A model trained on this data is likely to suggest when observing a new case of $A$ that class $0$ would be the best prediction, while when observing a new case of $B$ that class $2$ would be the best prediction. So it would never give a prediction of $1$. These predictions would be correct the majority of the time, and no other approach would be likely to do better (unless you are allowed to make non-categorical class predictions like $0.4$ or $1.7$).

Your actual model is likely to be more sophisticated than this, but if a particular class is always less common than one of the other classes no matter what additional information you condition on, then you could see a similar failure to ever predict that class.

Henry
  • 39,459
2

There is nothing particularly strange about this, it may well be that the model is giving the optimal answer to the question it has been posed. If the density of patterns belonging to the minority class (weighted by its prior probability) is less everywhere than the weighted density of the majority class, then no pattern is ever more likely to belong to the minority class than the majority class. In that case, the optimal accuracy is obtained by assigning all patterns to the majority class.

Here is a simple two-class example, with Gaussian class conditional densities:

enter image description here

If this is not a satisfactory answer, it usually means that the misclassification costs are not equal (in your case classifying a class 1 pattern as a class 0 or class 2 is a worse error than the other way round). The solution is to work out what the matrix of misclassification costs should be and use "minimum risk classification". If you have a probabilistic classifier, that can be done after training. If you use a discrete classifier, like the SVM, then the misclassification costs must be built in at training time (for the SVM having different values of C, the regularisation parameter, for each class).

Most classification methods are set up by default to assume that misclassification costs are equal, but that is not that often the case in real world classification tasks. Cost-sensitive learning is something that practitioners need to have in their toolbox.

Dikran Marsupial
  • 54,432
  • 9
  • 139
  • 204
  • 1
    +1 This is why it’s so important to think about how well the features let the minority class. If something about the minority class is screaming at you, then it will burn through the imbalance. If not, what is the model going to do, conclude that the unremarkable observation belongs to the remarkable category? – Dave Mar 07 '24 at 11:46
  • 1
    The imbalance is part of Bayes rule, we shouldn't do anything about it unless we have a reason to believe it results in an undue bias, but that is very hard to diagnose (https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance) as you suggest, the likelihood/class conditional density is also part of Bayes rule, and that may dominate the prior (imbalance). – Dikran Marsupial Mar 07 '24 at 11:57
1

As Henry noted (+1) this is a natural consequence of model fitting if the classifications are clearly discriminatory. A by-product of that is that the less likely events will get predicted wrong. You do not want perfect predictions in your model, as this is often a sign of overfitting.

On another note, it is much more important to look at out-of-sample predictions than in-sample predictions. Your model will usually correctly predict things more often on the data it was trained on because this is how it was informed. Testing on out-of-sample data is much more valuable for seeing how accurate your model is in the real world.

  • "You do not want perfect predictions in your model, as this is often a sign of overfitting." this is often not the case with modern ML methods like the SVM, where generalisation can be optimal for classifiers with zero error on the training set. There is a lot of interesting work on this ("benign overfitting") – Dikran Marsupial Mar 07 '24 at 12:55
  • I suppose my comments are more geared towards the typical case here then. – Shawn Hemelstrand Mar 07 '24 at 21:26