2

I am working on an ML classification task which is similar to the following:

Apples have to be classified to three classes: Big, Medium and Small.

I need a metric which I can use to assess the system. I consider using mean F1 score for the three classes (Big, Medium and Small).

However it looks like there is a problem with F1 score here: it will equally penalize the system for a big-to-medium apple confusion and for big-to-small confusion. And, from my intuition, system should be penalized more for making a bigger mistake (confusing a big apple with a small one is a bigger mistake than confusing a big apple with a medium one).

What metric could I use instead of F1 score here?

Alexey
  • 123

1 Answers1

1

The F1 score is an evaluation metric for binary classifiers, so it is not very appropriate here (although it could be adapted).

However, more important is that your problem is not a genuine classification problem. In classification, there is no order of the classes, and you cannot say "that class is nearer to this class than another". They are all "equally far apart".

So you already have to choose a more appropriate model, to begin with, one that knows how to deal with response variables that have ordered differences. This is always the case in regression when the response is a real number. But you don't really have a continuous real response, you have just three discrete values with "different distances between them". Thus, your problem is more appropriately handled by ordinal regression models.

frank
  • 10,797
  • Thank you for bringing up the term "ordinal regression". It helped with further googling ) Here is an answer which addresses my question about choosing an evaluation metric for ordinal regression: https://stats.stackexchange.com/a/557563/319972 – Alexey Jul 31 '22 at 11:24