I am working on an ML classification task which is similar to the following:
Apples have to be classified to three classes: Big, Medium and Small.
I need a metric which I can use to assess the system. I consider using mean F1 score for the three classes (Big, Medium and Small).
However it looks like there is a problem with F1 score here: it will equally penalize the system for a big-to-medium apple confusion and for big-to-small confusion. And, from my intuition, system should be penalized more for making a bigger mistake (confusing a big apple with a small one is a bigger mistake than confusing a big apple with a medium one).
What metric could I use instead of F1 score here?