Multiclass Unbalanced Classication :Very very low F1 scores, high precision low recalls

Question

have three classes for sentiment (negative, neutral, and positive). I created synthetic fake data for the positive class the analogy now is 50% neutral, 45% positive, 5% negative. I get the metrics below and I am not 100% sure how to interpret them and if it is good to deploy this to production. I want the model to catch the positives and neutrals and not misclassification on negative class (aka very small False Positive on Negative class I suppose). How would you interpret this table?

0=negative, 1=neutral 2=positive but model trained on one hot encoding.

Classification Report

        precision    recall  f1-score   support
 0       0.06      0.93      0.10       643
 1       0.95      0.16      0.27     36755
 2       0.06      0.62      0.11      2309

accuracy 0.20 39707

macro avg 0.35 0.57 0.16 39707

weighted avg 0.88 0.20 0.26 39707

Compared to this

Classification Report

 precision    recall  f1-score   support
0     0.40      0.45      0.43        44
1     0.90      0.87      0.88       751
2     0.46      0.52      0.49       123

accuracy 0.80 918

macro avg 0.58 0.61 0.60 918

weighted avg 0.81 0.80 0.81 918

Don't use accuracy, precision, recall, sensitivity, specificity, or the F1 score. Every criticism at the following threads applies equally to all of these, and indeed to all evaluation metrics that rely on hard classifications: Why is accuracy not the best measure for assessing classification models? Is accuracy an improper scoring rule in a binary classification setting? Classification probability threshold — Stephan Kolassa, Aug 03 '22 at 06:18
Instead, use probabilistic classifications, and evaluate these using proper scoring rules. This Meta.CV thread contains a curated list of useful links on imbalanced data. — Stephan Kolassa, Aug 03 '22 at 06:18
We use these metrics.Can we inference something from these metrics only ? — nikolaosmparoutis, Aug 03 '22 at 19:59
A major problem with these metrics is their dependence on setting a threshold for categorization, and every software I know assigns the point to the category with the highest probability. While this sounds reasonable at first, it might be that a minority class is never more likely. While @StephanKolassa and I would and often do argue in favor of evaluating the probability predictions, a simple approach that might work fairly well is to vary the threshold for classification (though this gets complicated when there are more than two classes). — Dave, Aug 03 '22 at 21:04
Further, your output variable is ordinal, not strictly categorical, in that it is worse to predict “positive” for a “negative” than it is to predict “neutral” for a negative. The tools of ordinal regression might be your friend. — Dave, Aug 03 '22 at 21:05
The labels are to one hot.The architecture is in Keras, a deep conv neural network with softmax.Can we inference something from the first report?I see that it is precise for the neutral class, on negative it is very good but maybe there are many positives on the negative class which i do not want, or neutrals which is slighly more acceptable. You think that it will missclassify positives as negatives? In general we do not care about negatives to be classified as positives but for the inverse and the neutrals@Dave — nikolaosmparoutis, Aug 04 '22 at 00:29
@Dave Please could you suggest me a solution to increase the precision on Negative class or to increase the recall on neutral? — nikolaosmparoutis, Aug 09 '22 at 18:16
To increase the recall on neutral, mark everything as neutral, and you’ll never miss a neutral case. As @StephanKolassa wrote, however, recall is a problematic metric. If you insist on using a problematic metric, it is hard to discuss best practices. Have you read the linked material? — Dave, Aug 09 '22 at 18:21
The prev system uses them, who can influence the boss that a system uses "wrong" metrics and we need to spend time to create new again...But i will check them on my free time. — nikolaosmparoutis, Aug 09 '22 at 20:19
@nikolaosmparoutis Have you tried a simpler classifier system, such as multinomial logistic (or ordinal - Dave makes a very good point there) regression? Deep learning doesn't always work better than simple classifier systems, and sometimes can work much worse (there is more to go wrong - more opportunities for operator error). Also deep networks tend to produce overly confident probability predictions, which makes them harder to work with (in terms of misclassification costs - see my answer). — Dikran Marsupial, Aug 11 '22 at 15:41

score 1 · Answer 1 · answered Aug 11 '22 at 15:33

" I want the model to catch the positives and neutrals and not misclassification on negative class (aka very small False Positive on Negative class I suppose). "

The results you have are probably near optimal results assuming that you have no preferences on the types of error that the model makes. If that assumption is incorrect (as it appears to be in this case), the results are not likely to be meaningful.

If you have particular concerns about one class, then that is an indication that the misclassification costs for your problem/analysis are not equal. What you should probably do is to work out what the misclassification costs are and build those into your classification, using "minimum risk classification" (also known as "cost sensitive learning", "Bayesian decision theory" etc).

A sort of agree with some of the comments saying not to use accuracy as a performance metric. It is based on the assumption that all misclassifications are equally bad, which is not always true, and a focus on accuracy tends to stop people from properly considering the misclassification costs, which can be of vital importance, especially for "imbalanced learning" tasks (as the minority class is often more "important" in some sense, and it is worth suffering some additional misclassifications of the majority class in order to catch more of the minority class).

I disagree with not using metric based on hard classifications though. If your application is one where you must make a hard classification, then it is likely that your metric of primary interest is based on that hard classification, and if you don't use it, you will be ignoring the primary goal of the project. You do need to be aware, however, of the shortcomings of these metrics, just as you need to be aware that proper scoring rules are no panacea either (see my answer here for an example where proper scoring rules chose the wrong model). That doesn't mean that it should be the only metric you use - sometimes it is good to use a variety of metrics to get an appreciation of different aspects of the model's performance.

So to summarise - work out plausible numeric costs for each type of misclassification and build that into your decision rule (rather than just picking the class with the highest probability).

But i do not use only F1 or only Accuracy, there is a mattrix and other values.I want to increase precision on Negative otherwise increase recall in Neutral and Positive. — nikolaosmparoutis, Aug 29 '22 at 22:04
@nikolaosmparoutis I think my point boils down to the fact that you need to know what it is you are trying to do and know what your performance criteria are measuring and why they are appropriate for your problem. Having lots of performance metrics is not always a good thing. — Dikran Marsupial, Aug 29 '22 at 22:06

Multiclass Unbalanced Classication :Very very low F1 scores, high precision low recalls

1 Answers1