For multiclass classification purpose I have to use a imbalanced dataset

Question

I am facing a problem. It's a multiclass classification problem I have 5 categories A has 107 instances B has 101 instances C has 882 instances, D has 229 instances and E has 129 instances. I used Knn, random forest and svm I got maximum accuracy score 62% . So, my question is Am I getting low accuracy score because of imbalanced data(since C has 882 instances which are far more than other categories)? or there is something else? NB: I looked the y_pred vector which has the predicted value and I noticed that all the values are 2(I encoded C as 2)why is that?

Biased generally means "not representative of the underlying population" in this context. The word you want is "imbalanced". — Matthew Drury, Aug 18 '18 at 17:16
Try constructing pairwise classifiers to see if you can make meaningful classifications between pairs of classes. That may give some insight into the problem — Dikran Marsupial, Aug 10 '21 at 17:10
https://stats.meta.stackexchange.com/questions/6349/profusion-of-threads-on-imbalanced-data-can-we-merge-deem-canonical-any — Dave, Dec 15 '22 at 14:10

score 0 · Answer 1 · answered Aug 19 '18 at 03:17

0

This is happening because of the imbalanced dataset. In order to avoid overfitting you can use boosting algorithms with trees with depth 1 and do a grid search to find the best boosting parameters. you can use Adaboost in python. Another measure to take is to edit the loss function of the algorithms you tried in to have a proportional loss function. eg If you have 80% class A and 20% class B then have your loss function be: $$L = {Missclassified}_A*(0.2) + {Missclassified}_B*0.8 $$ Ofcourse, you will have t play with the numbers but the idea is there.

answered Aug 19 '18 at 03:17

ABIM

544

Why do you suggest boosting depth one trees? – Matthew Drury Aug 19 '18 at 04:24
You can use other depths but the deeper trees are usually more prone to overfitting. – ABIM Aug 19 '18 at 18:27
@BIM it is not possible to state that this is due to the class imbalance problem based on the information provided. It may be that 62% is as good as you can do on this dataset as the density of C is higher than that of any other class anywere, see my question here https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance . It could be that the hyper-parameters of the classifiers are not well tuned or any number of other issues. – Dikran Marsupial Aug 10 '21 at 17:09
Did you consider an opportunity to build a balanced dataset? The most obvious solution is to drop some of dominating class observations, but it leads to loss of data. So another approaches are undersampling of dominating class or oversampling of minority class (different techniques exist for such purposes). As suggested above, weighting schemes are also used to address the issue of imbalanced classification. – rsx Apr 13 '22 at 11:18
Dropping observations is bad statistical practice. The original problem exists because of consideration of classification instead of estimation of the probability of class membership and using with it a proper accuracy score such as the Brier or logarithmic scores. – Frank Harrell Aug 20 '23 at 11:54

For multiclass classification purpose I have to use a imbalanced dataset

1 Answers1