Dealing with class imbalance and data complexity issues

Question

I am doing a classification task for a 5-class imbalanced dataset. Class distribution shows 2-majority & 2-minority clasess, as far:

>>> def class_distrib(arr):
...     print('-' * 70)
...     counter = Counter(arr)
...     for c, v in sorted(counter.items() , key=lambda x: x[0]):
...         per = v / len(arr) * 100
...         print('Class = %d,\tCount = %d,\tPercentage = %.2f%%' % (c, v, per))
...     print('Press Enter to continue ...')
...     #input()
...     print('-' * 70)
... 
>>> 
>>> 
>>> class_distrib(ytrain)
----------------------------------------------------------------------
Class = 0,  Count = 18749,  Percentage = 22.01%
Class = 1,  Count = 3482,   Percentage = 4.09%
Class = 2,  Count = 9566,   Percentage = 11.23%
Class = 3,  Count = 49741,  Percentage = 58.40%
Class = 4,  Count = 3634,   Percentage = 4.27%
Press Enter to continue ...
----------------------------------------------------------------------
>>>
>>>
>>> class_distrib(ytest)
----------------------------------------------------------------------
Class = 0,  Count = 14601,  Percentage = 32.08%
Class = 1,  Count = 1398,   Percentage = 3.07%
Class = 2,  Count = 8317,   Percentage = 18.27%
Class = 3,  Count = 20301,  Percentage = 44.60%
Class = 4,  Count = 904,    Percentage = 1.99%
Press Enter to continue ...
----------------------------------------------------------------------

On train set, class 5 takes about 60% of the samples, and 80% of all data points belong to classes 0 and 5. Whereas on the test set, classes 0 and 5 take about 76% of the total samples.

This is a time series dataset, and for some reasons, we split our train-test by data collection period (first 8-days train, last 2days test), not the conventional 80-20 split.

Random Forest classifier results is biased in favour of majority classes:

Confusion Matrix:   
[[ 8723    83   400  5287   108]
 [  339   463     0   595     1]
 [ 1500    28  1245  5525    19]
 [ 1416    84   645 18141    15]
 [  347     0     8   521    28]]
Classification Report: 
              precision    recall  f1-score   support
       0       0.71      0.60      0.65     14601
       1       0.70      0.33      0.45      1398
       2       0.54      0.15      0.23      8317
       3       0.60      0.89      0.72     20301
       4       0.16      0.03      0.05       904

accuracy                           0.63     45521

macro avg       0.54      0.40      0.42     45521
weighted avg       0.62      0.63      0.59     45521

Several imbalanced strategies were tested including SMOTE with random forest, AdaBoost, SMOTEBoost with no significant improvement over the random forest model's results above.

This make us thinking what other factors to investigate that may worsen imbalanced problem. Could anyone suggest what we could do to improve model's predictive performance? I thought of high-dimensional data visualization tools such as t-SNE but I'm not sure how this could help understand data complexity issues in the dataset (t-SNE plot in attached)? Thought of PCA too? What could you suggest from experience?

uhm, the kind of problem I was struggling with couple of weeks ago. — super_ask, Mar 15 '22 at 17:42
Are you sure that the imbalance is an issue?[ Statisticians tend not to see class imbalance as an issue when proper statistical methods are used. In particular, since you told the model that the majority classes are more likely than the minority classes, I say that the model should be biased towards the majority classes. — Dave, Mar 15 '22 at 17:48
I suspect class imbalance is contributing (if not the only factor). I can only be sure if I have some way to investigate other factors and assess their impact. — arilwan, Mar 15 '22 at 17:51
You have yet to give the performance on a proper scoring rule like log loss or Brier score, so I am not sure how you can say that the performance is poor. // You would learn a great deal from the link I posted a few minutes ago. — Dave, Mar 15 '22 at 18:04

Dealing with class imbalance and data complexity issues

0 Answers0