1

I am doing a classification task for a 5-class imbalanced dataset. Class distribution shows 2-majority & 2-minority clasess, as far:

>>> def class_distrib(arr):
...     print('-' * 70)
...     counter = Counter(arr)
...     for c, v in sorted(counter.items() , key=lambda x: x[0]):
...         per = v / len(arr) * 100
...         print('Class = %d,\tCount = %d,\tPercentage = %.2f%%' % (c, v, per))
...     print('Press Enter to continue ...')
...     #input()
...     print('-' * 70)
... 
>>> 
>>> 
>>> class_distrib(ytrain)
----------------------------------------------------------------------
Class = 0,  Count = 18749,  Percentage = 22.01%
Class = 1,  Count = 3482,   Percentage = 4.09%
Class = 2,  Count = 9566,   Percentage = 11.23%
Class = 3,  Count = 49741,  Percentage = 58.40%
Class = 4,  Count = 3634,   Percentage = 4.27%
Press Enter to continue ...
----------------------------------------------------------------------
>>>
>>>
>>> class_distrib(ytest)
----------------------------------------------------------------------
Class = 0,  Count = 14601,  Percentage = 32.08%
Class = 1,  Count = 1398,   Percentage = 3.07%
Class = 2,  Count = 8317,   Percentage = 18.27%
Class = 3,  Count = 20301,  Percentage = 44.60%
Class = 4,  Count = 904,    Percentage = 1.99%
Press Enter to continue ...
----------------------------------------------------------------------

On train set, class 5 takes about 60% of the samples, and 80% of all data points belong to classes 0 and 5. Whereas on the test set, classes 0 and 5 take about 76% of the total samples.

This is a time series dataset, and for some reasons, we split our train-test by data collection period (first 8-days train, last 2days test), not the conventional 80-20 split.

Random Forest classifier results is biased in favour of majority classes:

Confusion Matrix:   
[[ 8723    83   400  5287   108]
 [  339   463     0   595     1]
 [ 1500    28  1245  5525    19]
 [ 1416    84   645 18141    15]
 [  347     0     8   521    28]]

Classification Report: precision recall f1-score support

       0       0.71      0.60      0.65     14601
       1       0.70      0.33      0.45      1398
       2       0.54      0.15      0.23      8317
       3       0.60      0.89      0.72     20301
       4       0.16      0.03      0.05       904

accuracy                           0.63     45521

macro avg 0.54 0.40 0.42 45521 weighted avg 0.62 0.63 0.59 45521

Several imbalanced strategies were tested including SMOTE with random forest, AdaBoost, SMOTEBoost with no significant improvement over the random forest model's results above.

This make us thinking what other factors to investigate that may worsen imbalanced problem. Could anyone suggest what we could do to improve model's predictive performance? I thought of high-dimensional data visualization tools such as t-SNE but I'm not sure how this could help understand data complexity issues in the dataset (t-SNE plot in attached)? Thought of PCA too? What could you suggest from experience?

enter image description here

arilwan
  • 273

0 Answers0