99% data redundancy in binary classification problem

Question

I am working on binary classification problem and there is 99.99% data redundancy. When I looked into the distribution of the classes both seem to be the same. Class imbalance is also part of the problem. Following is my analysis on the dataset:

t-SNE Plots

Generating t-SNE plots for different values (>4) produce same kind of graph where classes are on top of each other.

PCA Analysis

Top PCA components have very low variances spread across all of the features (total features=42).

Now I trained xgboost classifier with smote upsampling and top 5 PCA components (top 5 explains 80% of the variance). I did k fold cross validation with hyper parameter tuning. I analyzed decision boundaries for different of values of each xgboost hyper parameter. My F1 score is 0.11, which is not a good score.

Can anyone share thoughts on the current problem with all the information I provided? My main concern is if the underlying nature of the distribution might not be good from learning point of view. Because I know even if I deduplicate and train the model I might get good results but when tested on the real world data, I will have too many points from both classes which have exactly the same values (distribution).

In reality classes are often not well separated and overlap substantially. You cannot expect to get what you'd call a "good result" then. In such cases the data don't allow to predict the class with much precision whatever method you use. That's just how it is. — Christian Hennig, May 27 '21 at 12:44
Thanks for the feedback. Yes that's the problem. Classes overlap too much to a point where classifier ends up learning too much from the seen data but never performs well on the unseen data. — Anonymous, May 27 '21 at 12:54

score 0 · Answer 1 · answered Apr 27 '23 at 16:30

I dispute that the classes are so inseparable. For instance, in perplexity 4, observations around $(50, -20)$ are almost certainly going to be red, yet points around $(40, 30)$ seem to be an even mix of red and green. Since green is so outnumbered by red, a $50/50$ chance of green in a region is quite remarkable!

Now, t-SNE can create groupings that do not exist just as it can miss groupings that do exist, so that perplexity 4 plot is not necessarily indicative of how the real data look in many dimensions. Nonetheless, having clusters like this strikes me as at least a positive sign.

One of the issues perhaps leading you to have poor results is the concern with a threshold-based, improper scoring rule like $F_1$ score. Among the issues, $F_1$ score does not evaluate the XGBoost model. The $F_1$ score evaluates the XGBoost model along with a decision rule based on a threshold that might be wildly inappropriate for your task. A standard software default is a threshold of probability $0.5$. In all four of the plots, there are few regions where there will be a probability of $0.5$ of the point being green.

You might find yourself having better luck evaluating the raw outputs of your model instead of applying a software-default threshold. At the very least, you can tune the threshold to change your $F_1$ score.

Finally, it might be that the classes are simply quite similar and cannot be separated do a large extent on your data.

I will leave some links on class imbalance and the drawbacks of threshold-based performance metrics.

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

Profusion of threads on imbalanced data - can we merge/deem canonical any?

Why is accuracy not the best measure for assessing classification models?

Academic reference on the drawbacks of accuracy, F1 score, sensitivity and/or specificity

99% data redundancy in binary classification problem

1 Answers1