Let us say I have a bunch of rows for a classification problem:
$$X_1, ... X_N, Y$$
Where $X_1, ..., X_N$ are the features/predictors and $Y$ is the class the row’s feature combination belongs to.
Many feature combination and their classes are repeated in the dataset, which I am using to fit a classifier. I am just wondering if it is acceptable to remove duplicates (I basically perform a group by X1 ... XN Y in SQL)? Thanks.
PS:
This is for a binary presence only dataset where the class priors are quite skewed
Say, 20% of your data belonged to a particular class and 1/4th of those seeped into testing, then algorithms such as Decision Trees will create gateways to that class with the duplicate samples.So basically you are saying that keeping duplicates might lead to data leakage? If so then this is true for all algo's not just decision trees – spectre Jan 06 '22 at 16:40