Remove duplicates from training set for classification

Question

Let us say I have a bunch of rows for a classification problem:

$$X_1, ... X_N, Y$$

Where $X_1, ..., X_N$ are the features/predictors and $Y$ is the class the row’s feature combination belongs to.

Many feature combination and their classes are repeated in the dataset, which I am using to fit a classifier. I am just wondering if it is acceptable to remove duplicates (I basically perform a group by X1 ... XN Y in SQL)? Thanks.

PS:

This is for a binary presence only dataset where the class priors are quite skewed

score 14 · Accepted Answer · answered Feb 20 '12 at 11:36

No, it is not acceptable. The repetitions are what provide the weight of the evidence.

If you remove your duplicates, a four-leaf clover is as significant as a regular, three-leaf clover, since each will occur once, whereas in real life there is a four-leaf clover for every 10,000 regular clovers.

Even if your priors are "quite skewed", as you say, the purpose of the training set is to accumulate real-life experience, which you will not achieve if you lose the frequency information.

score 1 · Answer 2 · answered Aug 27 '18 at 22:23

I agree with the previous answer but here are my reservations. It is advisable to remove duplicates while segregating samples for training and testing for specific classifiers such as Decision Trees. Say, 20% of your data belonged to a particular class and $\frac{1}{4}^{th}$ of those seeped into testing, then algorithms such as Decision Trees will create gateways to that class with the duplicate samples. This could provide misleading results on the test set because essentially there is a very specific gateway to the correct output.

When you deploy that classifier to completely new data, it could perform astonishingly poor if there are no samples similar to the above said 20% samples.

Argument: One may argue that this situation points to a flawed dataset but I think this is true to real life applications.

Removing duplicates for Neural Networks, Bayesian models etc is not acceptable.

Another feasible solution could be to weight the duplicates lower based on their frequency of occurrence. — Rakshit Kothari, Aug 27 '18 at 22:26
Say, 20% of your data belonged to a particular class and 1/4th of those seeped into testing, then algorithms such as Decision Trees will create gateways to that class with the duplicate samples. So basically you are saying that keeping duplicates might lead to data leakage? If so then this is true for all algo's not just decision trees — spectre, Jan 06 '22 at 16:40

Remove duplicates from training set for classification

2 Answers2

Linked