Should I remove duplicates from my dataset for my machine learning problem?

Question

I am new to machine learning and have a problem.

I have a dataset with 1191 samples with 10 features which belong to 5 different classes. I have trained a neural network with this dataset and obtained a good accuracy of about 0.9. I noticed that about 350 samples are duplicated. All random selections of data for testing the network, contain about 180 samples which also have been in training data.

My question is should I remove the duplicates from the dataset? Do they contribute to this accuracy?

Hi, welcome. Yes, these duplicates add weight to the fit of those specific observations (cases). Wether this effect is big or small is hard to tell from the information you have provided. — Jim, Sep 22 '19 at 09:14

mkt · Answer 1 · 2019-09-24T08:50:34.267

10

You should probably remove them. Duplicates are an extreme case of nonrandom sampling, and they bias your fitted model. Including them will essentially lead to the model overfitting this subset of points.

I say probably because you should (1) be sure they are not real data that coincidentally have values that are identical, and (2) try to figure why you have duplicates in your data. For example, sometimes people intentionally 'oversample' rare categories in training data, though why this is done is not clear, as it probably helps only in rare cases.

As a side note, it's worth reading this thread: Why is accuracy not the best measure for assessing classification models?

edited Sep 24 '19 at 08:50

answered Sep 22 '19 at 09:55

mkt

18,245
11
73
172

2

+1. I answered a similar question a few months ago here: https://stats.stackexchange.com/questions/602110/how-to-treat-duplicates-while-dealing-with-real-data/602140#602140 (I just flagged it as a duplicate of this one). Now, I'd be a bit wary of saying we should probably remove them, because it may be critical to determine if the identical observations are erroneous duplicates, or different observations whose values happen to coincide (using the term "probably" could encourage to not look hard into the issue). – J-J-J Jan 08 '24 at 10:17
1

@J-J-J Fair point. If I wrote this answer today, I would probably emphasise the need for checking more. I will edit this later to do that. And +1 to your answer, I'd be happy for the duplicate to go the other way as your answer seems more comprehensive. – mkt Jan 08 '24 at 13:20
1

ah, too late, the other question has been closed! Anyway, no big deal I guess; I realized that's a FAQ, as searching for "removing duplicates" returns a few similar results (https://stats.stackexchange.com/search?q=removing+duplicates). Other relevant answers are probably scattered all around the website. – J-J-J Jan 08 '24 at 16:02
1

@J-J-J Duplicate direction reversed now. But we can see if there's a canonical answer about this out there and direct all threads towards it. – mkt Feb 07 '24 at 15:40
I tried to provide a relatively comprehensive answer, however it might be that another thread could be more adequate as a canonical answer. I prefer to let other people decide that. I also suspect that a few of these questions might have more to do with the "unbalanced dataset problem", and there are excellent canonical answers to this kind of question. Another situation that my answer does not cover is the case of bootstrapping (I vaguely remember reading a couple of comments about bootstrapping in a discussion about removing duplicates, but I can't find it again). – J-J-J Feb 08 '24 at 07:01

userK · Answer 2 · 2024-01-08T09:44:19.800

1

You should also make sure the samples in the test data are not the same as the samples in the training data, as that is then overfitting the model in that it will think it is predicting the data in the test set perfectly when it might just be due to the model having 'seen' that data in the training already.

edited Jan 08 '24 at 09:44

answered Jan 05 '24 at 15:41

userK

11

3

Because this implies an unusual meaning of "overfitting" (and doesn't seem to be necessarily true), could you please explain your remark? – whuber Jan 05 '24 at 16:13
1

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. – Community Jan 05 '24 at 17:27
Consider a very simple model where you'd have just one independent binary variable (gender: woman/man), and one dependent binary variable (exam results: passed/failed). With regard to your answer, can you clarify how you would identify whether an observation is the same as another one? Or, in other words, what is your definition of "the same"? – J-J-J Jan 08 '24 at 17:43
Ah I see, I was thinking more in terms of a neural network model on text data but for numeric data this wouldn't apply I guess. In the case of text classification sometimes it can be good to have duplicates to give more weight to certain classes (e.g. under-represented classes) so it depends on the data and the use case. – userK Jan 10 '24 at 16:02

Should I remove duplicates from my dataset for my machine learning problem?

2 Answers2

Linked

Related