4

Let’s assume that samples in a dataset are characterized by ID, timestamp, features, and target and each sample is a real observation. After dropping totally-duplicated rows, how should duplicates (with regards to all features except for ID and timestamp that are excluded from the model) be treated while building a predictive model (classification)?

These are the different possibilities I thought of:

  1. Drop all duplicates
  2. Drop all duplicates and assign some way a weight proportional to number of times a unique occurrence was present in the dataset
  3. Keep duplicates
  4. Keep duplicates and exclude them some way from validation/test set(s)

In my opinion, 3. is wrong because duplicates could introduce a bias in the model, 4. could solve the issue but it is not feasible in a (nested) cross validation scenario. Looks like dropping duplicates is the right thing to do: 1. is the standard approach with academic or competition datasets, 2. is an improvement of the latter and i think is the best option but I shouldn’t know how to implement it, also because I am using scikit-learn and its importance weighting in CV is broken.

Which is the correct solution? Are there any other valid approaches (maybe under/over sampling related) for treating duplicates?

  • 5
    Why should duplicates introduce a bias in the model? If you have observed a particular predictor-outcome combination 20 times rather than just once, that gives you valuable information. I would absolutely go with 3. What issue do you see here and want to address? Also, I sincerely hope that dropping all duplicates is not taught in courses, because it simply does not make sense. – Stephan Kolassa Jan 16 '23 at 10:53
  • It could happen that same record is present more than once in test set and so model predictive performance estimate wouldn’t be fair (think about the extreme case where test set is made by only one unique record). It could also happen that an observation is present in both training and validation/test set violating assumption that the two sets must be separated (so 4. was a possibile solution). And yes, dropping all duplicates is what is taught in any course or tutorial. – AngelMarcos Jan 16 '23 at 11:36
  • 3
    What is "not fair" about having multiple records with the same predictor-outcome combination in the training set? (Of course, I am not talking about erroneous duplications of observations.) If you have 20 male smokers, all of which develop cancer, and 2 male smokers who do not develop cancer, would you really discard "duplicates" until you are left with one male smoker with cancer and one male smoker without cancer? ... – Stephan Kolassa Jan 16 '23 at 12:08
  • 3
    ... Also, it is no problem at all to have the same predictor-outcome combination in both the training and the test set. To the contrary, if this combination occurs often in your population, then artificially removing such instances from the test set will indeed bias your model, how should it then learn that there is indeed a strong association between the predictor values and this particular outcome? In the example above, if you have male smokers with cancer in the training set, then would you propose explicitly excluding male smokers with cancer from the test set? – Stephan Kolassa Jan 16 '23 at 12:10
  • 3
    Finally, can you point us to a course or textbook where something like this is advocated? I am sure there are such "courses" on the internet, because there is a lot of misinformation especially around statistics and data science on the web. But a claim like removing all (non-erroneous!) duplicates is really just great evidence that this source is unreliable. (I work in retail forecasting. If all I have is the day of week and the sales value, does that mean that having observed two Tuesdays with zero sales is a problem, and we should remove one of them?) – Stephan Kolassa Jan 16 '23 at 12:13
  • 1
    As @StephanKolassa says, you shouldn't drop duplicates if they're not erroneous. Now, you still have to check if they're erroneous or not. If it's a survey on a human population, in particular if it's a long questionnaire, duplicates or near duplicates should be treated with caution if not suspicion, as they might be fabricated data, data coming from some software bug, etc. See https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2580502 for a reference. So you might want to investigate this, but none of the 4 options you suggest are appropriate to deal with this potential problem. – J-J-J Jan 16 '23 at 12:41
  • Thanks @StephanKolassa for your comments, I'm starting to understand the point. About your first comment, my concern is not about training set, but about validation set: think about the extreme case where the latter set is made of one unique record repeated multiple times; if the model is "well" buit, it should always return the same prediction (correct/incorrect), generating a distortion in predictive performance, do you conisider this "fair"? So, 4. could solve this, allowing duplicates in training phase but not during validating performance, but still unfeasible in my CV scenario. – AngelMarcos Jan 16 '23 at 14:22
  • Finally, to remove all duplicates, is what I was taught during Master degree in Statistics, even if I only worked with "synthetic" or "fake" data, so I understand that with real data things could change. I can't show you that course, but I can cite a book I used during my studies : If your dataset simply has duplicate rows, there is no need to worry about preserving the data; it is already a part of the finished dataset and you can merely remove or drop these rows from your cleaned data (Data Wrangling with Python). – AngelMarcos Jan 16 '23 at 14:22
  • Here you can find some tutorials where dropping duplicates is recommended: 1, 2, 3 – AngelMarcos Jan 16 '23 at 14:24
  • 3
    @AngelMarcos The first tutorial link says "This could be due to things like data entry errors or data collection methods. ", i.e. they're talking about erroneous duplicates. You have to make sure your duplicates are erroneous, you can't just remove an observation only because it's similar to another one. You have to understand why they are similar before choosing to keep or to drop them (e.g. is it coming from the same person answering twice the same survey? Or is it just because two different persons happened to answer the same way?). – J-J-J Jan 16 '23 at 14:29
  • @JJJ ok then, how do I know if they are erroneous duplicates or not? I guess I should talk with a domain expert and evaluate them with his help (?) – AngelMarcos Jan 16 '23 at 14:35
  • 2
    Re this comment: if the validation set has a unique record that is repeated, then of course this is an erroneous duplication and should be cleaned. If there are simply many instances with identical predictor-outcome combinations, then that is simply the way your population is, and of course these data should stay in your validation set. Deleting correct data deletes information! – Stephan Kolassa Jan 16 '23 at 14:35
  • 2
    And looking at your sources, as @JJJ writes, every one of them discusses removing erroneous duplicates. Of course you should remove entries that were duplicated erroneously. In your use case, these should be the "totally-duplicated rows" you removed initially. After that, if you have two rows with different IDs and timestamps, but identical predictors and outcomes, my presumption would be that these rows are perfectly valid. Of course, there could always be some strange problem that could cause erroneous data like that, but you'd need to understand where the data came from to address this. – Stephan Kolassa Jan 16 '23 at 14:38
  • 1
    @AngelMarcos If you want to do this on your own, and if you're not the person who created the dataset, typically you should look at the dataset documentation (if any) to get a better idea of how the dataset has been constructed (e.g. if someone else already checked for erroneous duplicates and removed them). Otherwise, you should ask the person who designed the study or collected the data for more information about those duplicates. What you should do really depends on the context of the study, how it was designed, and how the data was collected. – J-J-J Jan 16 '23 at 16:09
  • 1
    Thank you so much to both JJJ and StephanKolassa, the discussion and your comments helped me to understand how to deal with duplicates in real data! I will go with 3. – AngelMarcos Jan 16 '23 at 17:26

1 Answers1

7

Assuming your dataset should be a random sample drawn from your population, it depends on what you define as a duplicate. I see two scenarios:

  1. "Duplicate" defined as two (or more) distinct observations that happen to have the exact same features/values.

In this case, you should keep them, as it's just how your population of interest is. You immediately see why if you have a dataset of let's say 2,000 observations, but only 2 features with 2 levels ("gender: man or woman" and "Exam results: pass or fail"). If you drop all the observations that are identical, you'll end up at best with 4 observations: 1 man who passed the exam, 1 man who failed, 1 woman who passed, and 1 woman who failed. So in this case, you would just be deleting useful information, and you'll be unable to infer anything about your population of interest.

  1. "Duplicate" defined as the same observation erroneously recorded twice or more

On the other hand, if you define a duplicate as the same observation incorrectly recorded multiple times, ideally you should drop them, because at some point something has been interfering with the sample random draw. In particular, it may be indicative of something going wrong with the data collection process, like a software bug, some typing error, or even fraud from respondents or interviewers, see Kuriakose, N., & Robbins, M. (2016). Don't get duped: Fraud through duplication in public opinion surveys. Statistical Journal of the IAOS, 32(3), 283-291.

This kind of duplicates may seriously bias estimates, see Sarracino, F., & Mikucka, M. (2017, April). Bias and efficiency loss in regression estimates due to duplicated observations: a Monte Carlo simulation. In Survey Research Methods (Vol. 11, No. 1, pp. 17-44). So it may be a good idea to not ignore them if you know they shouldn't be there.


In some situations, it might not be obvious if duplicates (or near duplicates) fall under the first or second category, in particular when you've not been involved in the study design and data collection stages. If you're not sure, you should investigate the matter further, for example by asking additional information to the people who collected the data (hoping that they won't be trying to cover their tracks if it's a case of fraud from their part).

As a side note, you can find various references to prevent or detect problems related to survey fraud or falsification (e.g. Schwanhäuser, S., Sakshaug, J. W., & Kosyakova, Y. (2022). How to catch a falsifier: Comparison of statistical detection methods for interviewer falsification. Public opinion quarterly, 86(1), 51-81.)

J-J-J
  • 4,098