2

I am building a CNN classification model. However, my data have some duplicated images. I am just wondering if it is acceptable to remove the image duplicates. If yes, what technique can I use for detect and remove duplicated them?

kha
  • 255

1 Answers1

1

If your dataset is big and there are just a few duplicates, it is not really necessary, but still recommended. If there are a lot of duplicates though, than you really should do it, since the classifier would tend to overfitt those images and show worse performance on unseen data. (Btw. you really want to make sure, that also each class has the same number of training examples)

If the images are exactly the same, you can hash them and then simply sort out the ones that are double.

If they show the same thing but are slightly different, you can try to do something with the covariance matrix (I dont know how exactly though, just heared that this is one approach)