2

I am working with a pretty big dataset (800k samples) and I solve a classification problem. What puzzles me is that models (CNN and MLP) with ~3000 and 3000000 parameters have pretty much the same (and decent) performance. The bigger model does not overfit, learning curves (loss and AUC) look similar. What are the possible reasons for that? The problem is too simple?

Yuri
  • 545
  • 1
  • 5
  • 11

1 Answers1

2

The problem would be simple when the dataset does contain enough information to easily separate the classes. This would be a situation where it is very difficult to overfit.

If the data does not contain enough information to separate the classes, you should expect overfit with some models.

The number of parameters alone does not necessarily imply overfitting either. Random forests for example can have very many parameters but do not overfit proportionally to this.

As a sanity check, try one nearest neighbor classification. That's easy to implement and horribly overfitting. If this doesn't overfit, not much else will.

David Ernst
  • 3,151
  • 10
  • 15
  • Ah, that is what I meant (there should be no "not" in my question). Thank you for noticing! I tried to play with a toy dataset (IRIS) to see what is going on in that case (https://stats.stackexchange.com/questions/299645/cannot-overfit-on-the-iris-dataset) and couldn't overfit either even though the problem is not that simple. The only reasonable idea was "dataset is too small", but probably the problem is simple. – Yuri Aug 29 '17 at 01:12