Consider the two pairs of learning curves below.
- The red and green lines are the training and validation curves of some model 1, and
- the gray and orange lines are the training and validation curves of some model 2.
Both models were run on the same data with the only exception that model 1 is run on a more complex model than model 2 (see below for full specs).
I initially had model 1 only, but because of the wide gap between training and validation, I concluded that it suffered from overfitting. I then designed model 2 with fewer hidden units to mitigate this overfitting. Although the gap between training and validation did get significantly reduced, the overall loss worsened. I'm therefore confused about what this means. On the one hand, model 2 is "healthier" since it's more generalizable, but on the other hand its predictive power is worsened. I naively thought that reducing overfitting would also reduce the loss, but it seems that there is an lose-lose trade-off between the two.
The setup:
- 3 million data points, of which 20% are used for cross-validation.
- 33 features that are either boolean or scalars normalized in the $[0, 1]$ range
- 1 boolean target
- The models are multi-layer perceptrons driven by the mean absolute error as its metrics. The hidden units are activated by a ReLU and the output unit is a single sigmoid activation. (I wanted to get prediction scores in the $[0, 1]$ as opposed to a boolean classification so as to have a quantification of confidence.)
- The number of units of model 1 are [16, 8, 1] and those of model 2 are [8, 4, 1].
