Is cross-validation (test) error below chance an indicator of overfitting?

Question

I am training a binary classifier (e.g. logistic regression) on some multidimensional problem. I have tried leave-one-out and k-fold cross-validation. I have tried L1 and L2 regularization, and I have produced plots sweeping the regularization parameter over multiple orders of magnitude.

The typical behaviour is that for some range of regularization parameters the testing error significantly improves. However, I have some datasets where the testing error stays below chance (e.g. below 40%) for all values of the regularization parameter, whereas training error transitions between 100% and chance.

What does that mean? Is this expected for datasets that are not predictive of the labels? Or is this an indication that there is overfitting happening and that L1/L2 regularizers are potentially suboptimal for those datasets?

Dave · Accepted Answer · 2023-06-29T17:18:55.160

I think a good definition of overfitting is that a model achieves higher in-sample but lower out-of-sample performance than a simpler model. Andrew Gelman, for instance, uses this definition (more or less).

It sounds like your models can achieve reasonably good in-sample performance yet are exposed as having poor generalizability when they are assessed out-of-sample.

If you get that a simple model that always predicts at the chance level based on the in-sample data$^{\dagger}$ outperforms your complex models on the out-of-sample data despite the complex models doing better in-sample, that sound like a case of a simple model doing worse than a complex model in-sample yet better out-of-sample, and that would seem to satisfy the definition of overfitting.

$^{\dagger}$ I give an argument here for why the chance level must should come from the training data, a stance now supported by an article in The American Statistician (Hawinkel, Waegeman & Maere (2023)).

EDIT (RESPONSE TO COMMENTS)

I gave a link to a simulation in the commets, and a response remarked that the problem there was regression and not classification and that it was not surprising to have overfitting on the MSE metric. To that I respond:

MSE is a function of the classification accuracy when the MSE is applied to the $0/1$ labels and $0/1$ predictions. Summing the squared differences between the $0/1$ labels is equivalent to counting how many misclassifications there are, and then dividing by the size gives the proportion of misclassified observations. The proportion of misclassified observations plus the proportion of observations classified correctly (accuracy expressed as a proportion instead of a percentage) must give $1$.

A simulation might be of help.

set.seed(2023)
N <- 1000
R <- 10000
acc <- mse <- rep(NA, R)
for (i in 1:R){
y <- rbinom(N, 1, runif(N, 0, 1))    # Simulate binary outcomes
  yhat <- rbinom(N, 1, runif(N, 0, 1)) # Simulate binary predictions
acc[i] <- length(which(y == yhat))/N # Accuracy
  mse[i] <- mean((y - yhat)^2)         # MSE
}
cor(acc, mse)    # I get -1: accuracy and MSE are functions of each other.
table(acc + mse) # All 1s; as expected, accuracy + MSE = 1 is the relationship.

The classification situation is not so difficult to simulate. The results are about the same as before.

set.seed(2023)
Define sample size

N <- 1000
Define number of parameters

p <- 950
Simulate data

X <- matrix(rnorm(N*p), N, p)
Define the parameter vector to be 0, 0, ..., 0, 0, so no relationship

B <- rep(0, p)#c(1, rep(0, p-1))
Define the conditional log-odds

z <- X %*% B
Transform the log-odds to probability

p <- 1/(1 + exp(-z))
Define the response variable

y <- rbinom(N, 1, p)
Fit to 80% of the data

L <- glm(y[1:800]~., data=data.frame(X[1:800,]), family = binomial)
Predict on the in-sample data

preds_in <- round(1/(1 + exp(-predict.glm(L, data.frame(X[1:800, ])))))
Predict on the remaining 20%

preds_out <- round(1/(1 + exp(-predict.glm(L, data.frame(X[801:1000, ])))))
Show the in-sample and out-of-sample accuracy,
assuming the default threshold of 0.5 probbaility

length(which(preds_in  == y[1:800]))/length(preds_in)
length(which(preds_out == y[801:1000]))/length(preds_out)
Show the chance level from the training data

mean(y[1:800])

I get perfect in-sample accuracy, an out-of-sample accuracy of $51\%$, and a chance-level accuracy of $53.375%$, higher than the out-of-sample accuracy. Thus, the simulation demonstrates the desired (in a way) situation where in-sample accuracy is great, yet out-of-sample accuracy is worse than chance-level.

I also get that the chance-level for the out-of-sample data is $53.5\%$ accuracy (mean(y[801:1000])), so while I would argue against using an out-of-sample calculation as your point of comparison, the results here do not hinge on using the chance level from in- or out-of-sample data.

REFERENCE

Stijn Hawinkel, Willem Waegeman & Steven Maere (2023) Out-of-sample R2 : estimation and inference, The American Statistician, DOI: 10.1080/00031305.2023.2216252

Thank you for your answer. I do not currently follow how it addresses my main concern, sorry if I'm missing it. Maybe I did not explain my concern well enough, so I'll try again now. I feel that it is pathological for the out-of-sample accuracy to be below chance. I am not sure, but this seems to imply that the in-sample data is not representative of the out-of-sample data. I don't remember any more on which dataset I observed this phenomenon, but the simplest explanation I see is that I had forgotten to shuffle the data before splitting it into training and test sets. — Aleksejs Fomins, Jun 29 '23 at 14:39
@AleksejsFomins If you way overfit to the training data, you can do much worse than chance on the out-of-sample data. Check out my simulation here for some starting code to create an example, and compare the MSE in that overfit model to the MSE of a model that predicts the training mean every time. That doesn’t imply a data issue. The issue is that my modeling in that simulation intentionally badly overfits, and we use techniques like out-of-sample validation to catch such issues. — Dave, Jun 29 '23 at 14:42
I have had a look at your simulations, but as far as I can tell they focus on regression and not classification. It is not at all surprising to me that MSE is worse for test data than for training data. Do you have an example simulation where accuracy on test data is consistently and significantly below 50%, given that the training set is drawn from the same distribution as the test set? — Aleksejs Fomins, Jun 29 '23 at 16:15
@AleksejsFomins I have responded to your comments with an edit to my answer. — Dave, Jun 29 '23 at 17:19

Is cross-validation (test) error below chance an indicator of overfitting?

1 Answers1

Define sample size

Define number of parameters

Simulate data

Define the parameter vector to be 0, 0, ..., 0, 0, so no relationship

Define the conditional log-odds

Transform the log-odds to probability

Define the response variable

Fit to 80% of the data

Predict on the in-sample data

Predict on the remaining 20%

Show the in-sample and out-of-sample accuracy,

assuming the default threshold of 0.5 probbaility

Show the chance level from the training data