I think a good definition of overfitting is that a model achieves higher in-sample but lower out-of-sample performance than a simpler model. Andrew Gelman, for instance, uses this definition (more or less).
It sounds like your models can achieve reasonably good in-sample performance yet are exposed as having poor generalizability when they are assessed out-of-sample.
If you get that a simple model that always predicts at the chance level based on the in-sample data$^{\dagger}$ outperforms your complex models on the out-of-sample data despite the complex models doing better in-sample, that sound like a case of a simple model doing worse than a complex model in-sample yet better out-of-sample, and that would seem to satisfy the definition of overfitting.
$^{\dagger}$ I give an argument here for why the chance level must should come from the training data, a stance now supported by an article in The American Statistician (Hawinkel, Waegeman & Maere (2023)).
EDIT (RESPONSE TO COMMENTS)
I gave a link to a simulation in the commets, and a response remarked that the problem there was regression and not classification and that it was not surprising to have overfitting on the MSE metric. To that I respond:
- MSE is a function of the classification accuracy when the MSE is applied to the $0/1$ labels and $0/1$ predictions. Summing the squared differences between the $0/1$ labels is equivalent to counting how many misclassifications there are, and then dividing by the size gives the proportion of misclassified observations. The proportion of misclassified observations plus the proportion of observations classified correctly (accuracy expressed as a proportion instead of a percentage) must give $1$.
A simulation might be of help.
set.seed(2023)
N <- 1000
R <- 10000
acc <- mse <- rep(NA, R)
for (i in 1:R){
y <- rbinom(N, 1, runif(N, 0, 1)) # Simulate binary outcomes
yhat <- rbinom(N, 1, runif(N, 0, 1)) # Simulate binary predictions
acc[i] <- length(which(y == yhat))/N # Accuracy
mse[i] <- mean((y - yhat)^2) # MSE
}
cor(acc, mse) # I get -1: accuracy and MSE are functions of each other.
table(acc + mse) # All 1s; as expected, accuracy + MSE = 1 is the relationship.
- The classification situation is not so difficult to simulate. The results are about the same as before.
set.seed(2023)
Define sample size
N <- 1000
Define number of parameters
p <- 950
Simulate data
X <- matrix(rnorm(N*p), N, p)
Define the parameter vector to be 0, 0, ..., 0, 0, so no relationship
B <- rep(0, p)#c(1, rep(0, p-1))
Define the conditional log-odds
z <- X %*% B
Transform the log-odds to probability
p <- 1/(1 + exp(-z))
Define the response variable
y <- rbinom(N, 1, p)
Fit to 80% of the data
L <- glm(y[1:800]~., data=data.frame(X[1:800,]), family = binomial)
Predict on the in-sample data
preds_in <- round(1/(1 + exp(-predict.glm(L, data.frame(X[1:800, ])))))
Predict on the remaining 20%
preds_out <- round(1/(1 + exp(-predict.glm(L, data.frame(X[801:1000, ])))))
Show the in-sample and out-of-sample accuracy,
assuming the default threshold of 0.5 probbaility
length(which(preds_in == y[1:800]))/length(preds_in)
length(which(preds_out == y[801:1000]))/length(preds_out)
Show the chance level from the training data
mean(y[1:800])
I get perfect in-sample accuracy, an out-of-sample accuracy of $51\%$, and a chance-level accuracy of $53.375%$, higher than the out-of-sample accuracy. Thus, the simulation demonstrates the desired (in a way) situation where in-sample accuracy is great, yet out-of-sample accuracy is worse than chance-level.
I also get that the chance-level for the out-of-sample data is $53.5\%$ accuracy (mean(y[801:1000])), so while I would argue against using an out-of-sample calculation as your point of comparison, the results here do not hinge on using the chance level from in- or out-of-sample data.
REFERENCE
Stijn Hawinkel, Willem Waegeman & Steven Maere (2023) Out-of-sample R2
: estimation and inference, The American Statistician, DOI: 10.1080/00031305.2023.2216252