Why does a Logistic Regression with 'ElasticNet' penalty overfit?

Question

I have the problem that my logistic regression model overfits. even though I use the combined L1 and L2 penalty ('elastic net').

I have a data set with 496 features and 186 samples and want to predict a binary target. For this purpose, I thought of a logistic regression (because of binary classification) and regularization, specifically elastic net, as it enables the model to drop features completely (because of the L1 penalty), which is very important given my feature-sample-ratio (which is definitely not ideal, but as part of a course on digital science, my goal is a theoretically correct implementation of this kind of algorithm and not necessarily the best algorithm).

After tuning the hyper-parameters using cross-validation, the accuracy of my model was 1 for my training data but only 0.53 for my testing data. I kind of thought that using penalties and cross-validation would preserve me from overfitting so badly, but I am very new to machine learning, so maybe I made some mistakes, which you could hopefully highlight.

I used the following packages and functions:

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
import pandas as pd
import scripts.functions as fn # my own functions sheet
import numpy as np

I got my data:

# Setup 
## Get data incl. target
data = fn.get_data()
data.dropna(inplace = True)
X = data.drop(labels=["target"], axis=1)
X = X.to_numpy()
X_scaled = StandardScaler().fit_transform(X)
Y = data["target"].to_numpy()
Split data into training and validation sets
x_train, x_test, y_train, y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=35)

I initialized the hyper-parameters and the model:

# set hyper parameters
l1_ratios = np.arange(0, 1, 0.01)
cs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3, 1e4]
model = LogisticRegressionCV(cv=5, penalty='elasticnet', solver='saga', 
                             l1_ratios=l1_ratios, Cs=cs, max_iter=5000, 
                             n_jobs=-1, refit=True)

Then I fit my training data and retrieve the selected best values of C and L1-Ratio and the accuracy for the training data:

model.fit(x_train, y_train)
acc_train = model.score(x_train, y_train)
print(f"The accuracy of the model for the training data is {round(acc_train, 3)}")
Out: 
    The accuracy of the model for the training data is 1.0
print(f"The hyper-parameter C is: {model.C_[0]}.\nThe hyper-parameter L1-Ratio is: {model.l1_ratio_[0]}")
Out:
    The hyper-parameter C is: 0.1.
    The hyper-parameter L1-Ratio is: 0.07

Subsequently I wanted to validate the model using the 'independent' test data:

acc_val = model.score(x_test, y_test)
print(f"The accuracy of the model for the test data is {round(acc_val, 3)}")
Out:
    The accuracy of the model for the validation data is 0.526

And now I am lost. To my understanding, using a model with some kind of penalty and then using cross-validation to tune this penalty are both steps to prevent overfitting. It seems like I would need a stronger penalty that the accuracies for my training and testing data converge, but this should be the result of the cross-validation, right?

I assume that the problem is the data I have, meaning the proportionally large number of features for these few samples. But if there are other mistakes regarding my general proceeding, I'd be very thankful for any advice!

P.S.: Unfortunately, I cannot share the data I use, and using different data would probably not reproduce the same problem. So please focus on my procedure.

Some questions/comments: (1) Shrinkage is shrinkage. Allowing L1 shrinkage may help interpretability, but there is no reason to assume it would work better than L2 only on high-dimensional data. (2) What is the proportion of 1s and 0s in your binary target? Why are you optimizing accuracy? (3) Do you get similar numbers with a different seed? A single run of 5-fold cross-validation is less than what is typically recommended. (4) Scaling your data before splitting into train/test is leaking information from the test set. — Frans Rodenburg, May 30 '23 at 12:44
(1) Are 496 features considered "high-dimensional"? (2) 94 times 1 (50.5%) and 92 time 0 (49.5%). I'm new to machine learning and thought it is the way to go for. I think I change to the log-loss, but I did not understand everything from the posts about accuracy at first sight. (3) I tested a different seed (=1) and 10-fold cv instead of 5-fold. The accuracy (sry, did not change it for that run) was 81% for the training data and 45% for the test-data. Interestingly, the L1-ratio was now .97 (instead of .07). (4) Changed it. - Thanks for the answer! — peer, May 30 '23 at 13:46
That ignores the key issues I described. This requires a lot of background study. — Frank Harrell, May 30 '23 at 14:22

score 4 · Accepted Answer · answered May 30 '23 at 13:37

4

Your sample size is not within a factor of 10 of being adequate to meet your goals. The list of "selected" variables will essentially be a random draw. See here. In general, the chance of elastic net of finding the "right" variables is essentially zero (and is worse for lasso). As discussed here the minimum sample size needed just to estimate the logistic model's intercept parameter is $n=96$. The minimum sample size needed to fit one pre-specified binary feature in a logistic model is $n=184$.

Were the target to be a continuous variable with little measurement error, things would not be quite as bleak.

answered May 30 '23 at 13:37

Frank Harrell

91,879
6
178
397

I was afraid of this kind of issue. Thanks for the feedback and the references! Nevertheless, as my goal is the theoretical correct implementation, I tried the model with the neg. log-loss and now got for very similar values for the training and the test data (-0.6931 and -0.6935, respectively). Also, the weight of the penalty is quite high (C=0.001), which was not the case for most of my previous computations. But I am not sure how to interpret the log-losses; I read that either values close to 0 or 1 are good, but I did not find something regarding negative values. – peer May 31 '23 at 08:43
Nothing like that will be interpretable or reliable with such an incredibly low sample size. I recommend starting with data reduction (unsupervised learning) that is masked to $Y$, as exemplified here. Sparse principal components is one promising technique. Also make absolutely sure that the target is truly all-or-nothing binary. Otherwise a continuous or ordinal analysis should be undertaken, which helps with the effective sample size (but not enough in your case). – Frank Harrell May 31 '23 at 11:52

Why does a Logistic Regression with 'ElasticNet' penalty overfit?

Split data into training and validation sets

1 Answers1