I have the problem that my logistic regression model overfits. even though I use the combined L1 and L2 penalty ('elastic net').
I have a data set with 496 features and 186 samples and want to predict a binary target. For this purpose, I thought of a logistic regression (because of binary classification) and regularization, specifically elastic net, as it enables the model to drop features completely (because of the L1 penalty), which is very important given my feature-sample-ratio (which is definitely not ideal, but as part of a course on digital science, my goal is a theoretically correct implementation of this kind of algorithm and not necessarily the best algorithm).
After tuning the hyper-parameters using cross-validation, the accuracy of my model was 1 for my training data but only 0.53 for my testing data. I kind of thought that using penalties and cross-validation would preserve me from overfitting so badly, but I am very new to machine learning, so maybe I made some mistakes, which you could hopefully highlight.
I used the following packages and functions:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
import pandas as pd
import scripts.functions as fn # my own functions sheet
import numpy as np
I got my data:
# Setup
## Get data incl. target
data = fn.get_data()
data.dropna(inplace = True)
X = data.drop(labels=["target"], axis=1)
X = X.to_numpy()
X_scaled = StandardScaler().fit_transform(X)
Y = data["target"].to_numpy()
Split data into training and validation sets
x_train, x_test, y_train, y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=35)
I initialized the hyper-parameters and the model:
# set hyper parameters
l1_ratios = np.arange(0, 1, 0.01)
cs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3, 1e4]
model = LogisticRegressionCV(cv=5, penalty='elasticnet', solver='saga',
l1_ratios=l1_ratios, Cs=cs, max_iter=5000,
n_jobs=-1, refit=True)
Then I fit my training data and retrieve the selected best values of C and L1-Ratio and the accuracy for the training data:
model.fit(x_train, y_train)
acc_train = model.score(x_train, y_train)
print(f"The accuracy of the model for the training data is {round(acc_train, 3)}")
Out:
The accuracy of the model for the training data is 1.0
print(f"The hyper-parameter C is: {model.C_[0]}.\nThe hyper-parameter L1-Ratio is: {model.l1_ratio_[0]}")
Out:
The hyper-parameter C is: 0.1.
The hyper-parameter L1-Ratio is: 0.07
Subsequently I wanted to validate the model using the 'independent' test data:
acc_val = model.score(x_test, y_test)
print(f"The accuracy of the model for the test data is {round(acc_val, 3)}")
Out:
The accuracy of the model for the validation data is 0.526
And now I am lost. To my understanding, using a model with some kind of penalty and then using cross-validation to tune this penalty are both steps to prevent overfitting. It seems like I would need a stronger penalty that the accuracies for my training and testing data converge, but this should be the result of the cross-validation, right?
I assume that the problem is the data I have, meaning the proportionally large number of features for these few samples. But if there are other mistakes regarding my general proceeding, I'd be very thankful for any advice!
P.S.: Unfortunately, I cannot share the data I use, and using different data would probably not reproduce the same problem. So please focus on my procedure.