2

I am performing a logistic elastic net regression to assess which variables influence the outcome and evaluate it. I am working with an imbalanced dataset that consists of 50 cases and 1700 controls. My objective is to assess the best approach for model development and evaluation.

At first I performed a traditional train-test split with an 80-20 ratio keeping the same ratio of cases in both the training and test data. However I end up with very few cases in the test sample.

I was wondering if I could split the analysis in two parts:

1- A exploratory model with all the data to visualize the coefficients and know which variables are influencing the outcome and visualize its relative importance.

2- A cross-validation analysis in which I will split and test the data, train and test the model each time and evaluate the performance.

I would like some feedback because I am not entirely sure if there would be any kind of concern if I used the second approach.

Xfar
  • 21
  • 2
    If you have only $50$ cases and less than $3%$ prevalence of cases in the population then you are inevitably going to be limited in what conclusions you can draw, whatever method you use, unless your variables provide clear separation between cases and non-cases – Henry Sep 13 '23 at 10:31
  • @Henry we are aware that this is a big limitation, and this is a exploratory analysis. I thought about the second approach to maximize the use of the data for the prediction, but I was unsure whether this approach would raise an eyebrow. – Xfar Sep 13 '23 at 10:45

1 Answers1

3

Since you don't have a lot of data, and are explicitly looking at model selection and tuning, your first approach runs a high risk of overfitting. You may well see something that looks like a relationship, and start explaining to yourself just why there is a relationship... lying to yourself in the process. (Humans are very good at retroactively finding stories to "explain" "patterns" they see, whether or not they exist.) Unfortunately, since you have used all your data in this exploratory step, you don't have any independent data left on which to test any relationships you may have "found". The regularization that the elastic net brings will help, but it is not a panacea.

In the end, there is unfortunately no simple way to avoid the fact that if you have little data, you have a hard time learning from it. I would recommend that you let yourself be guided strongly by available knowledge or theory about your situation. A "data mining" approach will very likely result in overfitting as above.

Stephan Kolassa
  • 123,354
  • What Stephan stated is all-important for your approach. In addition, the dataset is far too small for split-sample validation to work. Resampling, when all analysis steps are repeated afresh for each resample, works far better than data splitting unless $N$ is huge (e.g., $> 20000$). – Frank Harrell Sep 13 '23 at 11:38
  • Thanks for your answer. I am testing several outcomes with a varying number of cases and my idea was to find features that are shared between the outcomes and their relative importance. I wanted to discard models that have little to no prediction power.

    I performed the train-test split but obviously the number of cases in the test data is really low. This is why, I wondered if it would be wrong to get the coefficients from the whole model. But getting the metrics from a cross-validation training and testing the data like the whole model didn't exist.

    – Xfar Sep 13 '23 at 13:02
  • Hm. If you are doing this exercise with multiple outcomes, that multiplies the possible number of models (google Gelman's "Garden of forking paths" paper). I would be extremely careful about drawing conclusions from this analysis, other than "this association looks intriguing and may merit following up". – Stephan Kolassa Sep 13 '23 at 13:27
  • That is the plan, this is completely exploratory and it will be stated as such very clearly. In my first approach I checked the results and the obvious and expected outcomes appeared as the most relevant features, which gave me some peace of mind. I think I will go for the most conservative approach and only consider those models with a high f1-score, precision and recall for cases. I will definitely check that paper! – Xfar Sep 13 '23 at 13:41
  • Ah. I would recommend you are VERY CAREFUL about F1, precision and recall, especially in an "unbalanced" situation. They all suffer from the exact same issues as accuracy as a KPI. – Stephan Kolassa Sep 13 '23 at 14:08
  • I was only going to check how well they perform in cases only, to avoid the inflation due to a correct classification of controls. I will check for alternatives. – Xfar Sep 13 '23 at 14:17
  • Restricting the evaluation to a subset of your data (e.g., cases, as opposed to controls) will only change the way an evaluation using F1/precision/recall incentivizes your model to fail. (Also, why wouldn't you include non-cases? These are also data points. Few analyses profit from throwing data away you already paid for. This is related.) – Stephan Kolassa Sep 13 '23 at 14:26
  • I didn't know about the existence of the Brier score, thank you very much for your insight :) – Xfar Sep 13 '23 at 14:41