1

I am struggling in the preprocessing of some analyses. I have a dataframe with around 100 observations and quite a few possible predictors (categorial and numerical data, about 20 in total). I am quite aware of the restricted number of cases and the inherent problems, still I would like to come up with some predictions and yet a parimonious model. I chose glmStepAIC method from the caret package using a code similar to this

train_control <- trainControl(method = "repeatedcv", 
                              number = 10, 
                              savePredictions = "final", 
                              classProbs = TRUE, 
                              repeats=5)

mdl_step <- train(dataframe %>% select(-dv), dataframe$dv, method = 'glmStepAIC', preProcess = c("center", "scale"), tuneLength = 10, family = binomial, trControl = train_control, metric = "Accuracy")

The results are credible and the models make sense, but I am aware that the estimates are too optimistic [1], so I have been trying to find out about possible ways to validate our results. I have come across the possibility of pre-processing the data with lasso or ridge regression, but I can't find any examples and don't know how to set this up. Is there anyone who can give me some advice?

2 Answers2

3

Both your use of an automated stepwise regression and of accuracy as your metric are highly troublesome, as the corresponding linked pages explain in detail. Those approaches are likely to overfit, as you recognize for stepwise selection, and won't work well on new data samples. You certainly cannot use this approach for inference (p-values, confidence intervals), as your final model doesn't take into account that you used the outcomes to choose the model.

No form of "pre-processing" will help if you then simply use it as a prelude to stepwise predictor selection or some other automated modeling process that uses the outcomes to build the model. You end up with the same problems.

You have two types of choices if you want a validated model.

First, and probably most useful, is to use a penalized model based on LASSO or ridge regression directly--not as a prelude to stepwise selection. The penalization of regression coefficients in those methods brings the final model's "optimism" down to reasonable levels.

Second is to document the optimism of your modeling process and adjust your model accordingly. Take a bootstrap sample of your data (same sample size, but with replacement). That mimics the process of taking your original data sample from the underlying population. Apply your automated modeling process to the bootstrap sample. Evaluate that model's performance both on the corresponding bootstrap sample and on the full data set: the difference (how much better it performs on the bootstrap sample than on the full data set) is an estimate of the "optimism" of the fit. Repeat a few hundred times to get an overall estimate of the "optimism" of the modeling process. That's called the "optimism bootstrap."

When you do either of the above, use a proper scoring rule like log-loss or the Brier score instead of "accuracy," which is based on an assumption about a probability cutoff for class assignment.

Frank Harrell's notes on Regression Modeling Strategies go into extensive detail on these matters.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Wow, thanks @EdM and @jmarkov for your answer and the impetus you give. I will get to grips with creating a computational framework to implement all of what @EdM suggested including the penalisation method as option in train and method = 'optimism_boot' in trainControl but especially updating the metric. Above all, my work in research has taught me a natural critique of overly optimistic results, no matter how well they fit what I suspected - probably even especially in cases where it fits too well, especially with small data sets. – umrpedrod Jan 12 '23 at 22:54
1

If your model gives good predictions and "makes sense" as in "based on your previous knowledge it gives results which you can reasonably expect", then in my opinion you are likely on the right way, meaning that data should not be biased and your best prediction model can be used to predict new observations. If your data are biased, I don't think you can validate your predictions running a regression model with a regularization technique, because the results will contain the same bias as the prediction model.

Bias is strictly linked to the concept of causality. In the question you refer to, collinearity is mentioned. Collinearity is a statistical term that refers to the degree of correlation among the predictors, but does not say anything about the direction of their relationship. This means that if two predictors are highly correlated, in general you can't say much about them. If you have to choose which one to keep, I would be careful using a purely statistical techniques to make this decision (data-driven) such as lasso or ridge. You need to use your knowledge of the domain (hypothesis-driven), that is the knowledge you have of the research field and the data.

General suggestion: everything depends a lot on the kind of research question you are interested in. If you are only interested in predictions, you usually don't really need to make the model parsimonious. If you need parsimony, it probably means that you also want to understand something about the relationship among the outcome (dependent variable) and the predictors (independent variables, features or whatever you want call them).

Ask yourself: Do I have a reason to think that predictions obtained using this data are not generalizable to new observations? In other words, do this data carry some sort of selection bias?

Additionally, particularly because you want your model to have a few good predictors, it's also important to understand the direction of the relationship among them. Otherwise you might incur in confounding.

I recommend looking into DAGs (Directed Acyclic Graphs) which are intuitive tools that can help you to understand better your data and its potential biases. If you have time to invest, I recommended this book which you can download for free from the website. It will really help in a variety of different problems with data analysis.

Update: there is not a unique and straight answer to this, because the question is about the strategy of your analyses so people with different backgrounds working in different fields can have different views. Some answers might provide you a better explanation than others, based on your background knowledge. Therefore, this answer might not be very useful to you or it might be enlightening! It would be very good if you could give a feedback on this answer (either editing your question with updates or commenting my answer) so that others know what it should complemented with or you are satisfied with it! There are no deadlines, but don't wait too long if you intend to do so. It might not be easy for you to explain what it is not clear because some topics are complex and sometimes too broad to handle. I guarantee you that we all understand that! However, research thrives thanks to our differences, so please don't held back your doubts :). Also, consider that an effort from your side will be rewarding both for you, in getting the best answer you need to proceed with your analysis, and for the answerer, who is here to share knowledge with everyone for free! Bonus info: everyone is trying their best here, but some people seeing your question might be top-level experts in problems related to what you work on. They might can be of immense help, but their time might be very limited and they can be discouraged from answering if they don't have enough elements to help... You get me ;).

jmarkov
  • 683
  • 4
  • 11
  • Note that the term bias has multiple meanings in statistics and machine learning, so one has to be careful. – Richard Hardy Jan 10 '23 at 19:05
  • Do you mean that machine learning focuses more on the bias-variance trade off from a modelling perspective? I am less confident treating the machine learning side. – jmarkov Jan 11 '23 at 07:59
  • 2
    There are at least a couple of examples. First, bias of an estimator in statistics has nothing to do with causality. An estimator can be biased or unbiased for its target regardless of whether or not we may want to ascribe any causal interpretation to the target. Second, bias in the context of neural networks is the constant term (a node that acts as an additive constant); it would be called intercept in a regression setting. – Richard Hardy Jan 11 '23 at 08:49
  • I see! I would be interested in seeing how you would answer the question. Not sure how useful for the public considering that this question has no upvotes yet, but who knows.. – jmarkov Jan 11 '23 at 12:09