Variable selection in logistic regression

Question

So I'm trying to make a multivariate logistic regression model in R studio. I'm not sure how to go about this. What seemed to make sense to me was to model every predictor against the response individually and see whether there's any individual significance. If there's no relationship between a single predictor and the response why would I add that to my model?

Then I'd make a model with only predictors that were individually significant and then run AIC/BIC or ridge/lasso to reduce or remove non significant predictors. And what I'd have would be my resulting model.

I just want to know whether that's an intelligent way to go about constructing a multivariate regression and if not what a better way or ways there are. Thanks

What are you doing your logistic regression for? Is this (1) just an exploratory analysis, or do you want to (2) perform inference and significance testing, or (3) predict for unseen data? — Stephan Kolassa, Dec 03 '23 at 08:08
So I collected a sample of 10k data points (games in this case) and wanted to see whether I could predict game outcomes using a combination of predictors. So I guess it would be no.2 — AdmiralMunson, Dec 03 '23 at 08:11
Hm. That actually sounds much more like option 3, or possibly 1. It does not seem like you plan on calculating and reporting p values. — Stephan Kolassa, Dec 03 '23 at 08:17
So I've run a couple models in R and gotten p-values and all that. Maybe I'm not sure what you mean because all I'm trying to do is work out adding variables to a model in a way that is reasonable so that way I can understand relationships in the data (assuming they exist) and doing some prediction. Does that make sense? — AdmiralMunson, Dec 03 '23 at 08:19
A side comment, multivariate regression is not a synonym for multiple regression — utobi, Dec 03 '23 at 08:44
I meant a logistic model that has more than 1 predictor variable. I apologize should I have used the wrong term. Also that thread seems interesting and I will read it more when I am more awake. Thank you — AdmiralMunson, Dec 03 '23 at 08:53

score 7 · Accepted Answer · answered Dec 03 '23 at 08:28

The problem with your proposed approach is that every predictor in itself may not correlate with the outcome, but interactions between them might. Or you might have a curvilinear relationship between a predictor and the outcome. (For instance, very low and very high BMIs are associated with higher morbidity than medium BMIs.)

This is really little different than model selection for OLS or pretty much any other model. You should always start with domain knowledge and not just feed your data into some model selection algorithm, because the latter approach is pretty much guaranteed to have you chasing noise. Reliably finding relevant variables out of a large pool requires enormous amounts of data.

So best to first pare down your candidate predictors. Possibly think about transformations like splines to model potential nonlinearities, or interactions. (Look, you suddenly again have lots of predictors - so the caveats above apply again.) Then you might want to look at automatic model selection tools, like stepwise regression based on information criteria or statistical testing. This is highly problematic if you want to do inference (p values), but can be defended if your goal is prediction. Just don't go overboard with this and don't trust an automatic tool too much, because it will not save you from overfitting. Absolutely have a look at Algorithms for automatic model selection.

Ideally, bootstrap your model selection to get a feeling for how variable it is. Are some predictors always selected? Are others sometimes selected and sometimes not? One key outcome of your exercise should be a lot of humility as to whether you really found the "best" model.

Also, I would recommend you keep a holdout set, or wrap all this in a cross-validation setup. Assess probabilistic predictions from your logistic regression using proper scoring rules. Compare the performance of your selected model to an extremely simple model with only a few predictors, as selected by domain knowledge - chances are that a very simple model may be quite hard to beat by a more complex one.

score 4 · Answer 2 · answered Dec 03 '23 at 11:53

Adding to Stephen's excellent answer, there are lots of reasons not to do automatic variable selection. He covers some of them. But there are others:

A variable could be a mediator without being either important or significant in its own right.
A small effect size might be important. (E.g. if theory says the effect should be large).
A variable might be part of your hypotheses or research questions.
Your method relies on statistical significance, which has its own problems.

But, perhaps most important, doing automatic variable selection denies you the ability to think, it ignores any substantive knowledge that you or your colleagues have, and (if you are working for someone else), it tells your boss they don't have to pay you much.

As Donald Cox put it:

There are no routine statistical questions, only questionable statistical routines. He uses statistics like a drunken man uses a lamp post, more for support than illumination.

Variable selection in logistic regression

2 Answers2