I am trying to conduct multiple logistic regression. My independent variable has 22 events (yes) and 310(N0) (total sample 332). I am trying to include 20 independent variables (to minimize confounding variables).In this context, do I need to use penalized or standard regression method? Any suggestion and references, please. Thank you.
-
What problem are you trying to solve? What makes you think that penalization would or would not be a good solution? – Sycorax Oct 14 '20 at 02:05
-
1How many predictors are you thinking about including in the multiple regression? For what purpose are you constructing this model? – EdM Oct 14 '20 at 02:25
-
What do you mean by 22 events? 22 categories? – Dave Oct 14 '20 at 03:13
-
@Sycorax Thank you for your reply. My outcome variables have 22 yes (coded as1) and 310 No (coded as No), I found in the text that if your event is small, you need to use the penalized method. I got confused what method should I use, so that it will be not a problem in future while writing an article – Prasant Shahi Oct 14 '20 at 03:22
-
@EdM, Thaks for your reply, I am using around 20 predictors (socio-economic/individual/parents characteristics). I am constructing to find the association. – Prasant Shahi Oct 14 '20 at 03:24
-
@Dave, Thank you. 22 means the probability of outcome variables (22 yes and 310 No). Hope I make it bit clear. – Prasant Shahi Oct 14 '20 at 03:26
-
@PrasantShahi You've said that "I found the text that if your event is small you need to use the penalized method." On its face, this would seem to answer your question. Can you elaborate on why you doubt "the text"? Also, which text? There are lots of statistical references in the world; it's hard to know if this text is germane to your problem. – Sycorax Oct 14 '20 at 03:27
-
@Sycorax, thank you. I got confused on how much is considered a small event. However, thank you for advising me not to be in too much confusion. Thank you – Prasant Shahi Oct 14 '20 at 03:32
-
2A recent discussion of the number of event per variable "rule of thumb" and penalization methods can be found in the following article: Calculating the sample size required for developing a clinical prediction model. – chl Oct 14 '20 at 06:39
-
@chl, thank you for your suggestion. – Prasant Shahi Oct 14 '20 at 11:12
1 Answers
Whether you use the standard "rule of thumb" of 10-20 minority-class cases per predictor or the more refined approaches described in the paper linked by @Chl in a comment, you will find that having only 22 events will severely limit your ability to "find the association" between the event and your approximately 20 candidate predictors. Unless there is some penalization, you probably should not be considering any more than about 2 predictors.
You can try penalization, but that is not a cure-all for having too few events. If you use LASSO for penalization, your final model will probably only include about 2 or 3 or your 20 candidate predictors. Even if that model might provide some predictive ability on new cases, you will probably find that the particular 2 or 3 predictors that are included will vary substantially if you repeat the process on multiple bootstrap samples of your data. So you won't be able to say which predictors really represent "the association."
Penalization with ridge regression will keep all predictors in the model, but with 20 predictors and 22 events the regression coefficients will be very highly penalized. The large tradeoff thus made between variance and bias in setting the penalization level will make inference from the model unreliable. As the authors of the R penalized package put it:
It is a very natural question to ask for standard errors of regression coefficients or other estimated quantities. In principle such standard errors can easily be calculated, e.g. using the bootstrap.
Still, this package deliberately does not provide them. The reason for this is that standard errors are not very meaningful for strongly biased estimates such as arise from penalized estimation methods. Penalized estimation is a procedure that reduces the variance of estimators by introducing substantial bias. The bias of each estimator is therefore a major component of its mean squared error, whereas its variance may contribute only a small part.
Unfortunately, in most applications of penalized regression it is impossible to obtain a sufficiently precise estimate of the bias. Any bootstrap-based calculations can only give an assessment of the variance of the estimates. Reliable estimates of the bias are only available if reliable unbiased estimates are available, which is typically not the case in situations in which penalized estimates are used.
That again will limit your ability to "find the association" in a reliable way.
- 92,183
- 10
- 92
- 267
-
2(+1) Frank Harrell also discussed the basic requirement to estimate reliably the intercept in a logistic regression. – chl Oct 14 '20 at 11:50
-
-