How to fix violated linear probability model assumptions?

Question

I am currently conducting research on the experience of loneliness (1 = lonely, 0 = not) amongst Ethnic Groups. I have been advised to conduct the analysis using a linear probability model. I am aware that this violates many assumptions (non-normality, heteroscedastic errors, values lying outside 0,1). I understand that heteroscedasticity can be corrected by computing robust standard errors. But I was wondering if there was any way these violated assumptions can be corrected/minimised in the language r. For example, is there a way to produce valid predictions alongside this model? I feel if I present the corrected terms this will support my justification to use a linear probability model.

you prbably want logistic regression; a ucla link to get started — user20650, Apr 20 '21 at 10:16
I am using survey weights so my sample size for some ethnic groups is less than 10, is this acceptable with a logistic regression model? — , Apr 20 '21 at 11:04

score 1 · Answer 1 · answered Apr 20 '21 at 18:21

For example, is there a way to produce valid predictions alongside this model?

Yes, there is, and it is called logistic regression. The details of logistic regression are too long to put here. You can find them in any number of books on Generalized Linear Models. Logistic regression fixes near all of the problems you've listed here:

Non-normality is no problem because we model the likelihood (appropriately) as binomial rather than normal.
Dependence of the variance on the mean is also naturally handled
The logistic link function $\operatorname{logit} : \mathbb{R} \to (0,1)$, so we always get valid predictions.

The trade off is (perhaps) in interpretability. Linear probability models are dead simple in their interpretation, but the coefficients of a logistic regression are in terms of log odds ratios. Its simple to state what these are, but difficult to interpret them in the same way as the coefficients of a linear model.

LPM vs Logistic regression is hotly debated in some circles, and before progressing I think it may be beneficial to kindly ask whomever suggested LPM why they did that and not suggest logistic regression.

Does sample size affect the use of logistic regression? I am currently using the svyglm() package in r to compute my models as I am applying the correct survey weights. For one ethnic group, the application of survey weights means the sample size is reduced to 9 observations. However, I thought a rule of thumb for logistic regression was 10 observations per predictor? Correct me if I am wrong. — Chloe00, Apr 20 '21 at 19:56
I typically use 7 events per predictor, but this is questionable, as I say below. First of all, with binary outcomes, EVENTS per predictor matters: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5394463/ . Moreover, rules of thumb are subjective: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-016-0267-3 : one should check whether, in their particular case, there's multicollinearity and quasi-separation: — Federico Tedeschi, Dec 02 '22 at 11:41

score 0 · Answer 2 · answered Dec 02 '22 at 13:06

In your case, the usual "caveats" about the Linear Probability Model do not apply, in my opinion. The only issue I see is that you won't end up with odds ratios, but with risk differences. But this is not a problem at all: you can get the probabilities estimated by your model to estimate, in turn, both odds ratios and risk ratios. Generalized Linear Models (here is an example of a paper suggesting such approach, defining it as a "binomial model for the risk difference": https://academic.oup.com/aje/article/162/3/199/171116) may let you address non-normality (that may be an issue for inference in small samples) and heteroskedasticity (here: https://stats.stackexchange.com/a/140662/159259 you can see the formula to define the variance as a function of the mean in R). As for values lying outside the 0-1 interval: with a categorical predictor, you'll just have the point estimates of N probabilities (given by their frequencies), whose 1 will be the reference category (the constant term representing its estimated probability), while the N-1 parameters associated to all other ethnical groups will be their risk-difference estimates, wrt such reference category . All estimated probabilities will be strictly between 0 and 1, assuming that you wouldn't consider ethnical group as a variable in first place if all were from the same group, and that you wouldn't include any ethnical group that is not represented in your sample.

How to fix violated linear probability model assumptions?

2 Answers2