Small-sample binary logit and linear models - response to referees

Question

Background: This is a cross-sectional study that collected 30 thrombosis samples. We assessed the presence or absence of MP components (dependent variable), with 24 cases having MP (1) and 6 cases without MP (0). We carried single- and multi-factor logistic regression analyses, with clinical indicators as independent variables, and the presence or absence of MP as the dependent variable.

Comments from the reviewer : The main concern raised is the small sample size, which is common in the field of microplastics. The odds ratios (OR) in the multi-factor regression model ranged from 6 to 12, with wide 95% confidence intervals (CI), indicating that the logistic regression model is unreliable. Furthermore, with 80% of the results being positive (24 out of 30 cases), the association in OR calculation may be exaggerated. Given the small sample size, caution should be exercised when selecting variables to include in the regression model.

I have the following questions:

Can we still use a multi-factor logistic regression model?
The results of the single-factor logistic regression analysis suggest that the D-dimer concentration index has statistical significance (P=0.034) in relation to the presence or absence of MP, while the other indicators have p-values greater than 0.05. Can we include variables such as gender and age in the multi-factor logistic regression model? How many variables should be included at most in the multi-factor logistic regression model?
If the multi-factor logistic regression model is not feasible, can we only perform a single-factor logistic regression analysis to demonstrate the influence of D-dimer on the dependent variable?
If logistic regression is not used, can we utilize the adjusted Poisson regression (robust Poisson regression) mentioned in the referenced article (https://mp.weixin.qq.com/s/1EQjsKvLyWXRVFMX_u0y-w) and website (https://mengte.online/archives/11695)? Can the 95% CI of the odds ratio converge to single-digit values?
If none of the above approaches are suitable, what other statistical methods can be employed? Bayesian function？Exact logistic regression？

Thank you very much! Welcome everyone to discuss any issues and provide valuable suggestions.

CaroZ · Answer 1 · 2024-02-27T12:13:22.820

2

I do not think the problem comes from the small sample size per. It rather comes from the fact that MPs are obviously almost present everywhere regardless from the value of explanatory variables, since only 6 samples out of 30 have a score of 0, and all the rest are 1s. There is barely any variance to explain. Therefore, including more explanatory variables seems irrelevant here.

I would try to obtain a continuous measure of MP instead. If this is not possible, I would not run any statistics, but simply say that MP can be found in most samples regardless of the value of the explanatory variables.

edited Feb 27 '24 at 12:13

answered Feb 27 '24 at 11:50

CaroZ

755

This sounds more like a comment than an answer. – Stephan Kolassa Feb 27 '24 at 11:51
1

The concentration of MP is also one of the outcome variables, which is a continuous variable. Therefore, we have already used a multiple linear regression model for statistical analysis. – zhiheng yi Feb 27 '24 at 11:59
So then why do you want to analyse it as presence/absence when it is obviously almost always present ? – CaroZ Feb 27 '24 at 12:04
Because I am a beginner in statistics, and my methods are based on referencing papers from similar researchers, so there may be many errors. I am curious, is it unreasonable to analyze the presence of MP based on the assumption of a high probability of time occurrence (80%)? What are the specific reasons for this? – zhiheng yi Feb 27 '24 at 12:13
What is it you call "time occurrence" ? For the mathematical reason, see Frank Harrell's answer above. I can only tell you an answer based on common sense : if something is present almost everywhere, you do not have enough cases of non-occurrence to estimate the probability of non-occurrence. It would be a case of really wanting to fit an analysis in somewhere where a simple description of what you observe would be accurate. – CaroZ Feb 28 '24 at 08:35

score 2 · Answer 2 · answered Feb 27 '24 at 12:50

2

There is no fix for this situation. Not only has the experiment used a minimum information (binary) response variable, but it does not have the needed sample size (96) just to estimate the intercept in a logistic regression model. To add to the problem the distribution of Y is not balanced. The sample size needed here is more like 96 + 15 times the number of candidate predictors. The effective sample size is 3np(1-p) where p is the proportion of positive responses. More in https://hbiostat.org/rmsc/multivar and https://hbiostat.org/rmsc/lrm . The referee is exactly correct.

answered Feb 27 '24 at 12:50

Frank Harrell

91,879
6
178
397

Is the sample size needed (96 + 15k) about the effective sample size in binary regression? For example, n = 200, p = 0.5, then effective sample size is 150, which supports estimation of one intercept and three slopes. The raw sample size 200 seems to support one intercept and seven slopes instead. – DrJerryTAO Mar 13 '24 at 23:21
A much better formula is in https://onlinelibrary.wiley.com/doi/10.1002/sim.7992 which shows that rules of thumb aren't super reliable. But the 96 is very solid as the minimum. – Frank Harrell Mar 14 '24 at 22:36
Thanks. The paper argues that event per parameter (EPP) is not a constant threshold but the global shrinkage factor at 0.9 is. I am curious why the resulting EPP is not symmetrical around event definition in binary logit models. In Section 2.4, if the outcome proportion is 0.1, the calculated EPP is 8.5. Accordingly, EPP = (1698 × 0.9)/20= 76.4 is the outcome proportion is 0.9. But if we flip the outcome definition to its negation, so outcome proportion 0.9 is equivalent to negation proportion 0.1, the EPP should be still 8.5, contrasting to the much higher 76.4 suggested in the paper. – DrJerryTAO Mar 15 '24 at 11:55
First of all, EPP is not a great concept so don't put that much stock in it. Second, it is based on the lower of the two proportions p, 1-p. Think of the effective sample size for binary Y which is symmetric: 3np(1-p). The effective sample size for continuous Y is n. – Frank Harrell Mar 15 '24 at 13:17

score 1 · Answer 3 · answered Feb 27 '24 at 11:25

The usual rule of thumb with logistic regression is that you can include one IV for every 10 cases in the less common case. For you, that's six people, so this rule would say "One IV at most".

The problem here is not your method, but the data. You haven't got enough. Overfitting happens when you try to fit models to data like this.

From my small amount of research, modified Poisson is a robust method, intended to deal with problems such as clustered data. That's not your issue. Exact logistic may help, but I think only by making your results less significant and your CIs wider. This may not seem like help, but it is, in that, when you have overfitting, your p values are likely to be too small and your CIs too narrow. See King and Ryan (2002) A Preliminary Investigation of Maximum Likelihood Regression vs. Exact Logistic Regression, particularly the section on the ESR data.

I am not a Bayesian, but my guess on that is that the results of a regression with so little data ought to have very little effect on your priors (whatever they are).

DrJerryTAO · Answer 4 · 2024-02-28T05:05:37.243

Can we still use a multi-factor logistic regression model? 2. The results of the single-factor logistic regression analysis suggest that the D-dimer concentration index has statistical significance (P=0.034) in relation to the presence or absence of MP, while the other indicators have p-values greater than 0.05. Can we include variables such as gender and age in the multi-factor logistic regression model? How many variables should be included at most in the multi-factor logistic regression model?

As others have pointed out, the sample size is too small to allow many predictors. Too many predictors of a binary outcome tend to results in complete separation with large coefficients and huge standard errors. Coefficients in generalized linear models are asymptotically consistent, meaning that they are biased but the biases get smaller when the sample size increases. Similarly, their standard errors are at best consistent. In small samples, this bias may be relatively large. The p values you reported whether based on Wald, score, or likelihood-ratio tests are not trustworthy because they require asymptotic interpretations, which only apply when the sample size is large enough. See my comment on these tests Likelihood ratio vs. score vs. Wald test: Different p values, which to use?.

If the multi-factor logistic regression model is not feasible, can we only perform a single-factor logistic regression analysis to demonstrate the influence of D-dimer on the dependent variable?

Not a good idea to abandon multiple predictors if they actually affect the outcome and are correlated among each other. This will result in omitted variables, which bias the estimate.

If logistic regression is not used, can we utilize the adjusted Poisson regression (robust Poisson regression) mentioned in the referenced article (https://mp.weixin.qq.com/s/1EQjsKvLyWXRVFMX_u0y-w) and website (https://mengte.online/archives/11695)? Can the 95% CI of the odds ratio converge to single-digit values?

No, this will not solve the small-sample size problem. Poisson regression directly reports the relative risk estimate. But one can easily get the same quantity in logistic regression by using post-estimation tools. See marginaleffects::comparisons() in R. Here the "robust" is about the standard error only, which tries to correct the coefficient variance-covariance matrix by a sandwich estimator to make it consistent. "Consistent" means that it approaches the true value as sample size increases. With the sample size fixed and small, the robustness or consistency of standard errors are not very insightful.

If none of the above approaches are suitable, what other statistical methods can be employed? Bayesian function？Exact logistic regression.

The exact logistic regression appears to be the best option. It compares to permutation test in one-factor analysis. See a tutorial https://stats.oarc.ucla.edu/r/dae/exact-logistic-regression/ which also has a sample size of 30. "Exact logistic regression is used...when the sample size is too small for a regular logistic regression."

The concentration of MP is also one of the outcome variables, which is a continuous variable. Therefore, we have already used a multiple linear regression model for statistical analysis.

You have a sample-selection issue in the linear regression. We can only observe MP concentration if MP is present. Therefore, the effect of a predictor on MP concentration is biased if we estimate it from the 24 observations where MP is present. See Toomet, O., & Henningsen, A. (2008). Sample selection models in R: Package sampleSelection. Journal of Statistical Software, 27(7). https://doi.org/10.18637/jss.v027.i07 and Toomet, O. (2020). Treatment eﬀects with normal disturbances in sampleSelection package. https://cran.r-project.org/web/packages/sampleSelection/vignettes/treatReg.pdf. I believe these sample-selection correction operates also on large-sample properties. Nevertheless, you can bootstrap the small sample and get exact p values. I guess we need stratified bootstrap here, each time sampling six cases from the non-MP cases and 24 cases from the MP cases with replacement.

Small-sample binary logit and linear models - response to referees

4 Answers4