5

Harrell's Regression Modelling Strategies suggests that the number of predictors should not exceed $m/10$, $m/15$ or $m/20$.* For logistic regression $m$ is $\textrm{min}(n_1, n_2)$, where $n_1$ and $n_2$ are the numbers in the two categories you are predicting. (E.g. number of deaths, number of survivals.)

If you are preregistering a study, so that you do not actually know $n_1$ and $n_2$, how should you proceed? Should you use domain knowledge to take a guess? What about cases where the study is the first of its kind, so that all previous knowledge is anecdotal?

*Presumably 10, 15 or 20 depending on how careful you want to be to avoid overfitting, although it's not spelled out.

Edit: actually, how is it even legitimate to look at $n_1$ and $n_2$? Surely this contributes to the forking paths problem?

Mohan
  • 865
  • Your precision for most estimates from this study will depend on $\min(n1,n2)$ more than anything else. Is it possible to design your study to continue until you have a certain number of each event? – George Savva Nov 16 '23 at 14:55
  • Thank you for pointing out the article on the forking paths problem. Very good. – rolando2 Nov 16 '23 at 18:45

1 Answers1

6

The forking paths problem is one whereby we exaggerate the statistical significance of a result when it entailed one among multiple possible analyses. Attending to this problem certainly matters. But so, often times, does statistical power. Examining n1 and n2 (learning how narrow is the outcome distribution) and finding that the min(n1,n2) is rather large may lead you to inclusion of more predictors and conduct of more analyses than if you had assumed min(n1,n2) to be very small. You have the ability to do something to correct for those multiple comparisons that you find that data enable. You can adjust alpha according to the methods of Bonferroni, Benjamini, Hochberg, and so on. Or you can simply reduce alpha to a level you and colleagues find reasonable. In this way you can hopefully strike a balance between control of Type I and Type II error.

rolando2
  • 12,511
  • 1
    Could you expand on how you would adjust alpha in this specific case? – Mohan Nov 16 '23 at 19:45
  • There is no single correct way. Some excellent commentary: https://stats.stackexchange.com/questions/630316/how-many-p-value-observations-do-you-think-are-required-before-doing-fdr-correct/630324#630324 . I also recommend Geoffrey Keppel's commentary in Design and Analysis: A Researcher's Handbook (Prentice Hall). – rolando2 Dec 13 '23 at 18:07