I found a rather old post (How large should a sample be for a given estimation technique and parameters?) about a rough estimate of the number of data needed when performing parameter estimation via linear regression techniques. I followed the sources mentioned in that post, but the source mentioned doesn't seem to reference this 20 data per parameter advice written about. I was hoping that someone could point to a source that talks more in-depth about how much data is generally needed/advised when doing parameter estimation via linear regression techniques.
-
A rule I learned was 10 records per regression coefficient, but I can't "put my fingers on the source." – wjktrs Jan 20 '24 at 05:48
1 Answers
In terms of avoiding overfitting, Section 4.4 of Frank Harrell's Regression Modeling Strategies summarizes (with references) the results of several simulation studies on the ratio of observations to candidate predictors. The guidance for ordinary least squares regression, for "typical signal:noise ratios found outside of tightly controlled experiments," is for at least 15 observations per candidate predictor.
For survival analysis it's 15 events per candidate predictor. For binary regression, a rough estimate is 15 members of the minority class per predictor. The Harrell reference provides more detailed formulas for binary and ordinal regression.
That's only a bare minimum to avoid overfitting, however. When designing a study you want to have a large enough number of observations to provide a reasonable chance of detecting a true effect of a specified size. Those calculations come under the heading of power analysis. Depending on the size of the effect that you want to detect, you might need a good deal more.
- 92,183
- 10
- 92
- 267
-
I imagine that's where cross-validation is useful, correct? I'm not exactly a wiz at statistical analysis and regression. I'll dive into your reference though, thank you! – David G. Jan 20 '24 at 14:33
-
1@DavidG. for power analysis in experimental design you can use formal tests for many situations or, more generally, simulation. Cross validation has several uses for building and evaluating models on data sets that have been already acquired. If you are going to do a lot of regression analysis, it's worth taking the time to learn how to use Frank Harrell's
rmspackage in R. – EdM Jan 20 '24 at 15:11 -
"Overfitting" is important but it's a separate issue from the "parameter estimation" described in this question. The latter doesn't imply any form of model identification. Consequently, these rules of thumb are (IMHO) too conservative in many applications. – whuber Jan 20 '24 at 18:21