Can regression to the mean cause this?

Question

Let's say I have two independent samples (DS and VS). My data is also highly multivariate (e.g. one dependent variable, 300+ potential predictors, N<300). To reduce the dimensionality of this problem I decided to aggregate the effect of my 300+ potential predictors into one index parameter. Therefore, I regressed each predictor on my dependent variable, which provides me with an effect score (beta value) for each predictor. By multiplying these beta values with the predictors, I get the accumulated net effect (risk index) of all 300+ predictors for a particular observation.

Finally, I tested if this risk index is associated with my dependent variable in the Validation Sample (VS). I expected that the beta value for the risk index would fall between 0 (no association) and positive values (predictive power). However, to my surprise I consistently got negative beta values for some sets of predictor variables, especially those that I didn't actually expect to be good predictors.

I'm wondering if regression to the mean might be a sufficient explanation for this?

score 2 · Answer 1 · answered Jun 12 '13 at 19:06

2

Regression to the mean in a predictive setting generally implies that overfitted models in a training sample (DS) will have "lower" concordance in the validation sample (VS). "Lower" being closer to the null, i.e. the training sample was attenuated or differentially biased away from the null due to overfitting. Here, I'm defining concordance as

$$c = \rho(\hat{Y}, Y) \times \sqrt{\frac{\mbox{var}(Y)}{\mbox{var}(\hat{Y})}}.$$

If this concordance is negative in a test sample, it's probably a totally spurious result. I doubt that over simulated resamples using truly independent identically distributed samples you'd find that $Pr(c < 0) = 1$.

Your method in general seems a little bizarre and seems highly prone to overfitting, unless your sample is of size $n = 20 \times p$ or approx. 6000 or bigger. There are other ways of pooling a myriad of predictors of potential interest, such as LASSO or PLS which sound more relevant to your application. In psychometry, for instance, these specialized regression techniques are suited to create "scores" for behavioral diagnoses using large survey tools.

answered Jun 12 '13 at 19:06

AdamO

62,637

thanks for this answer. I'm quite aware that my approach is prone to overfitting. However, I don't intend to find the best predictor but simply want to establish if such a very simple metric as I have described is significantly associated with my dependent variable in an independent sample. This would allow me to make the statement that my predictors overall are linked to my dependent variable.
Would be great if you could point me to some psychology research that achieved the same through LASSO.
– aciM Jun 12 '13 at 19:26
1

Quite the opposite in fact. Suppose you have two almost perfectly negatively correlated factors in a sample, such as $E[Y|X,W] = X - W$ and $E[X] = -W$ approximately. Regressing $Y$ upon $X$ and $W$ will give a near perfect fit but $\hat{Y}$ is nearly 0 for all predicted values. This predicted value has low concordance. – AdamO Jun 12 '13 at 19:36
1

It would be far better to use an approach based on good statistical principles. lasso and elastic net are two worth trying, or just use ordinary quadratic penalization and keep all 300 variables in the model. Note that your problem is univariate as you have one $Y$ variable. It is a multivariable regression problem. Also note that if you don't use shrinkage, the only kind of dimensionality reduction that is usually reasonable is an approach masked to $Y$, e.g. variable clustering, redundancy analysis, nonlinear principal components. – Frank Harrell Aug 11 '13 at 20:36

Can regression to the mean cause this?

1 Answers1