Can I remove variance of OLS estimators through averaging coefficients from cross-validation?

Question

Imagine working on a linear multiple regression problem with a design-matrix $\Phi$ based upon some independent variables $x_k, k\in{}[1, r]$. The goal is to find an equation that explains the "true" relationship between a dependent variable $y$ and independent variables $x_k, k\in{}[1, r]$ e.g. of a recorded physical experiment. It is defined as \begin{equation} \bar{y}=\Phi\theta \end{equation} with $\theta$ being the regression coefficients and $\bar{y}$ being predictions of the estimator. Why do I need penalized regression or subset-selection?

From what I have read, additional unnecessary columns in $\Phi$ create more variance in the estimators parameters $\theta$ but no bias (p.94, Kennedy: A guide to econometrics). (I assume this is because any correlation with relevant variables will result in some non-zero coefficient for that term. Coefficients could also act in opposing directions (such as with sin(x) and x at low x values). Is that correct? If you could help me understand this with the covariance matrix, that would be great.)

Why am I not able to remove this variance by using cross-validation and by averaging coefficients of the best sets? Isn't their mean free of bias? I read about this approach in (Brunton: DATA DRIVEN SCIENCE & ENGINEERING. http://databookuw.com/databook.pdf) but could not find it anywhere else. It also suggests thresholding small coefficients.

I read https://stats.stackexchange.com/questions/472202/when-to-use-regularization-vs-cross-validation#:~:text=Cross%20validation%20is%20about%20choosing,to%2C%20result%20in%20similar%20solutions and understand that cross-validation does not do the same thing as regularization. But shouldn't it converge to the true parameters for infinite data?

If you know a good piece of literature covering this, please share it. I am a mechanical engineering student and this topic is way above anything we have ever done in classes. Thanks in advance!

Could you explain how one could ever get started doing a regression analysis with a "completely unknown" matrix of explanatory variables?? — whuber, Mar 31 '22 at 17:06
@whuber I meant that their transformations are unknown. It does not really contribute to the question, so I removed it. It could consist out of many possible transformation, such as trigonometric functions, polynomiales, ..., that could be part of the equation. — Timo, Mar 31 '22 at 17:15

score 0 · Answer 1 · answered Apr 01 '22 at 17:28

I have found my own theory on what is wrong with this approach. Please correct me if I am wrong! Like suggested by Kutz in https://youtu.be/NoQV1lc7OlU?t=416 or his book on pdf-page 192 http://databookuw.com/databook.pdf, cross-validation does not serve any model building purpose. It only gets used to get a better idea of the generalization error. Dividing the data-set up into k subsets, performing regression on them and averaging their coefficients is not a great idea. This is because for small datasets, the weights of the data-points get disrupted (Running regression on the entire dataset vs running on smaller dataset then take average of coefficient?). This way a lot of bias gets introduced. If the sets consist out of more data, this effect will be lower. But by having fewer sets (assuming an unchanged set of data), the variance-reduction effect will also be less. I believe that this is simply trading some unknown amount of bias for lower variance. It probably is better to just go for a biased regression straight away.

Can I remove variance of OLS estimators through averaging coefficients from cross-validation?

1 Answers1