Statistics behind overfitting in linear regression

Question

There are questions on this site relating to overfitting, see: Relationship between model over fitting and number of parameters and Can a Linear Regression Model (with no higher order coefficients) over-fit?

I don't see my specific question answered so I believe this post is warranted.

Setup: Data is sampled from $Y = \beta_0 + \beta_1 X + \epsilon$ where $\epsilon \sim N(0,1)$. Sampling, we get $(X_1, Y_1), .. , (X_n, Y_n)$. We introduce a new variable $Z \sim N(0,1)$ and sample it $n$ times as well, so we end up with the data $(X_1, Z_1, Y_1) , .. , (X_n, Z_n, Y_n)$.

We fit two linear models; one for the first set of data, i.e. $Y = \beta_0 + \beta_1 X$; one for the second set of data, i.e. $Y = \beta_0 + \beta_1 X + \beta_2 Z$.

I can see how it is possible that there is some non-zero correlation between the $Z_i$ and the $\epsilon_i$ that is sampled from the true distribution, which intuitively may lead to overfitting.

However, in terms of regular linear algebra, I don't see how we can make an assumption about whether or not overfitting will occur. For example, if the number of parameters in the linear regression model is larger than the number of data points then there are either no solutions or infinitely many - this tells us nothing about a specific case or the likelihood of the latter option occurring only that it could occur.

I also don't see why, statistically, we would expect the least squares loss to be lower for the second OLS model than the first, which would imply overfitting since the true distribution is of the first form.

I hope someone can answer why and how linear regression, without polynomial terms, can overfit a dataset by justifying with statistics and not just intuition.

I do not know if that is already statistics or still intuition to you, but to minimize the OLS loss w.r.t. to $\beta_0,\beta_1,\beta_2$ instead of just w.r.t. $\beta_0,\beta_1$ gives you an additional free parameter to optimize over, so that will leave the loss, at worst, what it is and generally reduce it. — Christoph Hanck, Dec 12 '22 at 07:09
Depends on the definition of "overfitting." But certainly, estimating a model that includes a variable that has little or no effect could be called "overfitting". An effect of such overfitting, as with overfitting by any definition, is to introduce extraneous variability in the predicted values, making predictions generally worse. — BigBendRegion, Dec 12 '22 at 13:35
@ChristophHanck Intuitively that is the case. But it should be possible to calculate the loss in both cases and show it algebraically. — Spencer Gibson, Dec 12 '22 at 16:00
https://math.stackexchange.com/questions/1976747/prove-that-r2-cannot-decrease-when-adding-a-variable — Christoph Hanck, Dec 13 '22 at 05:07

Statistics behind overfitting in linear regression

0 Answers0