There are questions on this site relating to overfitting, see: Relationship between model over fitting and number of parameters and Can a Linear Regression Model (with no higher order coefficients) over-fit?
I don't see my specific question answered so I believe this post is warranted.
Setup: Data is sampled from $Y = \beta_0 + \beta_1 X + \epsilon$ where $\epsilon \sim N(0,1)$. Sampling, we get $(X_1, Y_1), .. , (X_n, Y_n)$. We introduce a new variable $Z \sim N(0,1)$ and sample it $n$ times as well, so we end up with the data $(X_1, Z_1, Y_1) , .. , (X_n, Z_n, Y_n)$.
We fit two linear models; one for the first set of data, i.e. $Y = \beta_0 + \beta_1 X$; one for the second set of data, i.e. $Y = \beta_0 + \beta_1 X + \beta_2 Z$.
I can see how it is possible that there is some non-zero correlation between the $Z_i$ and the $\epsilon_i$ that is sampled from the true distribution, which intuitively may lead to overfitting.
However, in terms of regular linear algebra, I don't see how we can make an assumption about whether or not overfitting will occur. For example, if the number of parameters in the linear regression model is larger than the number of data points then there are either no solutions or infinitely many - this tells us nothing about a specific case or the likelihood of the latter option occurring only that it could occur.
I also don't see why, statistically, we would expect the least squares loss to be lower for the second OLS model than the first, which would imply overfitting since the true distribution is of the first form.
I hope someone can answer why and how linear regression, without polynomial terms, can overfit a dataset by justifying with statistics and not just intuition.