Geometric understanding of linear regression

Question

I am reading up on linear regression from mit 16.850

Here is how the lecture goes:

Given: $Y_{n,1}$ (targets), $X_{n, p}$ (data), $t_{p, 1}$ (the parameters I'm optimizing over), True model: $Y = \beta(X) + \epsilon$
I want to minimize $\|Y - Xt\|^2_2$ over $t$
The solution is: $\operatorname{argmin}_t\|Y - Xt\|^2_2 = \hat{\beta} = (X^TX)^{-1}X^TY$
This won't work if the rank of the $(X^TX)$ is less than $p$ which is to say the matrix $(X^TX)$ is not invertible. If $(X^TX)$ is not invertible then if I have $j$ as the solution I also have $j + \lambda v$ as the solution where $v$ is in the nullspace of $(X^TX)$. example: Let's say I have 2 data points and 3 variables. Solutions are infinite
Now, the professor says that even though we cannot talk about $\hat{\beta}$ we can still talk about $X\hat{\beta}$ and also says that though we cannot define $\hat{\beta}$ = ($(X^TX)^{-1}X^TY$) we can still define $X\hat{\beta}$ = ($X(X^TX)^{-1}X^TY$) even if we are low rank. Why? The professor does give some arguments post this around $X\hat{\beta}$ being a projection of $Y$ on the hyperplane defined by the linear combinations of the data vectors but I do not understand it entirely

Can you please help me understand why $X\hat{\beta}$ is defined and what is its significance?

Something is amiss, because that $(X^TX)^{-1}$ won't exist when $X$ lacks full rank. Might the professor have meant something about the generalized inverse, $(X^TX)^{-}?$ Note the slightly different exponent. — Dave, Feb 22 '24 at 16:20
yes $(X^TX)^{-1}$ wont exist in that case and he does point that out. But, he says that we can still talk about $X(X^TX)^{-1}X^TY$. I have updated the question with the actual timestamp where he says that — figs_and_nuts, Feb 22 '24 at 16:24
I think he means the predicted value can exist, even if the usual formula. $X(X^TX)^{-1}X^Ty$ does not exist. If $X$ does not have full rank, then $(X^TX)^{-1}$ does not exist, and a formula involving that expression does not make sense. — Dave, Feb 22 '24 at 16:37
See https://stats.stackexchange.com/questions/140848 or https://stats.stackexchange.com/questions/63143. The point is that $X\hat\beta$ is the projection of $Y$ onto the subspace spanned by the columns of $X.$ Because that subspace is closed and convex, the projection exists and is unique. — whuber, Feb 22 '24 at 16:49
Are you sure the video defines $X\hat\beta = X(X^TX)^{-1}X^Ty?$ There's an OLS solution, $\hat\beta,$ whether $X$ has full rank or not, so $X\hat\beta$ existing does not require $X$ to have full rank, but that $(X^TX)^{-1}X^T$ bothers me if $X$ lacks full rank. — Dave, Feb 22 '24 at 18:27

Michael Hardy · Accepted Answer · 2024-02-23T23:16:35.907

Rather than $Y=\beta X+\varepsilon,$ you need $Y= X\beta+\varepsilon.$ The matrix $X$ has $n$ rows and $p$ columns and $\beta$ has $p$ rows and just one column, so $X$ needs to be on the left and $\beta$ on the right.

If the columns of $X$ are not linearly independent, so that the matrix $X^\top X$ is not invertible, then the mapping $\beta\mapsto X\beta$ (i.e. the input is $\beta$ and the output is $X\beta~$) is not one-to-one.

Among all linear combinations of the columns of $X$ there is one that is closest to $Y.$ That one is the vector $\widehat Y$ of fitted values, and that is the one that would be called $X\widehat\beta.$ Since the columns of $X$ are linearly dependent, there is more than one way to write $\widehat Y$ as a linear combination of the columns of $X.$ If the $n\times1$ column vector whose every entry is $0$ can be written as a linear combination of the columns of $X$ while some of the coefficients in that linear combination are not $0,$ then that $p\times1$ vector of coefficients — call it $\widetilde{\beta\,}$ — can be added to $\widehat\beta,$ getting $\widehat\beta+\widetilde{\beta\,},$ and the vector of fitted values $\widehat Y = X \left( \widehat\beta + \widetilde{\beta\,}\right)$ is the same as $\widehat Y = X\widehat \beta,$ since $X\widetilde{\beta\,}=0.$

Thus more than one value of $\beta$ will serve as a least-squares solution, but there is only one value of $X\beta$ that is closer to $Y$ than is any other linear combination of the columns of $X.$

Geometric understanding of linear regression

1 Answers1