I am reading up on linear regression from mit 16.850
Here is how the lecture goes:
- Given: $Y_{n,1}$ (targets), $X_{n, p}$ (data), $t_{p, 1}$ (the parameters I'm optimizing over), True model: $Y = \beta(X) + \epsilon$
- I want to minimize $\|Y - Xt\|^2_2$ over $t$
- The solution is: $\operatorname{argmin}_t\|Y - Xt\|^2_2 = \hat{\beta} = (X^TX)^{-1}X^TY$
- This won't work if the rank of the $(X^TX)$ is less than $p$ which is to say the matrix $(X^TX)$ is not invertible. If $(X^TX)$ is not invertible then if I have $j$ as the solution I also have $j + \lambda v$ as the solution where $v$ is in the nullspace of $(X^TX)$. example: Let's say I have 2 data points and 3 variables. Solutions are infinite
- Now, the professor says that even though we cannot talk about $\hat{\beta}$ we can still talk about $X\hat{\beta}$ and also says that though we cannot define $\hat{\beta}$ = ($(X^TX)^{-1}X^TY$) we can still define $X\hat{\beta}$ = ($X(X^TX)^{-1}X^TY$) even if we are low rank. Why? The professor does give some arguments post this around $X\hat{\beta}$ being a projection of $Y$ on the hyperplane defined by the linear combinations of the data vectors but I do not understand it entirely
Can you please help me understand why $X\hat{\beta}$ is defined and what is its significance?