Recently, I was wondering about this question.
In a standard linear regression problem ($y=X\beta$ and we solve for $\beta$), the solution is $\beta = X^{-1}y$ when $X$ is square and invertible, and $(X^T X)^{-1}X^T y$ when $X$ has full column rank.
However, I wonder if there is any other explanation for this term, for example, view it as the inverse covariance $(X^TX)^{-1}$ multiplied by $X^Ty$. Then, I wonder what is the meaning for $X^Ty$ that makes it such a solution?
It seems that $X^Ty$ is just a vector of dot products of each feature vector and labels ($y$). I don't know if there is a better explanation.