5

Recently, I was wondering about this question.

In a standard linear regression problem ($y=X\beta$ and we solve for $\beta$), the solution is $\beta = X^{-1}y$ when $X$ is square and invertible, and $(X^T X)^{-1}X^T y$ when $X$ has full column rank.

However, I wonder if there is any other explanation for this term, for example, view it as the inverse covariance $(X^TX)^{-1}$ multiplied by $X^Ty$. Then, I wonder what is the meaning for $X^Ty$ that makes it such a solution?

It seems that $X^Ty$ is just a vector of dot products of each feature vector and labels ($y$). I don't know if there is a better explanation.

3 Answers3

5

I'll try to explain it from the linear algebra point of view, but I'm not sure if it's what you need.

First of all, when solving the equation in the case of inconsistent system, we know that $\hat y$ is the orthogonal projection of $y$ onto the column space of $X$. In other words, $\hat y$ can be estimated by $X \hat \beta$. Secondly, we know that when we subtract $y - \hat y$ we create the orthogonal component, which is orthogonal to column space of $X$.

Moreover, we know, that orthogonality means that if some vector $a$, which is orthogonal to vector $b$ is multiplied by $b$, it will give $0$ as the result. Finally, to have column space, not row space of matrix $X$, we need to take the transpose of it.

So, we have an equation $X^T(y - X\hat \beta) = 0$

When opening the brackets and putting different parts of the equation on the different sides, we receive the same equation you've been talking about.

$\hat \beta = (X^TX)^{-1}X^Ty$

olejnik_
  • 338
2

People sometimes break that quantity up a little differently and call $\bf{P=X(X^T X)^{-1}}X^{T}$ the Projection matrix, influence matrix, or hat matrix. You can think of the projection matrix as mapping between the actual $y$ values and the predicted ones.

The projection matrix has a number of handy properties. In particular, the $k$th element of its main diagonal ($\mathbf{P}_{k,k}$) contains the leverage score for the $k$th piece of data, which can be a useful piece of diagnostic information.

Matt Krause
  • 21,095
  • 2
    Now you get me thinking. What's the meaning the sum of the diagonal of the projection matrix? – horaceT Jul 07 '16 at 23:36
  • 4
    @horaceT The trace of the hat matrix is the number of free parameters (model d.f.). This applies to models you can write in the linear form $\hat{y}=Ay$, which includes a lot of models that are not plain linear regression models. Many smoothers can be written in this form for example. – Glen_b Jul 07 '16 at 23:53
2

Suppose we have a linear system of $m$ equations in $\mathrm x \in \mathbb R^n$

$$\mathrm A \mathrm x = \mathrm b$$

where $\mathrm A \in \mathbb R^{m \times n}$ has full column rank, and $\mathrm b \in \mathbb R^m$. Left-multiplying both sides by $\mathrm A^T$, we obtain a linear system of $n \leq m$ equations in $\mathrm x \in \mathbb R^n$

$$\mathrm A^T \mathrm A \mathrm x = \mathrm A^T \mathrm b$$

which is usually known as "normal equations". Since $\mathrm A$ has full column rank, the square matrix $\mathrm A^T \mathrm A$ is invertible. Hence, the latter linear system has the unique solution $(\mathrm A^T \mathrm A)^{-1} \mathrm A^T \mathrm b$, whereas the original linear system, $\mathrm A \mathrm x = \mathrm b$, may not even have a solution. Note that a solution to the "normal equations" is not necessarily a solution to the original linear system.

So, what is the "meaning" of $\mathrm A^T \mathrm b$? It is a scaled projection of $\mathrm b$ onto the column space of $\mathrm A$. The dimension of the right-hand side is reduced from $m \geq n$ to $n$, so that a unique solution can be found. As the columns of $\mathrm A$ are not necessarily normalized, left-multiplication by $(\mathrm A^T \mathrm A)^{-1}$ provides the needed normalization.