How do we know that we attain a minimum when $X^TX$ does not have full rank?

Question

Assume we have a linear model $E[\textbf{y}]=\textbf{X} \beta$. When we use least squares we get the normal equations $\textbf{X}^Ty=\textbf{X}^T\textbf{X}\hat{\beta}$. Assume that $\textbf{X}$ does not have full rank, since the rank of $\textbf{X}$ equals the rank of $\textbf{X}^TX$ we have that $\textbf{X}^T\textbf{X}$ does not have full rank either.

A way to solve this problem is with generalized inverses.

Definition: If we have a matrix $A$, $A^-$ is a generalized inverse if $AA^-A=A$. Generalized inverses always exist, but they may not be unique.[Foundations of Linear and Generalized Linear Models, Alan Agresti, pages 30-31].

It can then be shown that a solution to the least square normal equations is $(\textbf{X}^T\textbf{X})^-\textbf{X}^Ty$.

But do we know if this is a minimum? It is stated that when $X^TX$ is invertible it is a minimum because $\partial^2 L(\beta)/\partial \beta^2=2X^TX$ and the latter expression is positive definite, hence we have a minimum. But when $X^TX$ does not have full rank we don't have positive definite(only semidefinite), how do we know that we obtain a minimum when using the generalized inverse?

The least-squares objective is evidently quadratic, which implies at least one global minimum exists. When $X$ is of reduced rank, there will be a linear subspace of global minima of dimension equal to the rank deficit. Different versions of the generalized inverse solution can identify different points within that subspace: but all are where the least squares objective is minimized. You can work this out in detail in a simple case, such as $X=\pmatrix{1&2\1&2\1&2}$ (which arises when making three observations where the explanatory variable is constant). — whuber, Feb 24 '22 at 17:26

score 5 · Answer 1 · answered Mar 13 '23 at 16:23

Every $n\times p$ matrix $X$ has a Singular Value Decomposition

$$X = USV^\prime$$

where $U$ is an $n\times d$ matrix for $d\le p,$ $S$ is a nonnegative diagonal $d\times d$ matrix, and $V$ is a $p \times d$ matrix with $U^\prime U = 1_d = V^\prime V$ (that is, $U$ and $V$ are orthogonal matrices).

The "not of full rank" situation of the question occurs when $p-d \gt 0.$

Re-expressing the parameter vector as

$$(\alpha_1, \alpha_2, \ldots, \alpha_d) = \alpha = V^\prime\beta$$

and the response as the $d$-vector

$$z = U^\prime y$$

casts the model in the form

$$E[z] = E[U^\prime y] = U^\prime E[y] = U^\prime X\beta = U^\prime(USV^\prime)(V\alpha) = (U^\prime U)S(V^\prime V)\alpha = S\alpha,$$

which is a set of $d$ equations of the form $E[z_j] = s_j\alpha_j,$ $j = 1, 2, \ldots, d.$

The least squares objective function can therefore be universally expressed by applying basic matrix identities and completing the square as

$$\begin{aligned} ||y - X\beta||^2 &= (y-X\beta)^\prime(y-X\beta)\\ &= (y - US\alpha)^\prime(y - US\alpha)\\ &= \alpha^\prime(SU^\prime\,US^\prime)\alpha- 2y^\prime US\alpha + y^\prime y\\ &= \alpha^\prime (S^2)\alpha - 2y^\prime US\alpha + y^\prime y\\ &= \alpha^\prime (S^2)\alpha - 2z^\prime S\alpha + ||y||^2\\ &= \left(\sum_{j=1}^d s_j^2 \alpha_j^2 - 2s_jz_j\alpha_j\right) + ||y||^2\\ &= \left(\sum_{j=1}^d \left(s_j \alpha_j\right)^2 - 2\left(s_j\alpha_j\,z_j\right) + z_j^2\right) + ||y||^2 - \sum_{j=1}^d z_j^2\\ &= \sum_{j=1}^d \left(s_j\alpha_j - z_j\right)^2 + \left(||y||^2 - ||z||^2\right). \end{aligned}$$

The right hand side is a constant $||y||^2 - ||z||^2$ plus a sum of squares $(s_j\alpha_j - z_j)^2.$ Obviously it is uniquely minimized when all the squares are zero, whence

$$\hat\alpha_j = \frac{z_j}{s_j}$$

for $j = 1, 2, \ldots, d.$

The original parameter vector $\beta$ therefore is minimized for all solutions to

$$V^\prime \hat\beta = \hat\alpha.$$

Linear algebra tells us there is a $p-d$ - dimensional space of solutions given by picking any $\hat\beta_0$ for which $V^\prime \hat\beta_0 = \hat\alpha$ and adding any element of the kernel of $V^\prime$ (the solutions to the homogeneous system of equations $V^\prime\beta = 0$). The value of the least squares objective will have a constant value of $||y||^2 - ||z||^2$ on this space.

No Calculus was needed here (not even implicitly: the SVD can be established without resort to Calculus, either). Thus this constitutes a Calculus-free derivation of the least squares solution and reveals that it comes down to the problem of showing the square of a real number cannot be negative. That is the very definition of negative! — whuber, Mar 13 '23 at 16:26

score 0 · Answer 2 · 2021-12-05T18:36:33.930

Any time a predictor is an exact copy of another predictor, or a linear combination of another predictor, one of the eigenvalues of $(\mathbf{X}^\top\mathbf{X})^{-1}$ will be zero. However, throwing out the redundant variable won't change the minimum. Rather, I would say that the solution is not unique, since the inverted dispersion matrix is positive-semidefinite.

I would also recommend getting away from thinking about the G-inverse of a matrix, and focus more on singular value decomposition (SVD), which doesn't require a square real-symmetric matrix to be full rank for inversion. Most statistical packages will likely use QR (QL) iteration based on the tridiagonal form of an upper Hessenberg matrix. I usually use the left (QL), but it doesn't really matter.

Is it possible to prove that we obtain a minimum? – user394334 Dec 05 '21 at 18:35 — user394334, Dec 05 '21 at 18:35

How do we know that we attain a minimum when $X^TX$ does not have full rank?

2 Answers2