Why $(y-Xb)^T(y-Xb)\ne y^Ty-b^TX^Ty$, where $b=(X^TX)^{-1}X^Ty$ is the least squares estimate?

Question

Let $b=(X^TX)^{-1}X^Ty$ be the ordinary least squares estimate. Then the sum of squared residuals is

$$\begin{align} (y-Xb)^T(y-Xb)&=y^Ty-y^TXb-b^TX^Ty+b^TX^TXb\\ &=y^Ty-b^TX^Ty. \end{align}$$

The carbohydrate data below is from the book An Introduction to Generalized Linear Models (Fourth Edition)

carbohydrate <- structure(list(carbohydrate = c(33, 40, 37, 27, 30, 43, 34, 48, 
30, 38, 50, 51, 30, 36, 41, 42, 46, 24, 35, 37), age = c(33, 
47, 49, 35, 46, 52, 62, 23, 32, 42, 31, 61, 63, 40, 50, 64, 56, 
61, 48, 28), weight = c(100, 92, 135, 144, 140, 101, 95, 101, 
98, 105, 108, 85, 130, 127, 109, 107, 117, 100, 118, 102), protein = c(14, 
15, 18, 12, 15, 15, 14, 17, 15, 14, 17, 19, 19, 20, 15, 16, 18, 
13, 18, 14)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-20L))
N=dim(carbohydrate)[1]
X <- carbohydrate
X[,1] <- rep(1, N)
y <- as.matrix(carbohydrate$carbohydrate)
#b=(X^TX)^{-1}X^Ty
X <- as.matrix(X)
XTX <- t(X) %% X
XTy <- t(X) %% y
XTX_inv <- matlib::inv(XTX)
b=XTX_inv %*% XTy
#Residual S
t(y - X %% b) %% (y - X %% b)
#Residual S
t(y) %% y - 2(t(y) %% X %% b) + t(b) %% XTX %% b
#Residual S
t(y) %% y - t(b) %*% XTy

        [,1]
[1,] 567.6806
         [,1]
[1,] 567.6806
      [,1]
[1,] 545.8

Why is $(y-Xb)^T(y-Xb)\ne y^Ty-b^TX^Ty$?

It seems the manually calculated coefficients are not correct, the coefficients from the lm function can match:

res.lm=lm(carbohydrate~age+weight+protein,data=carbohydrate)
b2 <- res.lm$coefficients
#Residual S
t(y - X %*% b2) %*% (y - X %*% b2)
#Residual S
t(y) %*% y - 2*(t(y) %*% X %*% b2) + t(b2) %*% XTX %*% b2
#Residual
t(y) %*% y - t(b2) %*% XTy

[,1]
[1,] 567.6629
         [,1]
[1,] 567.6629
         [,1]
[1,] 567.6629

"[T]hen the residual is"--no, that's the sum of squared residuals. — Taylor, Dec 20 '23 at 18:36

statmerkur · Accepted Answer · 2024-01-14T12:53:14.580

The problem can be traced back to numerical issues in the calculation of XTX_inv via matlib::inv:

matlib::inv(XTX) %*% XTX
#      carbohydrate         age      weight     protein
# [1,]   0.99999309 -0.00031792 -0.00078345 -0.00010996
# [2,]  -0.00000260  0.99988772 -0.00030058 -0.00004118
# [3,]   0.00000747  0.00033684  1.00084915  0.00011914
# [4,]   0.00000563  0.00026482  0.00063345  1.00008974

You can do better by using base::solve:

solve(XTX) %*% XTX
#              carbohydrate           age        weight       protein
# carbohydrate 1.000000e+00 -2.046363e-12 -3.637979e-12 -5.684342e-13
# age          5.551115e-17  1.000000e+00  1.421085e-14  8.881784e-16
# weight       1.387779e-17  1.776357e-15  1.000000e+00  6.661338e-16
# protein      0.000000e+00  0.000000e+00  0.000000e+00  1.000000e+00

Replacing matlib::inv(XTX) by solve(XTX) in your code solves the apparent inconsistency.

Note, however, that stats::lm computes b2 by using a QR decomposition of X, which is more numerically stable:

b3 <- qr.coef(qr(X), y)
all.equal(b2, b3)
# [1] TRUE
b
[,1]
[1,] 36.9598000
[2,] -0.1137718
[3,] -0.2277399
[4,]  1.9579218
b2
[,1]
carbohydrate 36.9600559
age          -0.1136764
weight       -0.2280174
protein       1.9577126

Addendum (2024-01-14)

Both matlib::inv and base::solve ultimatly use Gaussian elimination with partial pivoting to calculate the inverse of a matrix.
matlib::inv does this by calling matlib::Inverse which in turn calls matlib::gaussianElimination. While the latter provides a reasonable implementation for didactic purposes, base::solve leverages the superior LAPACK subroutine dgesv for its computations.
That being said, if the tolerance argument tol of matlib::Inverse, used to check for zero pivots, is set to .Machine$double.eps, we get a better inverse:

matlib::Inverse(XTX, tol = .Machine$double.eps) %*% XTX
#       carbohydrate          age        weight       protein
# [1,]  1.000000e+00 1.136868e-12  4.547474e-12  2.273737e-13
# [2,] -2.386980e-15 1.000000e+00 -2.096101e-13 -3.863576e-14
# [3,]  1.115913e-13 5.175860e-12  1.000000e+00  1.780798e-12
# [4,]  6.261658e-14 2.927436e-12  7.105427e-12  1.000000e+00

Instead of solving the normal equations $X^\top X b = X^\top y$ directly (e.g., via Gaussian elimination with partial pivoting) to calculate the OLSE $b$, stats::lm uses a QR decomposition of $X$. Some details are given in this answer.
This avoids calculations with the matrix cross-product $X^\top X$, whose condition number squares the one of the (potentially ill-conditioned) design matrix $X$. Another advantage of stats::lm's (more precisely, dqrdc2's) QR decomposition with column pivoting is that it also works with rank-deficient (design) matrices.

I would actually say that in most cases we shouldn't invert a matrix at all. We want to solve a system. So more like solve(crossprod(X), crossprod(X,y)) to be proper about this. — usεr11852, Dec 13 '23 at 17:03
@usεr11852 Agreed. And qr.coef(qr(X), y) does solve a (simplified) system of linear equations, but without computing the matrix cross-product $X^\top X$, which can itself be numerically unstable. I'll expand on that later when I find some time. Compared to your suggestion this is slower but can also detect linearly dependent columns in the design matrix. — statmerkur, Dec 13 '23 at 17:19
Cool, it was more to keep it close to the $(X^TX)^{-1} X^Ty$ wording. — usεr11852, Dec 13 '23 at 17:53

score 3 · Answer 2 · answered Dec 13 '23 at 16:26

3

It appears that if you use XTX_inv <- solve(XTX) the results are identical. So, the inverse of XTX obtained by matlib::inv() is less precise.

answered Dec 13 '23 at 16:26

BenP

1,124

score 1 · Answer 3 · answered Dec 13 '23 at 16:04

matlib: A collection of matrix functions for teaching and learning

Use solve instead.

carbohydrate <- structure(list(carbohydrate = c(33, 40, 37, 27, 30, 43, 34, 48, 
30, 38, 50, 51, 30, 36, 41, 42, 46, 24, 35, 37), age = c(33, 
47, 49, 35, 46, 52, 62, 23, 32, 42, 31, 61, 63, 40, 50, 64, 56, 
61, 48, 28), weight = c(100, 92, 135, 144, 140, 101, 95, 101, 
98, 105, 108, 85, 130, 127, 109, 107, 117, 100, 118, 102), protein = c(14, 
15, 18, 12, 15, 15, 14, 17, 15, 14, 17, 19, 19, 20, 15, 16, 18, 
13, 18, 14)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-20L))
N=dim(carbohydrate)[1]
X <- carbohydrate
X[,1] <- rep(1, N)
y <- as.matrix(carbohydrate$carbohydrate)
#b=(X^TX)^{-1}X^Ty
X <- as.matrix(X)
XTX <- t(X) %% X
XTy <- t(X) %% y
#XTX_inv <- matlib::inv(XTX)
XTX_inv <- solve(XTX)
b=XTX_inv %*% XTy
#Residual S
t(y - X %% b) %% (y - X %% b)
#Residual S
t(y) %% y - 2(t(y) %% X %% b) + t(b) %% XTX %% b
#Residual S
t(y) %% y - t(b) %*% XTy

         [,1]
[1,] 567.6629
         [,1]
[1,] 567.6629
         [,1]
[1,] 567.6629
```

Perhaps you could expand your answer to articulate what causes the loss of precision and how this function mitigates this — Sycorax, Dec 20 '23 at 18:36

Why $(y-Xb)^T(y-Xb)\ne y^Ty-b^TX^Ty$, where $b=(X^TX)^{-1}X^Ty$ is the least squares estimate?

3 Answers3

[,1]

[1,] 36.9598000

[2,] -0.1137718

[3,] -0.2277399

[4,] 1.9579218

[,1]

carbohydrate 36.9600559

age -0.1136764

weight -0.2280174

protein 1.9577126