Optimizing OLS with Newton's Method

Question

Can ordinary least squares regression be solved with Newton's method? If so, how many steps would be required to achieve convergence?

I know that Newton's method works on twice differentiable functions, I'm just not sure how this works with OLS.

user20160 · Accepted Answer · 2021-03-16T17:40:28.220

If used for OLS regression, Newton's method converges in a single step, and is equivalent to using the standard, closed form solution for the coefficients.

On each iteration, Newton's method constructs a quadratic approximation of the loss function around the current parameters, based on the gradient and Hessian. The parameters are then updated by minimizing this approximation. For quadratic loss functions (as we have with OLS regression) the approximation is equivalent to the loss function itself, so convergence occurs in a single step.

This assumes we're using the 'vanilla' version of Newton's method. Some variants use a restricted step size, in which case multiple steps would be needed. It also assumes the design matrix has full rank. If this doesn't hold, the Hessian is non-invertible so Newton's method can't be used without modifying the problem and/or update rule (also, there's no unique OLS solution in this case).

Proof

Assume the design matrix $X \in \mathbb{R}^{n \times d}$ has full rank. Let $y \in \mathbb{R}^n$ be the responses, and $w \in \mathbb{R}^d$ be the coefficients. The loss function is:

$$L(w) = \frac{1}{2} \|y - X w\|_2^2$$

The gradient and Hessian are:

$$\nabla L(w) = X^T X w - X^T y \quad \quad H_L(w) = X^T X$$

Newton's method sets the parameters to an initial guess $w_0$, then iteratively updates them. Let $w_t$ be the current parameters on iteration $t$. The updated parameters $w_{t+1}$ are obtained by subtracting the product of the inverse Hessian and the gradient:

$$w_{t+1} = w_t - H_L(w_t)^{-1} \nabla L(w_t)$$

Plug in the expressions for the gradient and Hessian:

$$w_{t+1} = w_t - (X^T X)^{-1} (X^T X w_t - X^T y)$$

$$= (X^T X)^{-1} X^T y$$

This is the standard, closed form expression for the OLS coefficients. Therefore, no matter what we choose for the initial guess $w_0$, we'll have the correct solution at $w_1$ after a single iteration.

Furthermore, this is a stationary point. Notice that the expression for $w_{t+1}$ doesn't depend on $w_t$, so the solution won't change if we continue beyond one iteration. This indicates that Newton's method converges in a single step.

May I ask you to state the theorem you just proved? I've been studying proofs in my spare time and your proof is so well written that I'd love to see the theorem it proves stated precisely :) — Guilherme Marthe, Mar 16 '21 at 14:04
@GuilhermeMarthe It might be a bit loose for a formal proof (at the very least because I haven't stated the theorem :-). But, the precise statement would be something like: For every response vector $y$, full rank design matrix $X$, and initial coefficients $w_0$, Newton's method converges to the standard OLS solution $(X^T X)^{-1} X^T y$ in a single iteration. Perhaps a nicer approach would be to prove that Newton's method converges in a single iteration for every quadratic loss function with an invertible Hessian. Then, this would hold for OLS regression as a special case. — user20160, Mar 16 '21 at 17:49
Could you elaborate on where the gradient, hessian and wt+1 equations came from? — MLNewbie, Mar 16 '21 at 23:59
@MLNewbie The first expression for $w_{t+1}$ is the standard update rule for Newton's method. It comes from the idea I mentioned briefly in the beginning--iteratively constructing and then minimizing a quadratic approximation of the loss function. The Wikipedia page has a good description of the derivation and reasoning behind this. To get the gradient I differentiated the loss function w.r.t. $w$, and to get the Hessian I differentiated the gradient. This just uses some algebra and matrix calculus identities. — user20160, Mar 17 '21 at 13:16

score 16 · Answer 2 · answered Mar 16 '21 at 04:47

It takes one iteration, basically because Newton's method works by solving an approximating quadratic equation in one step. Since the squared error loss is quadratic, the approximation is exact.

Newton's method does $$\beta \gets \beta-\frac{f'(\beta)}{f''(\beta)}$$ and we have $$f(\beta)=\|y-x\beta\|^2$$ $$f'(\beta)=-2x^T (y-x\beta)$$ $$f''(\beta)=-2x^Tx$$

First, for simplicity, do it starting at $\beta=0$. The first iterate is $-f'(0)/f''(0)$, which is $$-(-2x^Tx)^{-1}(-2x^T (y-x\beta))=(x^Tx)^{-1}x^Ty$$ so we get the standard solution.

Starting somewhere else, the first iterate is $$\beta-(-2x^Tx)^{-1}(-2x^T (y-x\beta))= \beta+(x^Tx)^{-1}x^Ty-(x^Tx)^{-1}x^Tx\beta=(x^Tx)^{-1}x^Ty$$

Optimizing OLS with Newton's Method

2 Answers2

Proof

Linked