Can ordinary least squares regression be solved with Newton's method? If so, how many steps would be required to achieve convergence?
I know that Newton's method works on twice differentiable functions, I'm just not sure how this works with OLS.
Can ordinary least squares regression be solved with Newton's method? If so, how many steps would be required to achieve convergence?
I know that Newton's method works on twice differentiable functions, I'm just not sure how this works with OLS.
If used for OLS regression, Newton's method converges in a single step, and is equivalent to using the standard, closed form solution for the coefficients.
On each iteration, Newton's method constructs a quadratic approximation of the loss function around the current parameters, based on the gradient and Hessian. The parameters are then updated by minimizing this approximation. For quadratic loss functions (as we have with OLS regression) the approximation is equivalent to the loss function itself, so convergence occurs in a single step.
This assumes we're using the 'vanilla' version of Newton's method. Some variants use a restricted step size, in which case multiple steps would be needed. It also assumes the design matrix has full rank. If this doesn't hold, the Hessian is non-invertible so Newton's method can't be used without modifying the problem and/or update rule (also, there's no unique OLS solution in this case).
Assume the design matrix $X \in \mathbb{R}^{n \times d}$ has full rank. Let $y \in \mathbb{R}^n$ be the responses, and $w \in \mathbb{R}^d$ be the coefficients. The loss function is:
$$L(w) = \frac{1}{2} \|y - X w\|_2^2$$
The gradient and Hessian are:
$$\nabla L(w) = X^T X w - X^T y \quad \quad H_L(w) = X^T X$$
Newton's method sets the parameters to an initial guess $w_0$, then iteratively updates them. Let $w_t$ be the current parameters on iteration $t$. The updated parameters $w_{t+1}$ are obtained by subtracting the product of the inverse Hessian and the gradient:
$$w_{t+1} = w_t - H_L(w_t)^{-1} \nabla L(w_t)$$
Plug in the expressions for the gradient and Hessian:
$$w_{t+1} = w_t - (X^T X)^{-1} (X^T X w_t - X^T y)$$
$$= (X^T X)^{-1} X^T y$$
This is the standard, closed form expression for the OLS coefficients. Therefore, no matter what we choose for the initial guess $w_0$, we'll have the correct solution at $w_1$ after a single iteration.
Furthermore, this is a stationary point. Notice that the expression for $w_{t+1}$ doesn't depend on $w_t$, so the solution won't change if we continue beyond one iteration. This indicates that Newton's method converges in a single step.
It takes one iteration, basically because Newton's method works by solving an approximating quadratic equation in one step. Since the squared error loss is quadratic, the approximation is exact.
Newton's method does $$\beta \gets \beta-\frac{f'(\beta)}{f''(\beta)}$$ and we have $$f(\beta)=\|y-x\beta\|^2$$ $$f'(\beta)=-2x^T (y-x\beta)$$ $$f''(\beta)=-2x^Tx$$
First, for simplicity, do it starting at $\beta=0$. The first iterate is $-f'(0)/f''(0)$, which is $$-(-2x^Tx)^{-1}(-2x^T (y-x\beta))=(x^Tx)^{-1}x^Ty$$ so we get the standard solution.
Starting somewhere else, the first iterate is $$\beta-(-2x^Tx)^{-1}(-2x^T (y-x\beta))= \beta+(x^Tx)^{-1}x^Ty-(x^Tx)^{-1}x^Tx\beta=(x^Tx)^{-1}x^Ty$$