What is a consequence of an ill-conditioned Hessian matrix?

Question

In this publication I found an explanation of the Hessian matrix, along with what it means for it to be ill-conditioned. In the paper, there is this link given between the error surface and the eigenvalues of the Hessian matrix:

The curvature of the error surface is given by the eigenvalues $\lambda_i$ of the Hessian matrix.

so it gives me a bit of hint as to why it might be important to care if it is poorly conditioned. But I'm not quite there yet, I have troubles seeing the consequences of an ill-conditioned Hessian.

So my question is: could you give me some intuitive understanding why should we care? In particular, in what models and how it can cause problems?

one reason is that optimization algorithms often use the inverse of the hessian ( or an estimate of it ) to maximize the likelihood and if it's ill-conditioned, inverting it will be problematic. there are probably many other good reasons but that's one. — mlofton, Feb 11 '19 at 17:03
Please see https://stats.stackexchange.com/questions/7308 for a detailed, worked, well-discussed example of the problems that arise. (It's necessary to read several of the answers there, not just one.) The meaning of "only" computation-related is dubious, given that in the statistical contexts where such Hessians arise there is no such thing as a datum given with perfect mathematical accuracy and thus there is no such thing as a perfect idealized mathematical solution. — whuber, Feb 12 '19 at 14:15
Briefly, the problem arises because some parameters have huge curvature while some have smaller curvature. Learning rate has to be shrunk to compensate for the high curvature. — Molin.L, Apr 22 '20 at 11:51

score 7 · Accepted Answer · edited Sep 14 '23 at 14:39

It is easiest understood when considering solving the linear problem, $$ Ax = b $$ where $b$ and $A$ are the problem data, and $x$ the parameters we are trying to estimate. In practice you have errors in $b$ which propagate through $A$. How? Assume we have only errors in the measurements, $b$, and denote $\delta b$ and $\delta x$ the errors in the measurements and estimatation, respectively. Because of linearity, $$ \delta b = A \delta x $$ In order to see how the measurements errors are magnified by the matrix $A$, you can calculate, $$ \frac{||\delta x ||}{||x||}/\frac{||\delta b ||}{||b||} $$ We have that this number is bounded by the condition number of $A$, $$ cond(A) = \frac{\lambda_{1}}{\lambda_{n}} $$ where $\lambda_{1}$ and $\lambda_{n}$ are the biggest and smallest eigenvalue of $A$, resp. Hence, the bigger the condition number, the higher the magnification of errors.

Here, a low condition number corresponds to directions where the gradient is small, which leads to oscillations and slow convergence.

This issue has motivated a lot of research for the optimization of neural networks (as you already point out), which has led to the development of techniques like momentum (see On the importance of initialization and momentum in deep learning) and early stopping. This blog entry provides a very nice description of this topic.

@JW I edited the answer to fix this (and a minor typo in text), you could do it as well ;-) — ain92, Sep 14 '23 at 14:20
@JW Wait, the problem is the eigenvalue ratio is only applicable to normal matrices (for real matrices that means being either orthogonal, symmetric, or skew-symmetric, which obviously fails for Hessians in general), and even then it requires taking the absolute value — ain92, Sep 18 '23 at 17:41

cangrejo · Answer 2 · 2019-02-12T14:25:24.597

Many optimization methods, such as Newton's, require the computation of the inverse of the Hessian.

The conditioning of a matrix $H$ is usually defined as the ratio between the largest and smallest singular values, $$ \kappa(H)=\frac{\sigma_1}{\sigma_n}. $$ If this number is large, that is, $\sigma_n$ is small with respect to $\|H\|$, the matrix is said to be ill-conditioned.

Now, consider the singular value decomposition of $H=U\Sigma V^T$. The inverse, provided $H$ is non-singular, can be computed as $H^{-1}=V\Sigma^+U^T$, where $$ \Sigma^+_{ij}=\begin{cases}1/\Sigma_{ij} &\mbox{if } i=j \\ 0 & \mbox {otherwise.} \end{cases} $$ Since $\Sigma_{nn}=\sigma_n$, whenever this value is small enough, the computation of $1/\Sigma_{ij}$ might introduce significant numerical instability.

"Will introduce" may be too strong, because it's easy to supply counterexamples. (Consider optimization problems in which the objective function is completely separable--that is, a sum of univariate functions). A brief investigation into exactly how bad conditioning might translate to numerical instability will reveal the additional assumptions you are making. — whuber, Feb 12 '19 at 14:21

What is a consequence of an ill-conditioned Hessian matrix?

2 Answers2

Linked