How to find a good preconditioner to the system $(A^T A + \lambda I) x = A^T b$?

Question

The system in the title has a damper factor $\lambda > 0$ and the matrix $A$ is sparse and rectangular, with a structure I can exploit to solve matrix vector products very fast. My current solver, LSMR, is trying to solve the normal equations $(A^TA + \lambda I) x = A^T b$ associated to the original problem $\min \|Ax - b\|$.

Although each iteration is computed very fast, the algorithm uses the maximum number of iterations. I know this can be fixed with a good preconditioner. This is where lies my problem.

$A^TA + \lambda I$ is a SPD matrix, which is a good property to have. On the other side, this matrix is no more sparse. I don't know how to choose and use a preconditioner for this dense matrix. I suppose this is already worked by someone.

I want to know how to proceed in this case and, if possible, how to use the sparsity of $A$ to obtain a good preconditioner. What are the common approaches?

EDIT: In order to be more complete, I'll briefly describe how the matrix $A$ is obtained. My problem at hand consists in minimizing the error associated to a low tensor rank-$r$ approximation. You can consider a tensor $T$ as being a multidimensional array. In this case, a 3-D multidimensional array with coordinates $T_{ijk}$, for $1 \leq i \leq m, 1 \leq j \leq n, 1 \leq k \leq p$. I am considering an approximation $\tilde{T}_{ijk} = \sum_{\ell=1}^r X_{i \ell} \cdot Y_{j \ell} \cdot Z_{k \ell}$. The error in this approximation is given by $$ \frac{1}{2} \sum_{i,j,k} \left( T_{ijk} - \tilde{T}_{ijk} \right)^2 = \frac{1}{2} \sum_{i,j,k} res_{i,j,k} (X,Y,Z)^2,$$ where $X, Y, Z$ lists all components $X_{i \ell}, Y_{j \ell}, Z_{k \ell}$ and $res_{ijk}$ is the residual of the component with index $i,j,k$.

To find the components of $\tilde{T}$ which minimize the error above, it is of interest to find the Jacobian matrix of $res = (res_{111}, res_{112}, \ldots, res_{mnp})$. We have the formulas below for the partial derivatives:

$$\frac{\partial res_{ijk}}{\partial X_{I \ell}} = \left\{ \begin{array}{c} - Y_{j \ell} Z_{k \ell},\quad \text{if } i = I,\\ 0, \quad \text{otherwise} \end{array}\right.$$

$$\frac{\partial res_{ijk}}{\partial Y_{J \ell}} = \left\{ \begin{array}{c} - X_{j \ell} Z_{k \ell},\quad \text{if } j = J,\\ 0, \quad \text{otherwise} \end{array}\right.$$

$$\frac{\partial res_{ijk}}{\partial Z_{K \ell}} = \left\{ \begin{array}{c} - X_{i \ell} Y_{j \ell},\quad \text{if } k = K,\\ 0, \quad \text{otherwise} \end{array}\right.$$

This will give a sparse matrix, which becomes more sparse as we increase the dimensions. The structure follows a nested for loop pattern, from left to right. The figure below shows this structure for $m = 3, n = 5, p = 7, r = 10$. I hope this can be useful for someone to spot the "right" preconditioner, because at the moment I really don't know how to proceed. Keep im mind that I'm trying to use this structure to find a preconditioner for $A^TA + \lambda I$, where $A$ is this sparse matrix just described.

Normal equation squares the condition number. Other methods are better if the matrix is ill-condition. (see: https://eigen.tuxfamily.org/dox/group__LeastSquares.html) I haven't studied how to combine preconditioning with the other methods though. — R zu, Oct 25 '18 at 15:10
@R zu if the matrix is huge and sparse, using iterative methods is mandatory. One of the best ones is the conjugate gradient. Squaring the condition number means nothing if you have a good preconditioner to compensate it. — Integral, Oct 25 '18 at 15:32
Your comment is valid since I didn't mention the size of my matrix and I wrote the original problem as being $Ax = b$ where it should be $\min |Ax - b|$. Stil, I would prefer to have a good preconditioner instead of losing the SPD property, the damping factor and the CG method. — Integral, Oct 25 '18 at 15:39
You can't avoid preconditioning at all these days. It is the most important part of you algorithm, supposing you are dealing with a real difficult problem. This is my case. — Integral, Oct 25 '18 at 15:41
What does the sparsity pattern look like? I was reading Wolfgang Bangerth's answer at https://scicomp.stackexchange.com/questions/20402/choosing-preconditioner-for-unsymmetric-pressure-velocity-coupled-system?rq=1 He says the diagonal pre-conditioner is only good for simple problems ... — R zu, Oct 25 '18 at 15:52
This looks like a problem coming from regularization (say, Tikhonov with identity regularization operator). Cause if you are able to play with $\lambda$, it essentially leads to L-curve methods. (this comment has not a lot to do with preconditioners themselves). — Anton Menshov, Oct 25 '18 at 15:59
@AntonMenshov This minimization problem indeed comes from a regularization, the Tikhonov regularization. It is part of a bigger problem which needs to solve $\min |Ax - b|$ at each step, and at each step the factor $\lambda$ is updated in order to accelerate the overall convergence. — Integral, Oct 25 '18 at 16:06
have you tried the "usual suspects", i.e. diagonal preconditioning or incomplete Cholesky? other, less known preconditioning strategies for sparse LS are evaluated in this recent ACM TOMS paper by Gould. — GoHokies, Oct 25 '18 at 16:25
also, it would be good to include more information about the origin and (block-)structure of $A$. good preconditioners are often built from such domain knowledge. — GoHokies, Oct 25 '18 at 16:30
@Rzu Note that Integral mentioned they were using LSMR, not CG for the normal equation, which avoids the numerical issues with the squared condition number. (@Integral: LSMR is equivalent to MINRES applied to the normal equations, not CG -- that's a different method.) — Christian Clason, Oct 25 '18 at 16:35
@GoHokies I just update the question with more information about the structure of the matrix. Thank you for your suggestions. — Integral, Oct 25 '18 at 16:50
@ChristianClason Thank you for clarifying. I'll give a read at MINRES right now. But I suppose I still need a preconditioner anyway, since the LSMR implementation is using the max number of iterations all the time. — Integral, Oct 25 '18 at 16:51
@Integral As someone who works in this field, I would say this is basically an open research question. With a good preconditioner for the trust-region subproblem of an outer tensor decomposition, you will give all the SGD people a run for their money. Good luck! — Richard Zhang, Oct 26 '18 at 16:52

score 2 · Answer 1 · answered Feb 15 '20 at 08:54

Have you tried the preconditioner diag(A'*A)+lmb*speye(size(A))? For lmb large enough, I would imagine this preconditioner would accelerate the convergence significantly.

diag(A'*A) can be computed fairly efficiently using $B(i,i) = \sum_j A^T(i,j)A(j,i) = \sum_j A(j,i)A(j,i)$.

Another idea is to use fixed point iterations as preconditioners. Since your matrix is SPD, 4 steps of Gauss-Seidel should be a decent preconditioner. However, avoiding the explicit construction of the lower triangular part of $A^TA$ is challenging. It is doable though and I know that few of my old colleagues had implemented those for personal use.

I'm not familiar with the specifics of LSMR, but depending on how and where the preconditioner is applied, it may be necessary to preserve symmetry of the matrix. You can adapt this preconditioner to that caes by muliplying from the left and the right by $B^{1/2}+\sqrt{\lambda}I$, where $B$ is as constructed above. This will most likely have a similar effect as a preconditioner while preserving symmetry — whpowell96, Feb 16 '20 at 01:39

How to find a good preconditioner to the system $(A^T A + \lambda I) x = A^T b$?

1 Answers1