Efficiently estimating trace of a product of matrices

Question

I have $d\times d$ real-valued matrices $A_1,\ldots,A_k$, $1000<d<4000$, $k\approx 50$, and need to estimate the trace of the following matrix product

$$t=\text{tr}(A_1 A_2\cdots A_k A_k^T \cdots A_2^T A_1^T)$$

$\|u\|^2$ gives me an unbiased estimate of $t$ where $u=A_1 \cdots A_k v$ and $v$ is some random vector with identity covariance matrix. One approach could be to compute 2-100 samples of $\|u\|^2$ and average them.

However:

This tells me nothing about variance of the estimator. There are many distributions with identity covariance matrix, which one should I use? And how far off is my estimate?
This ignores structure of the product. Component matrices $A_i$ are known, but this information is not used.

Is there a way to improve on $\|u\|^2$ for this problem?

Edit summarizing discussion in the comments

We can get $\text{tr}(AA')$ exactly by summing over $d_1$ row norms

Alternatively we can sum over $d_2$ column norms

Alternatively, we can sample x from normal distribution look at norm of average column or row, using $x$ as weights for averaging

This gets inefficient for large $s$, because for $s=d$, an exact approach is possible.

So it would be nice to have an approach which smoothly interpolates between exact result for $s=d$, and existing stochastic estimator for $s=1$

Also, for $s=1$ stochastic estimator, can we use the knowledge of $A$ to pick $x$ on random $x$? An example of structure is matrix product $A=A_1 ... A_k$, where some component matrices $A_i$ are diagonal. IE, perhaps if $A_1$ had large first row, and the rest of the rows are small, we should pick $x=(1,0,0,...)$ and use $\|xA\|^2$ as estimate instead of random $x$?

Dumb question: why do you need an estimate? Is the direct computation forbidden for some reason? — Robert Bassett, Jan 07 '22 at 00:08
That's a good question actually....it's conventional wisdom that "Hessian trace of neural network is too expensive", and I simplified the problem a bit. Thinking more about why it's hard, actual dimensions of matrices for "Hessian trace" problem occurring in practice is similar, with an extra addition of matrices $A_0$ and $A_0^T$ on the ends, $A_0$ has shape $d^2 \times d$, but only $d^2$ entries non-zero — Yaroslav Bulatov, Jan 07 '22 at 04:07
For 1: There is recent research on accuracy bounds, take a look at https://arxiv.org/abs/2005.10009 . — Federico Poloni, Jan 07 '22 at 07:43

Yaroslav Bulatov · Answer 1 · 2022-01-12T18:23:26.687

Edit Jan 12 I was pointed by the author of https://arxiv.org/abs/2010.09649 to this simple estimator of trace (explanation), which should also be better than the orthogonalization approach in the original post.

function trace_est=simple_hutchplusplus(A, num_queries)
% Estimates the trace of square matrix A with num_queries many matrix-vector products
% Implements the Hutch++ algorithm. Random sign vectors are used.
% Calculate which matrices get how many queries, and generate random sign matrices
S = 2*randi(2,size(A,1),ceil(num_queries/3))-3;
G = 2*randi(2,size(A,1),floor(num_queries/3))-3;

% Compute only the Q of the QR decomposition
[Q,_] = qr(A*S,0);
G = G - Q*(Q'*G);

trace_est = trace(Q'*A*Q) + 1/size(G,2)*trace(G'*A*G);


end  % simple_hutchplusplus

original post

It seems like the following trick works -- after taking $k$ random vectors independently, orthogonalize them. The error seems to drop to 0 after when you reach $d$

(* computes trace of BB' efficiently, assuming matrix is tall *)
trace[B_] := Total[B*B, 2];
randomGauss[rank_, n_] := (
   RandomVariate[NormalDistribution[], {rank, n}]/Sqrt[rank]
   );
randomGaussOrthogonal[rank_, n_] := (
   Sqrt[n/rank]*
    Orthogonalize@RandomVariate[NormalDistribution[], {rank, n}]
   );
samples = 1000;
estimates = {};
npoints = 10;
dims = 10;
X = RandomVariate[NormalDistribution[], {dims, npoints}];
X = X/Sqrt[trace[X]];
trueValue = Tr[X . X[Transpose]];
For[k = 1, k <= dims, k += 1,
  vals = {};
  AppendTo[vals, expectedError[X, randomGauss, k, samples]];
  AppendTo[vals, 
   expectedError[X, randomGaussOrthogonal, k, samples]];
  AppendTo[estimates, vals]
  ];
(* estimate error using random matrix generator randomFunc *)
expectedError[X_, randomFunc_, rank_, samples_] := (
   estimate := (
     A = randomFunc[rank, npoints]; (* rank x datapoints *)
 B = X . A\[Transpose];
 trace[B]
 );

Mean[Table[(estimate - trueValue)^2, {samples}]]
   );
ListLinePlot[Transpose[estimates], 
 PlotLegends -> {"gauss", "gauss-normalized"}, PlotRange -> All]

Charlie S · Answer 2 · 2022-01-08T16:36:39.350

3

Your statistical method is pretty clever. This is less clever, but maybe you can build off the idea.

For any matrix $A$, $(AA^T)_{ii}=\sum_m{A_{im}A_{im}}$, and $tr(AA^T)=\sum_i{||A_i||^2}$ where $||A_i||^2$ is the squared norm of the $i$'th row. With that said, you don't need to explicitly calculate the product $AA^T$ to get the trace. For your problem, $A=A_1 A_2...A_k$. If you can afford to compute $k$ matrix products requiring $O(kd^3)$, then the trace of $AA^T$ is relatively cheap: $O(d^2)$.

edited Jan 08 '22 at 16:36

answered Jan 08 '22 at 16:30

Charlie S

661
4
9

That's actually an interesting idea -- your approach is to take $d$ rows of $A_1$, call it $x$, and for each $x$ compute $Bx$ where $B=A_2\ldots A_k$. My approach is to compute a set of $s$ values of $Bx$ where let $x$ is a random linear combination of all $d$ rows of $A_1$. This means that when my stochastic estimator uses $s=d$ samples, it costs the same as your exact approach while being worse. Now, what if $s=d/2$? Perhaps there's a way to beat stochastic estimator by being smarter about set of $x$. Perhaps let {x}=d/2 largest rows? – Yaroslav Bulatov Jan 08 '22 at 17:14
updated post with some diagrams comparing these approaches – Yaroslav Bulatov Jan 08 '22 at 18:04

Efficiently estimating trace of a product of matrices

2 Answers2