9

I have $d\times d$ real-valued matrices $A_1,\ldots,A_k$, $1000<d<4000$, $k\approx 50$, and need to estimate the trace of the following matrix product

$$t=\text{tr}(A_1 A_2\cdots A_k A_k^T \cdots A_2^T A_1^T)$$

$\|u\|^2$ gives me an unbiased estimate of $t$ where $u=A_1 \cdots A_k v$ and $v$ is some random vector with identity covariance matrix. One approach could be to compute 2-100 samples of $\|u\|^2$ and average them.

However:

  1. This tells me nothing about variance of the estimator. There are many distributions with identity covariance matrix, which one should I use? And how far off is my estimate?
  2. This ignores structure of the product. Component matrices $A_i$ are known, but this information is not used.

Is there a way to improve on $\|u\|^2$ for this problem?

Edit summarizing discussion in the comments

We can get $\text{tr}(AA')$ exactly by summing over $d_1$ row norms

enter image description here

Alternatively we can sum over $d_2$ column norms

enter image description here

Alternatively, we can sample x from normal distribution look at norm of average column or row, using $x$ as weights for averaging

enter image description here

This gets inefficient for large $s$, because for $s=d$, an exact approach is possible.

So it would be nice to have an approach which smoothly interpolates between exact result for $s=d$, and existing stochastic estimator for $s=1$

Also, for $s=1$ stochastic estimator, can we use the knowledge of $A$ to pick $x$ on random $x$? An example of structure is matrix product $A=A_1 ... A_k$, where some component matrices $A_i$ are diagonal. IE, perhaps if $A_1$ had large first row, and the rest of the rows are small, we should pick $x=(1,0,0,...)$ and use $\|xA\|^2$ as estimate instead of random $x$?

Yaroslav Bulatov
  • 2,655
  • 11
  • 23
  • 1
    Dumb question: why do you need an estimate? Is the direct computation forbidden for some reason? – Robert Bassett Jan 07 '22 at 00:08
  • 2
    That's a good question actually....it's conventional wisdom that "Hessian trace of neural network is too expensive", and I simplified the problem a bit. Thinking more about why it's hard, actual dimensions of matrices for "Hessian trace" problem occurring in practice is similar, with an extra addition of matrices $A_0$ and $A_0^T$ on the ends, $A_0$ has shape $d^2 \times d$, but only $d^2$ entries non-zero – Yaroslav Bulatov Jan 07 '22 at 04:07
  • For 1: There is recent research on accuracy bounds, take a look at https://arxiv.org/abs/2005.10009 . – Federico Poloni Jan 07 '22 at 07:43

2 Answers2

4

Edit Jan 12 I was pointed by the author of https://arxiv.org/abs/2010.09649 to this simple estimator of trace (explanation), which should also be better than the orthogonalization approach in the original post.

function trace_est=simple_hutchplusplus(A, num_queries)
% Estimates the trace of square matrix A with num_queries many matrix-vector products
% Implements the Hutch++ algorithm. Random sign vectors are used.
% Calculate which matrices get how many queries, and generate random sign matrices
S = 2*randi(2,size(A,1),ceil(num_queries/3))-3;
G = 2*randi(2,size(A,1),floor(num_queries/3))-3;

% Compute only the Q of the QR decomposition
[Q,_] = qr(A*S,0);
G = G - Q*(Q'*G);

trace_est = trace(Q'*A*Q) + 1/size(G,2)*trace(G'*A*G);

end % simple_hutchplusplus

original post

It seems like the following trick works -- after taking $k$ random vectors independently, orthogonalize them. The error seems to drop to 0 after when you reach $d$ enter image description here

(* computes trace of BB' efficiently, assuming matrix is tall *)

trace[B_] := Total[B*B, 2];

randomGauss[rank_, n_] := ( RandomVariate[NormalDistribution[], {rank, n}]/Sqrt[rank] ); randomGaussOrthogonal[rank_, n_] := ( Sqrt[n/rank]* Orthogonalize@RandomVariate[NormalDistribution[], {rank, n}] );

samples = 1000; estimates = {}; npoints = 10; dims = 10; X = RandomVariate[NormalDistribution[], {dims, npoints}]; X = X/Sqrt[trace[X]];

trueValue = Tr[X . X[Transpose]]; For[k = 1, k <= dims, k += 1, vals = {}; AppendTo[vals, expectedError[X, randomGauss, k, samples]]; AppendTo[vals, expectedError[X, randomGaussOrthogonal, k, samples]]; AppendTo[estimates, vals] ];

(* estimate error using random matrix generator randomFunc *)

expectedError[X_, randomFunc_, rank_, samples_] := ( estimate := ( A = randomFunc[rank, npoints]; (* rank x datapoints *)

 B = X . A\[Transpose];
 trace[B]
 );

Mean[Table[(estimate - trueValue)^2, {samples}]] );

ListLinePlot[Transpose[estimates], PlotLegends -> {"gauss", "gauss-normalized"}, PlotRange -> All]

Yaroslav Bulatov
  • 2,655
  • 11
  • 23
3

Your statistical method is pretty clever. This is less clever, but maybe you can build off the idea.

For any matrix $A$, $(AA^T)_{ii}=\sum_m{A_{im}A_{im}}$, and $tr(AA^T)=\sum_i{||A_i||^2}$ where $||A_i||^2$ is the squared norm of the $i$'th row. With that said, you don't need to explicitly calculate the product $AA^T$ to get the trace. For your problem, $A=A_1 A_2...A_k$. If you can afford to compute $k$ matrix products requiring $O(kd^3)$, then the trace of $AA^T$ is relatively cheap: $O(d^2)$.

Charlie S
  • 661
  • 4
  • 9
  • That's actually an interesting idea -- your approach is to take $d$ rows of $A_1$, call it $x$, and for each $x$ compute $Bx$ where $B=A_2\ldots A_k$. My approach is to compute a set of $s$ values of $Bx$ where let $x$ is a random linear combination of all $d$ rows of $A_1$. This means that when my stochastic estimator uses $s=d$ samples, it costs the same as your exact approach while being worse. Now, what if $s=d/2$? Perhaps there's a way to beat stochastic estimator by being smarter about set of $x$. Perhaps let {x}=d/2 largest rows? – Yaroslav Bulatov Jan 08 '22 at 17:14
  • updated post with some diagrams comparing these approaches – Yaroslav Bulatov Jan 08 '22 at 18:04