Frobenius norm of a product of Gaussian matrices

Question

Suppose $$C_n=X_1 X_2\cdots X_n,$$ where $X_i$ is $d\times d$ matrix with IID entries normally distributed with mean 0 and variance $\frac{1}{d}$.

The following appears to be true for large $d$, why?

$$\|C_n C_n^T\|_F^2\approx d(n+1).$$

Here are some numbers from 4 samples with $d=1000$ (Notebook):

n     sample1  sample2  sample3  sample4
----------------------------------------
1     2003.99  1998.66  1999.51  1998.14
2     3029.97  2990.12  3008.21  2999.13
3     3967.81  3995.46  4022.33  4005.20
4     5027.41  5075.39  4941.94  5057.40
5     6143.21  5964.35  5844.76  6015.08

I think a way to proceed will take advantage of the Marchenko–Pastur distribution for the distribution of $C_n C^T_n$. The distribution of elements of $C_n$ are inner products of vectors with IID Gaussians, for which I assume someone would know the distribution; perhaps relevant are https://en.wikipedia.org/wiki/Distribution_of_the_product_of_two_random_variables and answer by Ulisses Braga-Neto to question https://math.stackexchange.com/questions/101062/is-the-product-of-two-gaussian-random-variables-also-a-gaussian — Number, Nov 28 '23 at 22:40
BTW, this relation also appears to hold if we don't resample $X's$, ie, use the same random matrix for all $X_i$ — Yaroslav Bulatov, Nov 29 '23 at 18:39
I'd like to emphasize I believe the Marchenko–Pastur distribution is important here, because $||C C^T||^2_F = trace(C C^T) = \Sigma eigenvalues(C C^T) = \Sigma sigularvals(C)^2$ — Number, Dec 04 '23 at 01:12
@number is the square of the Frebenius norm of a matrix equal to the trace of the matrix? — Sextus Empiricus, Dec 04 '23 at 05:54
@SextusEmpiricus No, I've made a silly typo, sorry, thanks for pointing it out! Cannot edit it to correct it. The correct identity is: $||C||^2_F = trace(C^T C) = trace(C C^T)$. The last equality follows from the cyclical nature of trace. — Number, Dec 10 '23 at 18:11
@Number ah, that makes sense. So $||C^TC||^2_F = trace(C^TCC^TC)$. — Sextus Empiricus, Dec 10 '23 at 21:03
@SextusEmpiricus Yes. Or starting as in the original question: $||C C^T||^2_F = trace((CC^T)^T CC^T) = trace(CC^T CC^T) = trace(C^T CC^T C)$ which is equivalent to what you wrote. Therefore: $||C C^T||^2_F = trace(C^T CC^T C) = \Sigma eigvals(C^T CC^T C) = \Sigma eigvals(C^T C)^2 = \Sigma singvals(C)^4$. This is why I believe the Marchenko–Pastur distribution can be useful here. — Number, Dec 13 '23 at 17:56

amoeba · Answer 1 · 2023-12-06T18:38:57.673

11

Let $X$ be $d\times d$ a random matrix with iid $\mathcal N(0, 1/d)$ elements. Let us first show that $$\| XX^\top \|^2 \approx 2d.$$

Induction base

We want to compute expectations of squared elements of $\sum_j X_{ij}X_{kj}$.

Consider diagonal elements first: those are $\sum_j X_{ij} X_{ij} = \sum_j X_{ij}^2$. In this sum each term has expectation $1/d$ and there are $d$ of them, so the sum has expectation $1$. The variance is $O(1/d)$, so we can ignore it. Then the expectation of the square is approximately $1^2 = 1$, and there are $d$ of them, contributing $d$ to the squared norm.

Now off-diagonal elements: those are $\sum_j X_{ij}X_{kj}$. Each term has mean 0 and variance $1/d^2$ (product of $1/d$ and $1/d$), after summation we have mean 0 and variance $1/d$. After squaring, we get expected value $1/d$. There are $(d^2-d)$ off-diagonal elements, which is approximately $d^2$, so in total this contributes approximately $d$ to the squared norm.

Together, we get $2d$.

Induction step

Now we need to show that if $A$ has diagonal elements with mean $1$ and off-diagonal elements with variance $k/d$, then $XA X^\top$ will have diagonal elements with mean $1$ and off-diagonal variance approximately $(k+1)d$. It will then follow that $$\| XA X^\top \|^2 \approx d + (k+1)d.$$

In coordinates, $XA X^\top$ is $\sum_{jk} X_{ij} A_{jk} X_{lk}$. Again, considering diagonal elements first, we should look at $j=k$ terms, which are $\sum_j X_{ij}^2 A_{jj}$. Each term has expectation $1/d$, we have $d$ of them, so the total expected value is 1. The variance can be ignored. Done.

Now off-diagonal elements. We want their variance, which is the same as the expectation of their squares. The squares are: $$\big(\sum_{jk} X_{ij} A_{jk} X_{lk}\big)\cdot \big(\sum_{ab} X_{ia} A_{ab} X_{lb}\big) = \sum_{jkab} X_{ij} A_{jk} X_{lk} X_{ia} A_{ab} X_{lb}.$$ We only need terms that have non-zero expectation. There are two kinds of terms like that:

if $j=a$ and $k=b$, then we get $X_{ij}^2 X_{lk}^2 A_{jk}^2$. This has expected value $1/d \cdot 1/d \cdot k/d = k/d^3$. And there are $(d^2-d)$ terms like that, which is approximately $d^2$ so we get expectation approximately $k/d$.
if $j=a=k=b$, then we get $X_{ij}^2 X_{lj}^2 A_{jj}^2$. This has expected value $1/d \cdot 1/d \cdot 1 = 1/d^2$. And there are $d$ terms like that, so we get expectation $1/d$.

Together, the expectation is approximately $(k+1)/d$.

Code for sanity-checking:

import numpy as np
np.random.seed(42)
d = 1000
X = np.random.randn(d, d) / np.sqrt(d)
F = X @ X.T
print(f'Diagonal: {np.sum(np.diag(F)2):.1f}\nTotal:    {np.sum(F2):.1f}\n')
X = np.random.randn(d, d) / np.sqrt(d)
F = X @ F @ X.T
print(f'Diagonal: {np.sum(np.diag(F)2):.1f}\nTotal:    {np.sum(F2):.1f}')

Output:

Diagonal: 1002.8
Total:    2003.8
Diagonal: 1012.8
Total:    3029.2

edited Dec 06 '23 at 18:38

answered Dec 01 '23 at 15:26

amoeba

104,745

Haha, that's a lot of work. I found this derivation of $\operatorname{Tr}(A^4)$, in Terry Tao's "Topics in Random Matrix Theory", was looking to figure out how to adapt it to compute $\operatorname{Tr}(AAA^TA^T)$ – Yaroslav Bulatov Dec 01 '23 at 16:17
(screenshot linked above gives derivation for symmetric $A$) – Yaroslav Bulatov Dec 01 '23 at 16:43
Posted follow-up question on math.SE – Yaroslav Bulatov Dec 01 '23 at 20:47
it's something to do with non-commutativity (search for "Distentangling Lemma" in lecture notes). I used Mathematica to guess formulas for various trace expressions, got this table – Yaroslav Bulatov Dec 01 '23 at 21:38
"The generic term here has mean 0 and variance 1/d^4, and there are d^3 of them, so we get variance 1/d after summation" You might want to check this step with simulations. Possibly correlations between the d^3 terms mess up the 1/d after summation. – Sextus Empiricus Dec 03 '23 at 17:25
@SextusEmpiricus I thought about it, but it's actually not super easy to set up a simulation for that, so I haven't really tried so far. One would need to sum only over the terms with $j\ne k\ne l$... – amoeba Dec 03 '23 at 23:00
@amoeba I created some similar function to compute such function and just looked at the histograms of cov(C_n) for a single matrix. What we see is a bimodal distribution. One for the diagonal elements and another for the off-diagonal elements. When we increase $n$ then the variance of the off-diagonal elements increases. Also, when we make simulations with small $d$, then we see that the formula doesn't work for large $n$ and there is a decline. – Sextus Empiricus Dec 04 '23 at 01:07
@SextusEmpiricus (1) Yes, diagonal terms and non-diagonal terms have different distributions, that's why I consider them separately. (2) Hmm, it's not a typo. That's an expression for $ip$ element of the matrix that we are computing Frobenius norm of. The sum is over $jkl$. (3) Not exactly sure what you mean here, but the 3rd and 4th X's are transposed. That's why the indices are flipped. – amoeba Dec 04 '23 at 07:49
@amoeba I was mistaken about the typo – Sextus Empiricus Dec 04 '23 at 08:03
1

@SextusEmpiricus Ah sorry, I consider the situation where all $X$ are the same, see one of the first sentences in my answer. As OP wrote in the comments under his Q, experimentally the answer stays the same if all $X$ are the same matrix. – amoeba Dec 04 '23 at 08:12
@SextusEmpiricus Probably the better way to approach all of this would be to directly consider the induction step and just talk about $XAX^\top$ where $A$ has mean 1 elements on the diagonal and some variance $n/d$ off-diagonal. Then it's easy to see that the resulting product will also have mean 1 on the diagonal, and the goal is to show that off-diagonal variance is now $(n+1)/d$. I still have no idea how to show it though. Naive calculation as in my answer gives $n/d$. Probably you are right and it's due to covariances between summation terms. – amoeba Dec 04 '23 at 08:43
@SextusEmpiricus Actually I think writing it like that, I have now the correct calculation. Let me try to edit my answer now... – amoeba Dec 04 '23 at 09:01
@SextusEmpiricus Edited!! – amoeba Dec 04 '23 at 09:17
@YaroslavBulatov I believe I have found the mistake. Now everything works. Such a relief. – amoeba Dec 04 '23 at 09:17
@amoeba I haven't yet found time to re-read your post. But, if we do not resample $X$ then shouldn't the process $C_{n+1} = C_n X$ in the long run lead to a matrix with columns that are all multiples of the largest eigenvector of $X$ and then for large $n$ the behaviour becomes a simple multiplication of the largest eigenvalue. – Sextus Empiricus Dec 04 '23 at 10:47
1

@amoeba I like this version of your answer. An inductive type of argument is also what I was considering. I am wondering why your previous answer didn't work. Was it the lack of independence? Also, I wonder when this approximation starts to break down, because in the long run the linear behaviour becomes noisy. – Sextus Empiricus Dec 04 '23 at 11:42
@SextusEmpiricus Yes, I believe it was the lack of independence. The wrong calculation in my initial answer basically corresponded to the first bullet point in the induction step calculation in the current answer. What I missed, is the second bullet point. It became easier after I wrote it as $XAX^\top$ instead of $XXX^\top X^\top$ because the expressions became shorter and I managed to write down the square of the sum (which is not the same as sum of squares! which is what I implicitly assumed previously). – amoeba Dec 04 '23 at 14:37
@SextusEmpiricus Regarding the power iteration argument: I think it's again about whether $d$ is large enough for a given $n$. For any given $n$, the approximation only holds for sufficiently large $d$ values. E.g. for $n=1$, we have squared norm $2d-1/d$ which is approximately $2d$ for let's say $d>10$. For larger and larger $n$, we may need larger and larger $d$ to have this "approximate" equality. BTW, my induction argument gives a recurrent but exact formula for any $n$ and $d$. – amoeba Dec 04 '23 at 14:41
@amoeba the induction argument still uses some sort of independence of the terms. Does the induction argument also work when the columns in $A$ start to become more correlated? – Sextus Empiricus Dec 04 '23 at 15:36
@SextusEmpiricus I don't see any assumptions of independence anymore, but maybe I am overlooking smth. – amoeba Dec 04 '23 at 15:48
@amoeba I think you still have the dependence, but now it is incorporated into the formula's. It is expressed in the expectations of some of the off-diagonal elements that have a value $k/d$ which is increasing for larger $k$. – Sextus Empiricus Dec 04 '23 at 16:05
@amoeba nice work. I guess for recursive solution, one could substitute the exact formula for $E[XAX']$ (don't have that one handy, but more involved one is derived here) – Yaroslav Bulatov Dec 04 '23 at 22:16
@YaroslavBulatov I've just added the recursive formula to my answer, it follows directly from the text above. – amoeba Dec 04 '23 at 22:20
@YaroslavBulatov And now I deleted the recursive formula because turns out, it was wrong... – amoeba Dec 06 '23 at 18:39

Sangchul Lee · Answer 2 · 2023-12-05T04:55:49.190

Fix $n$, and let

$$ \Gamma = [d]^{n+1} = \{ \gamma = (\gamma_0, \gamma_1, \ldots, \gamma_n) : \gamma_i \in [d] \} $$

be the space of length-$n$ paths in $[d] = \{1, \ldots, d\}$. We also define

$$ \Gamma^{⧖} = \{ (\gamma^{11}, \gamma^{12}, \gamma^{21}, \gamma^{22}) \in \Gamma^4 : \gamma^{i1}_0 = \gamma^{i2}_0 \text{ and } \gamma^{1j}_n = \gamma^{2j}_n \}. $$

That is, $\Gamma^{⧖}$ is the space of all quadruples of curves $(\gamma^{11}, \gamma^{12}, \gamma^{21}, \gamma^{22})$ that form a configuration like below:

Then with $d\times d$ matrices $X^{(1)}, \ldots, X^{(n)}$ with IID $\mathcal{N}(0, d^{-1})$ entries and $C_n = X^{(1)} \cdots X^{(n)}$,

$$ \| C_n C_n^T \|_F^2 = \sum_{(\gamma^{11}, \gamma^{12}, \gamma^{21}, \gamma^{22}) \in \Gamma^{⧖}} \prod_{\substack{t \in [n] \\ i, j \in [2]}} X^{(t)}_{\gamma^{ij}_{t-1},\gamma^{ij}_t} $$

Expectation. We observe that, for the product $\prod_{t \in [n]; i, j \in [2]} X^{(t)}_{\gamma^{ij}_{t-1},\gamma^{ij}_t}$ to have non-zero expectation, each of the factor $X^{(t)}_{pq}$ must appear even number of times in the product. This forces the paths $\gamma^{11}, \gamma^{12}, \gamma^{21}, \gamma^{22}$ to coalesce. Among such coalesced configurations, the most abundant pattern is of the form

Since the "quadruple joint" in the above figure can occur at any of $t = 0, 1, \ldots, n$, the number of such configurations is $ (n+1) d^{n+1} (d-1)^n $, yielding

$$ \mathbf{E}[ \| C_n C_n^T \|_F^2 ] \sim (n+1) d^{n+1} (d-1)^n \cdot \mathbf{E}[(X_{pq}^{(t)})^2]^{2n} \sim (n+1)d. $$

Remark 1. Numerical simulation suggests that

$$ \mathbf{E}[ \| C_n C_n^T \|_F^2 ] = (n+1)d + \frac{n(n+1)}{2} + \mathcal{O}(d^{-1}) $$

as $d \to \infty$ for each $n$. When $n = 1, 2$, this claim can be directly verified by the exact formulas

$$ \mathbf{E}[ \| C_1 C_1^T \|_F^2 ] = 2d + 1, \qquad \mathbf{E}[ \| C_2 C_2^T \|_F^2 ] = 3d + 3 + \frac{3}{d}. $$

Remark 2. It seems that $\mathbf{Var}[ \| C_n C_n^T \|_F^2 ] = \mathcal{O}_n(1)$ as $d \to \infty$. For example, we have an exact formula

$$ \mathbf{Var}[ \| C_1 C_1^T \|_F^2 ] = 36 + \frac{40}{d} + \frac{20}{d^2}. $$

I will revisit this idea and try to (1) articulate more rigorous and detailed estimate of the expectation, and (2) prove the boundedness of the variance as $d \to \infty$. I'm just too exhausted and sleepy right now...

Interesting...how did you get the exact formulas, by enumerating cycle types as in your XXX'X' answer? — Yaroslav Bulatov, Dec 05 '23 at 04:24
@YaroslavBulatov, For the expectation, indeed yes. I enumerated all the possible types of configurations and computed their contributions. For the variance, however, I don't have a nice graphical description yet (although I have a half-baked idea for this), so I had to use a software to enumerate all the possible combinations and automatically compute their contributions. I also evaluted the variance for small $d$'s and verified that the formula works in those cases. — Sangchul Lee, Dec 05 '23 at 04:58
I see...I'm curious about your software approach, I've used a more heuristic approach, forked this discussion to here — Yaroslav Bulatov, Dec 05 '23 at 07:01

Frobenius norm of a product of Gaussian matrices

2 Answers2

Induction base

Induction step

Linked