Expected length of vectors after orthogonalization

Question

Suppose I take $k$ vectors randomly sampled from surface of unit sphere in $d$ dimensions. $$v_1, v_2, v_3,\ldots ,v_k$$

I apply Gram-Schmidt orthogonalization (but not orthonormalization) to obtain mutually orthogonal vectors.

$$u_1,u_2,u_3,\ldots,u_k$$

Can we say anything about how expected value of $\|u_1\|^2+\ldots +\|u_k\|^2$ depends on $k$ and $d$?

One expects a flattening curve as each new observation $v$ provides less and less "orthogonal" information.

Empirically, here's what it looks like for $k\in [1..100]$ and $d=100$

Motivation, this seems related to "step-size problem" for mini-batch iterative linear least squares solvers where ideal step size grows almost linearly until batch size = $\frac{\text{Tr}(\Sigma)}{\|\Sigma\|}$, called "intrinsic dimension" by Tropp

Using equation 3 from that paper suggests relationship of the form $$\frac{bk}{a+k}$$

Where $b$ and $a$ are some constants. This form seems to approximately match expected length of vectors after orthogonalization:

notebook

score 4 · Accepted Answer · answered Jul 28 '22 at 17:27

After performing the Gram-Schmidt process on $k\ge 0$ vectors, you have an orthonormal frame spanning a $k$-dimensional subspace $V^k\subset \mathbb{R}^d.$ The next vector is randomly and uniformly drawn from the unit sphere $S^{d-1},$ but only its component orthogonal to $V^k$ is retained.

What is the distribution of the squared length of this component? Consider the situation where $k=d-1;$ that is, there is a single dirction orthogonal to $V^k.$ As explained at https://stats.stackexchange.com/a/85977/919, the density of the length $t$ is proportional to $(1-t^2)^{(d-3)/2}.$ The expected squared length therefore is

$$\frac{\int_0^1 x^2 (1-x^2)^{(d-3)/2}\,\mathrm{d}x}{\int_0^1(1-x^2)^{(d-3)/2}\,\mathrm{d}x} = \frac{1}{d}.$$

Generally, for arbitrary $0\le k \le d-1,$ there are $d-k$ directions orthogonal to $V^k$ and (by additivity of squared lengths in the Pythagorean theorem and linearity of expectation) each one contributes $1/d$ to the expected squared distance, for a total value of $(d-k)/d = 1-k/d.$ Consequently, the height of your curve at any such $k$ must be

$$f(k;d) = \sum_{i=0}^k 1 - \frac{i}{d} = (k+1) \left(1 - \frac{k}{2d}\right).$$

As a check, here are results of a simulation (of length $500$) for $d=100$ showing excellent agreement with this formula for $f(k;d):$

Because there are more efficient ways to carry out the Gram-Schmidt process than shown in the Mathematica code, the R code that produced this figure may be of interest. (BTY, Mathematica has a very flexible Gram-Schmidt procedure implemented in its Orthogonalize function.)

#
# Generate uniform vectors on a unit sphere in R^d.
#
runif.sphere <- function(n, d) 
  scale(matrix(rnorm(n * d), d), center = FALSE) / sqrt(d - 1)
#
# Gram-Schmidt orthogonalization of matrix columns.
#
GramSchmidt. <- function(W) { # Brute force, for checking
  for (i in seq_len(ncol(W) - 1)) 
    W[, i+1] <- residuals(lm(W[, i+1] ~ 0 + W[, 1:i]))
  W
}
GramSchmidt <- function(W) { # More efficient
  obj <- qr(W); Q <- qr.Q(obj); R <- qr.R(obj)
  for (i in seq_len(ncol(W) - 1))
    W[, i+1] <- W[, i+1] - Q[, 1:i] %*% R[1:i, i+1, drop = FALSE]
  W
}
#
# Check with a simulation.
#
d <- 100
k <- d
X <- replicate(5e2, colSums(GramSchmidt(runif.sphere(k, d)) ^ 2)) # All squared lengths
# 
# plot(0:(d-1), rowMeans(X))
# curve(1 - k/d, add=TRUE, xname="k")
#
plot(0:(d-1), cumsum(rowMeans(X)), xlab=expression(italic(k)), ylab="", 
     main = "Simulation Results Compared to the Theoretical Value",
     cex.main = 1, cex.axis=1.25, cex.lab=1.5)
curve((k+1) * (1 - k/(2*d)), add = TRUE, lwd = 2, xname = "k")

Wow, thanks for the in-depth explanation! Orthogonalize turned out not be be flexible enough to use for orthogonalization. Is there a way to extend this analysis to general Gaussian? IE, if Gaussian eigenvalues fall off as 1/i, how fast do "orthogonal contributions" decay? — Yaroslav Bulatov, Jul 28 '22 at 17:39
Obtaining an exact solution to that sounds much harder. Even in the case $d=2,$ $k=0$ you are asking for the expectation of a sum of independent $\chi^2(1)$ distributions with different scales. Although that expectation is easy to compute, with larger $d$ you have to consider the conditional expectations for $k\ge 1$ and integrate over them, which means integrating over a sum of differently-scaled $\chi^2$ distributions. As far as the flexibility of Orthogonalize, you'll find it's amazing once you work it out. I have used it to orthogonalize families of functions, for instance. — whuber, Jul 28 '22 at 18:51
BTW, something interesting happens in high dimensions (ie, d>20): expected value of 1 over "average dot product squared over all pairs of vectors in a batch" starts following a simple law -- https://mathoverflow.net/questions/429728/expected-norms-of-wishart-matrices — Yaroslav Bulatov, Sep 04 '22 at 17:32

Expected length of vectors after orthogonalization

1 Answers1