1

I want to find a single metric to assess how spread (or how much variance) a multidimensional dataset (a large number of features) is. I learned that the determinant (or pseudo-determinant) of the covariance matrix of features could be a good measurement (the volume intuition). However, taking pseudo-determinant as an example, it requires the computation of all the non-zero eigenvalues (with svd). For a large covariance matrix, the full-svd is usually slow, but the partial-svd (e.g. get only a few biggest eigenvalues) is fast. I wonder if it is fair to only use the first few biggest eigenvalues to evaluate the total spreads/variance, e.g., by calculating a production of them. If so, does it also have a geometric intuition (such as the volume of a subspace)? After extensive searching, I haven't found discussion/usage along this idea. Thanks.

I have read the questions posted (very helpful already):

A measure of "variance" from the covariance matrix?

Measures of multidimensional spread or variance

qliang
  • 21
  • 1
    It's better for you to clarify variance of what quantity exactly. Or if you just want to approximate a covariance matrix $\Sigma = O\operatorname{diag}(\lambda_1, \lambda_2, \ldots, \lambda_p)O^T$ by $\Sigma' = O\operatorname{diag}(\lambda_1, \lambda_2, 0, \ldots, 0)O^T$ when $p$ is large? – Zhanxiong Dec 29 '22 at 22:30
  • Thanks for the response. The variance I was trying to say is not the well-defined 'variance' in statistics. It is mostly another way of saying 'how spread it is'. My question is similar to the two previous ones (as in links), and I accept that the (pseudo) determinant of covariance matrix is what I need, ideally. The problem is the speed of doing full-svd for large matrix, so I wonder if partial-svd is a fair thing to work with for this purpose. For matrix approximation, yes, capturing the first few eigenvalues will work, but that is not my intention. – qliang Dec 29 '22 at 22:50
  • My intuition is that the production of the first few largest eigenvalues can also be used as a measurement of the 'multidimensional spread', yet lacking previous examples/theories of doing such things. – qliang Dec 29 '22 at 22:54
  • Subspaces don't have finite volumes. When all eigenvalues are nonzero, level sets of the inverse covariance matrix are ellipsoids enclosing volumes whose measures equal the product of the eigenvalues (that is, the determinant) of the covariance matrix times the volume of the sphere of a radius proportional to the square root of the constant covariance on the level set. – whuber Dec 30 '22 at 15:28
  • Thanks@whuber. In fact, I am facing a problem in that some eigenvalues are zero, and pseudo determinant is preferred. Also, many of the eigenvalues are non-zero but close to zero (e.g., 1e-15). That is the reason I am thinking about only using the largest few eigenvalues. Does that sound good? Perhaps I should use trace as the metric? With trace, I can ignore many small eigenvalues since it is a sum. – qliang Dec 30 '22 at 16:55

0 Answers0