4

I am reading pg 651 of Elements of Statistical Learning,where is says:

"The simplest form of regularization assumes that the features are independent within each class, that is, the within-class covariance matrix is diagonal. Despite the fact that features will rarely be independent within a class, when p ≫ N we don’t have enough data to estimate their dependencies."

I am struggling to understand why this is.

Take for example a hypothetical case where you have 1000 data points, and 5000 predictors. Here $p>>N$. Surely 1000 points is enough to calculated the correlation between say, $X_1$ and $X_2$ regardless of whether we have predictors up to $X_{5000}$.

Is there something I'm not understanding about high dimensional problems here?

Sean
  • 634
  • 2
    What do you suppose the rank is of such a covariance matrix? – Sycorax Jun 30 '20 at 21:30
  • I understand that the $pxp$ covariance matrix will have rank of at most N, which means it will never be full rank when p>>N as obviously N<p in this case. However, surely we can still calculate the covariance matrix - it is just the inversion that won't work. I guess I am maybe misunderstanding what is meant here by "their dependencies". Is this not just referring to the covariance between the two predictors? – Sean Jun 30 '20 at 21:38
  • 1
    There's a subtlety here: the individual entries in the covariance matrix can be very accurately estimated with large $N,$ regardless of $P:$ see https://stats.stackexchange.com/a/61068/919 for the argument (which extends beyond the Normal-distribution context of that answer). However, when $P\gt N,$ the covariance matrix in some sense is extremely special because it must be singular, confining it to a measure-zero subset of the space of all covariance matrices. cc @Sycorax – whuber Jul 01 '20 at 14:32

0 Answers0