I am reading pg 651 of Elements of Statistical Learning,where is says:
"The simplest form of regularization assumes that the features are independent within each class, that is, the within-class covariance matrix is diagonal. Despite the fact that features will rarely be independent within a class, when p ≫ N we don’t have enough data to estimate their dependencies."
I am struggling to understand why this is.
Take for example a hypothetical case where you have 1000 data points, and 5000 predictors. Here $p>>N$. Surely 1000 points is enough to calculated the correlation between say, $X_1$ and $X_2$ regardless of whether we have predictors up to $X_{5000}$.
Is there something I'm not understanding about high dimensional problems here?