How to efficiently do PCA/SVD on dataset with thousands of features (both continuous and OHE)

Question

I am currently dealing with a dataset with about 300,000 records, there are a wide variety of categories in several columns and naturally when one-hot-encoding these the number of features increases into the 1000s.

Is there some sort of heuristic to find the optimal number of n_components for SVD that in turn minimises the MSE of the Linear Regression algorithm the data is then fed into after the decomposition?

I have tried a random search but 10 hours and counting and it still is nowhere near complete.

Since you want to regress the PCA outputs against some target variable, why not do that directly with [tag:partial-least-squares]? — Sycorax, Mar 18 '22 at 01:39
Interesting and thank you for your suggestion, I just had a brief read over this but I'm still not sure what the best method is to find the optimal number for n_components even with PLS. — Ace, Mar 18 '22 at 01:41
The MSE on the training data will be minimal if you include all of the features. You don't even need to do PCA. — Sycorax, Mar 18 '22 at 01:53
Oh wow, okay, I always thought adding features willy nilly has the possibility of making the model perform worse. — Ace, Mar 18 '22 at 02:06
If you're computing a new PCA/SVD for each choice of number of components, you're wasting time because you can re-use the first $k$ each time you want to add an additional component. Or just do PCA/SVD once for the largest number of components you care to compute, and then down-select to the $k$ you want to use. Rinse and repeat. — Sycorax, Mar 18 '22 at 02:07
Let me clarify, I want to minimise MSE on the testing data, of course it will be minimised regardless on training data. — Ace, Mar 18 '22 at 02:08
PCA has no guarantee to improve the model. https://stats.stackexchange.com/questions/141864/how-can-top-principal-components-retain-the-predictive-power-on-a-dependent-vari Feature selection might or might not be misguided. — Sycorax, Mar 18 '22 at 02:08
Testing data is a different story. You can't know how the model predicts unseen data until you measure it. — Sycorax, Mar 18 '22 at 02:10
Thank you so much for your help and sorry for posting a duplicate, I guess I just didn't know how to phrase the question when I was doing my research. — Ace, Mar 18 '22 at 02:13

How to efficiently do PCA/SVD on dataset with thousands of features (both continuous and OHE)

0 Answers0