1

I am currently dealing with a dataset with about 300,000 records, there are a wide variety of categories in several columns and naturally when one-hot-encoding these the number of features increases into the 1000s.

Is there some sort of heuristic to find the optimal number of n_components for SVD that in turn minimises the MSE of the Linear Regression algorithm the data is then fed into after the decomposition?

I have tried a random search but 10 hours and counting and it still is nowhere near complete.

Richard Hardy
  • 67,272
Ace
  • 11
  • 1
    Since you want to regress the PCA outputs against some target variable, why not do that directly with [tag:partial-least-squares]? – Sycorax Mar 18 '22 at 01:39
  • Interesting and thank you for your suggestion, I just had a brief read over this but I'm still not sure what the best method is to find the optimal number for n_components even with PLS. – Ace Mar 18 '22 at 01:41
  • 1
    Optimal for what? – Sycorax Mar 18 '22 at 01:47
  • Minimizing MSE. – Ace Mar 18 '22 at 01:49
  • 1
    The MSE on the training data will be minimal if you include all of the features. You don't even need to do PCA. – Sycorax Mar 18 '22 at 01:53
  • Oh wow, okay, I always thought adding features willy nilly has the possibility of making the model perform worse. – Ace Mar 18 '22 at 02:06
  • 1
    If you're computing a new PCA/SVD for each choice of number of components, you're wasting time because you can re-use the first $k$ each time you want to add an additional component. Or just do PCA/SVD once for the largest number of components you care to compute, and then down-select to the $k$ you want to use. Rinse and repeat. – Sycorax Mar 18 '22 at 02:07
  • Let me clarify, I want to minimise MSE on the testing data, of course it will be minimised regardless on training data. – Ace Mar 18 '22 at 02:08
  • 1
    PCA has no guarantee to improve the model. https://stats.stackexchange.com/questions/141864/how-can-top-principal-components-retain-the-predictive-power-on-a-dependent-vari Feature selection might or might not be misguided. – Sycorax Mar 18 '22 at 02:08
  • 1
    Testing data is a different story. You can't know how the model predicts unseen data until you measure it. – Sycorax Mar 18 '22 at 02:10
  • Thank you so much for your help and sorry for posting a duplicate, I guess I just didn't know how to phrase the question when I was doing my research. – Ace Mar 18 '22 at 02:13

0 Answers0