0

I have a dataset that has some 25 continuous variables and a continuous target. A tutor showed using this dataset, that when PCA scores are used instead of raw variables as inputs for predicting target, the MAPE (mean absolute percentage error) came down significantly. My understand was that, it should not affect the accuracy, only the interpretation of variables.

Question: how does PCA improve the accuracy over the raw variables?

amoeba
  • 104,745
muni
  • 384
  • I think it is happening because, PCA uses normalized scores as inputs, while raw variables might have scale problem – muni Dec 05 '16 at 14:14
  • 1
    See here: http://stats.stackexchange.com/questions/141864 and links in the accepted answer. This topic has been discussed on this forum many times. – amoeba Dec 05 '16 at 14:14
  • No, normalization has nothing to do with it. I assume that your tutor used only a subset of PCs and not all the PCs. This is the key. Also, I assume that they were talking about a test set error, not a training set error. – amoeba Dec 05 '16 at 14:16
  • 1
    More specifically, see my answer in http://stats.stackexchange.com/questions/36249 – amoeba Dec 05 '16 at 14:16
  • we are trying to score the training data only. I normalized my raw inputs and it is giving me even lesser MAPE than PCA components now. What could this be pointing towards? – muni Dec 05 '16 at 14:44
  • 1
    Towards something being wrong :-) The predictions of linear regression should be exactly the same whether you normalize your inputs or not. – amoeba Dec 05 '16 at 14:45
  • one more interesting thing, when I use all the PCA, it is equal to using all the raw variables MAPE. Also increasing the number of PCAs increase the error. – muni Dec 05 '16 at 14:51
  • I think you may have to make your data-set available to move this one forward otherwise, as @amoeba hints, there remains the possibility of a programming error somewhere. If you do make the data available the question may then be thought more suitable for a programming site so make sure the statistical question is still the most salient one. – mdewey Dec 05 '16 at 15:33
  • Can we post a dataset here? – muni Dec 05 '16 at 17:03

0 Answers0