2

I have some high-dimensional spectral data I want to use in modeling plant productivity using a supervised model like random forest. I want to use the model for inference as well as for prediction in a model comparison.

I initially wanted to use PLS to reduce the dimensionality and then (instead of performing a regression) use the PLS components together with some coviarates to fit a random forest model with the same response variable as in PLS.

But after giving it some thought, I went for PCA instead as it is an unsupervised method and I was worried about data leakage effects when using the same response in PLS and the follow-up supervised model. Is this line of thinking correct? Should the components resulting from PLS be used in other models at all?

ormr
  • 21
  • Since you've chosen a random forest, it seems that you are interested in prediction rather than inference. The goal of the analysis makes a difference for whether it's okay to use "y-aware" PCA or not. See the first paragraph of this answer: Principle Component Analysis for pre-grouped variables since I didn't find a better reference. – dipetkov Nov 06 '22 at 22:40
  • Thank you for the helpful link. Actually I want to use the model for inference using SHAP values but also for prediction in a model comparison to determine the best of several preprocessing schemes. I updated the question to clarify. – ormr Nov 08 '22 at 14:47
  • @ormr interesting question, maybe you can use a probabilistic PLS, sample $M$ embeddings, fit $M$ random forests, generate $M$ shap vals and then aggregate? – John Madden Nov 08 '22 at 14:48
  • Inference usually means p-values and confidence intervals. And Shapley values are used to explain predictions (so that's not inference?) Terminology aside, you seem to have a prediction problem. So the challenge is to make sure the model doesn't overfit (which could happen if using the $y$s to make transformations but not necessarily). – dipetkov Nov 08 '22 at 17:57

0 Answers0