I have some high-dimensional spectral data I want to use in modeling plant productivity using a supervised model like random forest. I want to use the model for inference as well as for prediction in a model comparison.
I initially wanted to use PLS to reduce the dimensionality and then (instead of performing a regression) use the PLS components together with some coviarates to fit a random forest model with the same response variable as in PLS.
But after giving it some thought, I went for PCA instead as it is an unsupervised method and I was worried about data leakage effects when using the same response in PLS and the follow-up supervised model. Is this line of thinking correct? Should the components resulting from PLS be used in other models at all?