I'm aware of the issues underlying using PCA to feature select (https://blog.kxy.ai/5-reasons-you-should-never-use-pca-for-feature-selection/) and (https://towardsdatascience.com/pca-is-not-feature-selection-3344fb764ae6). However, I need to consider potential alternative methods of feature selection rather than just looking at correlation, so for now I'm looking at using PCA to see if it might be useful.
My original dataset consists of thousands of proteins ( rows= samples, columns= the different proteins), where the values correspond to their concentration values. Values were centered and log transformed prior to PCA.
Using PCA as an unsupervised way to "feature select", I have selected the first 30 PCs that account for ~80% of the explained variation. For each feature, I have multiplied the loadings by the 'proportion of variance explained'. Then I took the sum of all loadings for each feature, I then use a cut off point and select only those features which have a final loading value >= to that cut off ( supposedly the most important in relation to the PCs).
My confusion lies with what exactly does a feature with a high loading mean in relation to the feature in the original dataset? For example, from this link (https://towardsdatascience.com/pca-is-not-feature-selection-3344fb764ae6), he explains "The only way PCA is a valid method of feature selection is if the most important variables are the ones that happen to have the most variation in them". In other words does this mean features with high loadings would show the most variation in the original dataset?
Ultimately, I would like to select features which do not show a large variation in the original dataset across samples, so should I select those with a lower loading?
If I'm on the wrong tracks entirely, considering the definition of loadings as "the covariances/correlations between the original features and the unit-scaled components", if features with low loadings are not informative/not important for the PCA, what is that telling me about the variance or otherwise of the feature in the original dataset?
I'm aware of similar questions on Stack/Cross-Validated, but none that clarify this point.
Using PCA for feature selection?
Using principal component analysis (PCA) for feature selection
Any nudges in the right direction would be appreciated.
I have multiplied the loadings by the proportion of variance explained. Why would you need doing this? A loading squared already bears the information on the magnitude of variance explained. Variance explained by the component it the sum of its squared loadings. – ttnphns May 15 '23 at 06:41