Does it make sense to run LDA on several principal components and not on all variables?

Question

I am interested in building a linear discriminant function to discriminate between 2 groups, out of 60 variables. (I'm planning to select the most discriminative of the variables for a future diagnostic test.) I have calculated the area under the ROC curve for each of these variables individually and none has an AUC greater than 0.73. I have a fairly small sample of 50 healthy and 50 diseased individuals (these are the two groups).

I have tried to reduce the number of variables using principal component analysis (PCA). There are 3 components accounting for 83% of the variation. But unfortunately, all 60 variables have similar weightings (loadings) in the 3 components, so I can't pick just few. I would ordinarily pick the highest weighted variables and then incorporate them in a linear discriminant function, but 60 is too many, especially given the small sample.

I wondered if, rather than use the 60 variables, it is possible to use the 3 principal components themselves in a linear discriminant analysis (LDA)?

The PCA gives a coordinate system into which you can map your data. If you were to choose only one component, then you could describe each of your samples using only a single scalar value. If you used only 2 components, then you would map your 60 dimensional space to a 2-dimensional space.
In these revised spaces sometimes the information comes out more clearly. — EngrStudent, Jun 27 '13 at 18:04
See also this later question: Does it make sense to combine PCA and LDA? — amoeba, Jan 13 '15 at 15:46

score 5 · Answer 1 · edited Apr 13 '17 at 12:44

First of all, do you have an actual indication (external knowledge) that your data consists of a few variates that carry discriminatory information among noise-only variates? There is data that can be assumed to follow such a model (e.g. gene microarray data), while other types of data have the discriminatory information "spread out" over many variates (e.g. spectroscopic data). The choice of dimension reduction technique will depend on this.

I think you may want to take a look at chapter 3.4 (Shrinkage methods) of The Elements of Statistical Learning.

Principal Component Analysis and Partial Least Squares (a supervised regression analogue to PCA) are best fit for the latter type of data.

It is certainly possible to model in the new space spanned by the selected principal components. You just take the scores of those PCs as input for the LDA. This type of model is often referred to as PCA-LDA.

I wrote a bit of a comparison between PCA-LDA and PLS-LDA (doing LDA in the PLS scores space) in my answer to "Should PCA be performed before I do classification?". Briefly, I usually prefer PLS as "preprocessing" for the LDA as it is very well adapted to situations with large numbers of (correlated) variates and (unlike PCA) it already emphasizes directions that help to discriminate the groups. PLS-DA (wihtout L) means "abusing" PLS-Regression by using dummy levels (e.g. 0 and 1, or -1 and +1) for the classes and then putting a threshold on the regression result. In my experience this is often inferior to PLS-LDA: PLS is a regression technique and as such at some point will desparately try to reduce the point clouds around the dummy levels to points (i.e. project all samples of one class to exactly 1 and all of the other to exactly 0), which leads to overfitting. LDA as a proper classification technique helps to avoid this - but it profits from the reduction of variates by the PLS.

As @January pointed out, you need to be careful with the validation of your model. However, this is easy if you keep 2 points in mind:

Data-driven variable reduction (or selection) such as PCA, PLS, or any picking of variables with the help of measures derived from the data is part of the model. If you do a resampling validation (iterated $k$-fold cross-validation, out-of-bootstrap) - which you should do given your restricted sample size - you need to redo this variable reduction for each of the surrogate models.
The same applies to data-driven (hyper)parameter selection such as determining the number of PCs or latent variables for PLS: redo this for each of the surrogate models (e.g. in an inner resampling validation loop) or fix the hyperparameters in advance. The latter is possible with a bit of experience about the particular type of data and particularly for the PCA-LDA and PLS-LDA models as they are not too sensitive for the exact number of variates. The advantage of fixing lies also in the fact that data-driven optimization is rather difficult for classification models, you should use a so-called proper scoring rule for that and you need rather large numbers of test cases.

(I cannot recommend any solution in Stata, but I could give you an R package where I implemented these combined models).

update to answer @doctorate's comment:

Yes, in priciple you can treat the PCA or PLS projection as dimensionality reduction pre-processing and do this before any other kind of classification. IMHO One should spend a few thoughts about whether this approach is appropriate for the data at hand.

There's quite some literature about the combination PLS with generalized linear models such as logistic regression, see e.g.
- Bastien, P.; Vinzi, V. E. & Tenenhaus, M.: PLS generalised linear regression, Computational Statistics & Data Analysis, 48, 17-46 (2005). DOI: 10.1016/j.csda.2004.02.005
- Fort, G. & Lambert-Lacroix, S.: Classification using partial least squares with penalized logistic regression, Bioinformatics, 21, 1104-1111 (2005). DOI: 10.1093/bioinformatics/bti114
- Boulesteix, A.-L. & Strimmer, K.: Partial least squares: a versatile tool for the analysis of high-dimensional genomic data, Brief Bioinform, 8, 32-44 (2007). DOI: 10.1093/bib/bbl016](http://dx.doi.org/10.1093/bib/bbl016)
- R packages plsRglm and plsgenomics have generalized linear models with PLS and PLS with logistic regression.
On the other hand, if you find yourself reducing the data by linear projection to a few latent variables and then applying a highly nonlinear model such as randomForest, you should know an answer why this is the way to go as opposed to do a linear or maybe "slightly non-linear" model on the original data.

+1 Very comprehensive. I italicized your notion it already emphasizes directions..., with your permission... — ttnphns, Jun 27 '13 at 18:15
Thanks. The data involves components of pupil responses to different stimuli. For example the amplitude and latency of pupil constriction to blue, red, green and white light. As there are several measures of amplitude and latency (and other parameters) there is likely to be some redundancy, however as the individual area under the curves for each of the 60 variables are similar I suspect that most of the variables carry some discriminatory information rather than noise. It sounds like PLS-DLA would be a good approach to try. I would be very grateful for an R package that helps with this. — Andrew, Jul 01 '13 at 15:06
@cbeleites, +1 very clear treatment to such data, one question, can on follow this approach with other machine-learning classifiers other than LDA post PLS processing and fixing components? Is there any restriction to what type of classifier post PLS, can we do PLS-RandomForest, PLS-C5.0, PLS-Support Vector Machines, PLS-logistic regression? the aim is to compare these approaches but I am asking if there is some known contra-indications or violations here? do you know such implementation in R? or references? — doctorate, Jan 09 '14 at 07:19

score 2 · Answer 2 · edited Jun 27 '13 at 10:10

One way of reducing the dimensionality of your samples might be the so-called "sparse PCA" (SPCA), but I don't know whether it is available for Stata. SPCA limits the number of variables with non-zero weight per component and thus allows you to select the variables much more tightly.

Alternatively, use the top N variables with the largest absolute loading and test how well your model performs then. But be warned: never use the same samples for test and selection procedure; otherwise your results will be worthless.

Another approach that I personally find very useful in such a setting is to use PLS-DA, which is both a dimension reduction technique and a supervised machine learning algorithm. However you have to mind the way you are validating your results (see paper by Westerhuis et al. and van Dorsten, Metabolomics 2008).

Other machine learning algorithms also are suitable for variable selection -- another one that I have experience with is random forests, where variables are given weights and can be selected for a refined model with a limited number of variables.

PLS-DA = Partial Least Squares Discriminant Analysis. (I'll comment more on this and the possibility of combining PLS with linear discriminant analysis in a proper answer) — cbeleites unhappy with SX, Jun 27 '13 at 17:24

Does it make sense to run LDA on several principal components and not on all variables?

2 Answers2

Linked