I am trying to do a PCA to reduce the no. of variables in my data before performing a cluster analysis. Suppose I extract 3 principal components P1, P2 and P3. Now when I am to do the clustering, on which variables should I run my analysis? I am not very clear as to should I use all the initial variables (then how will PCA help) or should I use the extracted 3 components? A detailed answer with example will be very helpful
3 Answers
How many features are in your original data? If it is not too many (say thousands), many clustering algorithm can work in your original data.
By using PCA you are losing information. If you do not want to lose too much, you can use as many PC as possible. (assume you can afford the computational efforts and there are not curse of dimensionality problem)
If you want to check how much information you lose, you can check my answers to this post to see how to get how much information (variance) preserved by PCA.
To you comment:
If you really want to use PCA, you can run clustering algorithm on the transformed data. In R with toy iris data. It is pca_out$x
pca_out=prcomp(iris[,1:3])
pca_out$x
PC1 PC2 PC3
[1,] -2.49088018 -0.320973364 -0.0339745251
[2,] -2.52334286 0.178400622 -0.2329011355
[3,] -2.71114888 0.137820058 -0.0025055723
[4,] -2.55775595 0.315675226 0.0670512306
[5,] -2.53896432 -0.331356903 0.0986154338
[6,] -2.13542015 -0.750523350 0.1367151904
[7,] -2.67669609 0.072944140 0.2311696738
[8,] -2.42912498 -0.162931683 0.0007979233
[9,] -2.70915877 0.572318127 0.0322430634
[10,] -2.44080592 0.123908243 -0.1318158483
[11,] -2.30049402 -0.641538592 -0.0654553841
[12,] -2.41545393 -0.015273540 0.1681603305
[13,] -2.56232620 0.242322950 -0.1666121092
[14,] -3.03215612 0.502494126 0.0604799584
[15,] -2.44677625 -1.179585963 -0.2360617554
[16,] -2.24724960 -1.353446638 0.1997840653
[17,] -2.50197109 -0.829777299 -0.0024222281
[18,] -2.49088018 -0.320973364 -0.0339745251
[19,] -2.00936932 -0.867984466 -0.1284528211
[20,] -2.42654485 -0.524077475 0.1997126274
Note I am showing first 20 data points after the transformation. You can use all 3 transformed features without information loss. OR you can use first 2 columns. Then your data becomes 2 dimensional but lose some information.
-
Thanks for the answer. I have around 42,000 observations and 25 variables. So I want to run a PCA on the variables. Let me reframe my question. After PCA, if I extract 'x' principal components, then how am I supposed to use the result in my clustering? Should I use the extracted components? Or if I want to use a subset of the original variables, then how do I choose that subset? – Srewashi Lahiri Sep 20 '16 at 13:49
By doing PCA you are retaining all the important information. If your data exhibits clustering, this will be generally revealed after your PCA analysis: by retaining only the components with the highest variance, the clusters will be likely more visibile (as they are most spread out).
What you should do is to look at the scatterplot in the plan defined by your three principal components: the data should clearly be grouped in separated clusters. After you know the number of clusters, you can apply K-means algorithm to perform a classification of your dataset.
Useful links: 1. http://www.cs.colostate.edu/~asa/pdfs/pcachap.pdf 2. http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf
- 293
-
6
by retaining only the components with the highest variance, the clusters will be clearly visibile (as they are most spread out).The 1st paragraph and especially its categorical claim is misleading. PCA retaining few strong components does not guarantee finding clusters because clusters might be separated well in dimensions where they - as the total cloud - are not "most spread out"). – ttnphns Sep 20 '16 at 15:50 -
Seconding @ttnphns, it might be helpful to read this: Examples of PCA where PCs with low variance are “useful”. – gung - Reinstate Monica Sep 20 '16 at 22:17
-
1Most of the times PCA helps in revealing clustering:
"PCA constructs a set of uncorrelated directions that are ordered by their variance. In many cases, directions with the most variance are the most relevant to the clustering. Removing features with low variance acts as a filter that provides a more robust clustering." (link .
"High dimensional data are often transformed into lower dimensional data via the PCA where coherent patterns can be detected more clearly. " link
– Roland Sep 21 '16 at 07:53 -
Angy, when addressing a specific commenter you should mention his name in the form like @username, else he won't be notified anyhow and you reply will be missed. As for the content of your comment: thanks for the links; you might want to expand your answer by adding and considering them in it. – ttnphns Sep 21 '16 at 08:24
-
1
acts as a filter that provides a more robust clusteringThis pass is to an extent true. It, however, is about the stability of clusters (as found from sample to sample) and not about the ability detecting them. – ttnphns Sep 21 '16 at 08:34 -
1@ttnphns I apologize, I am new here :)
What about the sentence from the other paper?
"coherent patterns can be detected more clearly"
If directions with the most variance are the most relevant to the clustering, then clusters should likely be easier to be identified. That's the message underlying it, I think. Anyway, I have edited my comment by relaxing the conclusions. I will add the links as well.
– Roland Sep 21 '16 at 08:52
Thank you everyone. I wanted to know whether we use the PCs in clustering analysis and if yes, then how we use them. I figured out the answer that we don't use the PCs directly but make a transformation of the original variables based on the PCs.
- 371
- 2
- 4
- 11
-
1This is unclear and possibly wrong. What do you mean by "transformation of the original variables based on the PCs"? – amoeba Sep 20 '16 at 21:51
-
@amoeba If my original data set A is a nxp matrix and the related PCs P form a pxq matrix (q=3 as per my initial question of 3 components, which implies p = no. of original variables) then K = AxP will form a nx3 matrix. I hope I can use these 3 transformed variables in clustering. Please correct me if I am wrong – Srewashi Lahiri Sep 20 '16 at 22:04
-
3Yes, this is correct. The problem is that when you say "PCs" (as in this answer of yours), it is unclear if you refer to matrix P or to matrix K. Personally, when I say "PC" I usually refer to matrix K. If you want to be precise, you can call matrix P "PC eigenvectors" and matrix K "PC scores". To say that for clustering "we don't use PCs directly" sounds wrong; if you say "we don't use PC eigenvectors directly, but we use PC scores", then it's correct & clear. – amoeba Sep 20 '16 at 22:07
-
Perfect! Thanks a ton. The little confusion I was having regarding this terminology is clear now – Srewashi Lahiri Sep 20 '16 at 22:13
-
3:-) Consider editing this answer of yours to make it clearer for future readers. – amoeba Sep 20 '16 at 22:19
-
1This simply is not an answer. Consider deleting it and editing your question accordingly to reflect your expectations. – ttnphns Sep 21 '16 at 08:37
-
1I think this is an answer, at least the final sentence, but would benefit from editing as suggested – Silverfish Sep 21 '16 at 08:57
PCA cluster analysis. – ttnphns Sep 20 '16 at 15:44