1

Please I am about to cluster some data based which have about 15 different columns all of which are numbers(Some categorical while some are measurements) also some of my values are missing in some columns . Please can you give me pointer on how to go about it.

I have previously explored the clustering with weka but I am not sure about the way weka implements so I am going the R route.

What I know : I already know about Principal components analysis at least in theory. But is this necessary in all clustering of multiple columns . It will go a long way if anyone could provide me a link to a tutorial on this because Quick-R has for just 2 variables.

A sample of my dataset is listed below

1,64,9,30,33,2,3,1,6,1,5,-3.62,-3.71,-2.73,1
2,61,4,30,33,2,3,2,7,4,4,-3.62,-3.71,-2.00,1
3,49,4,18,21,2,3,2,8,17,18,-3.68,-3.88,-2.00,1
4,40,4,10,12,2,2,2,24,20,23,-3.32,-3.42,-2.00,1
5,43,9,10,12,2,2,1,2,1,29,-3.12,-3.19,-2.73,1
6,52,9,16,19,2,3,2,35,34,35,-3.33,-3.26,-2.95,1
7,46,4,15,18,2,3,2,8,40,42,-3.59,-3.50,-2.00,1
8,40,4,10,12,2,2,2,24,20,46,-2.45,-2.69,-2.00,1

2 Answers2

1

First thing you'll want to do is deal with your missing data. Some stats packages will "deal with it" for you, but usually don't tell you when or how it's being done. A common approach is to replace missing values with the grand mean, or perhaps the mode for categorical data--or eliminate the data point altogether.

Your intuitions are right, PCA is not the only way to go, but it probably is the best approach here. If you happen to have a weighting scheme for your features (e.g., treating them all equally or according to a weighted model), you can compute a multivariate index for each item and do k-means clustering. The real advantage of this is that the computations are much simpler. If computation is not an issue, I'd opt for PCA instead, using a correlation matrix of your data.

Here's a tutorial I found using a quick google search, but full disclosure: I haven't done PCA in R.

R tutorial on PCA

Jeff
  • 3,927
  • 3
    Some care is advised in doing PCA with the categorical variables! If they are purely categorical (and not ordered), perhaps they can be re-expressed as indicators. – whuber Aug 21 '11 at 20:57
  • @whuber very good point. in my field (psychology) factor analysis is commonly done with likert scales-- so if you can arrange the categories as if they were ordinal, i think it should be ok. otherwise, i'm not sure what the solution is. – Jeff Aug 21 '11 at 21:05
  • Thanks for the answer. Please what do you mean when you said I can arrange the categories as if they were ordinal. How can that be done ? – persistence911 Aug 21 '11 at 21:52
  • 1
    @persistence i think what whuber is saying is that if you have multiple qualitative categories that are being denoted as numbers, your correlations will be spurious. an example would be assigning numbers 1,2,3 to categories 'blue' 'yellow' 'pink'. however, if your categories are things like 'small' 'medium' 'big', then they can be naturally represented ordinally using 1,2,3. – Jeff Aug 21 '11 at 22:01
  • Thanks jeff. what happens to missing value would I replace with row means. Or what will be the best way to handle missing values ? – persistence911 Aug 22 '11 at 11:56
  • @persistence missing data is a topic all of its own, but from the looks of your data i might try replacing values with the column mean. if a row is missing multiple values, i would consider getting rid of it all together. there may be a better approach. see http://stats.stackexchange.com/q/1385/1977 and related questions. – Jeff Aug 22 '11 at 15:35
0

That depends on your data:

  • Do the missing data occur randomly, or is "missing" related to some other variable? For example, in one of my problems I deal with student success. If a student is dismissed after the first semester, then data for the following semester are missing. However, it is very unlikely that the average grades of these students, had they been allowed to continue, in the following semesters would have been the same as the average grade of those students that passed first semester. Thus, the data are missing non-randomly and need extra careful handling in order to avoid bias.
  • How much co-linearity is in your data, in other words, what is the rank of the data matrix? PCA will tell you that, and can also be used to reduce dimensionality if needed. For example, in a recent problem I had 170 variables, but only 24 significant eigenvalues. In such cases, it makes sense to do further analysis with the scores on these 24 components.
  • PCA is in my experience totally unsuitable for nominal data, but works with ordinal and binary, and of course shines with cardinal.
  • For cluster analysis, mixed data types can be handled using Gower's universal similarity coefficient (DOI:10.2307/2528823). That works even with nominal data. Podani has published a modification of $S_g$ that uses the information in ordinal data better (DOI:10.2307/1224438). This coefficient also handles missing data (in effect by pairwise deletion, so that all available information is used). It has been shown that $S_g$ handles up to 25% missing data without too much bias (https://www.raco.cat/index.php/Questiio/article/viewFile/143135/194807). Imputation by regression from existing data has also been tried with success (https://www.researchgate.net/profile/Olivier_Gauthier2/publication/225554456_Missing_data_in_craniometrics_A_simulation_study/links/0c96052e276f03b78a000000.pdf)