0

I am working on clustering a medium-sized, high-dimensional data set (200k rows; 120 columns). Once I have attempted (multiple) cluster solutions, I would like to profile my clusters and understand them.

Previously, I used to calculate descriptive statistics (mean, mode, median, standard deviation). I was trying to use Parallel Coordinates Plots but these don't help much with large number of variables.

I was wondering if there are some other ways for profiling and understanding clusters.

  • 2
    I would recommend you to re-title your Q. Because it is not about clustering, but about visualizing many groups stats of many variables. Groups could ba any - classes, clusters... Also, by making search data visualization Q/A on this site you may find an old answer what will suit you. – ttnphns Feb 20 '16 at 13:12

2 Answers2

4

I understand that you're interested in visual approaches to cluster insight.

In running your descriptive stats, did you employ an index of the cluster value relative to the total sample value for that statistic? So, for the 120 features in your data, in total and by cluster, create a (k+1)x120 matrix, with k=# clusters, then simply divide the cluster values by the grand mean (median, whatever) for each feature, multiply by hundred and round off the decimals. The resulting index is like an IQ score where indices of 80 and less or 120+ are considered (un)representative of that cluster. Really simple but it's useful for quick and dirty insights.

Once you have the indices, you can create a heat map of the features that highlight the deviances. Here's a link to an introduction to heat mapping that is fairly clear:

http://www.fusioncharts.com/dev/chart-guide/heat-map-chart/introduction.html

Joint-space maps would provide a visualization of the clusters relative to a canonical discriminant function of the features. The canonical variates would summarize the features in a low-dimensional, component space while also producing average values for the clusters. By locating each feature in this new, coordinate space a cluster by feature proximity matrix can be created which would be easy to visualize. Here's a link to a paper which discusses approaches to mapping such as this. The key thing is that any dimension reduction method can be leveraged:

http://web.mit.edu/hauser/www/Papers/Alternative_Perceptual_Mapping_Techniques.pdf

Topologists have developed an approach to analysis and visualization of complex data, Extracting insights from the shape of complex data using topology. Here's their Nature paper as well as some R code that they've created:

http://www.nature.com/articles/srep01236#f1

http://arxiv.org/abs/1411.1830

Here's a link to an article that has a multitude of visuals for clusters:

http://shabal.in/visuals.html

Evaluating cluster quality can also provide useful insight. There are lots of approaches to this but here's a link to an article that proposes 4 information-theoretic metrics: purity, normalized mutual information, rand index and the F-measure:

http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html#fig:clustfg3

Hope these help.

user78229
  • 10,594
  • 2
  • 24
  • 45
  • 1
    +1. However, I admit that the OP was asking not about "visual approaches to [do] cluster[ing]" but simpler to profile my clusters, that is, to plot relevant summary univariate statistics which would help to describe the differences b/w the clusters. – ttnphns Feb 20 '16 at 13:23
  • @DJohnson thanks ever so much for the detailed, very helpful answer. Just waiting for more answers to pour in, if they happen to. :) – info_seeker Feb 20 '16 at 17:23
1

Beware that with 120 dimensions, it is really hard to make clustering work well at all. Most likely, some dimensions dominate the result, and others were not taken into account at all.

Assuming you are using e.g. k-means or any distance-based clustering algorithm, you have 120-1=119 degrees of freedom just in linear normalization of your data. Roughly, every dimension has a scalar weight $\omega_i$ with linear normalization. If you choose the weight too large, that dimension is overrepresented. If it is too small, it is underrepresented.

So you really need to study your clusters, because they might not take everything into account well.

Probably the standard approach to understanding the differences between clusters would be decision trees. You use the cluster labels as classes, then train an interpretable classifier. The resulting tree can be used to explain what each cluster contains. It may be worth builiding multiple classifiers, for example one-vs-rest for each cluster. It is also worth trying a random-forest approach, but analyzing all the resulting trees. One possible analysis is this: for every tree, count the features in the root node, or the first n levels. If a feature is nominated too much, it probably dlminates your output too much. If a feature is not mentioned much, either that feature just is garbage, or it received too little influence.

  • 1
    In my opinion, that could be a nice answer if to make it more detailed or provide according links. Both parts (why 120 is risky for domination; how to do best of the DT) deserve to be expanded. (If you agree.) – ttnphns Feb 20 '16 at 13:30
  • 1
    That requires more effort, and it is not clear whether the asker or other viewers will really appreciate it. I think the answer here is redundant to other questions (probably even the question has close duplicates) but I'm using the android app, and searching/linking is a pain in the app. – Has QUIT--Anony-Mousse Feb 20 '16 at 13:37
  • @Anony-Mousse actually, I will be very thankful if you can expand your answer. I was, indeed, applying decision trees to interpret the clusters, but I noticed that only a few variables were being used by DT to explain the outcome/cluster variables. I assumed with this that I won't get much help from DT, and then asked this question to find ways of (meticulously) describing/profiling each cluster. – info_seeker Feb 20 '16 at 15:35
  • @Anony-Mousse The stakeholders of this activity have asked to include 120 variables. They expect this to run fine, and have been doing something similar in SAS. Initially, we had 200 variables (!), but I convinced them to reduce this number. – info_seeker Feb 20 '16 at 15:38
  • 1
    As mentioned, I'd expect a careful analysis to reveal that not all variables played a role (sometimes rightfully, sometimes because of poor weights). It will appear to "work" even when variables go unnoticed (it's not as if this causes the app to crash!), and that is the big risk. What good are 200 variables when the result is only due to the 2 of them with the largest variance? – Has QUIT--Anony-Mousse Feb 20 '16 at 16:01
  • @Anony-Mousse sorry, just a quick question in regards to your answer: does not standardising or normalising, or whitening, solve the problem of a few variables exhibiting high variance and thus getting a greater weight? – info_seeker Feb 21 '16 at 20:54
  • Normalization/standardization is global. What if different parts (lets call them clusters) are different, because your data is not uniform? Normalization is a heuristic that lessens this effect, but in real data some variables should have more weight (at least for some parts of your data) and some should have less weight (e.g. because they are noise). This cannot be automated. – Has QUIT--Anony-Mousse Feb 21 '16 at 23:10
  • In particular, if you compare your results on not normalized, normalized, and standardized data, I'd expect you to get quite different results. I'd expect the latter to take more variables into account. – Has QUIT--Anony-Mousse Feb 21 '16 at 23:13
  • @Anony-Mousse that is actually true. I did try the raw, normalised, and standardised data, and I did get different results. Only with the standardised was I able to view some shapes/clusters after projecting the data to a low-dimension. The clustering also took more variables into account with the latter. Thanks ever so much for your comments. – info_seeker Feb 22 '16 at 07:29
  • 2
    It may be a good idea to randomly choose features, standardize, cluster, analyze to prevent certain strong features (such as binary variables) from always masking other aspects. Also consider splitting the features by type (e.g. binary features), study for skewed distributions (e.g. monetary values tend to be skewed) etc. and to consider multiple clusterings to be correct even if they appear to contradict each other. – Has QUIT--Anony-Mousse Feb 22 '16 at 07:47
  • @Anony-Mousse agreed, although about considering multiple solutions: it may be hard to justify/explain such to the business user (this is a market segmentation task, which I didn't point out for sake of brevity). They're looking for easily understandable clusters. Thanks for the valuable contribution. – info_seeker Feb 22 '16 at 13:43
  • 1
    Well, none of them is more correct than the others... that is the reality. What good is having a single solution, if it is not reliable? If tje segments are only as good as random? That happens a lot when business users use clustering their way. – Has QUIT--Anony-Mousse Feb 22 '16 at 20:22