Finding patterns in data

Question

I am probably looking for a definition.

Imagine we have 10 variables, but we are not interested in some kind of linear relation (nor quadratic or with any curve to it). What I would like is a way to find "clusters" , patterns or combinations (whatever you want to call it). For instance, given 10 variables, let's say two (or more) people have extremely similar scores, though they are both not necessarily high nor low. I'm under the impression that this information is lost to us, while in fact this could be an interesting finding.

Is there a name for trying to distinguish such interesting data patterns?

Any suggestions are welcome (also for the title).

score 3 · Accepted Answer · answered Mar 19 '13 at 17:27

3

You definitely wan't to do some kind of clustering, but there are so many algorithms now a days, it's hard to suggest one without knowing more about the data (what types of variables and number of records, for example). Can you give some more information? Such as more about the data structure, or what kind of patterns you are looking for (maybe an example of how scores can be similar but one high and one low; do you mean similar variance?)

I don't think PCA is a good choice, as it only finds linear relationships(which you specifically mentioned you aren't looking for), and doesn't deal well with multicollinearity if it is present. It seems like the question asker is looking for a more robust method than using the eigenvalues of a correlation/covariance matrix.

answered Mar 19 '13 at 17:27

TLJ

978
1
6
13

I don't have a specific example, I would just like to know what is out there. – PascalVKooten Mar 19 '13 at 18:08
Are you interested in categorical(nominal or ordinal) or continuous variables? If ordinal or continuous, you can use a spearman correlation matrix as a basis for many clustering algorithms. Also, methods such as random forests might be possible. Hard to say without knowing what you mean by 'patterns'--every single clustering algorithm will find patterns, and almost none of them are based on binning into 'high' or 'low' extremes. I would suggest picking up a book or doing a google search for 'clustering algorithms'. – TLJ Mar 19 '13 at 18:14
Could you give me a starting point on random forests? – PascalVKooten Mar 19 '13 at 21:03
Random Forests require a classifier output (dependent variable), so after rereading your question, it is probably not appropriate for the given situation. If you want to look for which variables are different between different groups, than I would use random forests. Your situation could probably be best modeled by classification trees. You could then see which binary decisions separate them. If you use R, look into the rpart package. – TLJ Mar 19 '13 at 21:23
1

I use R and I will check it out. Thank you for elaborating. – PascalVKooten Mar 19 '13 at 21:42
No problem. Feel free to give me a +1 if you are so inclined- I'm trying to get to the commenting threshold – TLJ Mar 19 '13 at 21:43
1

Knowing now that you use R, see this brief intro to better see what is out there. – learner Mar 20 '13 at 00:30

score 0 · Answer 2 · answered Mar 24 '18 at 14:32

Been comparing clustering algorithms kcluster, PCA, and TSNE for a while now. I would suggest TSNE its a great algorithm that beats PCA on one research benchmark in a specific real dataset like a twitter feed same with MNIST dataset.

Summary of process: First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an extremely small probability of being picked. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map. Note that whilst the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this should be changed as appropriate.

Comparison to KCluster: The difference between kcluster and tsne is that you don't have to set on how many cluster it should have in the hyper parameter its allocated automatically. the only down side of this is its slow as it computes everything.

Preview/References: Here's a web demo: https://cs.stanford.edu/people/karpathy/tsnejs/ Here's the blog version: http://karpathy.github.io/2014/07/02/visualizing-top-tweeps-with-t-sne-in-Javascript/

score 0 · Answer 3 · answered Mar 19 '13 at 16:25

0

It sounds like you are trying to cluster your data. Principle Component Analysis is an easy way to find clusters within your data, regardless of their relative high/low quality (there are many R packages for this). k-means clustering is an established algorithm as well.

You might look at the general Wikipedia article.

answered Mar 19 '13 at 16:25

learner

419

I am aware of PCA, but having used it only in SPSS I find it difficult to "visiualize" how exactly this shows a pattern (when I used it, I usually got a few factors when Eigenvalue > 1). How can we translate back having those variance components to being able to identify "person 3 and person 4 look really similar"? – PascalVKooten Mar 19 '13 at 16:31
@Dualinity You can translate that by plotting individuals on a factorial plan of the first 2 factors for example, as shown in the link learner is pointing to, or this one on PCA – Antoine Vernet Mar 19 '13 at 16:53
@AntoineVernet This is often a good way but it depends on the first two components explaining a substantial share of the total variation. – Erik Mar 19 '13 at 16:55
@Erik, yes, of course I should have been more precise and say that. This makes sense only if a large proportion of the total variation is explained by the first two components. What constitute "large" is not obvious though. – Antoine Vernet Mar 19 '13 at 16:58
Hmmm... I've never used SPSS, so I've no idea. However, what you want is a biplot in which samples are plotted in the principle component space (usually PCs 1 and 2). Perhaps you could export PC1 and 2 values for each sample and then plot those in a standard 2D plotting function? – learner Mar 19 '13 at 16:58

Finding patterns in data

3 Answers3