Questions tagged [clustering]

Cluster analysis is the task of partitioning data into subsets of objects according to their mutual "similarity," without using preexisting knowledge such as class labels. [Clustered-standard-errors and/or cluster-samples should be tagged as such; do NOT use the "clustering" tag for them.]

Tag Usage

Clustered-standard-errors and/or cluster-samples should be tagged as such; do not use the "clustering" tag for them. Both these methodologies take clusters as given, rather than discovered.

Overview

Clustering, or cluster analysis, is a statistical technique of uncovering groups of units in multivariate data. It is separate from classification (clustering could be called "classification without a teacher"), as there are no units with known labels, and even the number of clusters is usually unknown, and needs to be estimated. Clustering is a key challenge of data mining, in particular when done in large databases.

Although there are many clustering techniques, they fall into several broad classes: hierarchical clustering (in which a hierarchy is built from each unit representing their own cluster up to the whole sample being one single cluster), centroid-based clustering (in which are units are put into the cluster nearest to a specific centroid), distribution- or model-based clustering (in which clusters are assumed to follow a specific distribution, such as multivariate Gaussian), and density-based clustering (in which clusters are obtained as the areas of the highest estimated density).

References

Consult the following questions for resources on clustering:

4021 questions
36
votes
5 answers

Clustering methods that do not require pre-specifying the number of clusters

Are there any "non-parametric" clustering methods for which we don't need to specify the number of clusters? And other parameters like the number of points per cluster, etc.
Learn_and_Share
  • 866
  • 1
  • 10
  • 18
35
votes
3 answers

What stop-criteria for agglomerative hierarchical clustering are used in practice?

I have found extensive literature proposing all sorts of criteria (e.g. Glenn et al. 1985(pdf) and Jung et al. 2002(pdf)). However, most of these are not that easy to implement (at least from my perspective). I am using scipy.cluster.hierarchy to…
26
votes
8 answers

Clustering quality measure

I have a clustering algorithm (not k-means) with input parameter $k$ (number of clusters). After performing clustering I'd like to get some quantitative measure of quality of this clustering. The clustering algorithm has one important property. For…
Max
  • 495
24
votes
1 answer

How to calculate purity?

In cluster analysis how do we calculate purity? What's the equation? I'm not looking for a code to do it for me. Let $\omega_k$ be cluster k, and $c_j$ be class j. So is purity practically accuracy? it looks like were summing the amount of truly…
Iancovici
  • 795
  • 2
  • 5
  • 17
23
votes
5 answers

Clustering 1D data

I have a dataset, I want to create clusters on that data based on only one variable (there are no missing values). I want to create 3 clusters based on that one variable. Which clustering algorithm to use, k-means, EM, DBSCAN etc.? My main question…
Ali
  • 339
  • 1
  • 2
  • 3
19
votes
10 answers

Rand index calculation

I'm trying to figure out how to calculate the Rand Index of a cluster algorithm, but I'm stuck at the point how to calculate the true and false negatives. At the moment I'm using the example from the book An Introduction into Information Retrieval…
Pakspul
  • 351
14
votes
1 answer

How should I interpret GAP statistic?

I used GAP statistic to estimate k clusters in R. However I'm not sure if I interpret it well. From the plot above I assume that I should use 3 clusters. From the second plot I should choose 6 clusters. Is it correct interpretation of GAP…
14
votes
5 answers

What is the difference between graph clustering and community detection methods?

Basically, the goal of graph clustering and community detection methods are to compute clusters. Is there any difference between them?
13
votes
1 answer

What does total ss and between ss mean in k-means clustering?

I'm very new to cluster analysis. I'm using R for k-means clustering and I wonder what those things are. And what is better if their ratio is smaller or larger?
kanbhold
  • 865
13
votes
3 answers

How to cluster longitudinal variables?

I have a bunch of variables which contain longitudinal data from day 0 to day 7. I am looking for an appropriate clustering approach which can cluster these longitudinal variables (not cases) into different groups. I tried to analyze this data set…
cchien
  • 408
12
votes
4 answers

Predicting cluster of a new object with kmeans in R

I used my training dataset to fit cluster using kmenas function fit <- kmeans(ca.data, 2); How can I use fit object to predict cluster membership in a new dataset? Thanks
user333
  • 7,211
9
votes
1 answer

Can someone explain the C-Index in the context of hierarchical clustering?

This is a followup to this question. I am currently trying to implement the C-Index in order to find a near-optimal number of clusters from a hierarchy of clusters. I do this by calculating the C-Index for every step of the (agglomerative)…
9
votes
4 answers

How to tell quantitatively whether 1D data is clustered around 1 or 3 values?

I've got some data on the time between heart beats of a human. One indication of ectopic (extra) beats is that these intervals are clustered around three values instead of one. How can I obtain a quantitative measure of this? I'm looking to compare…
9
votes
1 answer

How do I algorithmically determine values of T1 & T2 for canopy clustering?

I am trying to use canopy clustering to provide initial clusters for KMeans in mahout. Is there a way to determine / approximate the values of the distance thresholds T1 & T2 algorithmically? Right now I have T1 = 100 and T2 = 1 which doesn't seem…
8
votes
1 answer

Analyze a football match: similar players with DBSCAN and similar trajectories with TRACLUS

I'm trying to analyze a dataset that originates from sensors located near players' shoes in a match (http://www.orgs.ttu.edu/debs2013/index.php?goto=cfchallengedetails). I decided to look at clustering to identify: Similar trajectories of players…
denadai2
  • 89
  • 2
1
2 3
17 18