Questions tagged [clustering]

Cluster analysis is the task of partitioning data into subsets of objects according to their mutual "similarity," without using preexisting knowledge such as class labels. [Clustered-standard-errors and/or cluster-samples should be tagged as such; do NOT use the "clustering" tag for them.]

Tag Usage

Clustered-standard-errors and/or cluster-samples should be tagged as such; do not use the "clustering" tag for them. Both these methodologies take clusters as given, rather than discovered.

Overview

Clustering, or cluster analysis, is a statistical technique of uncovering groups of units in multivariate data. It is separate from classification (clustering could be called "classification without a teacher"), as there are no units with known labels, and even the number of clusters is usually unknown, and needs to be estimated. Clustering is a key challenge of data mining, in particular when done in large databases.

Although there are many clustering techniques, they fall into several broad classes: hierarchical clustering (in which a hierarchy is built from each unit representing their own cluster up to the whole sample being one single cluster), centroid-based clustering (in which are units are put into the cluster nearest to a specific centroid), distribution- or model-based clustering (in which clusters are assumed to follow a specific distribution, such as multivariate Gaussian), and density-based clustering (in which clusters are obtained as the areas of the highest estimated density).

References

Consult the following questions for resources on clustering:

4021 questions

votes

5 answers

Clustering methods that do not require pre-specifying the number of clusters

Are there any "non-parametric" clustering methods for which we don't need to specify the number of clusters? And other parameters like the number of points per cluster, etc.

clustering

asked Oct 20 '16 at 13:51

Learn_and_Share

votes

3 answers

What stop-criteria for agglomerative hierarchical clustering are used in practice?

I have found extensive literature proposing all sorts of criteria (e.g. Glenn et al. 1985(pdf) and Jung et al. 2002(pdf)). However, most of these are not that easy to implement (at least from my perspective). I am using scipy.cluster.hierarchy to…

clustering

asked Sep 12 '10 at 19:49

Björn Pollex

1,383

votes

8 answers

Clustering quality measure

I have a clustering algorithm (not k-means) with input parameter $k$ (number of clusters). After performing clustering I'd like to get some quantitative measure of quality of this clustering. The clustering algorithm has one important property. For…

clustering

asked Jan 14 '11 at 14:06

Max

votes

1 answer

How to calculate purity?

In cluster analysis how do we calculate purity? What's the equation? I'm not looking for a code to do it for me. Let $\omega_k$ be cluster k, and $c_j$ be class j. So is purity practically accuracy? it looks like were summing the amount of truly…

clustering

asked Apr 29 '14 at 23:05

Iancovici

votes

5 answers

Clustering 1D data

I have a dataset, I want to create clusters on that data based on only one variable (there are no missing values). I want to create 3 clusters based on that one variable. Which clustering algorithm to use, k-means, EM, DBSCAN etc.? My main question…

clustering

asked Aug 03 '11 at 02:36

Ali

votes

10 answers

Rand index calculation

I'm trying to figure out how to calculate the Rand Index of a cluster algorithm, but I'm stuck at the point how to calculate the true and false negatives. At the moment I'm using the example from the book An Introduction into Information Retrieval…

clustering

asked Mar 06 '14 at 14:04

Pakspul

votes

1 answer

How should I interpret GAP statistic?

I used GAP statistic to estimate k clusters in R. However I'm not sure if I interpret it well. From the plot above I assume that I should use 3 clusters. From the second plot I should choose 6 clusters. Is it correct interpretation of GAP…

clustering

asked Apr 26 '14 at 11:29

peterpeter

votes

5 answers

What is the difference between graph clustering and community detection methods?

Basically, the goal of graph clustering and community detection methods are to compute clusters. Is there any difference between them?

clustering

asked Sep 21 '11 at 04:12

Jovice King

votes

1 answer

What does total ss and between ss mean in k-means clustering?

I'm very new to cluster analysis. I'm using R for k-means clustering and I wonder what those things are. And what is better if their ratio is smaller or larger?

clustering

asked Jan 19 '14 at 23:29

kanbhold

votes

3 answers

How to cluster longitudinal variables?

I have a bunch of variables which contain longitudinal data from day 0 to day 7. I am looking for an appropriate clustering approach which can cluster these longitudinal variables (not cases) into different groups. I tried to analyze this data set…

clustering

asked Oct 31 '11 at 20:27

cchien

votes

4 answers

Predicting cluster of a new object with kmeans in R

I used my training dataset to fit cluster using kmenas function fit <- kmeans(ca.data, 2); How can I use fit object to predict cluster membership in a new dataset? Thanks

clustering

asked Jul 04 '11 at 14:32

user333

7,211

votes

1 answer

Can someone explain the C-Index in the context of hierarchical clustering?

This is a followup to this question. I am currently trying to implement the C-Index in order to find a near-optimal number of clusters from a hierarchy of clusters. I do this by calculating the C-Index for every step of the (agglomerative)…

clustering

asked Sep 13 '10 at 20:20

Björn Pollex

1,383

votes

4 answers

How to tell quantitatively whether 1D data is clustered around 1 or 3 values?

I've got some data on the time between heart beats of a human. One indication of ectopic (extra) beats is that these intervals are clustered around three values instead of one. How can I obtain a quantitative measure of this? I'm looking to compare…

clustering

asked Dec 21 '11 at 15:49

Nikolaus

votes

1 answer

How do I algorithmically determine values of T1 & T2 for canopy clustering?

I am trying to use canopy clustering to provide initial clusters for KMeans in mahout. Is there a way to determine / approximate the values of the distance thresholds T1 & T2 algorithmically? Right now I have T1 = 100 and T2 = 1 which doesn't seem…

clustering

asked Aug 05 '11 at 09:06

Rohan Monga

votes

1 answer

Analyze a football match: similar players with DBSCAN and similar trajectories with TRACLUS

I'm trying to analyze a dataset that originates from sensors located near players' shoes in a match (http://www.orgs.ttu.edu/debs2013/index.php?goto=cfchallengedetails). I decided to look at clustering to identify: Similar trajectories of players…

clustering

asked Jun 10 '13 at 11:59

denadai2

2 3

…

17 18 Next