I have the following data:
type distance
0 X 12572
1 X 11229
2 Y 14144
3 A 15781
4 A 15486
5 B 461
6 X 328
7 X 23
8 X 50
9 A 45
10 A 231
11 A 10779
12 X 11433
... .....
type refers to the data points category. distance is the distance between each data point. That is, the difference between X index 0 and X index 1 is 12572, the difference between the second and third datapoint is 11229, etc.
One can think of this set of datapoints as being along one dimension. The identity (i.e. type) of the datapoint is irrelevant to this problem. I am interested somehow inferring the "clusters" of data points which occurs when datapoints are spaced closely together. In this case, it looks clear that the datapoints from index 5-11 consist of one grouping.
One-dimensional clustering algorithms come to mind. However, there is a natural structure to this dataset; if the distances are less than 10,000, normally there's a cluster. Simply binning by hand might be more important.
Is there a method for this problem based in probabilistic inference? Either there could be a way to infer the "natural" clustering within a given dataset (though that's ill-defined) or perhaps use part of the dataset as a training set?
distancevariable represent; what does the "difference" between two points mean; and how do you measure the condition of being "spaced closely"? – whuber May 30 '17 at 14:52