Classifying histograms in N dimensional space - is my aproach correct?

Question

I have a problem which consists of classifying N-dimensional histograms. The salient points are as follows:

For ALL dimensions:

Each histogram has the same number of bins (say 500, for argument sake)
Each bin has the same range

A high level view of my (lay man's) approach to the problem would be to do the following:

Calculate points P in 500 D space using Pythagoras theorem (pairwise combination)
Calculating the Mahalanobis distance of each point P, and using that to categorize the points.

I'm not a practising statistician, but this seems to be an intuitive way to solve the problem.

Am I missing anything fundamental here (or are there any assumptions I am making that I may be unaware of)?

How many classes do you need to classify? Given the high dimensionality of your feature space (500), I think linear discriminant analysis is a good start, or random forest. — Gumeo, Sep 26 '15 at 12:53
@GuðmundurEinarsson. I am not a statistician by profession, I find a lot of the model formulae difficult to intuit - hence trying to "roll my own", in a way which I understand, and can actually code myself. Could you please explain why you think LDA or random trees are suited to what I'm trying to do? — Homunculus Reticulli, Sep 27 '15 at 19:00
@GuðmundurEinarsson. To answer your first question, I do not have any apriori knowlege as to what the possible number of categories will be (in fact, I would be very interested in testing my assumption that there are only a few categories - i.e. less than 20). — Homunculus Reticulli, Sep 27 '15 at 19:07
Ok, so you do not have a response variable? I.e. you do not have a label for each data point? This sounds more like a clustering problem. What is the underlying problem you are solving? Where do these histograms come from? — Gumeo, Sep 28 '15 at 06:03
have you solved your problem? Otherwise I can write up a suggested approach. — Gumeo, Oct 05 '15 at 13:54
@GuðmundurEinarsson: Sorry I couldn't get back to you earlier. No, I have not yet found a solution. I am generating the data from a system that I suspect to be a FSM. I am trying to use the histogram to 'partition' the states - so yes, you're right, it's a clustering problem.
I would prefer however, to code the solution by hand (well, using python, pandas etc), so that at least, I understand what is going on - as I don't want to use formulae/models I don't really understand - intuitively. HTH — Homunculus Reticulli, Oct 10 '15 at 06:58

score 1 · Answer 1 · answered Sep 26 '15 at 12:46

1

Don't calculate all the pairwise distances. Use K-Means or AGNES to do the clustering.

answered Sep 26 '15 at 12:46

rwinkel2000

9

Thanks for your answer. As I said earlier, in my response to Guomondur, I'm not a statistician by profession, so I try to avoid models that I don't understand intuitively...
Could you please explain why you think K-Means would be a suitable way of achieving what I want to do?.
– Homunculus Reticulli Sep 27 '15 at 19:02
1

K-Means is easy to understand. Just get on You Tube, search for "k means", pick a video and then fast forward to the part where they show a dot plot. It is a very likable method and you can easily understand it. – rwinkel2000 Sep 28 '15 at 12:31

Classifying histograms in N dimensional space - is my aproach correct?

1 Answers1