2

I have a problem which consists of classifying N-dimensional histograms. The salient points are as follows:

For ALL dimensions:

  1. Each histogram has the same number of bins (say 500, for argument sake)
  2. Each bin has the same range

A high level view of my (lay man's) approach to the problem would be to do the following:

  1. Calculate points P in 500 D space using Pythagoras theorem (pairwise combination)
  2. Calculating the Mahalanobis distance of each point P, and using that to categorize the points.

I'm not a practising statistician, but this seems to be an intuitive way to solve the problem.

Am I missing anything fundamental here (or are there any assumptions I am making that I may be unaware of)?

amoeba
  • 104,745
  • How many classes do you need to classify? Given the high dimensionality of your feature space (500), I think linear discriminant analysis is a good start, or random forest. – Gumeo Sep 26 '15 at 12:53
  • @GuðmundurEinarsson. I am not a statistician by profession, I find a lot of the model formulae difficult to intuit - hence trying to "roll my own", in a way which I understand, and can actually code myself. Could you please explain why you think LDA or random trees are suited to what I'm trying to do? – Homunculus Reticulli Sep 27 '15 at 19:00
  • @GuðmundurEinarsson. To answer your first question, I do not have any apriori knowlege as to what the possible number of categories will be (in fact, I would be very interested in testing my assumption that there are only a few categories - i.e. less than 20). – Homunculus Reticulli Sep 27 '15 at 19:07
  • Ok, so you do not have a response variable? I.e. you do not have a label for each data point? This sounds more like a clustering problem. What is the underlying problem you are solving? Where do these histograms come from? – Gumeo Sep 28 '15 at 06:03
  • have you solved your problem? Otherwise I can write up a suggested approach. – Gumeo Oct 05 '15 at 13:54
  • 1
    @GuðmundurEinarsson: Sorry I couldn't get back to you earlier. No, I have not yet found a solution. I am generating the data from a system that I suspect to be a FSM. I am trying to use the histogram to 'partition' the states - so yes, you're right, it's a clustering problem.

    I would prefer however, to code the solution by hand (well, using python, pandas etc), so that at least, I understand what is going on - as I don't want to use formulae/models I don't really understand - intuitively. HTH

    – Homunculus Reticulli Oct 10 '15 at 06:58

1 Answers1

1

Don't calculate all the pairwise distances. Use K-Means or AGNES to do the clustering.

  • Thanks for your answer. As I said earlier, in my response to Guomondur, I'm not a statistician by profession, so I try to avoid models that I don't understand intuitively...

    Could you please explain why you think K-Means would be a suitable way of achieving what I want to do?.

    – Homunculus Reticulli Sep 27 '15 at 19:02
  • 1
    K-Means is easy to understand. Just get on You Tube, search for "k means", pick a video and then fast forward to the part where they show a dot plot. It is a very likable method and you can easily understand it. – rwinkel2000 Sep 28 '15 at 12:31