3

I have training samples that I project onto the eigenspace via pca. What is a reasonable threshold for the mahalanobis distance (to the mean) to reject invalid input data ?

The paper here states that a distance of 3 standard deviations would be reasonable. However, the example stated here has training data up to 10 standard deviations. What threshold should I set for my application of face recognition ? I have found that the distance of my training samples to the mean can go up to 8-9 standard deviations.

Is there a rule of thumb for setting a threshold for the mahalanobis distance ? Thank you.

RuiQi
  • 645
  • 3
    Welcome to CV. That two different papers use different thresholds tells you something, namely, that there is no "rule of thumb" and the decision is up to the analyst. – user78229 Mar 27 '16 at 17:09
  • Awesome thanks ! I had the impression that most of the samples would lie within 3 standard deviations. I guess im wrong and that it depends on the quality of the training data. – RuiQi Mar 27 '16 at 17:36
  • +/-3 std.dev - under the assumption of Normal Distribution. N.B. your assumptions given to your analysis plays important role in its results. If Normally distributed data, then the size of sample also is important - it should be enough to prove Dependency according "the law of huge numbers" . – JeeyCi Jan 12 '24 at 07:06
  • Besides you can deal with heavy-tailed distributions - for detecting outliers also care about the estimator -- thus, it all depends on your Assumptions ! – JeeyCi Jan 12 '24 at 07:06
  • can use DBSCAN - the only parameter you need assume is eps - can choose it with elbow-chart – JeeyCi Jan 12 '24 at 07:19

1 Answers1

1

In order to detect the outliers, we should specify the threshold; we do so by multiplying the mean of the Mahalanobis Distance Results by the extremeness degree k in which k = 2.0 * std for extreme values and 3.0 * std for the very extreme values and that's according to the 68–95–99.7 rule

  • 1
    That rule is a rather poor approximation to the distribution of Mahalanobis distances (which will have strong positive skew in lower dimensions)--and is far too generous for flagging outliers. – whuber Dec 27 '19 at 14:24
  • Thank you, but this is Markov distance I guess for where most data fits in. I don't think this is practical at all and this is very week indeed. – Avv Jul 31 '21 at 00:57