Mahalanobis distance distribution of multivariate normally distributed points

Question

I'm trying to understand the properties of Mahalanobis distance of multivariate random points (my final goal is to use Mahalanobis distance for outlier detection). My calculations are in python. I miss some basics here and will be glad if someone will explain me my mistake.

Here is my code:

First I generate 1000 multivariate (2D) random points around the origin

#imports and definitions
import numpy as np
import scipy.stats as stats
import scipy.spatial.distance as distance
chi2 = stats.chi2
np.random.seed(111)


#covariance matrix: X and Y are normally distributed with std of 1
#and are independent one of another
covCircle = array([[1, 0.], [0., 1.]])
circle = multivariate_normal([0, 0], covCircle, 1000) #1000 points around [0, 0]
mahalanobis = lambda p: distance.mahalanobis(p, [0, 0], covCircle.T)
d = np.array(map(mahalanobis, circle)) #Mahalanobis distance values for the 1000 points
d2 = d ** 2 #MD squared

At this point I expect that d2 (Mahalanobis distance values squared) to follow chi2 distribution with 2 degrees of freedom. I would like to calculate the corresponding probability distribution function (PDF) values and use them for outlier detection. To illustrate my problem, plot the calculated PDF values for values between 0 and 10 for chi-squared distribution with 2 degrees of freedom and with location and scale values equal to the mean and standard deviation of d2, respectively:

theMean = mean(d2)
theStd = std(d2)
x = linspace(0, 10)
pdf = chi2.pdf(x, 2, theMean, theStd)
plot(x, pdf, '-b')

Here is the plot:

enter image description here As we can see, the PDF values that correspond to X values below ~2.04 are neglegible low, suggesting that there are no points that are closer than 2.04 from the origin, which is obviously not true. What am I missing?

EDIT Why am I using theMean and theStd as arguments to the pdf function? As far as I understand, the distances are expected to have chi2 distribution with the given mean and stdev. That is why I use these as arguments to PDF. From a comment to this question I understand that this is not true. So what are the correct parameters of the theoretical chi2 distribution?

I mean that your question is obviously incomplete; it is hard to answer it in this form. — , May 16 '12 at 11:44
@mbq Yep you are right. Missed the question during copy/paste :) — David D, May 16 '12 at 11:54
It looks like you're shifting the $\chi^2$ density over by theMean, which explains why there's no mass to the left of theMean. Why are you making that shift? — Macro, May 16 '12 at 13:05
@Macro As far as I understand, the distances are expected to have chi2 distribution with the given mean and stdev. That is why I use these as arguments to PDF. From your comment I understand that this is not true. So what are the correct parameters of the theoretical chi2 distribution? — David D, May 16 '12 at 13:20
What exactly do these arguments do? Shift and scale, I guess? (I'm not familiar with these functions in python). From your statement "At this point I expect that d2 (Mahalanobis distance values squared) to follow chi2 distribution with 2 degrees of freedom.", it seems that no shifting or scaling should be necessary. — Macro, May 16 '12 at 13:29

score 5 · Accepted Answer · answered Nov 26 '12 at 23:34

Your call to stats.chi2 is indeed incorrect. When you map your data using the mahalanobis distance, it is theoretically $\chi^2_2$ data, so you do not need to play with the loc, scale parameters in the stats.chi2 function (but do keep df=2, like you did). Here's my modified code, plus a pretty visualization of outlier detection.

I should mention this is somewhat cheating. We used the true mean and true covariance matrix to find outliers. In practice, we only have estimates (which depended on the outiers themselves!), hence we require more robust mean and covariance estimates, see this great paper.

The data that is above the 99.5% line is shown in red below $The $\chi^2$ test for outliers$

observed data. Red points are considered outliers

#imports and definitions
import numpy as np
import scipy.stats as stats
import scipy.spatial.distance as distance
import matplotlib.pyplot as plt
chi2 = stats.chi2
np.random.seed(111)


#covariance matrix: X and Y are normally distributed with std of 1
#and are independent one of another
covCircle = np.array([[1, 0.], [0., 1.]])
circle = np.random.multivariate_normal([0, 0], covCircle, 1000) #1000 points around [0, 0]
mahalanobis = lambda p: distance.mahalanobis(p, [0, 0], covCircle.T)
d = np.array(map(mahalanobis, circle)) #Mahalanobis distance values for the 1000 points
d2 = d ** 2 #MD squared

degrees_of_freedom = 2

x = range( len( d2 ))

plt.subplot(111)

plt.scatter( x, d2 )

plt.hlines( chi2.ppf(0.95, degrees_of_freedom), 0, len(d2), label ="95% $\chi^2$ quantile", linestyles = "solid" ) 
plt.hlines( chi2.ppf(0.975, degrees_of_freedom), 0, len(d2), label ="97.5% $\chi^2$ quantile", linestyles="dashed" ) 
plt.hlines( chi2.ppf(0.99, degrees_of_freedom), 0, len(d2), label ="99.5% $\chi^2$ quantile", linestyles = "dotted" )

plt.legend()
plt.ylabel("recorded value")
plt.xlabel("observation")
plt.title( 'Detection of outliers at different $\chi^2$ quantiles' )

plt.show()

I would add a pointer to this paper too. – user603 Nov 26 '12 at 23:43 — user603, Nov 26 '12 at 23:43

Mahalanobis distance distribution of multivariate normally distributed points

1 Answers1

Linked