I'm trying to understand the properties of Mahalanobis distance of multivariate random points (my final goal is to use Mahalanobis distance for outlier detection). My calculations are in python. I miss some basics here and will be glad if someone will explain me my mistake.
Here is my code:
First I generate 1000 multivariate (2D) random points around the origin
#imports and definitions
import numpy as np
import scipy.stats as stats
import scipy.spatial.distance as distance
chi2 = stats.chi2
np.random.seed(111)
#covariance matrix: X and Y are normally distributed with std of 1
#and are independent one of another
covCircle = array([[1, 0.], [0., 1.]])
circle = multivariate_normal([0, 0], covCircle, 1000) #1000 points around [0, 0]
mahalanobis = lambda p: distance.mahalanobis(p, [0, 0], covCircle.T)
d = np.array(map(mahalanobis, circle)) #Mahalanobis distance values for the 1000 points
d2 = d ** 2 #MD squared
At this point I expect that d2 (Mahalanobis distance values squared) to follow chi2 distribution with 2 degrees of freedom.
I would like to calculate the corresponding probability distribution function (PDF) values and use them for outlier detection.
To illustrate my problem, plot the calculated PDF values for values between 0 and 10 for chi-squared distribution with 2
degrees of freedom and with location and scale values equal to the mean and standard deviation of d2, respectively:
theMean = mean(d2)
theStd = std(d2)
x = linspace(0, 10)
pdf = chi2.pdf(x, 2, theMean, theStd)
plot(x, pdf, '-b')
Here is the plot:
As we can see, the PDF values that correspond to X values below ~2.04 are neglegible low, suggesting that there are no points that are closer than 2.04 from the origin, which is obviously not true. What am I missing?
EDIT
Why am I using theMean and theStd as arguments to the pdf function? As far as I understand, the distances are expected to have chi2 distribution with the given mean and stdev. That is why I use these as arguments to PDF. From a comment to this question I understand that this is not true. So what are the correct parameters of the theoretical chi2 distribution?


theMean, which explains why there's no mass to the left oftheMean. Why are you making that shift? – Macro May 16 '12 at 13:05