Hypothesis testing to determine cluster outliers

Question

I have a cluster of $p$-dimensional data from $n$ samples which is assumed to be normally distributed as a multivariate Gaussian with sample mean ${\bar{\mu}}$ and sample covariance matrix ${\bar{S}}$.

Given a new sample $x_i$ I want to test whether or not the sample belongs to this cluster. I am doing the following:

My null hypothesis, $H_o$, is: $x_i$ does belong to the cluster defined by the multivariate Gaussian with sample mean ${\bar\mu}$ and sample covariance $\bar{S}$

My alternate hypothesis, $H_a$, is: $x_i$ does not belong to the cluster

My test statistic, $T(\bar{\mu},\bar{S})$, is defined as the Mahalanobis distance which, according to this post and the paper Gaussian mixture modeling by exploiting the Mahalanobis distance (page 13, Appendix), is distributed as a scaled ${\beta}$ distribution:

$$ \frac{n}{(n-1)^2}T(\bar{\mu},\bar{S}) \sim \beta\left(\frac{p}{2}, \frac{(n-p-1)}{2}\right) $$

If I want to use a significance level ${\alpha} = 0.05$ then I believe I need to calculate my $p-value$ which, in this example, shows the probability of getting a sample as extreme (far from this distribution) or worse (further) from the underlying multivariate Gaussian as $x_i$. With this $p-value$ I will then check:

if $p-value < {\alpha}$ :

then reject $H_o$ i.e. this sample does not belong to this cluster

else accept $H_o$ i.e. this sample does belong to the cluster

Is this the correct way to perform this test? And if so, how do I calculate the $p$-value?

EDIT:(Adding more information and idea for solution)

It is my understanding that the $p-value$ represents the probability of obtaining a Mahalanobis distance ($d$) as extreme or worse than $d(x_i)$.

This is calculated as $P(X{\geq}x_i)$, which is $1-P(X{\leq}x_i)$ which is $1 - $ the $cdf$ of the Beta function $\beta\left(\frac{p}{2}, \frac{(n-p-1)}{2}\right)$

So the $p-value = 1 - I_{x_i}\left(\frac{p}{2}, \frac{(n-p-1)}{2}\right)$ where $I_{x_i}\left(\frac{p}{2}, \frac{(n-p-1)}{2}\right)$ regularized incomplete beta function.

Given that this is a two-tailed hypothesis test, with confidence level of $95%$ My test should be:

$p-value = 2 {\times} \left(1 - I_{x_i}\left(\frac{p}{2}, \frac{(n-p-1)}{2}\right)\right)$

if $p-value {\leq} 0.05$

then: reject the null hypothesis (i.e. declare the sample as an outlier)

else: accept the null hypothesis (i.e. sample is not an outlier)

I think this is correct, but could someone more knowledgeable please confirm?

@user603 do you mean that ${\bar{\mu}}$ and ${\bar{S}}$ were calculated without ${x_i}$? As in, ${x_i}$ was not one of the $n$ samples used to calculate sample mean and covariance? If so, then in this case yes, $x_i$ is independent from ${\bar{\mu}}$ and ${\bar{S}}$ — Aly, Feb 20 '13 at 18:11
yes (the rest of this comment is there because otherwise is too short) — user603, Feb 20 '13 at 18:19
simply get your software to spit out the 99 percent quantile of the beta distribution. Then compare the value of the adjusted (e.g. with the n/(n-1)**2 factor) mahalanobis distance of the each $x_i$ to this cutoff: if it is larger, the observations is an outlier and otherwise it is not. You will not gain anything by converting this to p-values. — user603, Feb 21 '13 at 14:25
@user603 Thanks, please put this as an answer and I will accept — Aly, Feb 21 '13 at 16:29
@user603 do I have to check if $d(x_i) > 2{\times}betainv(0.95,a,b)$ as it is a two tailed test? or just against ${1{\times}betainv(0.95, a,b)}$? — Aly, Feb 21 '13 at 17:31

Hypothesis testing to determine cluster outliers

0 Answers0