3

I have a set (cluster) of vectors in dimension d. From this I have calculated the sample mean and covariance matrix ( I make the assumption that they are from a multivariate Gaussian).

My question is, given a new vector (in dimension d) I am trying to decide if it belongs to this cluster by checking if the distance from the mean is less than 2 standard deviations.

In the one dimensional case I would simply check if x-x_bar > 2*sigma.

How does this extend to the multivariate case?

Thanks

Aly
  • 1,249

1 Answers1

3

First of all, in the univariate case (when $d=1$, e.g. the one you already know the decision rule for), assuming you have a vector of $n$ univariate measurements $x$ (so that $x$ is a $n\times 1$ matrix where each entry $x_i$ is a scalar), the decision rule you describe is really:

$$\left(\frac{n(n-1)}{(n-1)(n+1)}\frac{\left(x_i-\hat{\mu}_x\right)^2}{\hat{\sigma}^2_x}\right) > F_{0.95}(1, n-1)$$

where $F_{0.95}$ is the 95 percentile of a Fisher distribution (you consider that $x_i$ is too far from $\hat{\mu}_x$ in the metric $\hat{\sigma}$ to belong to the cluster with mean $\hat{\mu}_x$ and scale $\hat{\sigma}$). This is the correct version of your rule of thumb when $p=1$ (I denote $p$ what you write $d$, sorry for the confusion but if I change my notation now, my answers to your comments below will become meaningless)

In the multivariate case (where $p>1$, e.g. the one you are really interested in), this becomes: assuming $X$ is of dimensions $n\times p$ (so that each row $X_i$ of $X$ is a $p$-vector) and $\sigma_X^{-1}$ (the inverse of the variance covariance matrix of the $X$) exists:

$$\left(\frac{n(n-p)}{p(n-1)(n+1)}\left(X_i-\hat{\mu}_X\right)'\hat{\sigma}_X^{-1}\left(X_i-\hat{\mu}_X\right)\right) > F_{0.95}(p, n-p)$$

denoting $\mu_X$ the $p$-vector of means of $X$. $\left(X_i-\hat{\mu}_X\right)'\hat{\sigma}_X^{-1}\left(X_i-\hat{\mu}_X\right)$ is the vector of Mahalanobis distances of $X_i$ w.r.t. to $(\hat{\mu}_X,\hat{\sigma}_X)$

user603
  • 22,585
  • 3
  • 83
  • 149
  • I thought in the univariate case the vectors are of length 1? – Aly Feb 11 '13 at 17:17
  • it's a notation convention thing. I've made the dimensions explicit in all cases. – user603 Feb 11 '13 at 17:27
  • But can you use this formula for a multi-dimension vector? for my purpose I have a a cluster of n-dimension vectors and I wish to compute a mean and std deviation so that given another n-dimension vector I can calculate the likelihood that it belongs to this cluster. Should I be using a multivariate gaussian or have I misunderstood something and can just use the univariate? – Aly Feb 11 '13 at 17:30
  • If your $x_i$ is a scalar, use the first inequality. If your $X_i$ is a vector (of p measurments) use the second inequality. Can you explain where my formulation is confusing? I can edit the answer – user603 Feb 11 '13 at 17:37
  • Sorry, in the first formula. xi refers to a scalar,so what is n? For the univariate case, let's say for example I have had M samples and calculated the mean and variance, given another sample x, if I want to test that its distance from the mean is less than two standard deviations I should use the first formula? If so, what does the subscript x mean on mu and sigma and also what is n. – Aly Feb 11 '13 at 17:39
  • If my M samples were of dimension n, then I construct an n-dimensional mean and a nxn dimension covariance matrix. If I then have a new n-dimension sample and I want to figure out if its distance from the mean is less than two standard deviations I should use the second formula? If so, I think I am getting confused by the notion of the nxp dimension covariance matrix as I thought it had to be square. Also, the subscripts on mu and sigma. Thanks – Aly Feb 11 '13 at 17:41
  • Additionally, why are we using the beta distribution and not Z scores with the phi distribution? – Aly Feb 11 '13 at 18:02
  • Where do these formulas come from? For large $n$, and because clearly the right hand sides cannot exceed $1$, the formulas would indicate that any $x_i$ for which $|x_i - \hat{\mu}_x| \gt \sigma_x$ is (or perhaps is not?) a member of the cluster. I cannot find any way to interpret the question that makes this correct. What am I missing? And why does this answer deal with a vector of observations rather than a single observation as posed in the question? – whuber Feb 11 '13 at 18:15
  • @Aly: i've provided a link for the beta distribution, my $n$ is your $M$ (and your $n$ is my $p$) $X$ is not a covariance matrix (it is your dataset: you have n obserations each a p variate vector). $\hat{\mu}_x$ ($\hat{\mu}_X$) is the mean of the entries of $x$ ($X$). In other words, for example in the univariate case, $x$ is the dataset you have used to compute "x_bar" (my $\hat{\mu}_x$) and you "sigma" (my $\hat{\sigma}_x$). – user603 Feb 11 '13 at 18:53
  • Thank you! That was exactly it--I missed the fact that the denominator was squared but the numerator not. I feel much better about the situation now :-). – whuber Feb 11 '13 at 19:51
  • @user603 Looking at the link you provided, and the fact that my observation is independent of the samples used to estimate the distribution, it would appear I should be using Fishers F-ratio distribution. If this is correct, please update your answer and I will accept – Aly Feb 12 '13 at 13:37
  • Also, in my case I am just checking one sample so n=1? If so then the distance metric used above n/(n-1)^2 * d will give a divide by zero – Aly Feb 12 '13 at 13:42
  • 1
    @Aly: the $n$ refers to the sample size used to estimate the parameters ('x_bar' and 'sigma').... – user603 Feb 12 '13 at 13:55
  • @user603 thanks, I have accepted. Can you tell me how/where I can look up values of $F_0.95(p,n-p)$ ? – Aly Feb 12 '13 at 14:38
  • @Aly strictly speaking here but these are implemented in many statistical packages (for example R in the function qF) – user603 Feb 12 '13 at 15:12