8

I'm looking for the assymptotic ($n\rightarrow \infty$) value of (the log of the determinant of) the covariance of the $\alpha$% of observations with smallest Eucledian distance to the origin in a sample of size $n$ drawn from, say, a bivariate standard Gaussian.

--The hyper-volume of an ellipse is proportional to the determinant of its covariance matrix, hence the title.--

--By standard bivariate Gaussian, I mean $\mathcal{N}_2(0_2,\pmb I_2)$ where $0_2$ is a vector of 0 of length 2 and $\pmb I_2$ is the rank 2 identity matrix.---

It's easy to see by simulations than when $\alpha=52/70$ the number is around $\approx -1.28$:

library(MASS)
n<-10000
p<-2
x<-mvrnorm(n,rep(0,p),diag(2))
h<-ceiling(0.714286*n)
p<-ncol(x)
w<-mahalanobis(x,rep(0,p),diag(p),inverted=TRUE) #These are eucledian distances, because the covariance used is the identity matrix
s<-(1:n)[order(w)][1:h]
log(det(cov(x[s,])))

but I don't recall how to obtain an exact expression (or failing that, a better approximation) for this.

user603
  • 22,585
  • 3
  • 83
  • 149
  • 1
    In your text, you say nothing about the parameters of the bivariate distribution. Also, it looks like your code is about Mahalanobis d, not Euclidean d. – ttnphns Jul 18 '14 at 09:06
  • 1
    By standard gaussian I mean the one centered at the origin and with Identity covariance (I will edit this in). Mahalanobis distance wrt to the Identity covariance matrix==Eucledian distances. – user603 Jul 18 '14 at 09:08
  • 1
    If you are using code, or seeking help with code, please state what language or program you are using. – wolfies Jul 18 '14 at 16:34

2 Answers2

7

Ok, this question seems to come up from time to time so I though I'll give a general answer.

In [1], the authors show that if $\pmb x_i\sim \mathcal{N}_p(\pmb \mu,\pmb \varSigma),i=1,\ldots,n$ with $\varSigma$ symmetric positive definite, and $S_{\alpha}$

$$S_{\alpha}=\{i: (\pmb x_i-\pmb\mu)'\varSigma^{-1}(\pmb x_i-\pmb\mu)\leqslant q_{\alpha}\}\label{a}\tag{0}$$

for $q_{\alpha}=\chi^2_{p}(\alpha),\;0<\alpha\leqslant 1$ and

$$C_{\alpha}=\mbox{cov}_{i\in S_{\alpha}}\pmb x_i\label{b}\tag{1}$$

Then, asymptotically, $C_{\alpha}$ converges to $l_{\alpha}\varSigma$ where

$$l_{\alpha}=\frac{ F_{\chi^2_{p+2}(q_{\alpha})} }{\alpha}\label{c}\tag{2}$$

This approximation is really good (here for alpha=60/70):

library(MASS)
alpha<-60/70
p<-2
n<-1000000

radius<-sqrt(qchisq(alpha,df=p))
x0<-mvrnorm(n,rep(0,p),diag(p),empirical=TRUE)
Id<-which(rowSums(x0*x0)<=radius**2)
cov(x0[Id,])

qalpa<-qchisq(alpha,p)
diag(1/(alpha/(pchisq(qalpa,p+2))),p)

So, finally, to answer the question, the $\log$ determinant of the covariance matrix of the $[\alpha n]$ observations with smallest Eucledian norm to the origin (this is the particular case where $\varSigma=\pmb I_p$ and $\pmb \mu=\pmb 0_p$) is given by:

$$p\log F_{\chi^2_{p+2}(q_{\alpha})}-p\log\alpha\label{d}\tag{3}$$

  1. Croux C., Haesbroeck G. (1999). Influence function and efficiency of the minimum covariance determinant scatter matrix estimator. Journal of Multivariate Analysis. 71. 161--190.
user603
  • 22,585
  • 3
  • 83
  • 149
3

Say $X\sim N_n(0,\Sigma)$, where $\Sigma$ is positive definite with $n$ eigenvalues $\lambda_1,\lambda_2,\dots,\lambda_n$. Then the constant-density contours are ellipsoids with $i$th principal axis of $r_i = \sqrt{\chi^2_{\alpha,n}\lambda_i}$, and therefore the volume of the hyper-ellipsoid can be found as

$$ \big( \prod_{i=1}^nr_i \big)\pi^{n/2} \Big/ \Gamma\big(1+\dfrac{n}{2}\big). $$

Car Loz
  • 850