4

I have data that I can assume will be multivariate normal with a known mean vector mu and known covariance matrix sigma, and I'm looking to identify points that fall outside the 95% ellipse. I think the proper way to do this is to apply a transform to the data to a standard multivariate normal, compute the euclidean distances to zero, and compare against the value derived from running mvtnorm::qmvnorm(.95,tail='both',mean=mu,sigma=sigma). However, if this is correct (and feel free to correct me if there's a simpler solution to labelling the points outside the 95% ellipse), I'm struggling with the transform part. I know I'd subtract the mean vector mu from the observed data matrix, and I need to do something involving the resulting matrix and sigma, but I'm lost on what. Some code below showing some data and ending where I'm stuck:

mu = c(100,50)
sigma = matrix(c(100,210,210,900),nrow=2)
obs_data = MASS::mvrnorm(1e2,mu=mu,Sigma=sigma)
transformed = t(t(obs_data) - mu)
#what now?
Mike Lawrence
  • 13,793
  • 1
    See https://stats.stackexchange.com/questions/62092. – whuber Aug 19 '20 at 18:44
  • 1
    Ah! So the Mahalanobis distance is the euclidean distance-from-zero after the transform, so I can just use the stats::mahalanobis() function. Neat! Quick follow-up @whuber: do compare these distances to the 95% ellipse (or hypersphere) would I be getting the comparison value via mvtnorm::qmvnorm(.95,tail='both',mean=mu,sigma=sigma) or mvtnorm::qmvnorm(.95,tail='both',sigma=diag(2))? (i.e. do I need to tell qmvrnorm the original mean & covariance?) I'm getting very different results between these two. – Mike Lawrence Aug 19 '20 at 20:03
  • Oh, looking at the help page for stats::mahalanobis() and it shows that the distances should follow the same quantile function as a chi-square distribution with df=ncol(x), so I don't even need qmvnorm. When I use qchisq(.95,df=ncol(dat)), I do indeed observe that 95% of the distances exceed this value. So I wonder what qmvnorm() is doing or is supposed to be used for? – Mike Lawrence Aug 19 '20 at 20:57
  • 1
    qmvnorm gives you the half-width of a mean-centered square (or, generally, hypercube) in the original coordinates. In the transformed coordinates that region would correspond to a generalized rhombus. – whuber Aug 19 '20 at 21:02

1 Answers1

7

With the help of the content here, I was able to work this out and thought I'd post the resulting code and verification:

#define a function to transform a standard multivariate normal (zero mean, 
# identity covariance) to a non-standard multivariate normal with specified
# means and covariance.
std_to_mvn = function(Z,mu,sigma){
    tchol_sigma = t(chol(sigma))
    X_0centered = t(tchol_sigma %*% t(Z))
    X = t(t(X_0centered)+mu) #add means
    return(X)
}

#define a function to transform a multivariate normal with mean mu and covariance

sigma to a standard multivariate normal with zero mean and identity covariance

mvn_to_std = function(X,mu,sigma){ tchol_sigma = t(chol(sigma)) X_0centered = t(t(X)-mu) Z = t(solve(tchol_sigma,t(X_0centered))) }

#Check that we can induce the non-standard multivariate normal

#first,sample from a "standard" multivariate normal z = MASS::mvrnorm( n = 1e5 , mu = rep(0,times=3) , Sigma = diag(3) )

#next, specify the parameters (arbitrary; taken from https://jamesmccaffrey.wordpress.com/2017/11/03/example-of-calculating-a-covariance-matrix/) mu = c(68, 600, 40) sigma = matrix( c( 11.5 , 50, 34.75 , 50, 1250, 205 , 34.75, 205, 110 ) , nrow = 3 , ncol = 3 )

#now, transform and check that we get what we expected x = std_to_mvn(z,mu,sigma) max(abs(colMeans(x)-mu)) # <1 max(abs(cov(x)-sigma)) # <1

#Check that we can transform a non-standard multivariate normal to standard x = MASS::mvrnorm( n = 1e5 , mu = mu , Sigma = sigma )

#now, transform and check that we get what we expected z = mvn_to_std(x,mu,sigma) max(abs(colMeans(z))) # <1 max(abs(cov(z)-diag(3))) # <1

Mike Lawrence
  • 13,793