5

I am applying Bhattacharya distance to multivariate normal distributions

$$D_{B}={1 \over 8}({\boldsymbol \mu }_{1}-{\boldsymbol \mu }_{2})^{T}{\boldsymbol \Sigma }^{{-1}}({\boldsymbol \mu }_{1}-{\boldsymbol \mu }_{2})+{1 \over 2}\ln \,\left({\det {\boldsymbol \Sigma } \over {\sqrt {\det {\boldsymbol \Sigma }_{1}\,\det {\boldsymbol \Sigma }_{2}}}}\right),$$

where ${\boldsymbol \mu }_{i}$ and ${\boldsymbol \Sigma }_{i}$ are the means and covariances of the distributions, and

$${\boldsymbol \Sigma }={{\boldsymbol \Sigma }_{1}+{\boldsymbol \Sigma }_{2} \over 2}.$$

The determinant of the covariance matrix turns out to be approximately $0.$

What is the right way to handle this situation?

  • All help is appreciated – Faiz Kidwai May 22 '19 at 17:46
  • Please also see the closely related thread at https://stats.stackexchange.com/questions/106325/jeffries-matusita-distance-for-14-variables/106352#106352. – whuber Oct 14 '19 at 12:23
  • It appears to be solely a numerical issue. In theory, given $\Sigma_1 > 0, \Sigma_2 > 0$, then it is guaranteed that $\Sigma > 0$. In other words, if you observed $\det(\Sigma) \approx 0$, then both of the inputs $\det(\Sigma_1)$ and $\det(\Sigma_2)$ should be near $0$, in view of the inequality $\det(A + B) \geq \det(A) + \det(B)$ when $A \geq 0, B \geq 0$. – Zhanxiong Feb 20 '23 at 21:09

3 Answers3

3

This distance will be infinite whenever either of the distributions is singular with respect to the other. This means the only cases where it will not be infinite are where the distributions have a common support, which will necessarily be an affine subspace of $\mathbb{R}^d$ on which both $\mu_1$ and $\mu_2$ lie. When that is a proper subspace, all three determinants $|\Sigma|,$ $|\Sigma_1|,$ and $|\Sigma_2|$ will be zero.

The key is to find this common subspace. It will lie within the span of $\Sigma.$ Writing its Singular Value Decomposition

$$\Sigma = U D V^\prime$$

for orthogonal matrices $U$ and $V$ and diagonal matrix $D$ with positive elements (the singular values),

  1. Identify the singular values that are essentially zero (compared to the others up to the precision of your arithmetic).

  2. Set $U^{*}$ to be the matrix formed by columns of $U$ corresponding to the nonzero singular values, $D^{*}$ to be the diagonal matrix of nonzero singular values, and $V^{*}$ to be the matrix formed by the corresponding columns of $V$.

  3. Change the basis in which all three matrices are represented by forming $$\Sigma_i^{*} = {U^{*}}^\prime \Sigma_i V^{*}.$$

  4. In this basis, $\Sigma = D^{*}$ and so its determinant is the product of the nonzero singular values. Use the determinants of the $\Sigma_i^{*}$ in the Battacharya distance formula.

Of course you need to check that $\mu_1-\mu_2$ lies in the span of $\Sigma.$ That's now easy: all the components of $U^\prime (\mu_1-\mu_2)$ corresponding to the zero singular values must themselves be zero.


To illustrate, here is R code implementing this strategy. First, though, let me show an example of its use where both distributions are singular but, because they have a common support on the line $y=x+1,$ their distance is not infinite:

distance.Battacharya(c(1,0), c(2,1), matrix(1,2,2), matrix(1,2,2))

[1] 0.125

Here, $\Sigma_1 = \Sigma_2 = \pmatrix{1&1\\1&1}.$

distance.Battacharya <- function(mu1, mu2, Sigma1, Sigma2) {
  d <- length(mu1)
  if (d != length(mu2)) return(Inf)

Sigma <- (Sigma1 + Sigma2)/2 fit <- svd(Sigma) i <- zapsmall(fit$d) != 0

Compute the log determinant terms.

logdet <- function(X) determinant(t(fit$u[, i]) %% X %% fit$v[,i], log=TRUE)$modulus q <- c(sum(log(fit$d[i])), -logdet(Sigma1)/2, -logdet(Sigma2)/2)

Compute the Mahalanobis distance term.

dmu <- mu1 - mu2 if(any(t(fit$u[, !i]) %% dmu != 0)) return(Inf) y <- fit$v[,i] %% diag(1/fit$d[i], sum(i), sum(i)) %% t(fit$u[, i]) %% dmu

sum(dmu * y)/8 + sum(q)/2 }

whuber
  • 322,774
0

I have not studied in the field mathematics, but I have worked with some data sets as an engineer.

When the determinant of your co-variance matrix is zero:

1) either the data are located on two parallel lines, or parallel planes, etc. or 2) or they are located on the same spot, or line, or plane, or etc.

Generally, If a determinant of a matrix (not co-variance matrix) is zero, and you multiply that to another data matrix, that singular matrix is going to put all of your data on a parallel line, plane or zero.

I'm not sure if following approach is either true of false: 1) Pre-processing step: Sometimes we add some small random noise to the data before any further analysis; then you cannot have such a co-variance.

2) or you can add very small values to your co-variance matrix elements (especially their diagonal elements). you can consider a very small value for your co-variance matrix elements and nothing should be changed. But as your matrix determinant is almost zero, so when you are multiplying that to another matrix, you have to make sure it is not accompanied with some errors. It depends on how large or small your data values are. Sometimes multiplication of two numbers (one very small, and the other a large number) is typically accompanied with some errors. There are also some ways how to handle that.

0

It's a precision issue. If you are using a diagonal covariance matrix, the determinant is just the product of the diagonal. If these values are <0, then the product compounds the results until they are so small that your memory can't hold them anymore, giving a value of 0.0.

To avoid this you can simply develop the equation further by opening the log term, which converts the product of the diagonals into a sum.

The code below creates a Gaussian class with a mean vector and covariance diagonal vector, and calculate the Bhattacharya distance with the D_Bhatt() and D_Bhatt_log() functions, with the latter computing the solution in log-space.

import numpy as np

class Gaussian: """Gaussian with diagonal covariance matrix."""

def __init__(
    self,
    mean: np.ndarray = None,
    cov: np.ndarray = None,
    mean_inv_cov: np.ndarray = None,
    inv_cov: np.ndarray = None,
):
    if mean is not None and cov is not None:
        assert mean.shape == cov.shape

        self.mean = mean.reshape(-1, 1)
        self.cov = cov.reshape(-1, 1)
        self.inv_cov = 1 / self.cov

    self.d = self.mean.shape[0]
    self.det = np.prod(self.cov)
    self.log_det = np.sum(np.log(self.cov))

def __key(self):
    return (self.mean.data.tobytes(), self.cov.data.tobytes())

def __hash__(self):
    return hash(self.__key())

def __eq__(self, other):
    return np.all(self.mean == other.mean) and np.all(self.cov == other.cov)

@staticmethod
@np.vectorize
def D_Bhatt(g1: &quot;Gaussian&quot;, g2: &quot;Gaussian&quot;) -&gt; float:
    &quot;&quot;&quot;Bhattacharyya distance.&quot;&quot;&quot;
    mean_cov = 0.5 * (g1.cov + g2.cov)
    mean_inv_cov = 1 / mean_cov
    term_1 = 0.125 * (g1.mean - g2.mean).T.dot(mean_inv_cov * (g1.mean - g2.mean))
    term_2 = 0.5 * np.log(np.prod(mean_cov) / np.sqrt(g1.det * g2.det))
    return term_1 + term_2

@staticmethod
@np.vectorize
def D_Bhatt_log(g1: &quot;Gaussian&quot;, g2: &quot;Gaussian&quot;) -&gt; float:
    &quot;&quot;&quot;Bhattacharyya distance using log.&quot;&quot;&quot;
    mean_cov = 0.5 * (g1.cov + g2.cov)
    mean_inv_cov = 1 / mean_cov
    term_1 = 0.125 * (g1.mean - g2.mean).T.dot(mean_inv_cov * (g1.mean - g2.mean))
    term_2 = 0.5 * (
        np.sum(np.log(mean_cov))
        - (np.sum(np.log(np.sqrt(g1.cov))) + np.sum(np.log(np.sqrt(g2.cov))))
    )
    return term_1 + term_2

```