Simulating KL Divergence between Cauchy RV and the MLE estimate of the RV - Multimodality seems wrong

Question

I'm working on a (what I think is a fairly simple, straightforward) explanation of how it's really hard to approximate distributions with fat tails accurately in the tails. I started looking at an example, (which ends up harder than I expected,) for KL-divergence between a Cauchy RV and an approximation of the Cauchy RV using MLE. I realized I'm not sure how to do this analytically, and it seems really messy. (Any suggestions for how to do this would be appreciated - but that's a more math-oriented problem. If you want to help me, see question here: https://math.stackexchange.com/questions/2733994/finding-kl-divergence-between-a-distribution-and-the-mle-estimate-of-that-distri

In any case, to try to get a handle on this, however, I built a simulation in R, and the histogram of the KL divergences seems really really strange, and I think there is something wrong, or I'm creating an artifact because of how I'm simulating. (I have tried changing the number of breaks for the histogram, and the number of samples. It doesn't help.)

So first, I wrote a function to simulate a single KL-Divergence

simulate_cauchy_KLD <- function() {
a     <- rcauchy(5000, 0, 1)
coefp <- coef(fitdist(a, 'cauchy', start = list(location = 0, scale = 1)))
h  <- hist(a, breaks = 10000,plot=F) # Only needed if plot=T --- freq = F, ylim = c(0,.35), xlim = c(-10, 10)
ppar<- dcauchy(h$mids, coefp[1], coefp[2])
trus<- h$density
rm(h)
ppar <- ppar[which(trus>0)]
trus<- trus[which(trus>0)]
return(sum(trus*log10(trus/ppar)))

}

Now, I simulate it a bunch of times.

trials=3000 #Takes 1 minute on my system. Seems like a good number.
KLD_cauchy = rep(0,trials)

for(i in 1:trials){
KLD_cauchy[i]<- simulate_cauchy_KLD()
}
hist(KLD_cauchy)

I get a Histogram with these really unexpected groupings:

I'm currently at a loss to understand this. I'd really like either an explanation of what I'm doing wrong, or some idea why this would occur.

After discussion, it was suggested that I try the same process with a normal distribution, and I got the below.

I don't see a valid estimate of KL divergence in your code, but perhaps I misunderstand what it's doing. Could you explain the algorithm you are using? — whuber, Apr 12 '18 at 15:06
It's generating a discretized version of a, with probabilities trus=h$density, at the points given by h$mids. ppar generates a new cauchy RV from a distribution at those same points. The discrete KL divergence is computed by sum(trus*log10(trus/ppar)). (That expression is directly the definition of KL divergence, but looking at it now, I probably should use Log2.) — David Manheim, Apr 12 '18 at 15:45
Yes, that's what it looks like. But that seems to be a poor way to estimate the KL divergence, which depends on the density, whereas such a finely discretized version is going to depart so far from the underlying continuous density that it seems to have little chance of being even remotely accurate. — whuber, Apr 12 '18 at 16:22
@whuber Agreed, it would be better to do it analytically - see the linked question in the original post.
But despite the inaccuracy of this method, the simulation seems to be creating a really weird and unexpected multi-modality, and I don't understand why that would exist, and/or if it's a simulation artifact. — David Manheim, Apr 12 '18 at 16:34
The very first thing to do is determine whether the problem lies with your estimate of KL divergence or elsewhere. Separate your calculation into a function; test it in benign circumstances (such as with a Normal distribution); then test it with Cauchy distributions. Only when you know the calculation is reliable would it make sense to pursue alternative possibilities or even to characterize the results as "weird" or "artifactual." — whuber, Apr 12 '18 at 16:51
@whuber - OK, I did that - see edited question, and yes, it displays similar behavior. I'm still unsure why. But I found a different solution, which I am posting momentarily. Thanks for the help! — David Manheim, Apr 12 '18 at 16:58
In a comment, Mark Stone (an expert on such calculations) has reminded me of a thread that might be relevant: see https://stats.stackexchange.com/questions/6840. — whuber, Apr 12 '18 at 17:54

score 3 · Accepted Answer · answered Apr 13 '18 at 04:51

There are several Monte Carlo difficulties with this approach:

The first one is that, when handling a sample $x_1,\ldots,x_T$ from a density $f$, the Monte Carlo average $$\frac{1}{T}\sum_{t=1}^T f(x_i) \log \{f(x_i)/g(x_i)\}$$ converges to$$\int f(x) \log \{f(x)\big/g(x)\} f(x)\,\text{d}x=\int \log \{f(x)\big/g(x)\} f^2(x)\,\text{d}x$$ not to $$\int \log \{f(x)\big/g(x)\} f(x)\,\text{d}x$$
The second one is that the sample $x_1,\ldots,x_T$ used in the R code is Cauchy and not generated from the histogram distribution $h_T$, which means that the Monte Carlo average $$\frac{1}{T}\sum_{t=1}^T h_T(x_i) \log \{h_T(x_i)/g(x_i)\}$$ converges to$$\int h_T(x) \log \{h_T(x)\big/g(x)\} f(x)\,\text{d}x$$
The third one is that even if the above Kullback-Leibler integral should be well-defined (since the histogram distribution $h_T$ has a finite support) it should vary considerably from Cauchy sample to Cauchy sample. For instance, the $1/5000$th quantile of the Cauchy is -1591... This does not explain for the multimodality but certainly for the instability.

A final point is to wonder about the utility of deriving this Kullback-Leibler distance. Is it to compare MLE with other estimators?

grand_chat · Answer 2 · 2018-04-13T01:29:36.010

The well-separated bumps in your simulation histogram are a consequence of the argument breaks=10000. Given such a request, R will create roughly 10000 equal-width bins, but the bin width, while prettified, may vary from run to run, depending on the presence of extreme outliers in your sample. The presence of outliers has several consequences:

To handle an outlier, the requested number of bins must cover a larger region, forcing a larger bin width.
You've used density values, i.e., the height at each bin, in your KLD calculation. You should be using the probability of each bin (i.e. height of bin times width of bin), not the height alone. By omitting what amounts to a normalizing factor, the KLD scales differently with different bin widths. If the bin width were the same from run to run, omitting bin width would not be an issue.
An extreme outlier contributes a large value of trus/ppar to your KLD. Thus a large bin width correlates not only with the presence of an outlier, but also with with a high KLD.
Your line ppar <- ppar[which(trus>0)] has the effect of deleting mass from the "true" density ppar. When bin width is large, more mass gets deleted than with a smaller bin width. This loss of mass appears to have a non-negligible effect on KLD, as you can see by plotting KLD vs bin width or KLD vs sum(ppar) after deletion of mass.

One fix would be to multiply bin height by bin width in the KLD calculation. The widely separated bumps go away when you do this, but the resulting KLD histogram can still exhibit bimodality because of the loss of mass caused by (4).

In my simulations if I force a bin width consistent across all trials (e.g. by using breaks=seq(-10,10,.01)) or if I renormalize ppar, I find the resulting KLD much better behaved. (There is some difficulty in forcing a fixed binwidth when the distribution is heavy-tailed.)

Since sum(h$density*diff(h$breaks)) applied to the histogram estimate returns 1, I do not think normalisation is the issue. I concur with you that having more bins than observations is critical in the poor approximation. — Xi'an, Apr 13 '18 at 04:56

score 1 · Answer 3 · answered Apr 12 '18 at 17:05

I am still unsure why the simulation is doing this, but at the prompting of @whuber, I tried simulating using R's 'integrate' function using an approach suggested in this question; Estimate the Kullback-Leibler divergence. This removes the multimodality. (I am still unsure exactly what created this artifact.)

integrate_cauchy_KLD <- function() {
  Samples = rcauchy(5000, 0, 1)
  coefp <- coef(fitdist(Samples, 'cauchy', start = list(location = 0, scale = 1)))
  KL_Integrand = function(x){dcauchy(x) * (log2(dcauchy(x))- log2(dcauchy(x, location = coefp[1], scale = coefp[2])))}
  return(integrate(KL_Integrand, -Inf, Inf))

}

Simulating as before, and generating the histogram;

trials=3000
KLD_cauchy_int = rep(0,trials)

for(i in 1:trials){
  KLD_cauchy_int[i]<- integrate_cauchy_KLD()[1]$value
}
hist(KLD_cauchy_int, breaks = 1000)

This resolution solves a different question since it takes the KL between two Cauchy distributions, the (true) one behind the data and the maximum likelihood fitted Cauchy. In the original problem, the distance is computed between an histogram estimator and the maximum likelihood fitted Cauchy. And why did you switch from log10 to log2? — Xi'an, Apr 13 '18 at 04:30

Simulating KL Divergence between Cauchy RV and the MLE estimate of the RV - Multimodality seems wrong

3 Answers3