Fitting a Gaussian to a histogram when the bin size is significant

Question

I'd like to fit a Gaussian to some experimental data that is binned (the binning is a result of the physical limits of the device). Importantly, the bin size is significant enough that the gaussian cannot be considered flat in the bin window (see pic below). The data is actually 3D but let's just consider the 1D example to start . How does one write a likelihood function for the goodness-of-fit?

My intuition is to simply consider each bin independent and compare the density verses the integrated Gaussian density in the bin window: $$ \begin{align} p(D|\Theta) &= \prod_i^N p(d_i|\Theta) \\ &= \prod_i^N f\left(d_i - \int_{x_i}^{x_{i+1}}\phi(x|\mu,\sigma)dx\right) \end{align} $$ Where N is the number of bins, $d_i$ is the bin height for bin $i$, $\phi(x|\mu,\sigma)$ is the Gaussian PDF, and the integral is over the bin width. My question is: what should I use for $f$? In other words, how is the agreement between $d_i$ and $\phi$ distributed?

Key additional questions:

How does this likelihood function change for higher dimensions?
The integration of a Gaussian over a finite bin size is pretty expensive to compute. Since again my problem is 3D, I'm going to have to do numerical integration MANY times for millions of bins. Is there a faster way to do it?

Illustration of problem

Although you discuss goodness of fit, it sounds like what you really want to do is the fitting itself--that is, to estimate the 3D means and covariance matrices. Would this in fact be the case? — whuber, Nov 14 '14 at 03:40
Indeed the goal is to do the fitting. The reason I'm asking for help with the likelihood function is that my eventual goal is to incorporate priors, so everything needs to be probabilistic. If that makes any sense :) — cgreen, Nov 14 '14 at 03:44
Sure, it makes sense. I also think things might not be as bleak as you make out: you likely can get an excellent initial fit by sampling each bin at its center (and computing the density there), and then improve on that fit if you like by slightly intensifying that sampling, such as moving to $8$ points per bin, etc. This actually promises to be faster than if you had all the unbinned data. — whuber, Nov 14 '14 at 03:47
I've thought about sampling points from the histogram, but doesn't that affect the probability weight? Meaning, doesn't the likelihood value change between comparing to, say, $10^7$ vs $10^6$ points? — cgreen, Nov 14 '14 at 03:52
You've lost me there. If you have a bin of volume $V$ with $k$ values in it, the center of the bin is located at $\mathbf x$, and the parameter values are $\mu,\Sigma$, then the contribution of that bin to the likelihood can be approximated as $(\phi(\mathbf x; \mu, \Sigma)V)^k$ where $\phi$ is the density function. When all bins have the same volume you can ignore the factor of $V^k$ altogether (which does not vary with $\mu$ or $\Sigma$), in effect treating all the data in a bin as if they were equal to the bin's center. — whuber, Nov 14 '14 at 03:58
@whuber: Are you sure that you can do that? If you're fitting a Gaussian, you want the mean and second moment, but the second moment of an average point drawn from a bin won't be the second moment of the center of the bin for most bins. Shouldn't you use the average second moment? — Neil G, Nov 14 '14 at 04:29
I think your product of differences can be negative, which can't work as a likelihood. — Neil G, Nov 14 '14 at 11:44
@Neil Yes, I'm sure. Remember, I'm talking about using an approximation to achieve an initial fit quickly and inexpensively and then "polishing" that fit based on the full and correct likelihood. Worked examples of how that polishing step would go are presented at http://stats.stackexchange.com/a/12491 and http://stats.stackexchange.com/a/68238. — whuber, Nov 14 '14 at 16:24
@whuber: Ah, in your second link you list something called Sheppard's corrections, which is what I was saying: Using the middle of the intervals alters the variance, which you should account for. — Neil G, Nov 14 '14 at 17:40
@Neil That is correct--but in finding the initial ML approximation it is an unnecessary complication, especially in more than one dimension (where the analog of Sheppard's corrections can be expected to be appropriate only for uncorrelated Gaussians). The idea here is to sneak up on a correct ML solution in two steps: the first step essentially ignores the binning, pretending all data are located either in the bin centers, randomly within their bins, or equally spaced within them; after initial estimates of the parameters are obtained, then a better calculation is performed. — whuber, Nov 14 '14 at 17:43
@whuber: fantastic, thanks for your help! I also came across this paper for fitting GMMs to binned data (my eventual goal), and it suggests sampling a random point from each bin, with probability in proportion to the distance between the Gaussian and the bin height. And I guess drawing a new point each update: http://www.sciencedirect.com/science/article/pii/S0167947309003739 — cgreen, Nov 14 '14 at 18:39
@whuber quick follow-up. The likelihood you suggested, how is it related to other metrics like the cross-correlation coefficient (CCC)? One could imagine simulated a histogram from the Gaussian and using the CCC (or other metric) to compare it to the data. — cgreen, Nov 17 '14 at 18:31

score 2 · Answer 1 · answered Sep 22 '15 at 23:25

If you know that $y_i \in [x_j, x_{j+1})$, where $x_j$'s are cut points from a bin, then you can treat this as interval censored data. In other words, for your case, you can define your likelihood function as

$\displaystyle \prod_{i = 1}^n (\Phi(r_i|\mu, \sigma) - \Phi(l_i|\mu, \sigma) )$

Where $l_i$ and $r_i$ are the upper and lower limits of the bin which the exact value lines in.

A note is that the log likelihood is not strictly concave for many of the models for interval censored data, but in practice this is not of much consequence.

score 0 · Answer 2 · answered Nov 14 '14 at 04:49

0

You should treat each bin as if it were generating random points uniformly within its bounds. Therefore calculate a weighted average for each bin $(x_l, x_h]$ of $E(x) = \frac{x_h + x_l}{2}$ and $E(x^2) = \frac{x_h^2 + x_lx_h + x_l^2}{3}$. This weighted average determines a Gaussian.

You can incorporate a prior by treating this Gaussian as a likelihood.

answered Nov 14 '14 at 04:49

Neil G

15,219

That's a little confusing, can you elaborate? I understand the idea of each bin generating uniform points, but what do you mean they determine a Gaussian? – cgreen Nov 14 '14 at 05:00
The mean and second moment are one way to parametrize a Gaussian. You could have done mean and variance if you like. – Neil G Nov 14 '14 at 05:03
Ah right of course. Then how do you compare these bin gaussians to the fitting function? – cgreen Nov 14 '14 at 05:05
@GreatM31: What's a fitting function? – Neil G Nov 14 '14 at 05:07
Sorry I mean the Gaussian that I'm trying to fit to the data.... – cgreen Nov 14 '14 at 05:07
@GreatM31 I suggested that you take the weighted average over all bins to produce one Gaussian. Alternatively, this is equivalent to finding a Gaussian for each bin and then taking a weighted average of "expectation parameters", which are precisely the mean and second moment. – Neil G Nov 14 '14 at 05:08
OK I see. I was hoping to have a likelihood in the form that I put above, because I wanted it to generalize to Gaussian mixture models in higher dimensions... – cgreen Nov 14 '14 at 09:19
1

It is inconsistent to assume the distribution is Gaussian and is uniform within each bin. That inconsistency is going to cause more and more trouble as the dimension increases. – whuber Mar 01 '15 at 21:37
1

@whuber: you're right. I guess this could be a starting point of an iterative approach. – Neil G Mar 01 '15 at 23:36

score 0 · Answer 3 · answered Jun 20 '15 at 07:42

Given whuber's comment on my last answer, I suggest you use that answer to find a mean and variance $\mu, \sigma^2$ as a starting point. Then, calculate the log-likelihood of having observed the bin counts you got $\ell$. Finally, optimize the mean and variance by gradient descent. It should be easy to calculate the gradients of the log-likelihood with respect the parameters. This log-likelihood seems to me to be convex.

Fitting a Gaussian to a histogram when the bin size is significant

3 Answers3

Linked