1

I have data aggregated as a histogram

$$ (m_1, c_1), (m_2, c_2), \dots, (m_k, c_k) $$

where $m_1 < m_2 < \dots < m_k$ are the midpoints of the histogram bins and $c_i$ are the counts that sum to the total number of the raw datapoints $\sum_i c_i = N$. The count $c_i$ is for the points within the $[m_i-d, m_i+d)$ range. The histogram bins are evenly spaced, so $m_i + 2d = m_{i+1}$.

Such histogram can be used to create a weighted kernel density

$$ f(x) = \sum_i w_i K_h(x - m_i) $$

with weights $w_i = c_i / \sum_j c_j$. The problem is finding appropriate bandwidth $h$ for the kernel $K_h$. What would be the most promising way of doing this?

How would the answer change if instead of a histogram I had an approximate histogram? In the approximate case, $m_i$ are the bin averages, but they are not evenly spaced. The bins do not have clear-cut boundaries and the $c_i$ counts are proportional to the number of points surrounding the $m_i$ centers.

Given the imprecise and approximate nature of such data, I would be perfectly fine with a decent rule of thumb for bandwidth selection. My main aim is possibly the least-misleading way of visualizing the histogram.

By approximate histogram with non-clear cut boundaries I mean, for example, a histogram created by taking multiple histograms as described in the first section and then merging them. This could be done by combining the closest bins between the histograms, where in different histograms the bins are centered on different values. In such a case, we would be averaging bins covering different ranges, so the averaged histogram would not have exact boundaries.

Tim
  • 138,066
  • 1
    Use my observation at https://stats.stackexchange.com/a/41520/919 that convolution of a distribution with a uniform distribution on an interval has a density given by a finite difference of the distribution's CDF. If you take the histogram bars literally, as representing uniform distributions, and the bars have constant width, then all you need to do is replace the kernel by the finite difference of its CDF. Thus, the KDE of the histogram is merely the KDE of its midpoints with a slightly different kernel shape -- and often the choice of shape is immaterial, anyway. – whuber Apr 19 '23 at 19:33
  • 1
    The same idea applies to variable-width bins but now the convolution is less-efficiently carried out. Finally, could you explain what it means for histogram bins not to have "clear-cut boundaries"? That's not what's usually meant by a "histogram" and could benefit from elaboration. – whuber Apr 19 '23 at 19:35
  • @whuber I agree that the KDE would be just the KDE of the midpoints, but how to find a "good" bandwidth? The bandwidth selection methods usually assume raw data and I am afraid that they would not work that well for aggregate data. I also edited it to explain the approximate histograms as you suggested. – Tim Apr 20 '23 at 07:08
  • 1
    There exists an equivalent theoretical analysis for the optimal bandwidth of histograms when seen as density estimators. Cf. https://books.google.fr/books/about/Th%C3%A9orie_de_l_estimation_fonctionnelle.html?id=I-tUAAAAYAAJ&redir_esc=y – Xi'an Apr 20 '23 at 07:32
  • 1
    If you adopt the premise that the bin widths of the histogram were appropriately chosen to begin with, then a reasonable choice of bandwidth for the kernel would be of the same order of magnitude--perhaps the bin width itself. There's little point to getting really precise about this in light of all the choices and compromises made in creating the histogram in the first place. – whuber Apr 20 '23 at 13:10
  • @whuber bandwidth = bin width is starting point but I wonder if there is anything better. With bin width, one can easily end up with histograms with peaks per each bin, so my guess would be that making it slightly higher would be a more appealing solution. I tried looking for some literature on this but without much success, as it does not seem to be a popular use-case. – Tim Apr 20 '23 at 13:29
  • 1
    If you are using a Gaussian kernel with a half-width equal to the histogram bin width, you will not get individual peaks in the bins generally. If you're concerned about that, use a larger kernel width! The usual trade-offs apply, of course, especially concerning extending the density beyond the histogram's limits. – whuber Apr 20 '23 at 13:34

0 Answers0