downsampling a kde / combining kde and histogram

Question

I'm calculating a KDE of one parameter (y, particle density) in bins of another parameter (x, distance from the origin). At small x I have poorly sampled data (10s to 1000s of points per x-bin) while at large x values it is very well sampled (millions of points per x-bin). At small radii, using a KDE seems very important/effective, while at large radii the result is effectively identical to a histogram, but it extremely slow to compute (at least using scipy in python) *[1]. Ultimately I don't need the KDE per se, I just need the smoothed/sampled PDF it produces (i.e. on a regular grid).

It seems like a hybrid approach would be possible in which the KDE is used when the sampling is sparse, but I revert to simple binning when it is very well sampled. Is there a standard procedure for hybridizing these approaches? Or are there techniques for adaptively choosing the bandwidth such that I can use a kernel with finite support that shrinks as the sampling becomes more dense?

*[1] Ultimately, I assume this is because I'm using Gaussian kernels with infinite support, and thus N*M evaluations (for N particles in the given KDE, which is resampled onto a grid of M points).

A simple procedure would be to compute the KDE of a nonlinear transform of the parameter and then convert it back. Assuming anything close to a slowly-varying density near the origin, a good choice would be the $n^\text{th}$ root of the distance in $n$ dimensions. — whuber, Feb 01 '19 at 22:41
@whuber can you expand a bit? I don't see how a transform helps the computational costs. Or is the idea that you reshape the parameter space, to mimic the effect of changing the kernel (/ bandwidth)? Do you know any references/tutorials for this, offhand? — DilithiumMatrix, Feb 02 '19 at 15:09
An appropriate transformation will place approximately the same numbers of data in each one of a regularly-spaced set of bins. Some examples appear at https://stats.stackexchange.com/questions/65866 — whuber, Feb 02 '19 at 16:21

score 2 · Answer 1 · answered Mar 30 '22 at 20:01

2

Maybe a bit late, but this package: KDEpy implements a convolution based FFTKDE which is much faster, than the Scipy implementations.

Although it has some limitations, especially, regarding your case, a fixed bin size, but maybe the speed up is enough?

(sorry, new here... can not comment)

answered Mar 30 '22 at 20:01

Helmut

121

R uses FFT by default https://github.com/wch/r-source/blob/cbae812b2414df898e522ca6dd5266ad0fee2e3a/src/library/stats/R/density.R#L170 – Tim May 02 '22 at 06:09

score 0 · Answer 2 · answered Feb 04 '19 at 10:58

0

In case you have coarse bins I would consider to use the method described in the following paper: Efficient Estimation of Smooth Distributions From Coarsely Grouped Data. In appendix 2 you will find a R code illustrating the method. It should be easy to translate in in Python. I hope it helps.

answered Feb 04 '19 at 10:58

Gi_F.

1,161

Thanks for the recommendation, but I don't quite follow how my problem is analogous to "ungrouping", can you elaborate a little on the connection? – DilithiumMatrix Feb 04 '19 at 15:06
Hi I think I misunderstood your question then, sorry. I thought that the under-sampled set of your data resulted in coarser bins. Anyway, the approach in the paper I suggested, is more general and can be used also in the case you do not have coarse bins as in Ill-posed problems with counts, the composite link model and penalized likelihood – Gi_F. Feb 04 '19 at 16:36

downsampling a kde / combining kde and histogram

2 Answers2