2

I'm calculating a KDE of one parameter (y, particle density) in bins of another parameter (x, distance from the origin). At small x I have poorly sampled data (10s to 1000s of points per x-bin) while at large x values it is very well sampled (millions of points per x-bin). At small radii, using a KDE seems very important/effective, while at large radii the result is effectively identical to a histogram, but it extremely slow to compute (at least using scipy in python) *[1]. Ultimately I don't need the KDE per se, I just need the smoothed/sampled PDF it produces (i.e. on a regular grid).

It seems like a hybrid approach would be possible in which the KDE is used when the sampling is sparse, but I revert to simple binning when it is very well sampled. Is there a standard procedure for hybridizing these approaches? Or are there techniques for adaptively choosing the bandwidth such that I can use a kernel with finite support that shrinks as the sampling becomes more dense?

*[1] Ultimately, I assume this is because I'm using Gaussian kernels with infinite support, and thus N*M evaluations (for N particles in the given KDE, which is resampled onto a grid of M points).

  • A simple procedure would be to compute the KDE of a nonlinear transform of the parameter and then convert it back. Assuming anything close to a slowly-varying density near the origin, a good choice would be the $n^\text{th}$ root of the distance in $n$ dimensions. – whuber Feb 01 '19 at 22:41
  • @whuber can you expand a bit? I don't see how a transform helps the computational costs. Or is the idea that you reshape the parameter space, to mimic the effect of changing the kernel (/ bandwidth)? Do you know any references/tutorials for this, offhand? – DilithiumMatrix Feb 02 '19 at 15:09
  • An appropriate transformation will place approximately the same numbers of data in each one of a regularly-spaced set of bins. Some examples appear at https://stats.stackexchange.com/questions/65866 – whuber Feb 02 '19 at 16:21

2 Answers2

2

Maybe a bit late, but this package: KDEpy implements a convolution based FFTKDE which is much faster, than the Scipy implementations.

Although it has some limitations, especially, regarding your case, a fixed bin size, but maybe the speed up is enough?

(sorry, new here... can not comment)

Helmut
  • 121
  • R uses FFT by default https://github.com/wch/r-source/blob/cbae812b2414df898e522ca6dd5266ad0fee2e3a/src/library/stats/R/density.R#L170 – Tim May 02 '22 at 06:09
0

In case you have coarse bins I would consider to use the method described in the following paper: Efficient Estimation of Smooth Distributions From Coarsely Grouped Data. In appendix 2 you will find a R code illustrating the method. It should be easy to translate in in Python. I hope it helps.

Gi_F.
  • 1,161