1

I would like to estimate the probability density function of a data set with a very large number of samples (50,000+) and a large number of continuous variables (2,048).

Compute efficiency is somewhat important, so I would like to avoid approaches based on artificial neural networks.

Considering the high-dimensional setting, is kernel density estimation still an appropriate method? Are there any alternatives?

  • 1
    What form would this estimate take? It cannot possibly be a discretized representation--even with just two bins per dimension you would need to specify $2^{2048} - 2047$ values! – whuber Aug 15 '22 at 16:53
  • Sorry, forgot to mention, these are continuous variables. – Sebastian Berns Aug 16 '22 at 18:42
  • 1
    Of course. But KDEs necessarily bin the variables and, to accomplish anything, require at least two bins per dimension. That's why it's crucial for you to specify what form any answer might possibly take. – whuber Aug 16 '22 at 19:04
  • 1
    (+1 to whuber comments) To add: Sorry but there is a next to zero chance for this to work out unless we are lucky enough that those 2K variables somehow have an exceptionally dense lower dimensional representation. In general, KDEs above 10 or so dimensions are not well-behaved, simply put there are not enough neighbours. Maybe (and that's a huge maybe) we are able to use copula density estimators but that's exceptionally computationally intensive. (It will effectively mimic how one stress-tests investment portfolio strategies) – usεr11852 Aug 16 '22 at 19:05
  • @SebastianBerns would fitting a parametric family possibly do the trick? Hard for us to suggest one without understanding the motivation and data tho – John Madden Aug 16 '22 at 21:04
  • Thanks for all your comments. To add some context: the data I am dealing with are 2048-dimensional representations of RGB images. I understand that there are limitations to using KDE, and high-dimensional cases are especially difficult. This is exactly why I am asking for alternatives. So, @JohnMadden, happy to hear suggestions. Many thanks! – Sebastian Berns Sep 02 '22 at 14:00
  • @SebastianBerns Ah I see, in this case a simple parametric family (like "multivariate Normal") as I was thinking will not do the trick. If you're comfortable with neural networks, using a normalizing flow with convolutional structure might strike the right balance between good inductive bias and flexibility. See e.g. https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial11/NF_image_modeling.html – John Madden Sep 08 '22 at 14:35
  • Thank you, that tutorial a great resource! – Sebastian Berns Sep 16 '22 at 11:11

0 Answers0