Non-parametric density estimation: Is it better to estimate a PDF by first finding its CDF?

Question

By better I mean fewer density error against the true PDF. Say that $X$ is the random variable that we wish to find its true PDF $f_X$ by the estimation $\hat f_X$. Then my goal is to find $\hat f_X$ that minimizes:

$$ \mathbb{E}[(\hat f_X(X) - f_X(X))^2] $$

Now that I have defined better, my question is:

We have a random variable $X$ that takes values in $\mathbb{R}^n$.
We have a finite set of samples $\mathcal{L} = \{x_1,x_2,\ldots,x_k\}$ that contains some outcomes of the variable $X$ as obtained in the past (think of $\mathcal{L}$ as a learning set).
Question: what is the best method to find $f_X$ by empirically analyzing the samples in $\mathcal{L}$. I.e. we must not assume any known kernel (I guess this makes it non-parametric).

Permitted assumptions:

Samples in $\mathcal{L}$ are randomly and uniformly chosen from the population with the PDF $f_X$.

My thought train:

I can empirically measure the CDF from $\mathcal{L}$, which will have some stairs. Then, I smoothen this CDF so that, when I differentiate it, I get a continuous PDF. I find this easy to understand and see that my assumptions are confined in the interpolation/smoothing of the stairs in the CDF.

However, when I try to find the PDF directly, I just can't think of a method to estimate it non-parametrically by looking at samples in $\mathcal{L}$. I can only think of estimating the PDF parametrically after assuming certain PDF.

To be honest, I am not even sure why getting non-parametric PDFs from CDFs looks approachable, while getting non-parametric PDFs directly isn't. But this is exactly why I am asking this question. Why is this?

Are my feelings due to some facts that I am not able to explicitly explain?
Or are my feelings so only due to the randomness of evolution, and that scientifically none of the methods are actually any better than the other.

Considerations related to Turing machines are scarcely of statistical interest. To see why not, there trivially are infinitely many easiest estimators in your sense: pick your favorite PDF supported on $\mathbb{R}^n$ and assume $f$ is it. That's a $O(1)$ operation. The point is that statistics is interested in the quality of the estimator, not in its properties as an abstract algorithm. — whuber, May 26 '16 at 01:33
@whuber thank you. I edited my question by removing easier and emphasizing the definition of better. — caveman, May 26 '16 at 01:44
An easy way to construct a PDF in a non-parametric manner is to make a histogram of your sample and then smooth that histogram (and normalise it to integrate to unity). You make no assumption about the underlying PDF or CDF. This smoothing approach is quite common: see for example Jones et al. A Brief Survey of Bandwidth Selection for Density Estimation. In general, the choice of bandwidth is far more important than the choice of the kernel. — usεr11852, May 27 '16 at 20:09
Also in your technique you "smoothing the CDF and then differentiating" you have "two sources of error". 1.We get the error from the CDF estimation (OK sure, we get that from PDF estimation too) but also 2. we introduce further error by numerical differentiation of the CDF to get the PDF. Numerical differentiation (even fancy schemes like Richardson Extrapolation) ultimately differentiates on a discrete grid and smooths. — usεr11852, May 27 '16 at 20:28
Kernel density estimators are nonparametric. The reason is that the number of parameters grows with the size of the dataset (the growing parameter set being the kernel locations). In the limit of infinite data, a KDE with the proper bandwidth (which may approach zero) could properly represent any distribution. — user20160, Jun 21 '16 at 05:47
@user20160 in the limit of infinite data, does the choice of the kernel matter? — caveman, Jun 21 '16 at 10:48
@caveman Yes. For example, if your kernel is too wide, you'll always oversmooth the density, even with infinite data. But the kernel shape doesn't matter asymptotically. — user20160, Jun 21 '16 at 10:57
Interesting. For common cases (not asymptotically), would it matter whether the bandwidth is constant (or variable) across the x axis? So far it seems to me that the bandwidth remains constant for all samples over the x axis. I wonder if it would make any difference to have (say) wide bandwidth around some region across the x axis (e.g. maybe between $0.5 \le x \le 5$) and then narrower bandwidth across some other region over the x axis (e.g. maybe between $5 < x$? — caveman, Jun 22 '16 at 00:55

Non-parametric density estimation: Is it better to estimate a PDF by first finding its CDF?

Now that I have defined better, my question is:

Permitted assumptions:

My thought train:

0 Answers0