How to empirically check a PDF

Question

Intro

Let $Y$ be a random variable whose PDF is $p_Y(\cdot)$. Let's say that $Y$ is a function $g(\cdot)$ of another random variable $X$ whose PDF $p_X(\cdot)$ is given. Then, you do your calculation and you end up with the analytic expression of the PDF $p_Y(\cdot)$.

My general question is: how do verify that your result for $p_Y(\cdot)$ is correct?

I think that there are plenty of method that answers my question. First of all there are two necessary conditions to be verified:

for all $y$ must be $p_Y\geq 0$;
it must be $\int p_Y = 1$.

If these conditions are not verified then for sure there is an error in the derivation of $p_Y$. But what about sufficient conditions that allows you to be confident in your result

My approach

So far my favourite method consists in the Monte Carlo method: with my lovely PC I generate a big sample $\{x_i\}_{i=1}^N$ according to $p_X(\cdot)$ and then I make a plot where I overlap the histogram of $\{y_i=g(x_i)\}_{i=1}^N$ versus the curve of the PDF $p_Y(\cdot)$ obtained analitically. Then, if $\{y_i=g(x_i)\}_{i=1}^N$ "resembles" $p_Y(\cdot)$ I state that $p_Y(\cdot)$ is correct. In order to be even more confident, I use to repeat the procedure for increasing values of $N$, and if the histogram seems to converge towards $p_Y(\cdot)$ then I accept the result of $p_Y(\cdot)$.

But, despite it is very intuitive (and not 100% rigorous because it is not quantitative - everything is done by eye), there is a catch with this simple method: it works only if $Y\in\mathbb{R}^n$ with $n=1,2$. Unfortunately, now I have to study cases where $n=3$ and $n=6$. How do you deal with that?

At a first glance I was thinking to the trivial extension where I make a proper plot for each component of $Y$, i.e. I repeat the procedure on each scalar marginal PDF. The problem is that if there is agreement within the histograms and their relative PDFs then only a necessary condition is verified (because only the marginal PDFs are verified, and not the joint PDF).

As a second method, I was thinking about something more rigorous. In particular I was thinking about some goodness of fit method. Currently, I've studied only the Chi-squared and the Kolmogorov-Smirnov test in the simple scalar case (i.e. $n=1$). Hence, I was thinking to adapt such test to my cases.

Question

Before falling down in the rabbit hole, I would like to have an opinion from someone that has already faced this type of problems, since I believe that are very common in the community of statisticians. In particular, I would like to have a precise study direction:

is there a dominant validation method? I mean, is there a method that is more suitable to solve my validation problem?

While it doesn't answer your question, it's related https://stats.stackexchange.com/questions/201116/best-way-to-check-implementation-of-density-distribution-function-and-random-ge/272256#272256 — Tim, Jan 19 '23 at 10:21
Ooh, thank you so much Tim! I totally missed that post, I'll give it a closer look — matteogost, Jan 19 '23 at 10:24
Various problems lend themselves to different approaches. E.g., if you think the distribution is multivariate Normal, you can check random univariate linear combinations for Normality. There are myriad ways to double-check mathematical derivations. You can check special cases that reduce to tractable calculations, etc. I have answered many questions here using such calculations and endeavored in most cases to check them. I believe those answers exhibit a wide range of verification techniques. Checking the univariate and bivariate marginals through simulation likely predominates. — whuber, Jan 19 '23 at 15:35
Thank you whuber, I'll check your past post about this topic — matteogost, Jan 19 '23 at 16:07
A very principled way of comparing samples to distributions is Maximum Mean Discrepancy. See this answer — g g, Jan 19 '23 at 17:00

How to empirically check a PDF

0 Answers0