How can I visualize the dataset with $n$ samples and $p$ variables to check whether it is from a specific and known distribution to check?

Question

Background

The lecturer of statistical computing asked such a question in title. To be specific, the population distribution is $$ f(x_1, \cdots, x_p) = \left(x_1^{p-1} + \cdots + x_p^{p-1}\right)I(0<x_i<1,\forall 1\le i\le p) $$ The theory of drawing an approximate sample is conditional distribution. $$ f(x_1, \cdots, x_p) = f_1(x_1) f_2(x_2 | x_1) f_3(x_3|x_2, x_1) \cdots f_p(x_p | x_{p-1}, ..., x_1) $$ We need to calculate marginal density which has been given to derive each entry above. $$ f(x_1, \cdots, x_j) = \left(x_1^{p-1} + \cdots + x_j^{p-1}\right) + \frac{p-j}{p} $$ which implies for every $j$, $$ f_j(x_j|x_{j-1},...,x_1) = \frac{f(x_j,..., x_1)}{f(x_{j-1}, .., x_1)} $$ Let's just skip the annoying calculations and give the sampling procedure directly.

generate $U_1, R_1 \sim U(0,1)$ dependently, let $x_1 = U_1^{\frac1p}$ if $R_1 \le \frac1p$ else $x_1 = U_1$
given $x_1$, generate $U_2, R_2 \sim U(0,1)$, let $x_2 = U_2^{\frac1p}$ if $R_2 \le \frac1{px_1^{p-1}+p-1}$ else $x_2 = U_2$,
given $(x_1, ..., x_{j-1})$, generate $U_j, R_j \sim U(0,1)$, let $x_j = U_j^{\frac1p}$ if $R_j \le \frac1{p\sum_{i=1}^{j-1}x_{i}^{p-1}+p-j+1}$ else $x_j= U_j$, for $3\le j \le p$.

Simulation

Let $p=5$,

## R code
p <- 5
n <- 1e4
set.seed(1)
## generate a vector ~ F with dim p
generateVector = function(p) {
  vec = c()
  for (i in 1:p) {
    point = 1 / (sum(p * (vec ^ (p - 1))) + p - i + 1)
    threshold = runif(1)
    if (threshold < point) {
      vec = c(vec, (runif(1) ^ (1 / p)))
    } else{
      vec = c(vec, (runif(1)))
    }
  }
  return(vec)
}
dta <- data.frame()
for (i in 1:n) {
  dta <- rbind(dta, generateVector(p))
}
colnames(dta) <- paste0('x', 1:p)
head(dta)

x1  x2  x3  x4  x5 \
0.3721239   0.9082078   0.8983897   0.6607978   0.0617863 \
0.1765568   0.3841037   0.4976992   0.9919061   0.7774452 \
0.2121425   0.1255551   0.8266908   0.8250891   0.3403490 \
0.5995658   0.1862176   0.6684667   0.1079436   0.4112744 \
0.6470602   0.5530363   0.7893562   0.8624731   0.6927316 \
0.8612095   0.2447973   0.6302822   0.5186343   0.4068302

My question is how to verify the data dta is indeed a sample from $f$ by visualization when $p=5$ or is there any hypothesis test help?

If $p=1$, we can do that by plot histogram and add the density function curve to it, and apply $\chi^2$ test or Kolmogorov test.

Note that generally statistical model assumptions are arguably never precisely fulfilled (particularly not for pseudo-random numbers generated by a computer; these are called "pseudo random" for a reason). So there is no way to "verify" such an assumption. The best you can ever achieve is to come up with something that can distinguish data fitted clearly badly by your model from data that, in certain respects, look similar to data generated by the model. — Christian Hennig, Mar 17 '23 at 16:31
@ChristianHennig, so it suffices to check that the sample data may follow the desired distribution with a relatively high probability. On that base we can continue to do some other statistical inference with the assumption the data obeys the distribution (say $f$ in the context of my post)? — Chia, Mar 22 '23 at 06:52
"so it suffices to check that the sample data may follow the desired distribution with a relatively high probability" - I'd say this probability is zero. Probability models such as distributions are idealisations, they never hold precisely in reality. To say that "theoretical assumptions need to be fulfilled in reality" is misleading. We do statistical inference all the time in situations in which assumptions are not fulfilled. What is important is that they are not violated in such a way that results are misleading. — Christian Hennig, Mar 22 '23 at 10:14
I wrote more on this here: https://stats.stackexchange.com/questions/538561/relevance-of-assumption-of-normality-ways-to-check-and-reading-recommendations/538566#538566 — Christian Hennig, Mar 22 '23 at 10:14

Sextus Empiricus · Answer 1 · 2023-03-17T15:28:51.040

1

Alternative tests that work in general are

Using statistics about the distribution of the nearest neighbour distances

Bickel, Peter J., and Leo Breiman. "Sums of functions of nearest neighbor distances, moment bounds, limit theorems and a goodness of fit test." The Annals of Probability (1983): 185-214.
Transform the data to discretize data distributed in bins/intervals and perform a $\chi^2$-test or G-test.

Possibly you could do something clever with rescaling the data or working with conditional distributions, but I don't see direct how.

edited Mar 17 '23 at 15:28

answered Mar 17 '23 at 15:22

Sextus Empiricus

77,915

Possibly you could use transformations like $y_i = x_i^p$ or $x_i = y^{1/p}$ and add a scaling based on $\sum y_i$ which will give a uniform distribution for the $y_i$ (only the boundaries of the distribution will not be so clear). – Sextus Empiricus Mar 17 '23 at 15:52

JimB · Answer 2 · 2023-03-17T23:18:05.487

If the most important aspect is "visual" (rather than "verify"), then thinking about what kinds of deviations might exist (and what might cause the deviations) and how one might display the data that would show those deviations is required.

Unless you're on acid, 3D plots might be the extent as to what you can display (other than changes in 3D plots over time if there was a time element). Below I increased your sample size to n <- 1e5 and created 3D histograms (with estimated probability density as the vertical axis) along with the bivariate pdf for each pair of variables (using Mathematica).

data = Import["pairs.csv"];
data = data[[2 ;;]];
data = data[[All, 2 ;;]];
labels = {"\!\(\*SubscriptBox[\(x\), \(1\)]\)", 
   "\!\(\*SubscriptBox[\(x\), \(2\)]\)", 
   "\!\(\*SubscriptBox[\(x\), \(3\)]\)", 
   "\!\(\*SubscriptBox[\(x\), \(4\)]\)", 
   "\!\(\*SubscriptBox[\(x\), \(5\)]\)"};
p = 5;
figures = Table[Show[Histogram3D[data[[All, {1, 2}]], Automatic, "PDF",
    RotationAction -> "Clip", SphericalRegion -> True,
    AxesLabel -> (Style[#, Bold, 18] &) /@ {labels[[i1]], labels[[i2]], ""}],
   Plot3D[(p - 2)/p + x[1]^(p - 1) + x[2]^(p - 1), {x[1], 0, 1}, {x[2], 0, 1},
    PlotStyle -> Green]],
  {i1, 2, 5}, {i2, 1, i1 - 1}]

The over- and under-estimates of density seem to occur without any pattern and none appear to be large.

What if the dimension goes higher? We have to plot ${p \choose 2}$ graphs? — Chia, Mar 22 '23 at 06:47
Of course there's a limit. Your question mentioned $p=5$ which is doable. So far I see no responses addressing your "visual" aspect. But maybe that's because you haven't stated what kinds of departures might occur or the consequences of any types of departures. — JimB, Mar 22 '23 at 14:27

How can I visualize the dataset with $n$ samples and $p$ variables to check whether it is from a specific and known distribution to check?

Background

Simulation

2 Answers2