4

I have sample of around 100 of observations. Each of them is the vector of real numbers of size 30.

Also I have pdf which I think may describe the source distribution of my sample. The pdf has well-defined functional form so I can use Gibbs sampling to generate few thousands simulated observations.

I can use the same pdf to generate two sets of probabilities for my original and generated samples.

Then I do Wilcoxon rank sum test on that two sets of probabilities. My null hypothesis is that my original sample and new generated sample are taken from the same distribution.

Does such approach have any potential problems? Is any better way to test hypothesis of multivariate sample coming from certain source distribution?

mdewey
  • 17,806
lowtech
  • 101

1 Answers1

2

It's possible to 'fool' your method two distributions have similar central tendencies but different dispersions. For instance, the R code below generates observations from the uniform and the normal distribution that 'pass' the Wilcoxon test (in which case we would erroneously believe the two sets of observations were generated from the same pdf).

# Create 100 observations from the normal and uniform distribution
obs1 <- rnorm(100,5,1)
obs2 <- runif(100,0,5)

# Calculate the probability of each observation based on uniform dist
punif_obs1 <- punif(n, min(n), max(n))
punif_obs2 <- punif(u, min(u), max(u))

# Calculate the probability of each observation based on normal dist
pnorm_obs1 <- pnorm(n, mean(n), sd(n))
pnorm_obs2 <- pnorm(u, mean(u), sd(u))

# Wilcox tests. Null hypothesis not rejected, despite that the 
# two sets of observations were sampled from different distributions.
wilcox.test(punif_obs1, punif_obs2)
wilcox.test(pnorm_obs1, pnorm_obs2) 

The most straightforward way to test whether your observations are from your pdf is to generate a set of data from the pdf and follow the procedure here: How do test whether two multivariate distributions are sampled from the same underlying population?

keithing
  • 301