Compare distribution to given shapes to find the most similar

Question

I have a multiple comparison problem between some thousands of correlation distributions and given shapes.

I have computed pair-wise correlation for a given set of gene expresion data, and I have generated correlation matrices with correlation distribution for each analyzed gene (range -1 to 1).

I want to compare the similarity between each correlation distribution and each of the given shapes I have generated.

A visual example:

I have the correlation distribution for gene X, which visually resembles the histogram of shape4, compared with shape2 and shape3.

I want to perform a test to get the probability of x to be equivalent to shape4 distribution. Say, the result would tell me that x follows a distribution that is more similar to shape4 than to shape2 or shape3, expressed as a p-val or probability (or both).

The range of the distributions goes from -1 to 1 (correlations).

I have also counted the number of ocurrences in 8 25percentile bins, and these 8 value vectors for each distribution also give me a visual inspection of the similarity, but I need a test to have a pval and/or probability. The ultimate goal is to assing each gene distribution to a given shape, based on the similarity among them.

I have checked a lot about different meassures of distance between distributions and chisq.test, ks.test ans so, but I would like to get a reasoned answer about what would be the best approach to this problem. I have also considered using some Machine Learning classifier.

Can someone provide a good solution (R based if possible)?

you may see https://cran.r-project.org/web/packages/fitdistrplus/vignettes/paper2JSS.pdf or refer to stats.exchange — , Jul 23 '19 at 08:41
Note that a standard statistical test will never give you the "probability that x will follow/be equivalent to distribution x". Rather it will only tell you whether x could be compatible with that distribution, and it may actually be compatible with many distributions, not only the best. If you really want a "probability that x was generated by distribution shape4", you need to go Bayesian and set up a prior over your distributions. — Christian Hennig, Sep 27 '20 at 14:09
If you actually want to run a standard test, doing this on the same data after selecting the best distribution is problematic and I'm not sure whether you can justify anything better than Bonferroni here, meaning that if you compare x to the best out of, say, 2,000 distributions with a suitable test, you need to multiply your p-value by 2,000 in order to make it valid. You may not like this; and it may suggest that testing on the same data that you used for selecting the best distribution may not be the best idea. — Christian Hennig, Sep 27 '20 at 14:12

score 0 · Answer 1 · answered Aug 05 '19 at 09:30

Have you tried looking into cross-entropy based methods? This question could be a starting point to have an intuitive understanding of the concept:

Intuitively, why is cross entropy a measure of distance of two probability distributions?

And this paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.462.80&rep=rep1&type=pdf if you want to go deeper.

Compare distribution to given shapes to find the most similar

1 Answers1