14

I was wondering if it might be possible to generate 2 correlated $Beta$ random variables?

In other words, I want to generate two Beta random variables which can be said to have come from two Beta distributions whose correlation is known to be "$\rho =$ some number"?

In R, I tried a simple linear transformation approach to achieve this:

set.seed(0)
X1 = rbeta(1e4, 5, 1) ; X2 = rbeta(1e4, 6, 1) ;  X3 = rbeta(1e4, 7, 1) ; a = -.5

Y1 = X1 + (a*X2) ; Y2 = X2 + (a*X3) ## Y1 and Y2 are meant to be correlated

cor(Y1, Y2)

plot(Y1~Y2, col = densCols(Y1, Y2) ) ; abline(lm(Y1~Y2), lty = 3, lwd = 2)

But the linear transformation approach above is particularly unprincipled. If you change a to .5, then the meant-to-be $Beta$ random variables go beyond 1:

enter image description here

P.S. There is apparently a principled way to do this as shown HERE.

rnorouzian
  • 3,986
  • Do you want to know the correlation between them, or you just want them to be correlated in some way? – Greenparker Jun 13 '17 at 15:14
  • 2
    There is nothing preventing Y1 or Y2 from being outside $(0,1)$ so they cannot be beta distributed? – Christoph Hanck Jun 13 '17 at 15:17
  • 1
    No, I haven't. Please summarize what you take away from that paper! – Christoph Hanck Jun 13 '17 at 15:21
  • 1
    @rnorouzian Can you explain why you think the method is not stable. Where are you seeing the problem? – Greenparker Jun 13 '17 at 15:23
  • @rnorouzian If you want people to solve this problem with minimal effort for you, then show them this behavior. Display the plots, and clearly state the problems you are facing. – Greenparker Jun 13 '17 at 15:27
  • 5
    Exactly how do you want the variables to be correlated? There are myriad ways to construct them, so we need further restrictions. A general method is to use a copula; another method would be to select a suitable Dirichlet distribution. – whuber Jun 13 '17 at 15:53
  • @whuber, I just need to generate two Beta random variables which can be said to have come from two Beta distribution whose correlation is known to be "rho = some number"? – rnorouzian Jun 13 '17 at 15:58
  • Do you care about the parameters of those variables or not? What ranges of $\rho$ do you need to handle? Are you concerned only about two Beta variables or will you be working with more than two? – whuber Jun 13 '17 at 16:00
  • @whuber, (1) Regarding shape parameters, I want to obtain the mean and variance of each mother Beta distribution. So, this means the shape parameters are only important in computing these two moments. More specifically, I want the two variances of the mother Beta distributions to be equal but means be different. (2) I just need to handle $ .3 =< \rho <= .9 $. – rnorouzian Jun 13 '17 at 16:07
  • 4
    The interesting paper at https://arxiv.org/pdf/1406.5881.pdf develops a bivariate distribution with beta marginals, that admits of all correlations in $[-1,1]$. They do not give an simulation algorithm but that shouldn't be to difficult. – kjetil b halvorsen Jun 13 '17 at 16:08
  • 1
    @Kjetil That's a nice paper: it outlines several different solutions (and the simulation method will vary with the nature of the solution). But please note that when the Beta distributions have different shapes, it will not be possible to achieve correlations arbitrarily close to $+1$. – whuber Jun 13 '17 at 16:29
  • @whuber, what I need the above for is for a virtual (simulated) paired-samples t-test study in my field. I need to produce n proportions for the pre-test stage, and n proportions for the post-test stage. But I want to assume that the proportions in the pre-test and the post-test are positively correlated by some known degree in the mother populations. – rnorouzian Jun 13 '17 at 16:39
  • 4
    Proportions don't necessarily have Beta distributions. In fact, for a fixed denominator that would be a mathematical impossibility: the proportion is discrete while a Beta variable is continuous. You could immediately adopt the solution I posted yesterday about generating correlated Binomial variates, because those provide nice models of the counts that you really need for the t-test: you can't (validly) do a t-test on proportions alone. – whuber Jun 13 '17 at 16:54
  • @whuber, I appreciate your comment, in the actual study that I'm trying to simulate, researchers use a measurement tool that only produces proportions, so are you suggesting that two sets of, say, 30 proportions obtained from a pre-test and a post-test can not be compared with each other using a paired-samples t-test? If so, what alternative test, do you recommend? – rnorouzian Jun 13 '17 at 17:08
  • 2
    If those proportions are based on substantially varying denominators, or if some of them approach closely to $0$ or $1$, then the t-test will be suspect because such data constitute prima facie evidence that one or more underlying assumptions of the t-test do not hold. In any event it is far better to base the tests on the actual counts using (say) a generalized linear model. – whuber Jun 13 '17 at 17:13
  • @whuber, wow, you helped me tremendously by this comment. So, generating correlated random data for this situation would be what, binomial? To be clear, the measurement tool in my case, is a test comprised of $n$ dichotomously scored items? – rnorouzian Jun 13 '17 at 17:23
  • 3
    Indeed, the method I proposed for generating correlated binomials seems to fit your situation perfectly, because it constructs them as sums of correlated binary ("dichotomously scored") random variables. Thus, all you have to do is decide how much correlation you want among the pre- and post-responses, item by item, and you're on your way. – whuber Jun 13 '17 at 17:41
  • @whuber, thank you so much, just quickly which glm test in particular you're suggesting to compare the two sets of pre- and post counts? – rnorouzian Jun 13 '17 at 17:45
  • It sounds like a good setting for applying logistic regression. – whuber Jun 13 '17 at 17:49
  • @whuber, I just posted a question HERE, do you think there is anything I can improve in my question to add more clarity? – rnorouzian Jun 13 '17 at 21:28
  • Related to https://stats.stackexchange.com/questions/441924. – Royi Dec 15 '21 at 08:22
  • @rnorouzian, Your link in the "p. s." remark doesn't work. – Royi Dec 15 '21 at 10:25
  • Related to https://stackoverflow.com/questions/31459291. – Royi Dec 15 '21 at 10:31

1 Answers1

0

As others have said, the number of possible answers here is endless. We can ignore the issue of the actual distribution. Obviously, for shift and scale distributions, the regression approach is the easiest method to generate correlated data, but it's not an omnibus. Beta RVs are not shift/scale distributions, as you have clearly noticed, if you add 1 to a Beta random variable, it is no longer Beta distributed. The copula method is the most theoretically sound approach for which there are several papers, code examples, and R packages.

In R, a very interesting method is just to sort arrays. Back to your case, simulate independent Beta random variables stored in a matrix. and for a desired correlation, choose a number of rows proportional to that correlation and jointly sort those rows. To gain a closer match, you can iteratively sort more or less rows until convergence is met. Sorting the vectors maintains the univariate distributional properties.

The below example is by no means an efficient implementation, but rather an illustrative example.

## initial setup
set.seed(123)
n <- 100 
x <- matrix(rbeta(n*2, 2, 2), n, 2)

user supplied desired correlation

p <- 0.5

np is the number of rows to sort, npprev and npnext to evaluate convergence

npprev <- np np <- np xnew <- x xnew[1:np, ] <- apply(xnew[1:np, ], 2, sort) pest <- cor(xnew[,1], xnew[ ,2])

add 1 row to sort if under correlated, subtract 1 row if overcorrelated

npnext <- np + sign(p-pest)

iterative until exact match OR alternating pattern reached

while (pest != p & npnext != npprev ) { npprev <- np np <- npnext xnew <- x xnew[1:np, ] <- apply(xnew[1:np, ], 2, sort) pest <- cor(xnew[,1], xnew[ ,2]) npnext <- np + sign(p-pest) }

AdamO
  • 62,637