3

I need to generate data from a generalized Dirichlet distribution in Python to test my model, but I have no idea how can I proceed with that, can anyone guide me?

enter image description here

Glorfindel
  • 1,118
  • 2
  • 12
  • 18
  • 2
    The Wikipedia article answers this in its introductory paragraph: in its notation, generate the $z_i$ as independent Beta$(\alpha_i,\beta_i)$ variables and recursively compute the $p_i$ (=$x_i$) from them as $p_1=z_1$ and $p_{i+1}=z_{i+1}(1-p_1-p_2-\cdots-p_i).$ – whuber Jul 13 '21 at 15:21
  • Thanks was really helpful. just to make sure z_i is Beta(alpha_i, beta_i), right? basically z is Beta function not Beta distribution, am I right? – user13612169 Jul 13 '21 at 15:56
  • 1
    The $z_i$ are random variables following Beta distributions, as explained in the Wikipedia article. – whuber Jul 13 '21 at 16:13

1 Answers1

2

Because there are several conventions for describing this distribution, let's begin by examining the one used in the question. And since the formula looks a little confusing, it might be clearer to use a specific example.

Taking $n=3$ ought to reveal the pattern. In this case the variable is $(X_1,X_2,X_3).$ It is limited to the simplex where $X_i\ge 0$ and $X_1+X_2+X_3 \le 1.$ There are three alphas and three betas for the parameters. The density is

$$\left[\frac{x_1^{\alpha_1-1}\left(1 - x_1\right)^{\beta_1-(\alpha_2+\beta_2)}}{B(\alpha_1,\beta_1)}\right] \left[\frac{x_2^{\alpha_2-1}\left(1 - x_1-x_2\right)^{\beta_2-(\alpha_3+\beta_3)}}{B(\alpha_2,\beta_2)}\right] \left[\frac{x_3^{\alpha_3-1}\left(1 - x_1-x_2-x_3\right)^{\beta_3-1}}{B(\alpha_3,\beta_3)}\right]$$

Each expression in brackets looks much like a Beta density, suggesting we attempt a change of variables where

$$Y_1 = X_1;\ Y_2(1-X_1)=X_2;\ Y_3(1-(X_1+X_2)) = X_3.$$

It follows

$$\mathrm{d}x_1\,\mathrm{d}x_2\,\mathrm{d}x_3 = (1-y_1)^2(1-y_2)\,\mathrm{d}y_1\,\mathrm{d}y_2\,\mathrm{d}y_3$$

and thus, in terms of the $y_i,$ the density is

$$\left[\frac{y_1^{\alpha_1-1}(1-y_1)^{\beta_1-1}}{B(\alpha_1,\beta_1)}\right] \left[\frac{y_2^{\alpha_2-1}(1-y_2)^{\beta_2-1}}{B(\alpha_2,\beta_2)}\right] \left[\frac{y_3^{\alpha_3-1}(1-y_3)^{\beta_3-1}}{B(\alpha_3,\beta_3)}\right].$$

Because this factors into $n=3$ Beta densities,

$(Y_1,Y_2, \ldots, Y_n)$ is independent with marginal Beta distributions having parameters $(\alpha_1,\beta_1),$ $(\alpha_2,\beta_2), \ldots,$ and $(\alpha_n,\beta_n).$

From such a collection of $(Y_i)$ we can recover a Generalized Dirichlet variable $(X_i)$ recursively via

$$X_1 = Y_1;\ X_2 = Y_2(1-X_1);\ X_3 = Y_3(1-X_1-X_2);\text{ etc.}$$

Thus, the problem is solved by generating univariate Beta values and combining them as shown.

An R implementation is shown below. As a check, it compares several low-order multivariate moments of a large sample (with, in this example, $n=4$) to the theoretical moments. The figure shows good agreement, indicating the implementation is correct. The code will work efficiently with values of $n$ into the thousands (at least) if you suppress the moment calculations (the number of them increases as $n^m$ where $m$ is the order of the moment).

Figure

#
# Generation of `n` random variates.
#
rGDirichlet <- function(n, alpha, beta) {
  d <- max(length(alpha), length(beta))
  Y <- matrix(rbeta(d*n, alpha, beta), d) # Each sample is in a column of Y
  q <- rep(0, n)
  for (i in seq_len(d)) {                 # Convert `Y` to `X` (in place)
    Y[i, ] <- Y[i, ] * (1 - q)
    q <- q + Y[i, ]
  }
  return(Y)
}
#
# Multivariate moments.
#
mGDirichlet <- function(r, a, b, use.log=TRUE) {
  d <- sum(r) - cumsum(r)
  z <- sum(lgamma(a+b) + lgamma(a+r) + lgamma(b+d) - 
              (lgamma(a) + lgamma(b) + lgamma(a+b+r+d)))
  if(!isTRUE(use.log)) z <- exp(z) # The `r` moment of a GDirichlet(a, b) variate
  z 
}
#------------------------------------------------------------------------------#
#
# Random draws.
#
set.seed(17)
alpha <- 0.1 + round(rexp(4, 1/3), 1)            # alpha parameters
beta <- 0.1 + round(rexp(length(alpha), 1/3), 1) # beta parameters
X <- rGDirichlet(1e4, alpha, beta)
#
# Check some low moments for agreement with theory.
#
m <- 3 # Maximum total order of the moments
R <- as.matrix(do.call(expand.grid, lapply(seq_along(alpha), function(i) 0:m)))
R <- R[rowSums(R) <= m, ]
rownames(R) <- apply(R, 1, paste0, collapse="")
Log.Moments <- apply(R, 1, function(r) {
  c(Theoretical=mGDirichlet(r, alpha, beta),
    Empirical=log(mean(apply(X, 2, function(x) prod(x^r)))))
})
plot(t(Log.Moments), 
     pch=21, bg=hsv(rowSums(R)/(m+1), .8, 1, .25), cex=1.5,
     main=paste0("Logs of Moments Up to Total Order ", m))
abline(0:1, col="Gray", lwd=2)
mtext(paste0("alpha = (", paste(signif(alpha, 2), collapse= ", "),
            ")  beta = (", paste(signif(beta, 2), collapse=", "), ")"), side=3)

Edit: Here's what @whuber's answer looks like in Python:

def GDirichlet(n, alpha, beta):
    d = max(len(alpha), len(beta))
    my_generator = np.random.default_rng()
    Y = my_generator.beta(np.array(alpha), np.array(beta), (n, d)).reshape((d, n))
    q = 0
    for i in range(0, d):
        Y[i, :] = Y[i, :] * (1 - q)
        q = q + Y[i, :]
    return Y 
whuber
  • 322,774