4

I need to generate random numbers from 3 correlated distributions. First two of them are lognormal and the final one is normal, i.e. for X, Y and Z I need X and Y to be lognormal, and Z is normal. The correlation matrix is given:

1, -0.8, 0.5
-0.8, 1,   -0.5
0.5,-0.5,  1

For example, the std of X and Y and Z are 0.5, 0.3 and 20. The mean of X and Y can be 1, and 0 for Z. How can I do that?

I know I can use https://stackoverflow.com/questions/16024677/generate-correlated-data-in-python-3-3 to generate 3 normally distributed. If all 3 are lognormal distributions, just simply use np.exp. But how can I apply it to my case?

Now my solution is: we can generate multivariate normal distribution of log(X), log(Y) and Z since these 3 variables are normal. However, we need to carefully calculate the elements of the covariance matrix and the vector of mean to ensure the requirements of the mean and std of X and Y and Z.

For mean vector, it is simply [-np.log(1 + std_X ** 2)/2, -np.log(1 + std_Y ** 2)/2, 0] since EX=EY=1. For covariance matrix, based on my calculation, the covariance between \ln X and \ln Y comes from $cov(\ln X, \ln Y)=\ln (1 + \frac{cov(X,Y)}{EXEY}) $. The covariance between \ln X and Z comes from https://stats.stackexchange.com/a/470146/303835.

But is there any simpler method?

-----advanced version

what if I want X and Y and Z all be AR(1) process? I am running EnKF by perturbing atmospheric forcing. Papers just said they made perturbations like this, but I don't know how do it. The data is hourly precipitation, however they said "The time series correlation (temporal correlation) is applied to a first-order autoregressive model." Table 2 shows that they apply a temporal correlation of 24 hours...(https://doi.org/10.1175/JHM-D-15-0037.1)

Thanks!

Xu Shan
  • 163
  • 2
    I believe that this is a useful question even quite apart from it being asked in terms of Python, since we don't seem to have a question tagged all three of "lognormal-distribution", "random-generation" and "correlation", so close-voters, please hold back. – Stephan Kolassa Nov 27 '23 at 10:52
  • There are already a lot of questions that address using copulas to go from multivariate normal to arbitrary correlated distributions (e.g. here, this one on Python specifically). The Python-specific implementation makes this a programming question in my opinion. – PBulls Nov 27 '23 at 11:41
  • @PBulls The complication is that it's difficult to determine a copula that will produce a desired correlation matrix. That usually requires approximation, a (not very easy) numerical procedure, or even guesswork. Moreover, not all valid correlation matrices can even arise in this situation. For instance, it is impossible for any lognormal distribution to have a correlation of $\pm 1$ with any normal distribution. In the present instance, the flexibility to modify the lognormal shape parameter shows (a) the question is underspecified and (b) there's a relatively simple solution. – whuber Nov 27 '23 at 15:22
  • (My ultimate remarks are predicated on interpreting the "can bes" of the question as mere possibilities rather than requirements. If they are requirements, the question is not underspecified but it's then unclear whether any solution exists at all.) – whuber Nov 27 '23 at 15:24
  • @whuber, what if they are requirements? I am running EnKF by perturbing atmospheric forcing. Papers just said they made perturbations like this, but I don't know how do it (they also omit this technical details... – Xu Shan Nov 27 '23 at 16:25
  • Please edit your question so that it asks what you need to know. Generally, you have specified all the marginal distributions and they are continuous. When $F$ and $G$ are two such distributions, their maximum possible correlation is attained by letting $U$ be a uniform$(0,1)$ variable and defining $X=F^{-1}(U),$ $Y=G^{-1}(U).$ You need to compute the integrals to find that value. Similarly, with $Y=G^{-1}(-U)$ the minimum correlation is attained. For instance, for the $X$ and $Y$ you have specified, the minimum possible correlation is approximately $-0.9448.$ – whuber Nov 27 '23 at 16:36
  • Perhaps the easiest solution would be to run R from Python and use the copula package where you can specify the parameters of a Gaussian copula and arbitrary margins. I don't ofhand know which package runs R from Python, but one definitely exists. $//$ Caveat: depending on the margins, the parameters of a Gaussian copula are not necessarily Pearson correlation. – Dave Nov 27 '23 at 16:40
  • @Dave, Right: that's exactly the problem. But one can perform a numerical search by varying the parameters of the Gaussian copula until the desired correlations are attained. One example of this approach is given at https://stats.stackexchange.com/questions/322841. – whuber Nov 27 '23 at 16:41
  • @whuber, thanks for your comment. Now I edit it. But I have another problem, which is, applying AR1 process to the random numbers...especially the paper said "24 hours" autocorrelation... – Xu Shan Nov 27 '23 at 16:50
  • I'm not posting an answer because I believe you haven't asked the right question. But, briefly, there is a closed formula for the correlation of the exponentials of bivariate Normal variates, enabling you easily to generate $(X,Y)$. From those, using the method I give at https://stats.stackexchange.com/a/313138/919, you can generate Normally-distributed $Z.$ – whuber Nov 28 '23 at 19:59

3 Answers3

2

Strategy:

  1. Calculate the normal mean and variance for the lognormal variables, then simulate the normal variables and calculate $e^\cdot$.
  2. An AR-process with diagonal $\phi$ and noise cov $C_a$ will have variance $C_a / (1-e)$ where $e$ is given below. Therefore, simulate the AR-process with target cov $C(1-e)$ so that the realized AR-cov becomes $C$.
import numpy as np
from numpy import diag, log, exp, ones
G = 10**6
ms  = [1,1,0]
s   = [0.5,0.3,20]
mx  = ms[0]
my = ms[1]
rho = np.array([[1, -0.8, 0.5], [-0.8, 1, -0.5], [0.5, -0.5, 1] ])
C   = diag(s) @ rho @ diag(s)   # Target cov
# AR model: a' = phi a + U[t]
phi = np.array([0.2, 0.30, 0.2]) # Set this to zero for non-ar case
# (a',a)_xy = (phi a_x + X, phi a_y + Y) = [..] = C[X,Y]/(1-phi_x*phi_y)
e = diag(phi) @ ones([len(phi)]*2) @ diag(phi)
Ca = C * (1 - e)      # Target cov for AR
s2w = log(Ca[0,0] / mx**2 +1) # V[W] = V[log(X)]
s2v = log(Ca[1,1] / my**2 +1) # V[V] = V[log(Y)]
m = ms.copy()                 # new mean
m[0] = log(mx) - s2w/2        # E[W] = E[log(X)]
m[1] = log(my) - s2v/2        # E[V] = E[log(Y)]

CC = Ca.copy() # new cov CC[:,0] = CC[:,0] / mx # C[X,Z] = C[W,Z] * mx so div by mx gives new cov CC[0,:] = CC[0,:] / mx CC[0,0] = log(CC[0,0] + 1) CC[:,1] = CC[:,1] / my # C[Y,W] = C[V,W] * my so div by my gives new cov CC[1,:] = CC[1,:] / my CC[1,1] = log(CC[1,1] + 1) CC[0,1] = CC[1,0] = Ca[1,0] / mx / my # C[X,Y] = C[W,Y] * mx = C[W,V] * mx * my U = np.random.randn(G,len(m)) @ np.linalg.cholesky(CC).T + m U[:,0] = exp(U[:,0]) # if phi = 0, U has mean ms and cov C U[:,1] = exp(U[:,1]) # if phi = 0, U has mean ms and cov C

Part 2: Sim AR

a = np.zeros((G,len(m))) # sim AR: for t in range(1, G): a[t] = phi * a[t-1] + (U[t] - ms) a = a + ms # add mean

print(' Simulated Cov / target Cov') print(np.cov(a.T) / C) print(' Simulated mean - target mean') print(a.mean(0) - ms)

The parameters impose restrictions on $\phi$. They must take values so that $C(1 - e(\phi))$ is positive definite.

Edit: Y is now also log normal per the posters request.

If you, for instance, want a temporal correlation such that $Z$ 24 hours ago is correlated to $X,Y$, then you could simply take the simulation above and lag $Z$:

L = 24
ag = a.copy()
ag[:-L, 2] = ag[L:, 2]   # Send simulated z back in time
print(np.corrcoef(ag[L:,0], ag[:-L,2])[1,0]) # corr(ax[t], az[t-24])

This has zero correlation on all horizons except L=24. For better control, and perhaps a more natural decay of correlation, you can specify $\phi$ as a matrix. This makes things slightly more complicated and requires you to calculate the $\phi$-matrix needed for the desired correlation.

The value for $\phi$ depends on what you are trying to simulate. I would make a rough estimate (either by looking at data or numerically estimating) set $\phi$ to similar values. If these values are creating non-PSD Covs, you would have to change your assumptions about the covariance.

Hunaphu
  • 2,212
  • Hi @Hunaphu, thanks for your answer! But could you please add more details on your codes? like how to setup the value of the phi? Btw, what can I do if I want to "apply a temporal correlation for 24 hours"...? shall I make an AR(24)? or to be general how can I do it for AR(n)? Thanks! – Xu Shan Nov 28 '23 at 12:50
  • Btw I found there is a slight typo in the first few lines in my problem...it should be X and Y are of lognormal distributions... – Xu Shan Nov 28 '23 at 13:42
  • Thanks a lot! It works. But is there any way to tune the s[0] and phi[0] as large as possible? I think basically phi[0] determines the correlations between x_[i] and x_[j]. I figured it out "temporal correlation length is 24 hours" means the data is somehow correlated with past 24 data points. I think it's ok to keep phi[1]=phi[2]. I think the only way is to tune the rho matrix. The sign of the rho matrix should be [[1, -, +], [-, 1, -], [+, -, 1] ] but not sure whether is there any way to make phi[1] goes above 0.5 while s[0] is 0.5~1? Thanks! – Xu Shan Nov 29 '23 at 16:28
  • And one more issue, if I tried with s = [2,0.7,20] and phi = np.array([0.9, 0.9, 0.9]), there are some negative values in a[:,0]...I think it is supposed that the AR1 of the lognormal should be positive? – Xu Shan Nov 29 '23 at 16:44
1

To generate random numbers from correlated distributions where two are lognormal and one is normal, and then extend it to an AR(1) process, you can follow these steps:

Step 1: Define Parameters and Correlation Matrix

First, set up the means, standard deviations, and the correlation matrix.

import numpy as np
from scipy.stats import multivariate_normal

Define the means and standard deviations

mean = [1, 1, 0] std_dev = [0.5, 0.3, 20]

Define the correlation matrix

corr_matrix = np.array([[1, -0.8, 0.5], [-0.8, 1, -0.5], [0.5, -0.5, 1]])

Calculate the covariance matrix from the correlation matrix

cov_matrix = np.diag(std_dev).dot(corr_matrix).dot(np.diag(std_dev))

Step 2: Generate Multivariate Normal Distribution

Generate a multivariate normal distribution using these parameters. This will be transformed for the lognormal distributions.

# Generate samples from multivariate normal distribution
mvn_samples = multivariate_normal(mean, cov_matrix).rvs(size=1000)

Step 3: Transform to Lognormal Distributions for X and Y

Transform the first two variables into lognormal distributions.

# Convert to lognormal for X and Y
mvn_samples[:, 0] = np.exp(mvn_samples[:, 0])
mvn_samples[:, 1] = np.exp(mvn_samples[:, 1])

X, Y, Z now contain the desired distributions

X, Y, Z = mvn_samples.T

Step 4: Generate AR(1) Processes for X, Y, and Z

Define a function to generate an AR(1) process and apply it to X, Y, and Z.

def generate_ar1_process(alpha, noise_series):
    series = [noise_series[0]]
    for i in range(1, len(noise_series)):
        series.append(alpha * series[i-1] + noise_series[i])
    return series

alpha = 0.8 # Autoregression coefficient

X_ar1 = generate_ar1_process(alpha, X) Y_ar1 = generate_ar1_process(alpha, Y) Z_ar1 = generate_ar1_process(alpha, Z)

Visualize the distributions and the AR(1) processes.

import matplotlib.pyplot as plt

Plotting the transformed distributions

plt.figure(figsize=(12, 4)) plt.subplot(1, 3, 1) plt.hist(X, bins=30, color='blue', alpha=0.7) plt.title('Transformed X (Lognormal)')

plt.subplot(1, 3, 2) plt.hist(Y, bins=30, color='green', alpha=0.7) plt.title('Transformed Y (Lognormal)')

plt.subplot(1, 3, 3) plt.hist(Z, bins=30, color='red', alpha=0.7) plt.title('Unchanged Z (Normal)')

plt.tight_layout() plt.show()

enter image description here

# Visualization of the AR(1) Processes
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1) plt.plot(X_ar1, color='blue') plt.title('AR(1) Process for X')

plt.subplot(1, 3, 2) plt.plot(Y_ar1, color='green') plt.title('AR(1) Process for Y')

plt.subplot(1, 3, 3) plt.plot(Z_ar1, color='red') plt.title('AR(1) Process for Z')

plt.tight_layout() plt.show()

enter image description here

Note

I'm not 100% sure I got all your requirements right. An optional step you could add at the start of the code is to adjust the mean and covariance matrix calculations. You could do it like so:

def lognormal_params(mean, std_dev):
    var = std_dev ** 2
    mean_log = np.log(mean ** 2 / np.sqrt(var + mean ** 2))
    std_dev_log = np.sqrt(np.log(var / mean ** 2 + 1))
    return mean_log, std_dev_log

Mean and std dev for X and Y (lognormal) and Z (normal)

mean_X, std_X = 1, 0.5 mean_Y, std_Y = 1, 0.3 mean_Z, std_Z = 0, 20

Adjusted mean and std dev for underlying normal distributions of X and Y

mean_X_log, std_X_log = lognormal_params(mean_X, std_X) mean_Y_log, std_Y_log = lognormal_params(mean_Y, std_Y)

Updated means and standard deviations

mean = [mean_X_log, mean_Y_log, mean_Z] std_dev = [std_X_log, std_Y_log, std_Z]

...

Hopefully it gives you some material to solve your problem.

  • Thanks! But will the value of alpha change the mean and std of the X and Y after applying the AR1 process? – Xu Shan Nov 27 '23 at 18:34
  • 1
    The problem, as explained in comments to the question, is that this method does not reproduce the intended correlation matrix, because the transformation involved does not preserve correlations. – whuber Nov 27 '23 at 19:03
0

Based on @Hunaphu's answer, I made a light change and here is a modified version:

import numpy as np
from numpy import diag, log, exp, ones
G = 10**6
ms  = [1,1,0]
s   = [0.5,0.3,20]
mx  = ms[0]
my = ms[1]
rho = np.array([[1, -0.8, 0.5], [-0.8, 1, -0.5], [0.5, -0.5, 1] ])
C   = diag(s) @ rho @ diag(s)   # Target cov
# AR model: a' = phi a + U[t]
phi = np.array([0.9, 0.9, 0.9]) # Set this to zero for non-ar case
# (a',a)_xy = (phi a_x + X, phi a_y + Y) = [..] = C[X,Y]/(1-phi_x*phi_y)
e = diag(phi) @ ones([len(phi)]*2) @ diag(phi)
# Ca = C * (1 - e)      # Target cov for AR
Ca = C * (1)      # Target cov for AR
s2w = log(Ca[0,0] / mx**2 +1) # V[W] = V[log(X)]
s2v = log(Ca[1,1] / my**2 +1) # V[V] = V[log(Y)]
m = ms.copy()                 # new mean
m[0] = log(mx) - s2w/2        # E[W] = E[log(X)]
m[1] = log(my) - s2v/2        # E[V] = E[log(Y)]

CC = Ca.copy() # new cov CC[:,0] = CC[:,0] / mx # C[X,Z] = C[W,Z] * mx so div by mx gives new cov CC[0,:] = CC[0,:] / mx CC[0,0] = log(CC[0,0] + 1) CC[:,1] = CC[:,1] / my # C[Y,W] = C[V,W] * my so div by my gives new cov CC[1,:] = CC[1,:] / my CC[1,1] = log(CC[1,1] + 1) CC[0,1] = CC[1,0] = Ca[1,0] / mx / my # C[X,Y] = C[W,Y] * mx = C[W,V] * mx * my CCa = CC * (1 - e) # Target cov for AR U = np.random.randn(G,len(m)) @ np.linalg.cholesky(CCa).T + m* (1 - phi)

Part 2: Sim AR

a = np.zeros((G,len(m))) # sim AR: for t in range(1, G): a[t] = phi * a[t-1] + (U[t]) a_tmp = a a_tmp[:,0] = np.exp(a[:,0]) a_tmp[:,1] = np.exp(a[:,1])

print(np.min(a_tmp, axis=0))

print(' Simulated Cov / target Cov') print(np.cov(a_tmp.T) / C) print(' Simulated mean - target mean') print(a_tmp.mean(0) - ms)

So basically define the AR1 process of lognormal distribution as:$\ln Y_{t} = \phi * \ln Y_{t-1}+N(\mu, \sigma^2)$ where the $Y_{t}$ follows a lognormal distribution and the $N(\mu, \sigma^2)$ is a normal distribution. So our strategy is to sample from a multivariate normal distribution, then convert it to an AR1 process, finally use np.exp to make it lognormally distributed. The core part is to generate random noise of $N(\mu, \sigma^2)$ to satisfy the certain values of the mean and std of the $Y_{t}$. There are two transformations:

The first one is to convert $\mu_{normal}$ and $std_{normal}$ to the $\mu_{lognormal}$ and $std_{lognormal}$. Then using $\mu_{lognormal}$ and $std_{lognormal}$ we can get the mean and std for $N(\mu, \sigma^2)$.

Xu Shan
  • 163