Create a data set with correlated variables

Question

I am trying to create a dataset where columns 2,3,4 are correlate (0.98,0.97,0.96, respectively) to column 1. Right now I have this code:

 library(MASS)
X<-mvrnorm(20,mu=c(5,6),Sigma=matrix(c(1,0.98,0.98,1),ncol=2),empirical=TRUE)
cor(X)
Y<-mvrnorm(20,mu=c(5,6),Sigma=matrix(c(1,0.97,0.97,1),ncol=2),empirical=TRUE)
cor(Y)
Y <- Y[,2]
Z<-mvrnorm(20,mu=c(5,6),Sigma=matrix(c(1,0.97,0.97,1),ncol=2),empirical=TRUE)
cor(Z)
Z <- Z[,2]
data <- cbind (X,Y,Z)
cor(data)

and it produces this matrix:

                              Y          Z
    1.00000000  0.9800000000 0.0826655886 -0.4293286
    0.98000000  1.0000000000 0.0009559618 -0.5221029
 Y  0.08266559  0.0009559618 1.0000000000  0.1847887
 Z -0.42932859 -0.5221029358 0.1847886713  1.0000000

I would like the final outcome to look like this. It doesn't matter how the column 2,3,4 are correlated with each other (i.e. the X), as long as they have the right correlation with the first column.

  1    0.98   0.97   0.96
 0.98    1      X     X
 0.97    X      1     X 
 0.96    X      X     1

Thanks for your help!

The last part of my post https://stats.stackexchange.com/a/313138/919 in the duplicate thread gives a fully general solution. It doesn't require the variables to be normally distributed. — whuber, Apr 11 '23 at 16:00

Firebug · Answer 1 · 2023-04-12T12:49:26.360

Method 1

Basically an abridged version of the one exposed by @whuber in this answer.

Sample from $X_1 \sim \mathcal N(0, 1)$, then have

$$ \require{cancel} \begin{cases} X_2=f_2X_1+\sqrt{1-f_2^2}\epsilon_2\\ X_3=f_3X_1+\sqrt{1-f_3^2}\epsilon_3\\ X_4=f_4X_1+\sqrt{1-f_4^2}\epsilon_4\\ \end{cases} $$

Where the fractions are given as $f=(f_2, f_3, f_4) = (0.98,0.97,0.96)$ and the $\epsilon_i$ are white noise samples from $\mathcal N(0,1)$.

It's easy to see that, for $i \in (2,3,4)$:

$$\operatorname{Var}(X_i)=f_i^2\cancelto{1}{\operatorname{Var}(X_1)}+(1-f_i^2)\cancelto{1}{\operatorname{Var}(\epsilon_i)}+f_i\sqrt{(1-f_i^2)}\cancelto{0}{\operatorname{Cov}(X_1,X_i)}=\\ f_i^2+(1-f_i^2)=1$$

$$ \operatorname{Cor}(X_1, X_i) = \operatorname{Cov}(X_1,X_i) = E[X_1X_i] - \cancelto{0}{ E[X_1]}\cancelto{0}{E[X_i]}= f_i\cancelto{1}{E[X_1^2]}+\sqrt{1-f_i^2}\cancelto{0}{E[X_1\epsilon_i]}\\ =f_i$$

Method 2 (requires specifying the whole correlation matrix):

as @whuber pointed in the comments, in your question, you only presented the 3 correlations pertaining to the first variable. You would need to specify the remaining 3 correlations in such a manner that $R$ is positive definite. A solution without this caveat is also possible, where you first sample the first variable in isolation.

Given your correlation matrix $R$, you can simulate data from a multivariate normal distribution by:

$$X=Z \cdot R^{1/2},$$

where $Z \sim \text{MVN}(\mathbb 0, \mathbb I)$.

It's easy to see that the covariance of the $X$ is given by

$$\operatorname{Cov}(X)=E[X^\top X] - E[X]^\top \cdot E[X]\\ ={R^{1/2}}^\top E[Z^\top Z]R^{1/2}-R^{1/2}E[Z]^\top \cdot E[Z]R^{1/2}$$

But $E[Z] = \mathbb 0$, while $E[Z^\top Z] = \mathbb I$, so

$$\operatorname{Cov}(X) = {R^{1/2}}^\top\cdot R^{1/2} = R.$$

$R^{1/2}$ is any decomposition of $R$ such that $R={R^{1/2}}^\top R^{1/2}$

Isn't part of the problem the fact that $R$ is not fully given? Indeed, some possible values for the missing correlations are mathematically invalid, so one cannot just plug them in arbitrarily. — whuber, Apr 11 '23 at 16:01
Absolutely true @whuber, I skimmed that part of the question — Firebug, Apr 12 '23 at 08:19

Create a data set with correlated variables

1 Answers1

Method 1

Method 2 (requires specifying the whole correlation matrix):