5

I'm trying to generate correlated data (preferably multinormal) with predefined correlations (e.g. 0.35 or 0.9). Any idea how I can do it? I'm using R and I did find a way to generate this (using mvrnorm), but you need to supply a covariance matrix. I have a covariance matrix with correlations around 0.9; however, I don't know how I can modify its entries to change the correlation. If I can do that, I'll be able to generate correlated data with the correlations I need.

Regards,

Jawad
  • 141
  • You just need to play with the values in the covariance matrix in mvrnorm and relate them with the definition of correlation matrix. –  Apr 23 '12 at 12:00
  • If you post the code for you covariance matrix we can tell you how to modify it to get other correlations. – MånsT Apr 23 '12 at 12:11
  • Procrastinator, I can't just change the values in the matrix to whatever I want, changing any number has an effect on other entries in the matrix and I must know how the other entries change (inc. or dec.) before changing anything. For example, changing the variance of any variable will change its covaraince with the other variables. – Jawad Apr 23 '12 at 12:30
  • MansT, there is no code for the covariance matrix, I have a table of correlated data which I read into R and pass it to mvrnorm to calculate the means and the covariance matrix in order to generate more correlated data based on the original one. I can post the matrix, but I'm not sure how that affects the method that I should use to create a covariance matrix with pre-determined correlations. There should be a way to do this regardless of what I already have. – Jawad Apr 23 '12 at 12:34
  • So is your question how you can obtain a correlation matrix given a covariance matrix? – MånsT Apr 23 '12 at 13:02
  • No, the question is still the same: how to obtain correlated data with pre-defined correlations? The fact that I have a covariance matrix at hand has nothing to do with it. If I can find a way to modify this matrix correctly, I can use the mvrnorm function in R to obtain the correlated data. – Jawad Apr 23 '12 at 13:18
  • 2
    The correlation between $X_i$ and $X_j$ is given $$ Cor(X_i,X_j) = \frac{Cov(X_i, X_j)}{sd(X_i)sd(X_j)}. $$ If your correlation matrix is V this is $$ Cor(X_i,X_j) = \frac{V_{ij}}{\sqrt{V_{ii}}\sqrt{V_{jj}}}. $$ Maybe this can help you set up your covariance matrix, especially if you are able to simplify your problem by standardizing each variable. – Erik Apr 23 '12 at 13:25
  • @MånsT is right. You want correlated data; that means you want data with a specified correlation matrix. The function for generating those data requires you to input a covariance matrix. Thus, what you need to know is how to get the covariance matrix that corresponds to the correlation matrix you're interested in. To do this, you use Erik's formulas. Start w/ what SD's you want for each variable; square them to get the variances; given that you know the correlation you want & now you have the variances, elementary algebra lets you solve for the covariances & you're done. – gung - Reinstate Monica Apr 23 '12 at 13:42
  • Sorry, the way I phrased part of that last bit might be misleading. W/ @Erik's formulas you solve for the covariances using the correlations you want & the SD's you want--you only use the variances to plug in the diagonal elements of the covariance matrix. – gung - Reinstate Monica Apr 23 '12 at 13:51
  • Thanks Erik and Gung, I'm already aware of the correlation formula, I thought there is another way to do this without working backwards from the formula. – Jawad Apr 23 '12 at 14:01
  • 2
    This question has been discussed on here before. For example, look here: http://stats.stackexchange.com/questions/13382/how-to-define-a-distribution-that-correlates-with-a-draw-from-another-distributi/13384#13384 – Macro Apr 23 '12 at 14:52
  • This has been answered in https://stackoverflow.com/a/44930649/1297830. The trick is to use MASS::mvrnorm(..., empirical=TRUE) – Jonas Lindeløv Aug 30 '18 at 08:08

2 Answers2

3

The MASS package has a function called mvrnorm() that can generate a group or random numbers to a specified level of correlation. An example of the setup can be found in the beginning of the example here: http://menugget.blogspot.de/2011/11/propagation-of-error.html

  • Sorry, didn't see that Jawad had already pointed you to the same function. In any case, the example posted might help you understand how to set it up. – Marc in the box Apr 23 '12 at 12:47
  • Thanks Marc, from the page I understand that all I have to do is set the diagonal elements of my covariance matrix to rho and the off-diagonal elements to 1 and I should get the data I need correlated by rho? – Jawad Apr 23 '12 at 13:16
  • Not exactly - the covariance matrix will depend on your defined standard deviations. If sd=1 for all series, then you are correct. Otherwise, you will need to define your std. devs for each series. – Marc in the box Apr 23 '12 at 13:26
  • No. The variances of the variables should be along the diagonal and the off-diagonal elements should be rho (if $\sigma^2=1$). – MånsT Apr 23 '12 at 13:38
3

Actually this is a trap question: it sounds easy but it is not (+1). The short answer to your question is you can't.

I will give an example. Imagine you have 3 Gaussian variables $X_1, X_2$ and $X_3$. You want the correlation between $X_1$ and $X_2$ to be 0, and all correlations with $X_3$ to be 1. This is obviously impossible because $X_1 = X_3$ and $X_3 = X_1$ says that $X_1 = X_2$ (up to shifting and scaling), which contrasts with the assumption that they are independent!

You would have the same situation if you replace 0 by "close to 0" and 1 by "close to 1" in the previous example. The issue here is that not every matrix is a correlation matrix. The requirement for being a correlation matrix is to be symmetric and positive definite.

You cannot choose arbitrary correlation values, but you can check whether they define a valid correlation matrix. Say that you have a symmetric square matrix mat with required correlation coefficients. You can test that it is positive definite as shown below.

all(eigen(mat)$values >= 0)

For symmetric real matrices, positive definite is equivalent to having all eigenvalues positive.

gui11aume
  • 14,703
  • It might be good to make the inequality in the code nonstrict to allow for perfect correlations between linear combinations of variables. – cardinal Jun 03 '12 at 14:41
  • @cardinal Done. But that is purely for demonstration purposes. Testing strict equality of real numbers is something R cannot do as (.3-.2) == (.2-.1) shows. – gui11aume Jun 03 '12 at 14:47
  • 1
    Good point; it was actually the larger conceptual point I was trying to address. That "limitation" has more to do with floating point representation, than R itself, though. Testing against zero is a bit special. Some related routines in R will truncate small values to zero if they fall below a tolerance. – cardinal Jun 03 '12 at 14:54
  • 1
    @cardinal 'That "limitation" has more to do with floating point representation, than R itself, though' Yes of course. Apologies to the R team :-) – gui11aume Jun 03 '12 at 15:01