Generating and Working with Random Vectors in R

Question

I've been trying to understand random vectors and generate them in R to reproduce properties. Recently, I asked a similar question, and it was rightfully placed on-hold for being too general. Thanks to the excellent video: https://youtu.be/uPRatm70noI I've been able to get some results as follows:

Narrowing down the task to the normal bivariate case, where each component, $X_i$ of the random vector $\textbf{V}= \begin{bmatrix}X_1, X_2\end{bmatrix}^{T}$ follows a $N(\mu_i,\sigma_i^2)$ marginal probability distribution, with a joint probability distribution given by the pdf: $f(V)=\frac{1}{\sqrt{2\pi|C|}}\,e^{-\frac{1}{2}[V - \mu]^{T} C^{-1}[V - \mu]}$, in which $C$ is a covariance matrix decided upon a priori, and singular value decomposed into, $C=U\Sigma U^{T}$, where $U$ is the matrix of eigenvectors, and $\Sigma$ the diagonal matrix of eigenvalues, we can obtain random vectors starting off with standard normal random samples.

Establishing $C$ for example as:

C
     [,1] [,2]
[1,]   20    8
[2,]    8   20

and $\mu$ as c(3, -5), and starting with the random vector $N(0,I)$ simulated by two random samples of $10^5$ observations of the standard normal:

vec1 <- rnorm(1e5)
vec2 <- rnorm(1e5)
Sv <- rbind(v1,v2) # Sv ~ "standard random vector"

the desired simulated random vector can be obtained utilizing the identities:

$C=U\Sigma U^{T}=C=(U\Sigma^{1/2})( \Sigma^{1/2}U^{T}) =AA^{T}$, and

$V = A S_{v}+\mu$.

The initial multiplication $\Sigma^{1/2}S_{v}$ results in stretching of the bivariate standard distribution, followed by a rotation caused by $U$ in $U\Sigma^{1/2}S_{v}$, and ending in a translation as a result of the addition of $\mu$. Graphically:

1

Ultimately we end up with a large matrix of $2$ x $10^5$ elements and $1.5Mb$, representing the random vector. This $2$ x $100,000$ matrix corresponds to $\textbf{V}= \begin{bmatrix}X_1, X_2\end{bmatrix}^{T}$ with the first row being $X_1$ ~ $N(3,20)$ and the second, $X_2$ ~ $N(-5, 20)$.

Evidently a cumbersome process just to generate one single realization of the sample space of $\textbf{V}$ in the minimalistic setting of only two variables.

So the question is whether there is an easier way to recreate random vectors in R (with this particular joint distribution, or in general) that is less unwieldy?

EDIT: As reflected below, the computation time is not an issue, unless you intercalate plotting code as I did. I have, hence, erased the parts of the initial post that made reference to time.

How much faster do you want it? I just tried this in R and all of the calculations were nearly instantaneous. — shadowtalker, May 02 '15 at 07:22
I made a mistake intermixing plotting code between different mathematical steps. Unaware of packages that Glen and @ssdecontrol nicely point out, I falsely equated time with code complexity. I will be happy to edit my question to reflect all this, or erase it if Glen, as the moderator, feels it's best. — Antoni Parellada, May 02 '15 at 15:49

score 5 · Accepted Answer · answered May 02 '15 at 07:30

The mvtnorm package in R has the rmvnorm function (analogous to rnorm) that produces arbitrary-dimensional Gaussian random variables. It also provides the option to use three different algorithms. A quick comparison using your exact setup:

library(mvtnorm)
library(microbenchmark)
sigma <- matrix(c(20, 8, 8, 20), 2)
mu <- c(3, -5)
microbenchmark(v1 <- rmvnorm(1e5, mu, sigma, "eigen"),
               v2 <- rmvnorm(1e5, mu, sigma, "svd"),
               v3 <- rmvnorm(1e5, mu, sigma, "chol"))
# Unit: milliseconds
#                                      expr      min       lq     mean   median
#  v1 <- rmvnorm(1e+05, mu, sigma, "eigen") 19.95751 21.31730 28.14967 21.57772
#    v2 <- rmvnorm(1e+05, mu, sigma, "svd") 19.98124 21.29868 30.23727 21.74448
#   v3 <- rmvnorm(1e+05, mu, sigma, "chol") 19.92971 21.31440 32.01633 21.77176
#        uq      max neval cld
#  22.84293 91.37796   100   a
#  23.23654 89.43729   100   a
#  24.03474 91.22031   100   a

The timings are all about the same, and they're all pretty darn fast. If you need to generate these simulations in significantly less time than that, you'll have to look for a solution in C++ (to interface with Rcpp), C, or Fortran.

@AntoniParellada update: take a look at http://stats.stackexchange.com/q/66610/36229 — shadowtalker, May 02 '15 at 21:23
I closed the post accepting your answer, because clearly you brought to my attention the time issue, and addressed the specific question about an automated package for bivariate normals. Later I posted an answer of my own with mvtnorm that includes all distributions. Take a peek, see what you think. Thanks, again for your help! — Antoni Parellada, May 03 '15 at 05:37

score 4 · Answer 2 · edited Aug 21 '15 at 20:14

In addition to @ssdecontrol's answer, I've been using the MASS package's mvrnorm. Adding to @ssdecontrol's code (with my slower compy):

library(mvtnorm)   
library(MASS)
library(microbenchmark)
sigma <- matrix(c(20, 8, 8, 20), 2)
mu <- c(3, -5)
microbenchmark(v1 <- rmvnorm(1e5, mu, sigma, "eigen"),
               v2 <- rmvnorm(1e5, mu, sigma, "svd"),
               v3 <- rmvnorm(1e5, mu, sigma, "chol"),
               v4 <- mvrnorm(1e5, mu, sigma))


# Unit: milliseconds
# expr                                          min       lq     mean   median       uq       max neval
# v1 <- rmvnorm(1e+05, mu, sigma, "eigen") 37.49799 40.23405 43.08878 42.20849 45.07547  76.43984   100
# v2 <- rmvnorm(1e+05, mu, sigma, "svd")   37.51092 39.18271 44.08090 41.82957 44.20879 206.87745   100
# v3 <- rmvnorm(1e+05, mu, sigma, "chol")  37.40030 39.74741 41.96467 40.84335 43.63740  50.37007   100
# v4 <- mvrnorm(1e+05, mu, sigma)          36.78208 37.73462 40.67353 39.23602 41.96271  89.75172   100

Note that mvrnorm only does the Eigen decomposition.

I would recommend against this, but only because I personally hate "omnibus" packages like MASS. Also, the inconsistent naming of mvrnorm bugs me. — shadowtalker, May 02 '15 at 21:19

score 3 · Answer 3 · edited Aug 21 '15 at 20:15

For completeness sake, here's a follow-up note on how to generate random vectors regardless of the marginal distribution of the individual components. I'm going to stick with the bivariate case:

Generate a bivariate vector from a standard normal random distribution following a predetermined correlation*. I'll stick with the case initially posted, which had used a covariance of 8 as an example. Once the final vector $\textbf{V}=[X_1,X_2]^{T}$ was obtained the correlation between $X_1$ and $X_2$ was found to be cor(X1,X2) [1] 0.4015484 (and the covariance as set up initially, cov(X1,X2) = 8.066535) (no seed was set).

We now set.seed(0), and sticking with a correlation of 0.4, we code a correlation matrix such as: matrix(c(1,0.4,0.4,1), nrow = 2), and we are ready for mvtnorm:

SN <- rmvnorm(mean = c(0,0), sig = C, n = 1e5) to produce two vectors distributed as ~ $N(0, 1)$ and with a cor(SN[,1],SN[,2]) = 0.3993723 ~ 0.4. Here's the plot with regression line:
Use the Probability Integral Transform here to obtain a bivariate random vector with marginal distributions ~ $U(0, 1)$ and the same correlation:

U <- pnorm(SN) - so we are feeding into pnorm the SN vector to find $erf$(SN). Here's the cor(U[,1], U[,2]) = 0.3828065 ~ 0.4. And here's the scatterplot with marginal distributions at the edges:
Apply the inverse transform sampling method here to finally obtain the bivector of equally correlated points belonging to whichever distribution family.

We can replicate initial posting and end up with two correlated samples from $N(3, 20)$ and $N(-5, 20)$, respectively:

X1 <- qnorm(U[,1], mean = 3,sd = 4.47) and X2 <- qnorm(U[,2], mean = -5, sd = 4.47), which will show a cor(X1,X2) = 0.3993723 ~ 4.

However, if the distributions chosen are more dissimilar, the correlation may not be as precise. For instance, let's get the first column of $U$ (U[,1]) to follow a Student's $t$ distribution with 3 d.f., and the second an Exponential with a $\lambda$=1:

X1 <- qt(U[,1], df = 3) and X2 <- qexp(U[,2], rate = 1)

The cor(X1,X2) = 0.333598 < 0.4. Here are the respective histograms:

@gung Not sure what you edited exactly but have no doubt it's for the better. I was just going over my posts, and the figures on this one looked disproportionately big... I didn't mean to recycle the question. Cheers! — Antoni Parellada, Aug 22 '15 at 00:59
If you put a single space before the link to the figure, it nests the figure under the numbered list. Otherwise, they stick out from the list & look incongruous. You can see what was edited by clicking the "edited ____ ago" link above the editor's name / identicon. — gung - Reinstate Monica, Aug 22 '15 at 01:37
@gung Thank you. I will fix on another one if I can find it. — Antoni Parellada, Aug 22 '15 at 01:39

Generating and Working with Random Vectors in R

3 Answers3