@whuber 's generation of a random variable with fixed covariance structure

Question

The question refers to @whuber's algorithm to draw a random variable with a given covariance structure to a given set of random variables: https://stats.stackexchange.com/a/313138/3277

The algorithm does exactly what it is supposed to do.

However, I don't grasp yet how and why it does so. Can anyone please clarify WHY the proposal by @whuber actually works.

In particular,

1) one line of his code reads

y.dual <- with(svd(y), (n-1)*u %*% diag(ifelse(d > 0, 1/d, 0)) %*% t(v))

where the inner matrix is a diagonal matrix containing the inverses of all singular values of y.

u contains the left-singular vectors of y. v contains the right-singular values of y.

So what is y.dual ?

2) the final step in calculating z is

z <- y.dual %*% rho + sigma*e

Why do we have to add sigma*e here? And, frankly, why does y.dual %*% rho + sigma*e produce what we are looking for?

I would be grateful for a clarification / explanation of this procedure.

Both questions are (sketchily) answered in my post: y.dual is (proportional to) a pseudo-inverse of y (which you can check by inspecting zapsmall(t(y.dual) %*% y) provided the columns of y are linearly independent) and (2) implements the generalization of the formula for $X_{Y;\rho}$ from ordinary to multiple regression. — whuber, Jan 09 '20 at 17:06
I think we need center = TRUE in y <- scale(y, center=FALSE). In the original answer, the example codes generate Y with rnorm, which has zero means so it works. When I generate Y with rbeta, which follows a beta distribution with a mean greater than 0, the codes fail. After some debugging, using center=TRUE will solve the problem. — abc1m2x3c, Oct 11 '22 at 07:45

score 21 · Answer 1 · answered Jan 09 '20 at 19:13

As a point for further elaboration, here is the explanation in the thread you reference:

If it's mathematically possible, [this method] will find an $X_{Y_1,Y_2,\ldots,Y_k;\rho_1,\rho_2,\ldots,\rho_k}$ having specified correlations [between $X$ and] an entire set of $Y_i$. Just use ordinary least squares to take out the effects of all the $Y_i$ from $X$ and form a suitable linear combination of the $Y_i$ and the residuals. (It helps to do this in terms of a dual basis for $Y$, which is obtained by computing a pseudo-inverse. The following code uses the SVD of $Y$ to accomplish that.)

(I apologize for the reversal from the usual roles of $X$ (which here is treated as a response variable in a multiple regression) and $Y$ (which here is a collection of explanatory or regressor variables). This notation was adopted out of respect for the original question.)

The R code given to specify this procedure is sufficiently compact to be reproduced here:

y <- scale(y)             # Makes computations simpler
e <- residuals(lm(x ~ y)) # Take out the columns of matrix `y`
y.dual <- with(svd(y), (n-1)*u %*% diag(ifelse(d > 0, 1/d, 0)) %*% t(v))
sigma2 <- c((1 - rho %*% cov(y.dual) %*% rho) / var(e))
return(y.dual %*% rho + sqrt(sigma2)*e)

Let's go through this recipe step by step.

Taking out the effects of $Y$. Ordinary Least Squares finds a vector $\beta=(\beta_i)$ of $k$ coefficients for which the residuals $$e = X - Y\beta$$ are orthogonal to (have zero covariance with) the vector space spanned by the columns of $Y = \pmatrix{Y_1,&Y_2,&\cdots, & Y_k}.$ (These orthogonality relations are known as the Normal Equations.) The line of code
```
e <- residuals(lm(x ~ y))
```
carries out this calculation.
A linear combination. By definition, a linear combination of $Y$ and the residuals $e$ can be written in terms of a number $\sigma $ and $k$-vector $\alpha$ as $$Z=\sigma e + Y\alpha =\sigma e + \alpha_1 Y_1 + \cdots + \alpha_k Y_k.$$ By virtue of the Normal Equations, the covariance between any $Y_i$ and such a combination is $$\operatorname{Cov}(Z, Y_i) = \sigma \operatorname{Cov}(e,Y_i) + \operatorname{Cov}\left(\sum_{j=1}^k \alpha_j Y_j, Y_i\right) = \operatorname{Cov}\left(\sum_{j=1}^k \alpha_j Y_j, Y_i\right).\tag{1}$$ Ultimately, we want to find coefficients $\alpha_j$ that give the right hand side the right values to achieve the desired correlations $\rho = (\rho_i).$
A dual basis. This problem of finding the $\alpha_j$ is easily solved with a dual basis. Without loss of generality, re-index the $Y_i$ if necessary so that the first $d$ of them are linearly independent and span the same space spanned by all the $Y_i:$ that is, they form a basis. We seek a collection of $d$ vectors $W_j,$ each of which is a linear combination of the $Y_j, j\le d,$ for which $$\operatorname{Cov}(W_i,Y_j) = 0\tag{2}$$ whenever $i\ne j$ and $j \le d.$ We may scale the $W_j$ so that $$\operatorname{Var}(W_j)=1.\tag{3}$$ $(W_j)$ is a dual basis associated with the $Y_i.$

As described in the text, such a matrix $W$ can be found using the Singular Value Decomposition (SVD) $$Y = U\, D\, V^\prime$$ by inverting the diagonal matrix $D$ (leaving any diagonal zeros alone) to create a pseudo-inverse $D^{-}$ whose first $d$ diagonal values are nonzero and forming $$Y^{-} = U\, D^{-}\, V^\prime.$$ You may check that the $d\times d$ block of $$(Y^{-})^\prime Y = (V\, D^{-}\, U^\prime)\, (U\, D\, V^\prime) = V\, I^{-}\, V^\prime = \pmatrix{\mathbb{I}_d & * \\ * & *}$$ is the identity matrix, which is just a compact way of expressing formulas $(2)$ and $(3)$ simultaneously. This is what the code
```
y.dual <- with(svd(y), (n-1)*u %*% diag(ifelse(d > threshold, 1/d, 0)) %*% t(v))
```
does. It turns out that this formula for $W$ works even when the first $d$ columns of $Y$ are not linearly independent. (Thinking of SVD in terms of geometric operations makes this apparent.) This obviates any need to execute the preliminary re-indexing step of the $Y_i.$
A suitable linear combination. Returning to $(1),$ express the linear combination $Y\alpha$ in terms of the dual basis, $$Y\alpha = \sum_{i=1}^d \gamma_jW_j.$$ This re-expresses $(1)$ as $$\operatorname{Cov}\left(\sum_{j=1}^k \alpha_j Y_j, Y_i\right) = \operatorname{Cov}\left(\sum_{j=1}^d \gamma_jW_j, Y_i\right) = \sum_{j=1}^d \gamma_j \operatorname{Cov}\left(W_j, Y_i\right) = \gamma_i.$$ Consequently, using a standard formula for correlations, the $d$ equations we need to solve are $$\rho_i = \operatorname{Cor}(Z, Y_i) = \frac{\operatorname{Cov}(Z, Y_i)}{\operatorname{sd}(Z)\operatorname{sd}(Y_i)} = \frac{\gamma_i}{\operatorname{sd}(Z)\operatorname{sd}(Y_i)}.\tag{4}$$
The final step. As a preliminary step, we may as well have scaled the $Y_i$ to give them unit standard deviations. (The line of code y <- scale(y, center=FALSE) does that.) We're going to make sure $Z $ also has a unit standard deviation, so that the previous formula $(4)$ actually gives the correlation. That is, the solution by all rights ought to be $\gamma_i = \rho_i.$ If that's possible, then $$1 = \operatorname{Var}(Z) = \sigma ^2 \operatorname{Var}(e) + \operatorname{Var}(W\rho).$$ Consequently, assuming the residual vector $e$ is nonzero, $$\sigma ^2 = \frac{1 - \operatorname{Var}(W\rho)}{\operatorname{Var}(e)} = \frac{1 - \rho^\prime \operatorname{Cov}(W)\rho}{\operatorname{Var}(e)}.$$ This formula (omitted in the present question) is implemented in the code as
```
sigma2 <- c((1 - rho %*% cov(y.dual) %*% rho) / var(e))
```
Finally, the solution $Z = \sigma e + Y\alpha = \sigma e + W\gamma = W\rho + \sigma e$ is implemented in the code as
```
y.dual %*% rho + sqrt(sigma2)*e
```

Thank you very much for expanding on this answer! A small question though, in the sketch or compact form of the code you are scaling the matrix y by the standard deviation whereas in the full implementation you are scaling by the root mean squared value (by adding option center = FALSE). Is there a reason for the inconsistency (which is also present in the original question)? — Kuku, Aug 18 '22 at 13:53
Thanks! Lastly, I will cite this in a work and temporarily calling the approach the 'Huber's Dual Basis' approach. Is there a more general and conventional family of methods that this is an example of? — Kuku, Aug 19 '22 at 09:07
@Kuku There is some literature on the geometry of multiple regression that would cover such results. See https://stats.stackexchange.com/a/533996/919. — whuber, Aug 19 '22 at 13:43

@whuber 's generation of a random variable with fixed covariance structure

1 Answers1

Linked