How to create a matrix with known number of signals for use in PCA testing

Question

I'm looking for a way to create a matrix with a known number of signals and background error for use in PCA. The following example attempts this by combining signals of known amplitude, followed by the addition of error, also of prescribed amplitude. I'm a bit unsure as to whether I can definitively say that only X PCs are expected to be significant due to these definitions.

Ideally, I would like to know this precisely, but I may need to ensure that the true signals are completely orthogonal to one another? I'm also unsure as to whether I have correctly prescribed the desired signal amplitude to be identified by the PCA.

# make data ---------------------------------------------------------------
set.seed(123456)
m <- 200 # row dimension
n <- 50 # column dimension
s <- 15 # number of signals
d <- rev(seq(s)) # amplitude of signals
e <- 8 # amplitude of error
make row signals
x <- do.call("cbind", lapply(d, function(x){scale(cumsum(rnorm(m)))}))
image(x)
plot(x[,2])
make column signals
y <- do.call("cbind", lapply(seq(s), function(x){scale(cumsum(rnorm(n)))}))
image(y)
plot(y[,2])
combine into field
Z <- matrix(0, nrow = m, ncol = n)
for(i in seq(s)){
  tmp <- as.matrix(x[,i]) %% t(as.matrix(y[,i]))  d[i]
  Z <- Z + tmp 
}
signal variance
sig.var <- apply(scale(Z, center = T, scale = F), 2, function(x){sum((x^2)/(length(x)-1))})
add error
err <- array(rnorm(length(Z), sd = e), dim = dim(Z))
error variance
err.var <- apply(scale(err, center = T, scale = F), 2, function(x){sum((x^2)/(length(x)-1))})
full field
Z <- Z + err
image(Z)
total variance
tot.var <- apply(scale(Z, center = T, scale = F), 2, function(x){sum((x^2)/(length(x)-1))})
sum(err.var) / sum(tot.var)
pca ---------------------------------------------------------------------
P <- prcomp(Z)
plot(P$sdev[1:n]^2, log = "y")
sum(P$sdev^2); sum(tot.var)
abline(v = e-1+0.5, lty = 2, col = 2) # expected signal cut-off

What would a "significant" PC be determined? PCA is not ordinarily employed as a hypothesis testing procedure, nor is it generally suitable to use hypothesis tests when carrying it out. If you wish to generate a set of variables ("signals") with specified correlations or covariances, a full solution (with code) is given at https://stats.stackexchange.com/a/313138/919. — whuber, Sep 26 '22 at 15:07
@whuber - Thank you for your comment. I have gotten interested in this topic because of the existence of methods like PRESS link, which appears to do a good job of distinguishing between PCAs that carry signal versus noise. In my experience, this seems to be a valuable tool to define truncation levels over more subjective approaches (e.g. scree plot). I will take a closer loo at the data generation code that you linked - thank you. — Marc in the box, Sep 27 '22 at 05:20

How to create a matrix with known number of signals for use in PCA testing

make row signals

make column signals

combine into field

signal variance

add error

error variance

full field

total variance

pca ---------------------------------------------------------------------

0 Answers0