0

I'm looking for a way to create a matrix with a known number of signals and background error for use in PCA. The following example attempts this by combining signals of known amplitude, followed by the addition of error, also of prescribed amplitude. I'm a bit unsure as to whether I can definitively say that only X PCs are expected to be significant due to these definitions.

Ideally, I would like to know this precisely, but I may need to ensure that the true signals are completely orthogonal to one another? I'm also unsure as to whether I have correctly prescribed the desired signal amplitude to be identified by the PCA.

# make data ---------------------------------------------------------------
set.seed(123456)

m <- 200 # row dimension n <- 50 # column dimension s <- 15 # number of signals d <- rev(seq(s)) # amplitude of signals e <- 8 # amplitude of error

make row signals

x <- do.call("cbind", lapply(d, function(x){scale(cumsum(rnorm(m)))})) image(x) plot(x[,2])

make column signals

y <- do.call("cbind", lapply(seq(s), function(x){scale(cumsum(rnorm(n)))})) image(y) plot(y[,2])

combine into field

Z <- matrix(0, nrow = m, ncol = n) for(i in seq(s)){ tmp <- as.matrix(x[,i]) %% t(as.matrix(y[,i])) d[i] Z <- Z + tmp }

signal variance

sig.var <- apply(scale(Z, center = T, scale = F), 2, function(x){sum((x^2)/(length(x)-1))})

add error

err <- array(rnorm(length(Z), sd = e), dim = dim(Z))

error variance

err.var <- apply(scale(err, center = T, scale = F), 2, function(x){sum((x^2)/(length(x)-1))})

full field

Z <- Z + err image(Z)

total variance

tot.var <- apply(scale(Z, center = T, scale = F), 2, function(x){sum((x^2)/(length(x)-1))}) sum(err.var) / sum(tot.var)

pca ---------------------------------------------------------------------

P <- prcomp(Z) plot(P$sdev[1:n]^2, log = "y") sum(P$sdev^2); sum(tot.var) abline(v = e-1+0.5, lty = 2, col = 2) # expected signal cut-off

enter image description here

  • 1
    What would a "significant" PC be determined? PCA is not ordinarily employed as a hypothesis testing procedure, nor is it generally suitable to use hypothesis tests when carrying it out. If you wish to generate a set of variables ("signals") with specified correlations or covariances, a full solution (with code) is given at https://stats.stackexchange.com/a/313138/919. – whuber Sep 26 '22 at 15:07
  • 1
    @whuber - Thank you for your comment. I have gotten interested in this topic because of the existence of methods like PRESS link, which appears to do a good job of distinguishing between PCAs that carry signal versus noise. In my experience, this seems to be a valuable tool to define truncation levels over more subjective approaches (e.g. scree plot). I will take a closer loo at the data generation code that you linked - thank you. – Marc in the box Sep 27 '22 at 05:20

0 Answers0