I used the prcomp() function to perform a PCA (principal component analysis) in R. However, there's a bug in that function such that the na.action parameter does not work. I asked for help on stackoverflow; two users there offered two different ways of dealing with NA values. However, the problem with both solutions is that when there is an NA value, that row is dropped and not considered in the PCA analysis. My real data set is a matrix of 100 x 100 and I do not want to lose a whole row just because it contains a single NA value.
The following example shows that the prcomp() function does not return any principal components for row 5 as it contains a NA value.
d <- data.frame(V1 = sample(1:100, 10), V2 = sample(1:100, 10),
V3 = sample(1:100, 10))
result <- prcomp(d, center = TRUE, scale = TRUE, na.action = na.omit)
result$x # $
d$V1[5] <- NA # $
result <- prcomp(~V1+V2, data=d, center = TRUE, scale = TRUE, na.action = na.omit)
result$x
I was wondering if I can set the NA values to a specific numerical value when center and scale are set to TRUE so that the prcomp() function works and does not remove rows containing NA's, but also does not influence the outcome of the PCA analysis.
I thought about replacing NA values with the median value across a single column, or with a value very close to 0. However, I am not sure how that influences the PCA analysis.
Can anybody think of a good way of solving that problem?
NAvalues mean: what is the cause of the "missingness"? – whuber Sep 02 '12 at 18:11