Imputation of missing values for PCA

Question

I used the prcomp() function to perform a PCA (principal component analysis) in R. However, there's a bug in that function such that the na.action parameter does not work. I asked for help on stackoverflow; two users there offered two different ways of dealing with NA values. However, the problem with both solutions is that when there is an NA value, that row is dropped and not considered in the PCA analysis. My real data set is a matrix of 100 x 100 and I do not want to lose a whole row just because it contains a single NA value.

The following example shows that the prcomp() function does not return any principal components for row 5 as it contains a NA value.

d       <- data.frame(V1 = sample(1:100, 10), V2 = sample(1:100, 10), 
                      V3 = sample(1:100, 10))
result  <- prcomp(d, center = TRUE, scale = TRUE, na.action = na.omit)
result$x                                # $
d$V1[5] <- NA                           # $
result  <- prcomp(~V1+V2, data=d, center = TRUE, scale = TRUE, na.action = na.omit)
result$x

I was wondering if I can set the NA values to a specific numerical value when center and scale are set to TRUE so that the prcomp() function works and does not remove rows containing NA's, but also does not influence the outcome of the PCA analysis.

I thought about replacing NA values with the median value across a single column, or with a value very close to 0. However, I am not sure how that influences the PCA analysis.

Can anybody think of a good way of solving that problem?

Your problem is not PCA problem but a wider missing values trearment problem. If you're not familiar with it, please read a bit on it. You have many opportunities: (1) delete cases listwise or (2) pairwise, or (3) replace missings by mean or median. Or (4) replace by random chosen of valid values (hot-deck approach). Or impute missings by (5) mutual regression (with or without noise addition) approach or by a better, (6) EM approach. — ttnphns, Sep 02 '12 at 11:16
As the comments and answers are showing, the key to getting a good answer is to explain what the NA values mean: what is the cause of the "missingness"? — whuber, Sep 02 '12 at 18:11
I think the "pcaMethods" package can solve your problem (here) — ToNoY, Jan 13 '14 at 04:31

Marc in the box · Answer 1 · 2015-05-08T06:26:07.943

32

There is in fact a well documented way to deal with gappy matrices - you can decompose a covariance matrix $\textbf{C}$ contructed from of your data $\textbf{X}$, which is scaled by the number of shared values $n$:
$$ \textbf{C}=\frac{1}{n} \textbf{X} ^ {\text{T}} \textbf{X},~~~~~~~~~~~~~~~~ C_{jl} = \overline{X_{.j}Y_{.l}} $$

and then expand the principal coefficients via a least squares fit (as @user969113 mentions). Here's an example.

However, there are several problems with this method relating to the fact that the covariance matrix is no longer semipositive definite and the eigen/singular values tend to be inflated. A nice review of these problems can be found in Beckers and Rixen (2003), where they also propose a method of optimally interpolating the missing gaps - DINEOF (Data Interpolating Empirical Orthogonal Functions). I have recently written a function that performs DINEOF, and it really seems to be a much better way to go. You could perform DINEOF on your your dataset $\textbf{X}$ directly, and then use the interpolated dataset as input into prcomp.

Update

Another option for conducting PCA on a gappy dataset is "Recursively Subtracted Empirical Orthogonal Functions" (Taylor et al. 2013). It also corrects for some of the problems in the least squares approach, and is computationally much faster than DINEOF. This post compares the all three approaches in terms of the accuracy of the data reconstruction using the PCs.

References

Beckers, Jean-Marie, and M. Rixen. "EOF Calculations and Data Filling from Incomplete Oceanographic Datasets." Journal of Atmospheric and Oceanic Technology 20.12 (2003): 1839-1856.

Taylor, M., Losch, M., Wenzel, M., & Schröter, J. (2013). On the sensitivity of field reconstruction and prediction using Empirical Orthogonal Functions derived from gappy data. Journal of Climate, 26(22), 9194-9205.

edited May 08 '15 at 06:26

answered Nov 08 '12 at 08:19

Marc in the box

3,712

2

(+1) This looks like a valuable contribution to me, because it is a novel idea. I asked a similar question long ago that is strikingly similar: how do you estimate a covariance matrix when data are censored (instead of missing)? If you have any thoughts about that situation, I would be glad of a reply! – whuber Nov 08 '12 at 15:06
Thanks @whuber - I believe this method has a lot of merit as well. Even if you're not interested in the interpolated values, the method is much better as describing EOFs/PCs for dataset - e.g. the error between the reconstructed data and the original is minimized through the algorithm. – Marc in the box Nov 09 '12 at 07:13
@whuber - Regarding censored data - This is out of my area of expertise and, interestingly, I asked a question in this direction a few weeks ago (which you commented on!). My hunch is that one would should fill the zeros with random values below the detection limit, which approximate the distribution of the observed values. I'll be looking into some of the cited literature from your post - this is a very interesting topic indeed. – Marc in the box Nov 09 '12 at 07:18
@whuber - You may be interested in the following paper describing a similar an iterative covariance matrix fitting procedure to sparse data: Bien, Jacob, and Robert J. Tibshirani. "Sparse estimation of a covariance matrix." Biometrika 98.4 (2011): 807-820. – Marc in the box Oct 28 '13 at 15:56
Thanks @Marc. Unfortunately censoring and spareseness are two different things with different concerns. – whuber Oct 28 '13 at 17:48
@Marc. Thanks for this summary. Does either of your recent codes deal with complex data? This seems to be envisioned by DINEOF at least. – Eli S Apr 21 '17 at 20:52

Tom Wenseleers · Answer 2 · 2023-06-22T05:27:02.177

A recent paper which reviews approaches for dealing with missing values in PCA analyses is "Principal component analysis with missing values: a comparative survey of methods" by Dray & Josse (2015). Some of the best known methods of PCA methods that allow for missing values are (1) the NIPALS algorithm (implemeted in the pca function of the pcaMethods package with method="nipals" and the nipals function of the ade4 package), (2) iterative PCA (Ipca or EM-PCA), implemented in the pca function of the pcaMethods package with method="svdImpute" and the imputePCA function of the missMDA package) and (3) Probabilistic PCA (PPCA) which is a variant of PCA that uses a probabilistic latent variable model, and which can be fit using the pca function of the pcaMethods package with method="ppca". The paper concluded that the Ipca / EM-PCA method performed best under the widest range of conditions.

For your example syntax is :

For NIPALS (you can also use library(pcaMethods) & the pca function with method "nipals") :

library(ade4)
nipals(d[,c(1,2)])

For Ipca / EM-PCA (you can also use library(pcaMethods) & the pca function with method "svdImpute") :

library(missMDA)
imputePCA(d[,c(1,2)],method="EM",ncp=1)

score 8 · Answer 3 · answered Sep 02 '12 at 11:46

My suggestion depends on how much data is missing and why it is missing. But this has nothing to do with PCA, really. If there is very little data missing, then it won't much matter what you do. Replacing with the median isn't ideal, but if there is not much missing, it won't be much different from a better solution. You could try doing PCA with both median replacement and listwise deletion and see if there are major differences in the results.

Next, if there is more data missing, you should consider whether it is missing completely at random, missing at random, or not missing at random. I would suggest multiple imputation in the first two cases and some of the time in the third case - unless the data is highly distorted by its NMAR status, I think multiple imputation will be better than listwise deletion (Joe Schafer of Penn State has done a lot of work on missing data - I recall some work of his showing that multiple imputation worked pretty well even in some NMAR cases). However, if the data are MCAR or MAR, the properties of multiple imputation can be proven.

If you do decide to go with MI, one note is to be careful because the signs of the components in PCA are arbitrary, and a small change in the data can flip a sign. Then when you do the PCA you will get nonsense. A long time ago I worked out a solution in SAS - it isn't hard, but it's something to be careful about.

score 2 · Answer 4 · answered May 02 '20 at 11:11

You could solve the problem of the missing value in different way. Below I'm going to illustrate them.

You should use the mean of the variable that includes NA values or impute the missing values with a linear regression.

You should use missMDA and then FactoMineR or the pcaMethods. Below an example.

library(missMDA)
nPCs <- estim_ncpPCA(VIM::sleep)


Output 
nPCS$ncp
    3

completed_sleep <- imputePCA(VIM::sleep, ncp = nPCs$ncp, scale = TRUE)
PCA(completed_sleep$completeObs)

The other example is:

library(pcaMethods)
sleep_pca_methods <- pca(sleep, nPcs=2, method="ppca", center = TRUE)
imp_air_pcamethods <- completeObs(sleep_pca_methods)

If you'd like to deep the PCA or the factoMiner package you should visit its website http://factominer.free.fr/

score 0 · Answer 5 · answered Sep 02 '12 at 10:54

0

There is no correct solution to the problem. Every coordinate in the vector has to be specified to get the correct set of principal components. If a coordinate is missing and replaced by some imputed value you will get a result but it will be dependent on the imputed value. so if there are two reasonable choices for the imputed value the different choices will give different answers.

answered Sep 02 '12 at 10:54

Michael R. Chernick

42,857

4

I just googled for PCA and missing data and found that: 4.2 How does SIMCA cope with missing data? Put simply the NIPALS algorithm interpolates the missing point using a least squares fit but give the missing data no influence on the model. Successive iterations refine the missing value by simply multiplying the score and the loading for that point. Many different methods exist for missing data, such as estimation but they generally converge to the same solution. Missing data is acceptable if they are randomly distributed. Systematic blocks of missing data are problematic. – user969113 Sep 02 '12 at 11:04
1

I don't know what you mean by no influence on the model. Any choice of missing value for the coordinate will affect the principal components. – Michael R. Chernick Sep 02 '12 at 11:11

Imputation of missing values for PCA

5 Answers5

Update

References

Linked

Related