What is the difference between
- normalizing the variables and doing PCA;
- using
scale=TRUEoption (without normalizing the variables) inprcompfunction in R?
What is the difference between
scale=TRUE option (without normalizing the variables) in prcomp function in R?No difference. Type debug(prcomp) before running prcomp. The third line of the function reads: x <- scale(x, center = center, scale = scale.); ie. you will either scale within the function if you set scale = TRUE during function call or you will have the scaling done originally by you.
Having said that, when applying PCA in general it is a good idea to scale your variables. Otherwise the magnitude to certain variables dominates the associations between the variables in the sample. Unless all your variables are recorded in the same scale and/or the difference in variable magnitudes are of interest I would suggest you normalise your data prior to PCA. This issue has been revisited multiple time within CV eg. 1, 2, 3.
Using the correlation matrix is equivalent to standardizing each of the variables (to mean 0 and standard deviation 1). In general, PCA with and without standardizing will give different results. Especially when the scales are different.
scale=TRUE bases the PCA on the correlation matrix and FALSE on the covariance matrix
For example:
#my data
set.seed(1)
x<-rnorm(10,50,4)
y<-rnorm(10,50,7)
df<-data.frame(x,y)
PCA based on covariance matrix and on Correlation matrix
PCA_df.cov <- prcomp(df, scale=FALSE)
PCA_df.corr <- prcomp(df, scale=TRUE)
scale=TRUE. Your code example just shows that toggling scale=TRUE and scale=FALSE produces different results, which doesn't do much to explain why those results are different. I think the code example would be more clear if you used it to demonstrate that scaling the data and setting scale=TRUE produce the same result, and showing that is the same performing PCA on the covariance matrix. In other words, use code to demonstrate the claims you make in text.
– Sycorax
Mar 29 '21 at 15:48