3

I'm using R iris data to show my point:

pca = prcomp(iris[,-5],scale=F)
png('irisNS_biplot.png',600,600)
biplot(pca)
dev.off()

PCA biplot

fviz_pca_biplot(pca, habillage=as.factor(iris$Species), addEllipses=TRUE, ellipse.level=0.95,
                label = "var", col.var = "red", col.ind = "#696969", alpha.var ="cos2", repel = TRUE) +
  theme_minimal() + theme_gray(base_size =12) + labs(title="", x ="PCA1", y = "PCA2")
ggsave("irisNS_ggplot.png", width = 8, height = 6, dpi = 300)

PCA ggplot

Look how the angle between “Petal Length” and “Sepal Length” differs between the two graphs. In the first graph, “Sepal Length” arrow points toward flower #119, while in the second graph, it points toward flower #132. Isn't it simply wrong? How can angles so different mean the same thing?

EDIT

Scaling the inputs:

pca = prcomp(iris[,-5],scale=T)
png('iris_biplot.png',600,600)
biplot(pca)
dev.off()

PCA biplot scaled

fviz_pca_biplot(pca, habillage=as.factor(iris$Species), addEllipses=TRUE, ellipse.level=0.95,
                label = "var", col.var = "red", col.ind = "#696969", alpha.var ="cos2", repel = TRUE) +
  theme_minimal() + theme_gray(base_size =12) + labs(title="", x ="PCA1", y = "PCA2")
ggsave("iris_ggplot.png", width = 8, height = 6, dpi = 300)

PCA ggplot scaled

Now the difference between the angles is smaller, but in the biplot the “Sepal Length” arrow is between 123 and 106/136, while in the ggplot it is between 103/136 and 110.

I also noticed that biplot shows a different scale for the arrows in the top and right axes, though both seem to use an aspect ratio of 1. Ggplot had a different aspect ratio in the first graph, but close (or equal?) to 1 in the second. However, its aspect ratio for the arrows is hidden. Does that make this specific ggplot graphic less reliable?

EDIT2

Added coord_fixed() to ggplot, to make sure its aspect ratio is 1.

fviz_pca_biplot(pca, habillage=as.factor(iris$Species), addEllipses=TRUE, ellipse.level=0.95,
                label = "var", col.var = "red", col.ind = "#696969", alpha.var ="cos2", repel = TRUE) +
  theme_minimal() + theme_gray(base_size =12) + labs(title="", x ="PCA1", y = "PCA2") +
  coord_fixed()
ggsave("iris_ggplotASP1.png", width = 8, height = 6, dpi = 300)

PCA ggplot scaled, aspect ratio = 1

Still, the arrows don't pass in the same position relative to the points, as in the biplot function. I see the numbers in the left/bottom axes are different between functions, but scaling the values equally along both axes shouldn't change the angle of the arrows, relative to the angle of the points (in relation to the origin (0,0) of the graph). So I guess the arrows use a different coordinate system, and this is not respecting the aspect ratio of 1, as (I thought) it should. Or am I missing something?

Rodrigo
  • 339
  • How do the plots compare if you scale the inputs -- as you should when doing PCA -- with scale = TRUE in prcomp? – dipetkov Jun 29 '22 at 22:20
  • You cannot make a visual comparison of angles until you draw your plots with identical aspect ratios--which these do not have. – whuber Jun 29 '22 at 22:35
  • @dipetkov Just added those plots. – Rodrigo Jun 29 '22 at 23:10
  • @whuber Added plots with scale = T, and some comments in the end. Also, didn't understand what you meant with "tch". – Rodrigo Jun 29 '22 at 23:11
  • Ah, so this is how you are eyeballing the angle. Don't plot symbols because it's not clear where the point they represent is, esp. when the label is multiple characters long. Plot dots in both versions instead. – dipetkov Jun 29 '22 at 23:32
  • @dipetkov I don't think that's the issue here. Numbers are pretty small (up to 3 digits long), and their centers are where the dots would be. The positions of the arrows in the scaled versions are still wrong. Have you read my last remarks in the post? – Rodrigo Jun 29 '22 at 23:38
  • No, I have to admit, not really. Instead I spent 1 min googling and came across this github issue. [In general I prefer using the standard packages so I don't know about factoextra and its reliability.] – dipetkov Jun 30 '22 at 00:01
  • Just realized that fixing the aspect ratio might not be enough to solve it. The ggplot versions are also not centered at (0,0); that also makes visual comparisons hard. You might be right that it's better to not use the factoextra package. – dipetkov Jun 30 '22 at 00:17
  • @dipetkov I prefer the standard packages too. Never liked ggplot. That's why I was doing my own graph, and needed to know which variables to extract. The difference from the ggplot graph sent by a friend intrigued me, I thought I was extracting the wrong variables... – Rodrigo Jun 30 '22 at 11:22
  • Sorry about the "tch": something went wrong when saving an edit to the comment. Based on your new graphics, it looks like there is no difference in angles or relative lengths, so all you're asking about is a difference in the normalization of the lengths. If you agree, please update your question (and title) accordingly. – whuber Jun 30 '22 at 12:09
  • @whuber How "there is no difference in angles", if one arrow passes over the two points (103/136) and the other passes below? I expected identical figures, since the data is the same, and still want to find the reason for the difference. – Rodrigo Jun 30 '22 at 15:34
  • I truly do not understand what you mean about the angles: the vectors in the two plots look identically situated to me. Be aware that each plot is the superposition of two separate plots (whence the two different sets of scales in the upper plot), so any discrepancy in comparing arrows to points will be due to a different convention for normalizing each plot. Ordinarily, one consults the documentation for such information. – whuber Jun 30 '22 at 16:13
  • @whuber Edited the question. Now both biplot and ggplot are using aspect ratio of 1. So, no matter what the scale is, the angles of both arrows and points (relative to the origin 0,0) should remain the same. I think the graphs should reflect precisely the data, and it obviously is not happening here. I want to know why. If I knew where in the documentation to find such thing, I wouldn't be here. – Rodrigo Jun 30 '22 at 16:38
  • Agreed--I'm not saying you shouldn't be here. But consulting the documentation via ?biplot produces the advice, "There are many variations on biplots (see the references)..." which at least heads you in a productive direction. Indeed, one of the references is an entire book on this plot (!), "J.C. Gower and D. J. Hand (1996). Biplots. Chapman & Hall." A search for it turns up another more recent book, https://onlinelibrary.wiley.com/doi/book/10.1002/9780470973196. – whuber Jun 30 '22 at 16:48
  • I found pieces of the Gower & Hand book at https://www.google.com/books/edition/Biplots/lTxiedIxRpgC?hl=en&gbpv=1&dq=Gower+Hand+Biplots&pg=PR15&printsec=frontcover. Start with section 2.6, "Some practical considerations concerned with plotting variables and scales." You might enjoy the remarks about users being at the "mercy of graphical software" that "does violence to the aspect ratio." – whuber Jun 30 '22 at 16:54

1 Answers1

5

It all started with a comment to always scale the input variables before doing principal components analysis....

The question asks why the PCA biplots generated with stats::biplot.prcomp (in base R) and factoextra::fviz_pca_biplot (built on ggplot2) "look different". It turns out that the plots differ in two ways:

  • biplot.prcomp plots the principal components while fviz_pca_biplot plots the principal components scaled to have unit variance.
  • biplot.prcomp and fviz_pca_biplot use different x:y aspect ratio to overlay principal components (points) and loadings (arrows) in an aesthetically pleasing way.

Okay, let's do principle component analysis. Don't forget to scale the data beforehand or to set scale = TRUE in the prcomp call. None of what follows holds if the data $\mathbf{X}$ is not standardized.

X <- iris[, 1:4]
X <- scale(X)
n <- nrow(X)

pca <- prcomp(X, scale = TRUE)

Recall that principal component analysis decomposes the matrix $\mathbf{X}$ into three components:

$$ \mathbf{X} = \mathbf{U}\mathbf{S}\mathbf{V}^\top $$ where $\mathbf{V}$ are the principal axes (the eigenvectors), $\mathbf{S}$ are the standard deviations of the principal axes (the square roots of the eigenvalues) and $\mathbf{U}\mathbf{S}$ are the principal components (aka scores). The standardized principal components $\mathbf{U}$ have unit variance. The loadings are defined as $\mathbf{V}\mathbf{S}/\sqrt{n-1}$.

# the standard deviations of the principal components (the square roots of the eigenvalues)
S <- pca$sdev * sqrt(n - 1) # same as sqrt(eigen(t(X) %*% X, symmetric = TRUE)$values)

the principal axes (the eigenvectors)

V <- pca$rotation # same as eigen(t(X) %*% X, symmetric = TRUE)$vectors

the principal components (the observations projected on the principal axes)

US <- pca$x # same as X %*% V

scores <- US scores_unit_var <- divide(US, S) * sqrt(n-1) loadings <- multiply(V, S) / sqrt(n - 1)

@amoeba explains all this in great detail in Positioning the arrows on a PCA biplot

biplot.prcomp plots the "raw" principal components.

# Compute the x:y ratio to overlay PCs and loadings into one plot
# See source code for how biplot.prcomp computes the ratio
# https://github.com/SurajGupta/r-source/blob/master/src/library/stats/R/biplot.R
xy_ratio <- default_xy_ratio(scores, loadings)

biplot_by_hand(scores, loadings, xy_ratio)

enter image description here

fviz_pca_biplot plots the standardized principal components.

# Compute the x:y ratio to overlay standardized PCs and loadings into one plot
# See source code for how factominer computes the ratio
# https://rdrr.io/cran/factoextra/src/R/fviz_pca.R
xy_ratio <- factominer_xy_ratio(scores_unit_var, loadings)

biplot_by_hand(scores_unit_var, loadings, xy_ratio / sqrt(n - 1))

enter image description here


Complete R code listing to reproduce all figures. I use ggplot2.


library("factoextra")
library("tidyverse")

plot_pcs <- function() { ggplot( mapping = aes(PC1, PC2) ) }

add_points <- function(p, data) { p + geom_point( shape = "o", size = 3, data = as_tibble(data) ) }

add_arrows <- function(p, data) { p + geom_segment( aes( x = 0, xend = PC1, y = 0, yend = PC2 ), inherit.aes = FALSE, data = as_tibble(data), color = "red", arrow = arrow(length = unit(0.1, "cm")) ) }

fix_aspect_ratio <- function(p) { layer <- p$layers[[1]] xy_limits <- range(layer$data[, unlist(p$labels)]) p + scale_x_continuous( limits = xy_limits ) + scale_y_continuous( limits = xy_limits ) + coord_fixed() }

biplot_by_hand <- function(x, y, xy_ratio) { plot_pcs() %>% add_points(x) %>% add_arrows(y / xy_ratio) %>% fix_aspect_ratio() }

multiply <- function(mat, vec) { x <- mat %*% diag(vec) dimnames(x) <- dimnames(mat) x }

divide <- function(mat, vec) { multiply(mat, 1 / vec) }

X <- iris[, 1:4] X <- scale(X) n <- nrow(X)

pca <- prcomp(X, scale = TRUE)

the standard deviations of the principal components (the square roots of the eigenvalues)

S <- pca$sdev * sqrt(n - 1) # same as sqrt(eigen(t(X) %*% X, symmetric = TRUE)$values)

the principal axes (the eigenvectors)

V <- pca$rotation # same as eigen(t(X) %*% X, symmetric = TRUE)$vectors

the principal components (the observations projected on the principal axes)

US <- pca$x # same as X %*% V

scores <- US scores_unit_var <- divide(US, S) * sqrt(n - 1) loadings <- multiply(V, S) / sqrt(n - 1)

Compute the x:y ratio to overlay PCs and loadings in the same plot

See source code for how biplot.prcomp computes the ratio

https://github.com/SurajGupta/r-source/blob/master/src/library/stats/R/biplot.R

default_xy_ratio <- function(x, y) { unsigned.range <- function(x) { c(-abs(min(x, na.rm = TRUE)), abs(max(x, na.rm = TRUE))) }

rangx1 <- unsigned.range(x[, 1L]) rangx2 <- unsigned.range(x[, 2L]) rangy1 <- unsigned.range(y[, 1L]) rangy2 <- unsigned.range(y[, 2L])

max(rangy1 / rangx1, rangy2 / rangx2) }

xy_ratio <- default_xy_ratio(scores, loadings) biplot_by_hand(scores, loadings, xy_ratio)

Compute the x:y ratio to overlay standardized PCs and loadings in the same plot

See source code for how factominer computes the ratio

https://rdrr.io/cran/factoextra/src/R/fviz_pca.R

factominer_xy_ratio <- function(scores, loadings, scale. = 0.5) { r <- min( (max(scores[, 1L]) - min(scores[, 1L]) / (max(loadings[, 1L]) - min(loadings[, 1L]))), (max(scores[, 2L]) - min(scores[, 2L]) / (max(loadings[, 2L]) - min(loadings[, 2L]))) ) r / scale. }

xy_ratio <- factominer_xy_ratio(scores_unit_var, loadings) biplot_by_hand(scores_unit_var, loadings, xy_ratio / sqrt(n - 1))

dipetkov
  • 9,805