83

In some disciplines, PCA (principal component analysis) is systematically used without any justification, and PCA and EFA (exploratory factor analysis) are considered as synonyms.

I therefore recently used PCA to analyse the results of a scale validation study (21 items on 7-points Likert scale, assumed to compose 3 factors of 7 items each) and a reviewer asks me why I chose PCA instead of EFA. I read about the differences between both techniques, and it seems that EFA is favored against PCA in a majority of your answers here.

Do you have any good reasons for why PCA would be a better choice? What benefits it could provide and why it could be a wise choice in my case?

ttnphns
  • 57,480
  • 49
  • 284
  • 501
Carine
  • 839
  • 2
  • 7
  • 4
  • 1
    Great question. I tend to disagree with ttnphns's answer, and will try to provide an alternative view later today. – amoeba Nov 07 '14 at 16:06
  • 5
    @amoeba I am rooting for you in advance. PCA is just a transformation technique that may be (sometimes, very) helpful. There is no need to demonise it or attribute to it spurious or inappropriate intention. You might as well excoriate a logarithm. – Nick Cox Nov 07 '14 at 16:57
  • 7
    It doesn't seem to be that ttnphns' answer demonizes PCA. To me he just seems to be arguing the PCA isn't based on the assumption of latent variables generating your data, so if that's what you are trying to do, FA is a better choice. – gung - Reinstate Monica Nov 07 '14 at 20:01
  • 1
    FWIW, I wasn't commenting specifically on ttphns's answer, but on comments and criticisms I often encounter which amount to charges that PCA doesn't do something for which it was never intended or is not suited. – Nick Cox Nov 12 '14 at 10:31
  • @gung: But PCA is generative model that does assume that latent variables are generating your data: one latent variable for each component whose varying activation explains the variance along its eigenvector. You can calculate these feature activations by taking dot products with the eigenvectors. – Neil G Nov 12 '14 at 11:52
  • 3
    @NeilG: PCA is not a [probabilistic] generative model, because it does not include a noise term and so there is no likelihood associated with it. There is a probabilistic generalization though (PPCA), and it is very closely related to PCA, see my answer here. – amoeba Nov 12 '14 at 17:15
  • @amoeba: Can't you just express PCA as PPCA with fixed $\sigma$? Just as you can write least squares regression as trying to fit the data to a Gaussian with unknown mean (that is a affine map of the input) and fixed variance. Minimizing the log-loss of such a model is equivalent to solving the linear regression. Similarly, doesn't minimizing the log-loss with fixed $\sigma$ make PPCA collapse to PCA? – Neil G Nov 12 '14 at 18:10
  • @NeilG: I think that for any fixed value of $\sigma \ne 0$ minimizing PPCA loss over $\mathbf W$ will result in $\mathbf W$ that is different from vanilla PCA loadings. One can recover standard PCA in the limit of $\sigma \to 0$. Actually, if one takes the EM update equations for PPCA and sets $\sigma = 0$, the equations continue to make sense and can be used as an iterative method to find the PCA solution (see Roweis 1998, section 2.2 and below; very nice paper btw). Not sure how it fits to your OLS analogy (which I did not fully understand). – amoeba Nov 12 '14 at 19:19
  • @amoeba: What I'm saying is that changing a fixed $\sigma$ when minimizing the log-loss of the PPCA model you described would only scale the log-loss for any given data set. So if shouldn't matter what you set $\sigma$ to, if something is a solution for $\sigma=5$, it should be the same solution for $\sigma=1$. That's because in your generative equation for PPCA, the error term implies a quadratic log-loss. If you're off by 1.2 in a component, you pay $\left(\frac{1.2}{\sigma}\right)^2$. Doubling $\sigma$ halves the log-loss of every data point, and so on… – Neil G Nov 12 '14 at 19:31
  • In short, I'm suggesting $\mathrm{PCA}: ::: \mathbf x = \mathbf W \mathbf z + \boldsymbol \mu + \boldsymbol \epsilon, ; \boldsymbol \epsilon \sim \mathcal N(0, \mathbf I)$. Although I'm not sure… – Neil G Nov 12 '14 at 19:33
  • 1
    @NeilG: This definitely cannot be true, and it seems to me that even the starting point of your argument (that changing $\sigma$ results in scaling of the log-loss) is not correct: see e.g. the original Tipping & Bishop 1999 PPCA paper, equation 10 for the log-likelihood of the data. It depends on $\sigma$ in a quite complicated way, via the $\mathrm{ln}|\mathbf C|$ term. – amoeba Nov 12 '14 at 21:11
  • It depends on the factor extraction procedure of EFA. The differences can be severe. There is no universal answer with the statements you gave. – MaHo Nov 16 '15 at 17:21

6 Answers6

115

Disclaimer: @ttnphns is very knowledgeable about both PCA and FA, and I respect his opinion and have learned a lot from many of his great answers on the topic. However, I tend to disagree with his reply here, as well as with other (numerous) posts on this topic here on CV, not only his; or rather, I think they have limited applicability.


I think that the difference between PCA and FA is overrated.

Look at it like that: both methods attempt to provide a low-rank approximation of a given covariance (or correlation) matrix. "Low-rank" means that only a limited (low) number of latent factors or principal components is used. If the $n \times n$ covariance matrix of the data is $\mathbf C$, then the models are:

\begin{align} \mathrm{PCA:} &\:\:\: \mathbf C \approx \mathbf W \mathbf W^\top \\ \mathrm{PPCA:} &\:\:\: \mathbf C \approx \mathbf W \mathbf W^\top + \sigma^2 \mathbf I \\ \mathrm{FA:} &\:\:\: \mathbf C \approx \mathbf W \mathbf W^\top + \boldsymbol \Psi \end{align}

Here $\mathbf W$ is a matrix with $k$ columns (where $k$ is usually chosen to be a small number, $k<n$), representing $k$ principal components or factors, $\mathbf I$ is an identity matrix, and $\boldsymbol \Psi$ is a diagonal matrix. Each method can be formulated as finding $\mathbf W$ (and the rest) minimizing the [norm of the] difference between left-hand and right-hand sides.

PPCA stands for probabilistic PCA, and if you don't know what that is, it does not matter so much for now. I wanted to mention it, because it neatly fits between PCA and FA, having intermediate model complexity. It also puts the allegedly large difference between PCA and FA into perspective: even though it is a probabilistic model (exactly like FA), it actually turns out to be almost equivalent to PCA ($\mathbf W$ spans the same subspace).

Most importantly, note that the models only differ in how they treat the diagonal of $\mathbf C$. As the dimensionality $n$ increases, the diagonal becomes in a way less and less important (because there are only $n$ elements on the diagonal and $n(n-1)/2 = \mathcal O (n^2)$ elements off the diagonal). As a result, for the large $n$ there is usually not much of a difference between PCA and FA at all, an observation that is rarely appreciated. For small $n$ they can indeed differ a lot.

Now to answer your main question as to why people in some disciplines seem to prefer PCA. I guess it boils down to the fact that it is mathematically a lot easier than FA (this is not obvious from the above formulas, so you have to believe me here):

  1. PCA -- as well as PPCA, which is only slightly different, -- has an analytic solution, whereas FA does not. So FA needs to be numerically fit, there exist various algorithms of doing it, giving possibly different answers and operating under different assumptions, etc. etc. In some cases some algorithms can get stuck (see e.g. "heywood cases"). For PCA you perform an eigen-decomposition and you are done; FA is a lot more messy.

    Technically, PCA simply rotates the variables, and that is why one can refer to it as a mere transformation, as @NickCox did in his comment above.

  2. PCA solution does not depend on $k$: you can find first three PCs ($k=3$) and the first two of those are going to be identical to the ones you would find if you initially set $k=2$. That is not true for FA: solution for $k=2$ is not necessarily contained inside the solution for $k=3$. This is counter-intuitive and confusing.

Of course FA is more flexible model than PCA (after all, it has more parameters) and can often be more useful. I am not arguing against that. What I am arguing against, is the claim that they are conceptually very different with PCA being about "describing the data" and FA being about "finding latent variables". I just do not see this is as true [almost] at all.

To comment on some specific points mentioned above and in the linked answers:

  • "in PCA the number of dimensions to extract/retain is fundamentally subjective, while in EFA the number is fixed, and you usually have to check several solutions" -- well, the choice of the solution is still subjective, so I don't see any conceptual difference here. In both cases, $k$ is (subjectively or objectively) chosen to optimize the trade-off between model fit and model complexity.

  • "FA is able to explain pairwise correlations (covariances). PCA generally cannot do it" -- not really, both of them explain correlations better and better as $k$ grows.

  • Sometimes extra confusion arises (but not in @ttnphns's answers!) due to the different practices in the disciplines using PCA and FA. For example, it is a common practice to rotate factors in FA to improve interpretability. This is rarely done after PCA, but in principle nothing is preventing it. So people often tend to think that FA gives you something "interpretable" and PCA does not, but this is often an illusion.

Finally, let me stress again that for very small $n$ the differences between PCA and FA can indeed be large, and maybe some of the claims in favour of FA are done with small $n$ in mind. As an extreme example, for $n=2$ a single factor can always perfectly explain the correlation, but one PC can fail to do it quite badly.


Update 1: generative models of the data

You can see from the number of comments that what I am saying is taken to be controversial. At the risk of flooding the comment section even further, here are some remarks regarding "models" (see comments by @ttnphns and @gung). @ttnphns does not like that I used the word "model" [of the covariance matrix] to refer to the approximations above; it is an issue of terminology, but what he calls "models" are probabilistic/generative models of the data:

\begin{align} \mathrm{PPCA}: &\:\:\: \mathbf x = \mathbf W \mathbf z + \boldsymbol \mu + \boldsymbol \epsilon, \; \boldsymbol \epsilon \sim \mathcal N(0, \sigma^2 \mathbf I) \\ \mathrm{FA}: &\:\:\: \mathbf x = \mathbf W \mathbf z + \boldsymbol \mu + \boldsymbol \epsilon, \; \boldsymbol \epsilon \sim \mathcal N(0, \boldsymbol \Psi) \end{align}

Note that PCA is not a probabilistic model, and cannot be formulated in this way.

The difference between PPCA and FA is in the noise term: PPCA assumes the same noise variance $\sigma^2$ for each variable, whereas FA assumes different variances $\Psi_{ii}$ ("uniquenesses"). This minor difference has important consequences. Both models can be fit with a general expectation-maximization algorithm. For FA no analytic solution is known, but for PPCA one can analytically derive the solution that EM will converge to (both $\sigma^2$ and $\mathbf W$). Turns out, $\mathbf W_\mathrm{PPCA}$ has columns in the same direction but with a smaller length than standard PCA loadings $\mathbf W_\mathrm{PCA}$ (I omit exact formulas). For that reason I think of PPCA as "almost" PCA: $\mathbf W$ in both cases span the same "principal subspace".

The proof (Tipping and Bishop 1999) is a bit technical; the intuitive reason for why homogeneous noise variance leads to a much simpler solution is that $\mathbf C - \sigma^2 \mathbf I$ has the same eigenvectors as $\mathbf C$ for any value of $\sigma^2$, but this is not true for $\mathbf C - \boldsymbol \Psi$.

So yes, @gung and @ttnphns are right in that FA is based on a generative model and PCA is not, but I think it is important to add that PPCA is also based on a generative model, but is "almost" equivalent to PCA. Then it ceases to seem such an important difference.


Update 2: how come PCA provides best approximation to the covariance matrix, when it is well-known to be looking for maximal variance?

PCA has two equivalent formulations: e.g. first PC is (a) the one maximizing the variance of the projection and (b) the one providing minimal reconstruction error. More abstractly, the equivalence between maximizing variance and minimizing reconstruction error can be seen using Eckart-Young theorem.

If $\mathbf X$ is the data matrix (with observations as rows, variables as columns, and columns are assumed to be centered) and its SVD decomposition is $\mathbf X=\mathbf U\mathbf S\mathbf V^\top$, then it is well known that columns of $\mathbf V$ are eigenvectors of the scatter matrix (or covariance matrix, if divided by the number of observations) $\mathbf C=\mathbf X^\top \mathbf X=\mathbf V\mathbf S^2\mathbf V^\top$ and so they are axes maximizing the variance (i.e. principal axes). But by the Eckart-Young theorem, first $k$ PCs provide the best rank-$k$ approximation to $\mathbf X$: $\mathbf X_k=\mathbf U_k\mathbf S_k \mathbf V^\top_k$ (this notation means taking only $k$ largest singular values/vectors) minimizes $\|\mathbf X-\mathbf X_k\|^2$.

The first $k$ PCs provide not only the best rank-$k$ approximation to $\mathbf X$, but also to the covariance matrix $\mathbf C$. Indeed, $\mathbf C=\mathbf X^\top \mathbf X=\mathbf V\mathbf S^2\mathbf V^\top$, and the last equation provides the SVD decomposition of $\mathbf C$ (because $\mathbf V$ is orthogonal and $\mathbf S^2$ is diagonal). So the Eckert-Young theorem tells us that the best rank-$k$ approximation to $\mathbf C$ is given by $\mathbf C_k = \mathbf V_k\mathbf S_k^2\mathbf V_k^\top$. This can be transformed by noticing that $\mathbf W = \mathbf V\mathbf S$ are PCA loadings, and so $$\mathbf C_k=\mathbf V_k\mathbf S_k^2\mathbf V^\top_k=(\mathbf V\mathbf S)_k(\mathbf V\mathbf S)_k^\top=\mathbf W_k\mathbf W^\top_k.$$

The bottom-line here is that $$ \mathrm{minimizing} \; \left\{\begin{array}{ll} \|\mathbf C-\mathbf W\mathbf W^\top\|^2 \\ \|\mathbf C-\mathbf W\mathbf W^\top-\sigma^2\mathbf I\|^2 \\ \|\mathbf C-\mathbf W\mathbf W^\top-\boldsymbol\Psi\|^2\end{array}\right\} \; \mathrm{leads \: to} \; \left\{\begin{array}{cc} \mathrm{PCA}\\ \mathrm{PPCA} \\ \mathrm{FA} \end{array}\right\} \; \mathrm{loadings},$$ as stated in the beginning.


Update 3: numerical demonstration that PCA$\to$FA when $n \to \infty$

I was encouraged by @ttnphns to provide a numerical demonstration of my claim that as dimensionality grows, PCA solution approaches FA solution. Here it goes.

I generated a $200\times 200$ random correlation matrix with some strong off-diagonal correlations. I then took the upper-left $n \times n$ square block $\mathbf C$ of this matrix with $n=25, 50, \dots 200$ variables to investigate the effect of the dimensionality. For each $n$, I performed PCA and FA with number of components/factors $k=1\dots 5$, and for each $k$ I computed the off-diagonal reconstruction error $$\sum_{i\ne j}\left[\mathbf C - \mathbf W \mathbf W^\top\right]^2_{ij}$$ (note that on the diagonal, FA reconstructs $\mathbf C$ perfectly, due to the $\boldsymbol \Psi$ term, whereas PCA does not; but the diagonal is ignored here). Then for each $n$ and $k$, I computed the ratio of the PCA off-diagonal error to the FA off-diagonal error. This ratio has to be above $1$, because FA provides the best possible reconstruction.

PCA vs FA off-diagonal reconstruction error

On the right, different lines correspond to different values of $k$, and $n$ is shown on the horizontal axis. Note that as $n$ grows, ratios (for all $k$) approach $1$, meaning that PCA and FA yield approximately the same loadings, PCA$\approx$FA. With relatively small $n$, e.g. when $n=25$, PCA performs [expectedly] worse, but the difference is not that strong for small $k$, and even for $k=5$ the ratio is below $1.2$.

The ratio can become large when the number of factors $k$ becomes comparable with the number of variables $n$. In the example I gave above with $n=2$ and $k=1$, FA achieves $0$ reconstruction error, whereas PCA does not, i.e. the ratio would be infinite. But getting back to the original question, when $n=21$ and $k=3$, PCA will only moderately lose to FA in explaining the off-diagonal part of $\mathbf C$.

For an illustrated example of PCA and FA applied to a real dataset (wine dataset with $n=13$), see my answers here:

amoeba
  • 104,745
  • 4
    I was just about to ask a question about the mathematical difference between the techniques, since most of the (otherwise excellent) answers on the topic here don't make explicit mathematical comparisons. This answer is exactly what I was looking for. – shadowtalker Nov 07 '14 at 21:25
  • 4
    This is highly valuable, unfolded account with fresh perspective. The putting of PPCA as an in-between technique is crucial - it is from where your opinion grows. May I ask you to leave more lines about PPCA? - What is $\sigma^2$, how is it estimated (briefly) and what makes it different from $\Psi$ so that PPCs (unlike factors) fill in the subspace of the variables and a PPC does not depend on $k$. – ttnphns Nov 08 '14 at 00:34
  • On tiny notion. The so called "Heywood case" (unreasonable communality $\psi$ value appears and breaks FA), although it may have various causes, it usually appears when you set $k$ higher than "optimal". Actually it might be a testimony for a statement that there exist optimal or "true" $k$ and hence your statement that $k$ is always subjective because we always explain correlations better and better as k grows is questioned. Heywood case might be that "natural plague" which baffles one's overfitting optimism in FA. If your above stance is vulnerable difference bw PCA and FA resuscitate. – ttnphns Nov 08 '14 at 02:16
  • And another one. FA does iterations and aims to fit correlations on each one. That means that correlation values in the matrix get fit relatively evenly in the end - it is the purpose of FA, whatever is $k$. We don't see such concern in PCA: as $k$ grows, correlations may get fitted better in uneven and unpredictable way. So, the "bigger k - better fit" dictum only disguises difference between FA and PCA. – ttnphns Nov 08 '14 at 02:38
  • 1
    Also. Why the (non-hierarchical form of) dependency of factor solution on $k$ should be counter-intuitive and confusing?. Factors are not hierarchical by nature (I'm not speaking of the so called 2nd order factors here). It is unclear why I must expect the 1st factor to be the same in a 1-factor and 5-factor solutions. It is comfortable, for sure, but why should it be a law? – ttnphns Nov 08 '14 at 02:51
  • What seems the principal discrepancy with your opinion is your formulation of the model of PCA (the 1st formula). It is superficially correct but it hides the fact that PCA won't stir a finger to explain off-diagonals of $\bf C$; it is concerned only with reproducing the trace, by components. – ttnphns Nov 08 '14 at 04:00
  • 6
    I continue to agree w/ ttnphns here, & the distinction that FA is based on latent variables whereas PCA is just a transformation of the data. However, this is very well reasoned & a useful contrary position. It contributes to the quality of this thread. +1 – gung - Reinstate Monica Nov 08 '14 at 04:28
  • To unite with @gung and to cross the t's: Those formulas in your answer which you call "models", are not the models at all. They are called the component/factor theorems and are the consequences of the models. Factor/PCA models are described in the first paragraph here. – ttnphns Nov 08 '14 at 05:09
  • ... and, because you are not correct saying that FA and PCA only differ in how they treat the diagonal (for, on the contrary, they differ in how they care about the off-diagonal), you are wrong when you state that their results become similar as $n$ grows because the diagonal becomes relatively lighter and lighter. – ttnphns Nov 08 '14 at 06:58
  • 1
    Hi @ttnphns, thanks for your comments; I did not have much time yesterday, so my answer was a bit hastily written. I will update it with some extra comments on PPCA and about "models". But for now: why do you say that PCA "won't stir a finger" to explain off-diagonal elements of $\mathbf C$? This sounds very weird to me (and might be one of the reasons behind our disagreements here); PCA finds $\mathbf W$ such that $|\mathbf C−\mathbf W\mathbf W^\top|^2$ is minimized. This is [one possible formulation of] what PCA is! The norm is given by the sum over all elements of $\mathbf C$. – amoeba Nov 08 '14 at 09:52
  • Hi there, amoeba. Again, I find your answer very thoughtful and original. For me, please don't modify it - issue an new answer, if you need. We have to leave this one for it got too many comments already. Though I myself prefer to think that you are wrong in some of your points, other people might call it the difference in opinions. – ttnphns Nov 08 '14 at 10:18
  • 2
    @ttnphns, no, I believe it is you who are mistaken. Let's try to clarify this point, as it is crucial for my answer. I can prove that minimizing $|\mathbf C-\mathbf W \mathbf W^\top|^2$ will result in $\mathbf W$ being the PCA loadings. So I insist that PCA loadings aim to reproduce the covariance matrix as close as possible, diagonal and off-diagonal elements alike. What you are saying (I think), is that $\mathrm{trace}(\mathbf W^\top \mathbf W)$ (total variance of PCs) is maximized [not minimized!] among all the projections of the data (right?). Correct! It is mathematically equivalent. – amoeba Nov 08 '14 at 11:33
  • I certainly meant trace "maximized". I've added one last paragraph about it here. – ttnphns Nov 08 '14 at 11:44
  • Wait, @ttnphns, this is important. What you wrote there is correct but misleading. Do you disagree that PCA's $\mathbf W$ minimizes the reconstruction error $|\mathbf C-\mathbf W \mathbf W^\top|$? PCA has two formulations: one in terms of maximizing variance, and another in terms of minimizing reconstruction error (usually of the data, but also of the covariance matrix). They are mathematically equivalent. So "FA aims at minimizing differences between corresponding off-diagonal elements" -- yes, but PCA does the same if you remove the word "off-diagonal". Agree, disagree, not convinced? – amoeba Nov 08 '14 at 11:52
  • Not convinced. FA minimizes (seeks to do) the error for every off-diagonal element, not just the "sum" of errors over the matrix. Can you show PCA does the same thing? 2) Please, show the equivalence between the variance-formulation and the error-formulation, for PCA. I'll be very thankful (it may well be so indeed, I didn't think of that). Do it in a comment (or new answer) to this one.
  • – ttnphns Nov 08 '14 at 12:14
  • 7
    @amoeba YOUR ANSWER IS GREAT. It is so clear and gratifying. Thanks for sharing your vision. –  Nov 08 '14 at 15:41
  • Introducing PPCA consolidates the belief that all the three procedures are targeting at minimizing $|\mathbf C-\mathbf W \mathbf W^\top|$, and only it. However, FA also pursues to leave the above residuals "evenly spread" over the matrix, that is, reasonably random-like. Neither PCA nor PPCA do it. – ttnphns Nov 10 '14 at 08:54
  • @ttnphns: No, not exactly, see my second update (I also provided similar reasoning in the comments to the linked post). PCA minimizes $|\mathbf C-\mathbf W \mathbf W^\top|$, PPCA minimizes $|\mathbf C-\mathbf W \mathbf W^\top-\sigma^2 \mathbf I|$, and FA minimizes $|\mathbf C-\mathbf W \mathbf W^\top-\boldsymbol \Psi|$. I am not sure what you mean by "evenly spread residuals"... FA simply does not care about the diagonal of $\mathbf C$, because whatever values $\mathbf W \mathbf W^\top$ has on the diagonal, $\boldsymbol\Psi$ can compensate for it. So in FA error on the diagonal is zero. – amoeba Nov 10 '14 at 21:52
  • Amoeba, I dared to just slightly edit your answer. Please review if I did it right. – ttnphns Nov 11 '14 at 08:25
  • Hmm. I'm sorry, but I couldn't trace all of your logic (I'm not mathematician!). You have to show that if $\bf ∥X−X_k∥^2$ is minimized $\bf ∥X^\top X − X_k^\top X_k∥^2$ is also minimized. To me, it's not obvious. – ttnphns Nov 11 '14 at 09:57
  • I did write it a bit too short, @ttnphns; I have updated this paragraph, take a look if it makes sense to you now (btw, thanks for the edits!). In addition to what I wrote above, I should say that minimizing $|\mathbf C - \mathbf W\mathbf W^\top|$ has many solutions: if loadings $\mathbf W$ are rotated in the latent space, it will not alter the product $\mathbf W\mathbf W^\top$ (as you know well, because it is routinely done in FA!). So when I say that PCA loadings minimize reconstruction error of the covariance matrix, what I really mean is "possibly rotated PCA loadings". – amoeba Nov 11 '14 at 13:08
  • Hi, amoeba, thank you a lot for the Update 3 demonstration. May I ask you - 1) why did you (strangely) make only 40 observations in X? This configuration is singular, which a problem for FA. Although some implementations can handle with it is generally considered as inappropriate for FA. 2) Also, what FA extraction method did you use? 3) Didn't you encounter Heywood case sometimes? – ttnphns Nov 17 '14 at 15:55
  • @ttnphns: (1) If I take a lot of observations to make $C$ non-singular, then it becomes pretty much diagonal with all correlations almost zero (all green on my figure). This makes diagonal really "stand out" of the rest and so makes PCA perform worse on the off-diagonal part (i.e. my "ratios" increase). Also, it is not a typical correlation matrix to run PCA or FA. Ideally, I would take a random $C$ which is positive-definite and has many strong correlations, but I don't know how to generate it. (2) Self-coded principal axis, iterated until convergence. (3) Not 100% sure, but I think not... – amoeba Nov 17 '14 at 17:46
  • @ttphns: I updated my Update 3 and used a properly generated random correlation matrix (so that it is full-rank and not singular). I am writing it here mainly to point out for possible future readers of these comments that your question (1) from above does not apply anymore. – amoeba Nov 25 '14 at 14:15
  • Dear amoeba I've just appended a comment concerning the problem of the estimation of the $\Psi$ but because of the length I made it another answer. Please see there – Gottfried Helms Dec 04 '14 at 15:03
  • It might help to clarify that when you say ${\mathbf C}$ you mean the sample covariance matrix that is estimated from the observed data. As distinct from the (true) covariance matrix (under the generative models). ${\mathbf C}$ is currently introduced as the "covariance matrix", but I guess it should be "sample covariance matrix". Excellent post by the way! – Aaron McDaid Oct 27 '16 at 15:06
  • 1
    @amoeba, (+1) it isn't clear to me how the objective function you write for FA (or PPCA) in update 2 relate to the model you write in update 1. In fact, they seem to me to be totally unrelated. Could you perhaps help me clarify this? (My qualms with the objective function are that $C$ appears gaussian (connecting to the euclidean error being used) and that $\Phi$ appearing additively seems obviously wrong to me.) – user795305 May 16 '18 at 06:56
  • 1
    if its further assumed that $z$ is random with cov $I$ and is independent of $\epsilon$, then we see that $E C = W W^T + \Phi$, where $C = (X - \mu) (X - \mu)^T$. Of course, though, $C$ isn't normal with covariance proportional to the identity (I'm identifying $C$ with it's vecc'ed version when I write that previous sentence), so that the above objective for FA is not the MLE. What is it exactly? I may be misunderstanding, but this FA objective (which is the standard one, I think) seems to be a case of many simplifying assumptions and handwaving at what the estimating equation should look like – user795305 May 16 '18 at 07:12
  • 3
    @user795305 Apologies, I forgot to reply. The FA model written in Update 1 is correct. The latent $z$ is indeed supposed to be from $\mathcal N(0,I)$ and independent of $\epsilon$. The ML solution for $W$ and $\Psi$ are indeed NOT minimizing the norm of $C-WW^\top-\Psi$ as I wrote in Update 2; that was sloppy and incorrect. I should fix it, thanks. However, I think it's okay to say that the ML solution is such that $C\approx WW^\top+\Psi$; it's just that the loss function here is not the norm of the difference but a more complicated expression (likelihood of $C$ given $WW^\top+\Psi$). – amoeba May 22 '18 at 20:04
  • +1 nice answer. Ignoring the mathematical formulations (and the math) comes at the cost of confusion and unnecessarily verbose explanations. – SiXUlm Apr 04 '19 at 09:59