0

For this question, I begin by explaining my factor model, and then ask how to theoretically motivate factor scores given my setup. I would also be very grateful for clarification about any mistakes I may be making in setting up my factor model or explaining factor scores, as I'm not an expert in this area.

Model Setup

I have data where $i$ indexes people and $q$ indexes questions. Performance of person $i$ on question $q$ is $X_{iq}$. I posit a model where performance is generated by two latent factors, $F_{ik}$, $k=1,2$, which are properties of people, through the following equation

$$X_{iq}=\beta_{q1}F_{i1} + \beta_{q2}F_{i2} + E_{iq}$$

where $\beta_{qk}$ measures how factor $k$ loads on question $q$, and $E_{iq}$ is the error term. I assume that $Cor(F_{i1},F_{i2})=0$, and $Cor(F_{ik},E_{iq})=0$ for every $k$ and $q$. In addition, estimation proceeds by maximum likelihood under the assumption that $F_{i1}$ and $F_{i2}$ are bivariate normal (across people $i$), each with mean zero and variance one; and the $E_{iq}$ are multivariate normal, each with mean zero and variance (uniqueness) $\psi_q = 1 - \beta_{q1}^2 - \beta_{q2}^2$. Putting these together, this implies that for each $q$, $X_{iq}$ is assumed to be standard normal (this motivates standardizing these variables in pre-processing).

Factor Scores

After estimating the factor model by maximum likelihood, I have estimates for $\beta_{q1}$ and $\beta_{q2}$, which I call $\hat{\beta}_{q1}$ and $\hat{\beta}_{q2}$. For further analysis, I would like to predict (something like a) $\hat{F}_{i1}$ and $\hat{F}_{i2}$, for individual people $i$. This leads me to consider factor scores.

ttnphns writes (https://stats.stackexchange.com/a/126985/137620) that, in general, factor scores are given by $\bf\hat{F} = XB$, where $\bf X$ is the data (analyzed variables -- test questions in my case) while $\bf B$ is a weight matrix. There are several different weight matrices. ttnphns writes (https://stats.stackexchange.com/a/126985/137620) that a "popular/traditional approach, sometimes called Cattell's, is simply averaging (or summing up) values of items which are loaded by the same factor." In non-matrix notation, I interpret this as

$$\hat{F}_{i1} = \sum_{q}{}{\hat{\beta}_{q1}X_{iq}}$$

and

$$\hat{F}_{i2} = \sum_{q}{}{\hat{\beta}_{q2}X_{iq}}$$

Thus, if the factor "loads heavily" in predicting whether the person gets a high score on the question (which is my interpretation of a large $\beta_{qk}$ in the original factor model), then the question score "loads heavily" in predicting whether the person has a high level of the factor in this factor score calculation.

This seems intuitive, but what is the formal, theoretical justification for this approach? If there is any? Can this approach be seen as a "best predictor" of the individual-level factors $F_{ik}$ in any sense? (Or is there an alternative approach that can be formally justified?)

  • You probably have noticed that in my answer you link to, factor loadings (which you notate $b$) are notated $a$ (A, in matrix form) while $b$ (B) is kept for factor score coefficients. My notation is a sort of established in literature and I might recommend to follow it. – ttnphns Apr 04 '22 at 00:10
  • Now, the Cattell's or "coarse" method of computation of factor scores proclaims: let B = A or = some trivial f(A). That is quite intuitive, straightforward decision. The question is - is it good enough? – ttnphns Apr 04 '22 at 00:15
  • p.s. The notation in my answer actually leaves letter P for the loadings, but also says that when factors are kept orthogonal, P=S=A. Also well known fact. – ttnphns Apr 04 '22 at 00:20
  • p.p.s. There is a link in my answer to another answer https://stats.stackexchange.com/a/191332/3277 where I tried to express my feeling why the "coarse method" (B=A) is inferior to a refined method (such as e.g. regression method). – ttnphns Apr 04 '22 at 00:37

0 Answers0