5

Is it necessary in structural equation modeling (SEM) to incorporate all potential independent variables that could affect the dependent variable? Or is it acceptable to examine the influence of only a select few independent variables on the dependent variable?

Marjaan
  • 51
  • This can help: https://stats.stackexchange.com/questions/63417/difference-between-simultaneous-equation-model-and-structural-equation-model/400279#400279 – markowitz Sep 13 '23 at 07:51

5 Answers5

7

Multiple regression is an example of structural equation modeling. It's often easier to think about what you would do in multiple regression, because that's true in SEM.

So, it is as necessary to incorporate all predictor variables in SEM as it is in multiple regression.

Which is to say that it depends on the question that you are asking, and the conclusions that you want to draw. But 'all potential independent [predictor?] variables' includes everything in the universe, so no.

Richard Hardy
  • 67,272
Jeremy Miles
  • 17,812
  • Thank you for your response. My current project involves investigating the influence of various inventory management issues on consumer shopping behavior. As we know, there are various other factors in addition to inventory management issues that can impact consumer shopping behavior. Therefore, I am trying to understand whether it's appropriate to utilize SEM solely to model various aspects of inventory management and their impact on consumer shopping behavior, while not including other factors that might impact the dependent variable. – Marjaan Sep 12 '23 at 18:35
  • 2
    McElreath has cautioned against putting as many potential predictors as possible into scientific models. It is an example of what he calls "causal salad". See this talk for an introduction. – Galen Sep 12 '23 at 20:18
  • Yeah, you should think hard about your predictors - Willem Saris wrote years ago (I completely forget where) that if your R^2 isn't 0.9 or above you can't interpret your regression, because maybe you forgot an important predictor. I don't necessarily agree, but it's an interesting thought (IMHO). – Jeremy Miles Sep 13 '23 at 00:00
  • It seems me that you conflate SEMs with regression. I suggest to don't this and embrace another perspective. Read here: https://stats.stackexchange.com/questions/63417/difference-between-simultaneous-equation-model-and-structural-equation-model/400279#400279 – markowitz Sep 13 '23 at 07:51
  • 1
    SEMs are a form of regression. They're historically motivated by causal thinking, but they are just statistical models by themselves. If you don't put causal assumptions in, then you won't get causal inferences out. – Galen Sep 13 '23 at 14:52
  • Regression is SEM, just like anova is regression and a t-test is regression. You can do more with SEM than regression, but when you have problems with sems, it can help to think about them as a regression problem. – Jeremy Miles Sep 13 '23 at 15:25
  • @JeremyMiles If SEMs is regression we can use SEM even for pure correlational analysis, pure forecasting, ecc. This is not true. – markowitz Sep 14 '23 at 14:57
  • @Galen, your perspective is like: SEM = regression with causal assumptions therein. In this way any sort of ambiguity can emerge. The right way seems me to consider SEMs as something different from regression (see the link). – markowitz Sep 14 '23 at 15:03
  • @markowitz You missed my point about no-causes-in-no-causes-out. I am recommending to keep our thinking about the causality a priori to the statistical model we are developing. A SEM isn't a regression with causal assumptions per se; it is just a family of statistical models. The historical motivation for them is causal, and I recommend guiding the model development using causal assumptions, but that doesn't make SEM causal by itself. – Galen Sep 14 '23 at 15:18
  • For the word "regression" it is an imprecise term that does not specially apply to simple/multiple linear regression (see functional regression for an example). Likewise, deep learning models are usually considered a regression problem. Pearl and some others aside, using the broad usage of the term, SEM is a form of regression. – Galen Sep 14 '23 at 15:18
  • @Galen, Comments are not for long discussion, I give some clue and stop here. Like I suspected your view is vague and open the door for ambiguities. Indeed even in your comments them emerge. I see your point " no-causes-in-no-causes-out" this is right but is not enough. You start with: “SEMs are a form of regression” … then … “A SEM isn't a regression with causal assumptions per se; it is just a family of statistical models”, therefore regression is a larger group of statistical model … but what definition we gave at statistical models? – markowitz Sep 15 '23 at 08:12
  • Then “[causal assumpion] doesn't make SEM causal by itself” … so what make the SEM causal? Moreover SEM can be not causal? Moreover again, “the word "regression" it is an imprecise term” Indeed a non clear definition of structural concepts often come from a not clear definition of regression. … so you conclude “Pearl and some others aside, using the broad usage of the term, SEM is a form of regression”. No, in Pearl’s opinion SEM and regression are clearly different tools, he wrote even too much about it and I think that him is right. – markowitz Sep 15 '23 at 08:12
7

In addition to what Jeremy Miles wrote, we typically include an error (residual) term in a structural model for each dependent (endogenous) variable, thereby acknowledging that we do not know (or haven't measured/don't have access to) all possible predictors and/or causes of our dependent variables in the model. Of course, not including certain relevant independent variables can lead to bias in the estimated regression (path) coefficients for those independent variables that are included in the model.

  • 2
    Regression has an error term too. You just don't show it or so anything with it. (I would even argue that it's latent.) – Jeremy Miles Sep 13 '23 at 15:26
4

I think choosing a "correct" SEM model boils down to these important questions:

  • Does my model make sense? Often I see cross-lagged models that sometimes become giant cob webs with no real interpretability and it makes me question what the end goal of such an analysis is. The important part is that your model has some actual utility outside of AMOS or lavaan. As some others have alluded to already, designing an interpretable design using something like DAGs helps narrow down what's actually important to your question and which are not.
  • Can my data actually support this? It is well known that a large part of SEM is fitting an implied relationship to the actual data, and this is contingent on the data actually behaving in the way you are considering. Loading up a ton of predictors without considering how important they are can lead directly to poor fit models and issues with model convergence, and there are a variety of reasons this happens which are directly relevant to which predictors you include.
  • Even if my data supports this, is it necessary? Related to concerns already noted here about whether or not you are overfitting your paths, another issue is whether or not you are wasting energy on including as many paths as possible. If for example we know that $X1$, $X2$, and $X3$ have substantial loadings onto a latent factor $X$, but then we have an $X4:X30$ variables which are statistically significant but have overall weak loadings, we may not really need to include them in a model given they don't explain much anyway.
  • Are there better models than mine? I think its not always great to think of your specified model in isolation, but rather one of an infinite continuum of models that are possible. This is sometimes called the issue of alternative or equivalent models. Related to my earlier point about DAGs, you should consider constructing other models to see what the weak points of your model are, what the strong points are, and what may alternatively explain the phenomena you are after. This will help solve which predictors are ultimately important.

If these points could be summarized into one sentence, I would say that your goal is to "explain phenomena in a way which is best explained by the data."

2

Basically, yes you can consider adding all the variables that influence output variable(s).

Most assumptions in that regard are no different than linear regression: too many inputs will lead to a Neyman-Scott kind of bias. Additionally, you need to be sure of their causal relationship. Many sort of "kitchen sink" models fail to consider which "inputs" might actually be colliders or mediators, which shouldn't be included as an effect.

The advantage of SEM is that, unlike OLS, you can specify this kind of structure in the modeling and avoid bias and perhaps even boost efficiency. It's just quite a bit harder to extract the correct inference.

AdamO
  • 62,637
0

You're asking because you're examining how much one variable can explain the other variable--that is, to find the regression coefficient, I assume. Given that different variables can "steal" the weight from different variables, and that this does not represent a change in the actual data, but a distinct interpretation depending on the variables for the same data, then yes, it can be a problem: if you use an incomplete model, you're going to end up with results that could have different interpretations compared to those of a complete model.

As others have suggested in comments directly to your question, this means you need to question the causal relationship between the variables. As they have also pointed out, this is no different than a regression, except that you're gonna specify certain residual characteristics depending on whether you're analyzing latent or observed variables.

Keeping with our usual unscrupulous regression examples, imagine that you create a SEM model that explains surfing accidents on one day by the number of soda cans found on the beach the next day. The model is likely to return some level of explanation. However, once you add the number of

No statistical model is able to tell anything about causality. Some are better than others at helping you formulate a theory about it, but it's the theory that has to do the heavy lifting. So no one can expect the SEM model to be explaining causality. As long as you explain what you're trying to show with the model and the downsides to doing it that way, using an incomplete model is not a big problem. Most often, this intermediate step towards a complete model is worth it.