3

We want the verify the association between artificial pregnancy (artificial insemination) versus natural pregnancy and a series of pregnancy conditions (hypertensions, diabetes, etc).

We first verified the effect of the reproduction type adjusted for subject confounders like age, BMI on each single pregnancy condition, producing a series of regressions.

Regression 1:

$$condition_1\ \tilde\ \ preg\_type + age + BMI\\ condition_2\ \tilde\ \ preg\_type + age + BMI\\ ...\\ condition_n\ \tilde\ \ preg\_type + age + BMI\\$$

We thought that we could actually do the opposite. That is, using a logistic model we assess the independent effect of having one those conditions (modeled as covariates, together with the aforementioned confounders) on the type of pregnancy.

Regression 2:

$$preg\_type\ \tilde\ \ age + BMI + condition_1 +\ ...\ + condition_n$$ This would allow to identify the independent effect of each condition. Obviously we know that these conditions cannot "cause" the use of an artificial reproduction techniques. But regression methods assesses only association, not causation, so conceptually the model should be valid.

Our questions are:

  • Which is the difference in interpretation between the two kind of analysis? Which questions do they answer?
  • Is regression 2 a proper analysis design or it's basically flawed?
  • Is it a valid interpretation stating that using the conditions as covariates would provide insight regarding the independent association between each condition and the type of pregnancy?
Bakaburg
  • 2,917

2 Answers2

0
preg_type  ~ age+BMI+condition1+ ... +conditionn

The second regression that you have can be used for cases where you don’t have any information about whether the pregnancy was natural or artificial. You are trying to make guess about the pregnancy using the conditions. It’s totally a different question here. You would build the model with the data in hand, and then you would predict the type of pregnancy in cases where you only have info on conditions (not pregnancy type).

Let’s say dependent variable takes the value of 1 when the pregnancy is natural and zero otherwise. Here is how I would interpret the coefficients in the second regression

Coefficient on Condition 1 is the odds (log odds) that the pregnancy was natural given someone have Condition 1.

Similarly, coefficient on Condition 2 is the odds that the pregnancy was normal given someone have condition 2.

subra
  • 861
  • 1
    I understand your view. In this case I'm interested in inference more than prediction. Regarding the coefficients I'm troubled by the biologic interpretation of the effect of the various condition on the preg_type given that I treat them as covariates (regression 2) compared with evaluating the effect of preg_type on the odds of every condition separately (regression 2). I'll edit my question. – Bakaburg Sep 01 '15 at 08:57
  • @Bakburg:I get your questions now. One obvious problem with the second regression:

    If your first model is true, then age and BMI will be correlated with condition1, condition 2, etc. This is called multicollinearity. Multicollinearity can increase the variance of the coefficient estimates. Therefore, your estimates will be unstable in your second regression. Because you are interested in inference, the second regression is not going to help.

    – subra Sep 01 '15 at 17:04
  • Uhm, it's not to taken for granted age and other covariates in regressions 1 are always correlated with the various conditions. I put them in the models because they are obvious probable confounders and I have to test them and most of the time the correlation is not significant. Anyway, it's right to test for multicollinearity, therefore I run a variance inflation analysis on regression 2 and verified that the VIF are low. – Bakaburg Sep 02 '15 at 14:38
0

While it is certainly true that regression does not assess causation, I see two problems with your second model.

First, switching the types of variables (dependent and independent; outcome and predictor; whatever names you want to give them) leads to conceptual confusion along the causation line. To take a simple linear regression example, we regress weight~height and not the other way round, because height "causes" weight (I put cause in quotations because it's not exactly causal, but I don't know of a good word for what is going on here).

Second, your second model is clearly an attempt to make one equation instead of several. But ... the model is now controlling for the "effect" of the other variables. I don't see how this makes sense in this situation.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
  • Depends on what the goal is, I think. If the goal is simply to have one test if there are differences between the groups at all, logistic seems sensible. See https://stats.stackexchange.com/questions/190156/t-tests-manova-or-logistic-regression-how-to-compare-two-groups and https://stats.stackexchange.com/questions/129442/using-the-hotelling-package-in-r/487803#487803 – kjetil b halvorsen Jan 04 '24 at 13:02
  • 1
    Dear Peter, thank you for the answer. It's been 8 years from then and now I would definitely never consider Model 2 since it would be a quintessential causal salad with hardly interpretable coefficients. – Bakaburg Jan 04 '24 at 16:28