We want the verify the association between artificial pregnancy (artificial insemination) versus natural pregnancy and a series of pregnancy conditions (hypertensions, diabetes, etc).
We first verified the effect of the reproduction type adjusted for subject confounders like age, BMI on each single pregnancy condition, producing a series of regressions.
Regression 1:
$$condition_1\ \tilde\ \ preg\_type + age + BMI\\ condition_2\ \tilde\ \ preg\_type + age + BMI\\ ...\\ condition_n\ \tilde\ \ preg\_type + age + BMI\\$$
We thought that we could actually do the opposite. That is, using a logistic model we assess the independent effect of having one those conditions (modeled as covariates, together with the aforementioned confounders) on the type of pregnancy.
Regression 2:
$$preg\_type\ \tilde\ \ age + BMI + condition_1 +\ ...\ + condition_n$$ This would allow to identify the independent effect of each condition. Obviously we know that these conditions cannot "cause" the use of an artificial reproduction techniques. But regression methods assesses only association, not causation, so conceptually the model should be valid.
Our questions are:
- Which is the difference in interpretation between the two kind of analysis? Which questions do they answer?
- Is regression 2 a proper analysis design or it's basically flawed?
- Is it a valid interpretation stating that using the conditions as covariates would provide insight regarding the independent association between each condition and the type of pregnancy?
If your first model is true, then age and BMI will be correlated with condition1, condition 2, etc. This is called multicollinearity. Multicollinearity can increase the variance of the coefficient estimates. Therefore, your estimates will be unstable in your second regression. Because you are interested in inference, the second regression is not going to help.
– subra Sep 01 '15 at 17:04