3

Imagine you are modelling an outcome based on two discrete factors, factor1 and factor2, which can be yes or no.

You could do something like:

model <- lm(outcome ~ factor1 + factor2)

Now suppose that out of the four combinations of values for the two factors there is one that can never happen. For instance if factor1 is true factor2 can be either true or false, but if factor1 is false factor2 can only be false.

Does one need to account for that in the model? And, if yes, how so?

nico
  • 4,581
  • 3
  • 32
  • 45

2 Answers2

3

Methodologically, there's no reason why this should be an issue. The interpretation of the model coefficients is an "associated difference in outcome for factor1 holding factor2 constant". For instance, if the outcome were serum testosterone (mg/ml) and factor1 were gender and factor2 were pregnancy status, the estimated coefficient for factor2 would be exactly the same if you restricted to women and dropped gender from the model.

If you used the model to generate a post estimate for an associated difference in serum testosterone comparing pregnant males to pregnant females, that estimate would be incorrect. This is an issue of extrapolation. However, the model coefficients are still correct in that they can borrow information across groups where such information is available.

AdamO
  • 62,637
2

Look at it that way: You could have a categorical variable with three levels: “both false”, “only factor 1 true”, “both true”. If you use dummy coding with “both false” as the reference category, you end up with the same model mathematically speaking. The only difference is that you are not tempted to include an interaction or interpret what happens when only factor 2 is true.

Gala
  • 8,501