0

Consider a general problem where we try to model an output variable $Y$ with several independent variables $X_1$, $X_2$, $X_3$, etc. that are binary or continuous. From previous study, we know that the values of the continuous variable $X_1$ are affected by a binary variable $Z$ but the $Z$ has no effect on the output. How should I model this in R?

  1. Y ~ X1:Z + X2 + X3
  2. Y ~ X1:Z + X1 + X2 + X3
  3. Y ~ X1:Z + Z + X1 + X2 + X3

Here is my concrete example as it might help : $X_1$, $X_2$, $X_3$ are features extracted from medical imaging data such as for each patient the mean or the maximum of the values in a region of interest. $Y$ could be either a binary output describing if the tumor is aggressive or not, or survival data such as overall survival. $Z$ is a binary variable that describes if the patient has got a premedication before the image acquisition. We know from a previous study that if the premedication is given to a patient we will observe higher $X_1$ values than in the absence of premedication.

My instinct tells me to use option 1. because $Z$ has no impact on $Y$. It only depends on if the premedication was given or not which depends on the date of the image acquisition (protocol changed other time) so we can assume in my opinion that this is random. But from what I read on an older post it doesn't sound like a good idea to omit the main effect term.

  • 1
    If indeed "$Z$ has no effect on the output," then what's the point of including it at all?? – whuber Mar 07 '23 at 19:49
  • I'm confused as to why you need to include the variable $Z$ at all if it does not affect the outcome $Y$? What is your goal? Is it inference on the $X$ variables? Is it to predict $Y$ accurately? – Yashaswi Mohanty Mar 07 '23 at 19:50
  • The point is to include the fact that $X_1$ values will depend on $Z$ and try to mitigate this effect. To be sure that $X_1$ is an independent predictor of $Y$ and that this observation is not due to the pertubation of $Z$. Am I wrong to turn the problem like this ? The problem arises because $X_1$, without interaction with $Z$, is apparently an independent predictor of $Y$. But we know that $Z$ has an effect on the values of $X_1$. We would like to take it into account even if we know that as a binary output $Y$ classes have same proportions on $Z$ classes. Is that test enough? – Timothée ZARAGORI Mar 07 '23 at 20:36
  • I am now doubly confused, because in your post you definitely state $Z$ has no effect, but in your comment you just as definitely state you know it has an effect! – whuber Mar 07 '23 at 21:15
  • It has no impact on $Y$ but an impact on $X_1$. It is a premedication before a medical imaging acquisition that will have an impact on image values ($X_1$) but having this premedication won't affect if you have an aggressive tumor or not or if the patient survive longer or not ($Y$). We just would like to be sure that when we see $X_1$ as a significant predictor it is not just because of the effect of $Z$ on $X_1$ – Timothée ZARAGORI Mar 07 '23 at 21:34
  • You seem to be confusing a regression model, which assesses numerical relationships among variables, with a causality model. Regression does not directly address causality, nor does significance of any predictor imply anything (in itself) about causality. – whuber Mar 07 '23 at 22:24
  • I do want a regression model to assess numerical relationship of my $X$ variables with my output $Y$ but isn't it possible to take into account in the same model that $Z$ has an impact on $X_1$ values, i.e. the value $X_1$ observed for the patient is higher with $Z=1$ than what it would have been with $Z=0$ ? Isn't that the purpose of interaction terms to handle that a variable has an effect on another ? – Timothée ZARAGORI Mar 07 '23 at 22:37

1 Answers1

0

The issue here is that the values of X1 change depending on whether or not treatment Z was applied before imaging. Thus the association of observed X1 values with outcome depends on whether Z was applied. The regression must take that into account.

If Z is coded as 0/1 for absence/presence, then Model 1 won't evaluate the association of X1 with outcome at all when Z=0. The interaction term is just the product of the individual predictor values, so the first term in your model will be 0 for Z=0 regardless of the value of X1.

Model 2 will provide an individual coefficient for X1 that represents its association with outcome for Z=0, and an interaction coefficient that represents the change in that association if X1 is measured following Z. You might get away with that if the model remains that simple and X1 values are affected multiplicatively by Z.

In general, Model 3 is the safest. See this page and its links for extensive discussion. The individual coefficient for Z in Model 3 will represent the apparent additive association of Z with outcome when X1=0, and the interaction coefficient allows for a proportional change in X1 values as a function of Z. You might find that the interaction term in that model isn't large. For example, if Z just has an additive effect on X1 values across the entire range, then you might find a corresponding coefficient for Z and an insignificant interaction coefficient.

All of your models implicitly assume a direct linear association between your continuous X values and (a possible transformation of) outcome. That's often not the case, and you should consider more flexible modeling with regression splines or a generalized additive model. That's particularly the case if Z has a complicated non-additive or non-proportional effect on X1 values.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Thank you for your detailed answer. I read a bit about linear mixed model (LMM) as I think my interaction here leads to non independence in the data. However, I don't know if LMM is capable of handling the non independence only on $X_1$. Is using lmer(Y ~ X1 + X2 + X3 + (X1|Z)) a better way to handle my interaction than lm(Y ~ X1 + X2 + X3 + Z + X1:Z)? – Timothée ZARAGORI Mar 09 '23 at 17:01
  • @TimothéeZARAGORI with only 2 levels of Z it would not make sense to treat it as a random effect in a mixed model. The "non independence" that random effects help to model is typically something like intra-individual or intra-group correlations of outcomes due to factors that aren't included as predictors in the model. That doesn't describe the role of Z here, which changes the observed X1 predictor value even if the underlying biological status is the same. Random effects are generally most useful when there are about 6 or more separate groupings, you only have 2 groups based on Z. – EdM Mar 09 '23 at 19:47