Does it make sense to include a predictor that is by definition related to the response variable in a regression model?

Question

I want to use a multiple logistic regression to model the relationship between two experimental groups (test and control) and accuracy of a procedure, controlling for the experience (in years) of the participants.

outcome ~ group + experience

The design I am using is paired in the sense that every participant is tested twice, so there are no differences in baseline characteristics between groups (since they are the same individuals). If I was only testing for differences in time, a paired t-test would suffice, but I need to control for experience, hence a regression model is being built.

Time is measured in seconds until the procedure is completed, and accuracy is defined as completing it within a pre-specified threshold (the outcome is 1 if less than or equal to 6 minutes and 0 otherwise). It is expected that time and experience are negatively correlated or, experience practitioners are expected to take less time to complete the procedure.

I would like to test for interactions in this model, but I don't think it makes much sense to interact the group with experience.

outcome ~ group * experience

I am considering including time in the model and test for interaction with experience.

outcome ~ group + experience*time

Since time is used in the definition of the response of the logistic model I expect it to be significant even with a small sample size. However it seems to me that including time in this model would be circular reasoning.

outcome ~ group*time + experience

Q1: Is this a correct interpretation?

Q2: If I try interactions between time and the group instead, would that tell me that time is modifying the effect attributed to the group?

Q3: Does it make sense to test for interactions between experience and group in this setting?

EDIT: I understand Douglas Altman's point of that, while unnecessary dichotomization of a continuous variable is prevalent in medical research, it leads to loss of estimate precision (at the very least). I was able to make the case for a linear model of time ~ group + experience for this experiment as a secondary endpoint, but the main goal needs to remain being accuracy, which is why the outcome is a dichotomization of time. This practice is prevalent for a reason :)

Could you include the formula/specification for the different models you are considering? — mkt, May 11 '23 at 11:07
Can you explain time and outcome in a little more detail? Also, the question in your title is perhaps too much of an oversimplification. It can be very useful to include predictors that are known to be related to the response variable in some situations, and less so in others, depending on the causal pathways. — mkt, May 11 '23 at 11:48
@mkt Time is measured in seconds, and the outcome is 1 if less than or equal to 6 minutes and 0 otherwise. Is that sufficient? — philsf, May 11 '23 at 11:55
It seems to me that you could just model time then - what's the reason for taking the continuous value and turn it into a binary? It throws away information. — mkt, May 11 '23 at 11:56
But I agree that if outcome is defined completely by time, it doesn't make sense to use time as a predictor for outcome (or vice versa, for that matter). — mkt, May 11 '23 at 11:58
@mkt Time will be assessed as a secondary outcome (with a linear model), but the main goal is to assess accuracy. Thank you for your response, would you expand on it for an answer? — philsf, May 11 '23 at 12:04
With an interaction, coefficients reported by a model summary() depend on predictor coding. For evaluating a predictor, use a measure that includes all terms involving it. The Anova() function in the R car package does that in a way that (unlike the basic anova() function) doesn't depend on the order of entry of variables in the model. Use post-modeling tools like those in the emmeans package to illustrate specific scenarios properly. — EdM, May 11 '23 at 14:58

mkt · Accepted Answer · 2023-05-11T18:02:33.020

7

outcome in this situation is fully determined by time. Another way to say this is that it is simply a re-expression of time on a binary scale. So if you were to model outcome and use time as a predictor, the other predictors would not matter - all the variation in outcome would be fully explained by time.

Actually, that's an oversimplification - it would be worse than this. You would presumably be using a logistic regression, and would encounter the problem of perfect separation.

I would add that outcome does not seem very useful to use as a response variable. You could just model time as a response; taking a continuous value and turn it into a binary throws away useful information. It's very unlikely that the binary outcome variable is a better metric of 'accuracy' than time.

EDIT:

Yes, including group*experience makes sense. As EdM says, it would tell you whether the effect of the treatment (group) on the outcome changes with experience. This is easier to understand if you plot the model output.

Also, if you've measured each participant more than once, you will need to model the non-independence of data points. A mixed model (random intercept and perhaps random slope for participant) would help address this.

edited May 11 '23 at 18:02

answered May 11 '23 at 12:13

mkt

18,245
11
73
172

thanks, that's helpful. What about the consideration of group*experience interaction? Does it help explain anything? How would you interpret this, if it turns out to be significant? – philsf May 11 '23 at 12:19
No, experience is measured in years, as stated. How many years does the participant have performed the procedure for? – philsf May 11 '23 at 12:29
@philsf if the effect of the experimental group might differ as a function of prior experience in years, then that calls for an interaction between them. The additive model without the interaction assumes no difference in experimental effects as a function of prior experience. I agree with this answer: you would be much better off modeling the time to completion as continuous rather than all-or-none. If you cease all trials at 6 minutes you would code 6-minute trial durations as right-censored and use a "tobit" type of continuous regression. – EdM May 11 '23 at 12:32
The design I am using is paired in the sense that every participant is tested twice, so there are no differences in baseline characteristics between groups (since they are the same individuals).
If I was only testing for differences in time, a paired t-test would suffice, but I need to control for experience, hence a regression model is being built.

Does the interaction between experience and group explain anything in this setting?
– philsf May 11 '23 at 12:33
@EdM If I understand your comment correctly, I could formulate a hypothesis that inexperienced participants are more likely to have a different average time (or accuracy) than experts, and test it with the interaction I proposed. Is this a correct interpretation of your comment? – philsf May 11 '23 at 12:38
2

@philsf not quite. The interaction would examine whether the effect of the experimental manipulation differed as a function of prior experience. The simple additive model would handle "inexperienced participants ... more likely to have a different average time (or accuracy) than experts" on its own, if the effect of the experimental manipulation doesn't depend on prior experience. If you have a large enough data set, it makes sense to include an interaction to check that. Also consider whether your simple linear model of experience is reasonable; a flexible spline fit is often better. – EdM May 11 '23 at 12:43
@EdM thanks for your +1 comment. I will add a note to the question acknowledging that a linear model will be preferable to a logistic model for this data. Unfortunately we are not always make all design decisions. – philsf May 11 '23 at 12:52
@EdM would you mind expanding on this point for the interaction as an answer addressing Q3? I am still trying to wrap my head around whether it makes sense to consider this interaction. – philsf May 11 '23 at 13:07
1

@philsf I've updated my answer to address that. I agree with all EdM's comments. – mkt May 11 '23 at 13:21
2

@philsf the addition to this answer says what I would have said: there's no downside to including the interaction. For further study, this answer and this answer show plots of situations in which an interaction term between a categorical predictor (group here) and a continuous predictor (experience here) would be important. Harrell's Regression Modeling Strategies is a useful resource on choosing predictors and flexible modeling of continuous predictors. – EdM May 11 '23 at 13:40

Does it make sense to include a predictor that is by definition related to the response variable in a regression model?

1 Answers1