2

In my linear regression model, baseline BMI is a significant predictor of BMI change following a weight loss intervention and including it improves fit (e.g. AIC). But it is strongly associated with gender, so the relationship is not linear (see the two clusters in the scatterplot below). Is it therefore a bad idea to include baseline BMI as a covariate, as it violates one of the key assumptions of linear models, even though it improves fit?

I have tried squaring it, and interacting it with gender, but it made little difference to fit (e.g. AIC) or specification tests (e.g. RESET).

enter image description here

mkt
  • 18,245
  • 11
  • 73
  • 172
  • 2
    Think about why you seem to have so little apparent overlap between males and females in baseline BMI. In general populations there's much more overlap. See the extensive table for BMI percentiles as functions of age and sex for the US in Wikipedia. Your males seem mostly to be below median male BMI, while your females are mostly at or above median female BMI. – EdM Apr 23 '22 at 18:22
  • 2
    Also on the topic of unusual observations, you describe the treatment as a weight loss intervention. But both men and women in the treatment group have higher BMI after the treatment than before as BMI change = post-BMI - pre-BMI is positive for all but one subject. – dipetkov Apr 23 '22 at 20:48

1 Answers1

7

It is a bad idea not to include baseline BMI as this implicitly makes a strong assumption about the relationship between pre- and post-treatment BMI.

$$ Y_{change} = Y_{post} - Y_{pre} = \alpha + \beta\text{Female} + \theta\text{Treatment} + \text{Error} $$ is equivalent to $$ \color{white}{Y_{change} = } Y_{post} = \alpha + \beta\text{Female} + Y_{pre} + \theta\text{Treatment} + \text{Error} $$

If the outcome is change from baseline but you don't include pre-treatment BMI in the predictors, you assume that the coefficient for pre-treatment BMI is fixed at 1. The regression can handle estimating one more coefficient so you don't have to make unnecessary assumptions.

Including pre-treatment BMI as a predictor also adjusts for any differences in BMI between the treatment and control group, which can occur by chance even in a randomized study. Since you are working with observational data, it is even more important to adjust for possible confounders. (A confounder is a covariate that is associated with the treatment and/or the outcome. For example, diet and exercise are potential confounders for weight loss.)

After we add the effect of pre-treatment BMI the model becomes:

$$ \text{(1)} \quad Y_{change} = \alpha + \beta\text{Female} + \gamma Y_{pre} + \theta\text{Treatment} + \text{Error} $$

In fact, consider modeling post-treatment BMI rather than the change in BMI:

$$ \text{(2)} \quad Y_{post} = \alpha + \beta\text{Female} + \gamma Y_{pre} + \theta\text{Treatment} + \text{Error} $$

The treatment effect $\theta$ is the same in models (1) and (2). But the interpretation is more straightforward in (2).

Finally, the plots suggest that you should include an interaction between Gender and Treatment. This is the most interesting feature in the data: the response is very different between males and females in the treatment and control groups. (I guess this is what you mean by non-linear relationship.) Re-doing these plots with post-BMI on the y-axis might be even more interesting.


You can read more about adjusting for pre-treatment measurements in Chapter 19, Section 3 of Regression and Other Stories [1] and in the BBR course notes, which argue strongly against modeling change from baseline and in favor of modeling post-treatment outcome [2].

[1] A. Gelman, J. Hill, and A. Vehtari. Regression and Other Stories. Cambridge University Press, 2020. See Chapter 19, Section 3 for a discussionabout pre-treatment predictors.

[2] Biostatistics for Biomedical Research course notes. Available online.


Previous CV posts discuss change from baseline in lots more detail. Thank you to @EdM for the references.

Is it valid to include a baseline measure as control variable when testing the effect of an independent variable on change scores?

Best practice when analysing pre-post treatment-control designs

dipetkov
  • 9,805
  • Thanks. The problem with using endline BMI as the outcome is that it isn't the clinically relevant figure. We want to know how much people "improve", not just where they end up. – Judderman88 Apr 23 '22 at 16:41
  • Also, how do I justify using baseline BMI if it isn't a linear predictor? Doesn't that violate the main assumption of linear models, and if so, what are the implications of that? – Judderman88 Apr 23 '22 at 16:43
  • That's why I point you to two references. Both discuss the pros and cons of change from baseline as the outcome. – dipetkov Apr 23 '22 at 16:43
  • Not sure what you mean by BMI not being a linear predictor. In your figures I mainly see a strong interaction between treatment and gender. – dipetkov Apr 23 '22 at 16:46
  • There's a very strong argument that the suggestion to model outcome is preferable to modeling the change; this page and this page provide much discussion. Frank Harrell notes: "change is heavily dependent on getting the transformation of Y correct." You can model BMI flexibly with regression splines if you think it has a nonlinear relationship with outcome; linearity is in the regression coefficients. – EdM Apr 23 '22 at 16:49
  • @Judderman88 you can easily get an estimate of the change in BMI as a function of initial BMI from a model with final BMI as the outcome. Specify a set of initial BMI values and express the predictions as changes (with confidence intervals) from those initial values. The modeling itself poses fewer risks when you model that way, versus modeling the change score. See this answer for why change scores tend to be correlated with baseline values. – EdM Apr 23 '22 at 16:58
  • Thanks @EdM. Can you please explain how to do this? "Specify a set of initial BMI values and express the predictions as changes (with confidence intervals) from those initial values." – Judderman88 Apr 23 '22 at 17:11
  • 1
    (I guess) If you have a model E(Y_post) = f(Y_pre, Gender, Treatment) you can use your model to predict Y_post for a range of Y_pre's and then plot E(Y_post) - Y_pre as function of Y_pre. Do this for all Genders and Treatment combinations. – dipetkov Apr 23 '22 at 17:19
  • @Judderman88 what dipetkov suggests in the above comment is exactly what I had in mind. You start with the predicted BMI_post values and their confidence intervals over a range of specified BMI_pre values, then just subtract the corresponding BMI_pre values while keeping the confidence interval around each difference unchanged (as subtracting a constant from a value having an associated variance doesn't change the variance). – EdM Apr 23 '22 at 17:44
  • I'm using Stata (for the first time) and have a Monday deadline. Is this all I have to do? regress endlinebmi age male bmi0 treat predict pred_ebmi generate dbmi = pred_ebmi - bmi0 – Judderman88 Apr 23 '22 at 18:54
  • I've never used Stata. I'm not sure about @EdM. Perhaps he could answer this? PS: I receive a notification about all comments since I wrote the answer. Anyone else receives a notification only if you tag them with @ username. – dipetkov Apr 23 '22 at 18:57
  • 1
    @Judderman88 I don't use Stata either, but the manual suggests that will give you predictions for all your cases and their actual covariate values. To do what dipetkov and I suggest, you would specify a newdata argument with values at the 4 combinations of sex and treat and a range of BMI values (say, from 25 - 31 in steps of 0.1) within each of those combinations. The stdp option gives you corresponding standard errors. With this size data set, 95% confidence intervals would be $\pm$ 1.96 times the standard error. – EdM Apr 23 '22 at 20:05
  • 1
    And I just noticed that a fourth predictor, unmentioned so far, has appeared: age. So you have to choose a fixed value for age as well. The mean or mode value for each gender perhaps. – dipetkov Apr 23 '22 at 20:11
  • 4
    The excellent answer by @dipetkov could be extended a step further. First of all since BMI is a ratio of two exponential quantities it is usually better to analyze its logarithm. But even better is to analyze weight adjusted for height (and sex, ...). Regression can do direct adjustment. Predicting weight from baseline weight and height (or their logarithms) is likely to fit better and to be more interpretable. – Frank Harrell Apr 24 '22 at 12:04
  • Thanks @FrankHarrell. I agree but don't have the height info. Will mention it in my limitations/further research. – Judderman88 Apr 24 '22 at 13:37
  • 1
    If you have weight you can backsolve for height given BMI. – Frank Harrell Apr 25 '22 at 03:01