4

I have data about investment preferences 1 year before the Covid and during the Covid lockdown.

Some changes appear using simple T-Test. I want to be able to assess if these changes are particularly strong for some specific demographics (e.g., older individuals ($X_1$), individuals with lower income ($X_2$), etc...).

Should I use the initial level of my dependant variable in the regressions? Basically, if I want to use OLS regressions to investigate which independant variable correlate with the change in my dependant variable, which model is preferrable?

Model 1 (apparently called Change Score Method): $(Y_2-Y_1)= \beta_1 . X_1+ \beta_2 . X_2 $

Model 2 (apparently called Regressor Variable Method) Score Method): $Y_2= \beta_1 . X_1+ \beta_2 . X_2 + \beta_3 . Y_1 $

Thank you so much for your help - Any reference would also be much appreciated!

L. M.
  • 85

1 Answers1

2

Both methods have been used. See here for example. It depends what question you want to answer. If you want to talk mostly about "change" you can use

(Y2-Y1) ~ X1 + X2            # (1)

Basal (Y1) should not be added to above equation as it will always be correlated with difference (Y2-Y1) - see comments below by @EdM and here.

On the other hand, if you want to discuss factors affecting "final value", you can use

Y2 ~ X1 + X2 + Y1            # (2)

However, since repeated measurements (Y1,Y2 at 2 times) have been done on same subject, hence mixed model is also often used. (including interactions as commented by @dbwilson below):

Y ~ X1 + X2 + time + X1*time + X2*time + (1|subject)

Following simplified version of formula is effectively same as above:

Y ~ X1*time + X2*time + (1|subject)            # (3)

There is another method commonly used, especially in biomedical literature: "Percent change", i.e.

(100*(Y2-Y1)/Y1) ~ X1 + X2            # (4)

It is not correct to keep Y1 as a predictor variable in this last method as there will be strong correlation between baseline and percent change.

I think this last method (percent change) is most understandable.

See here for more information on this topic.

Edit: For equation 3, data should be in form such that columns are: subject, x1, x2, time and y. Hence, y1 and y2 will be in 2 different rows (having same subject, x1 and x2 but different y value and time). For other equations, data will be in form such that columns are: subject, x1, x2, y1, y2 (one row for each subject; subject column will be ignored here).

rnso
  • 10,009