1

In a (panel) regression with income as the dependent variable, I would like to estimate the effect of a treatment on the relative change in income. I found two mathematically equivalent ways to do this. Either

  • by calculating relative change = post-treatment income - pre-treatment income / pre-treatment income and then regressing it on treatment,
  • or by taking the natural logarithm of income, i.e. lninc = ln(income), then regressing it on treatment and, finally, calculating exp(Beta_T)-1

However, the results are not the same! Here a stylized example replicating the problem:

    clear
set seed 111

set obs 10000

gen id = _n

expand 2 // two observations per individual

bysort id: gen t = _n // time

bysort id (t): gen T = (_n==2) // treatment

gen inc = rnormal(10+50000*T,1) // dependent variable

assert inc > 0 // all values > 0

bysort id: gen relinc = ((inc[_n] - inc[_n-1])/inc[_n-1]) // relative change
replace relinc = 0 if t==1

gen lninc = ln(inc) 

bysort id: gen lndiff = exp(lninc[2] - lninc[1])-1 
sum lndiff relinc if relinc != 0 // the relative changes using these two approaches are in fact the same

xtset id t
qui xtreg relinc T, fe
margins, dydx(T) // 5061

qui xtreg lninc T, fe
margins, expression(exp(_b[T])-1) // 5035

On real data, the differences can be quite large and sometimes even the sign differs.

How come Stata comes to different conclusions here?

Ben
  • 205
  • Since you are using different numbers to represent your data, you ought to be deeply surprised if Stata did give the same results: that would be grounds to suspect an error. What kinds of differences, then, do you want to draw our attention to? – whuber Aug 13 '20 at 19:34
  • Could you be more specific? Different numbers of observations? Both regressions run on the same sample. Different numbers as in two different ways to calculate the relative change? As I show in the code snippet, the approaches are mathematically equivalent. – Ben Aug 13 '20 at 20:11
  • On the contrary, they are not mathematically equivalent: the relative change is not the same as the logarithm, even though they will approximately agree for small changes. And exponentiating the parameter estimate simply is incorrect. – whuber Aug 13 '20 at 20:14

1 Answers1

2

The log difference is an approximation that works for small changes and quickly degrades, as @whuber already pointed out in the comments. Your change is enormous, so it's no surprise. If you have a smaller change, things look much better, as I show below.

He is also correct on the exponentiation. You can read this post by David Giles for details while I blushingly edit some old answers. I have implemented a less biased solution using nlcom. It assumes that once you log the outcome, the errors become normal.

I also tweaked your code in a couple places to use time-series operators, since this is so much better than using relative position.

. clear

. set seed 111

. set obs 10000 number of observations (_N) was 0, now 10,000

. gen id = _n

. expand 2 // two observations per individual (10,000 observations created)

. bysort id: gen t = _n // time

. bysort id (t): gen T = (_n==2) // treatment

. gen inc = rnormal(10+.5*T,1) // dependent variable

. assert inc > 0 // all values > 0

. xtset id T panel variable: id (strongly balanced) time variable: T, 0 to 1 delta: 1 unit

. gen relinc = D.inc/L.inc // relative change (10,000 missing values generated)

. replace relinc = 0 if t==1 (10,000 real changes made)

. gen lninc = ln(inc)

. bysort id: gen lndiff = exp(D.lninc)-1 (10,000 missing values generated)

. sum lndiff relinc if relinc != 0 // the relative changes using these two approaches are in fact the same

Variable |        Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------- lndiff | 10,000 .0631367 .1494297 -.4465929 .9864048 relinc | 10,000 .0631367 .1494297 -.4465929 .986405

. qui xtreg relinc T, fe

. margins, dydx(T) // 5061

Average marginal effects Number of obs = 20,000 Model VCE : Conventional

Expression : Linear prediction, predict() dy/dx w.r.t. : T


         |            Delta-method
         |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- T | .0631367 .0014943 42.25 0.000 .0602079 .0660655


. xtreg lninc T, fe

Fixed-effects (within) regression Number of obs = 20,000 Group variable: id Number of groups = 10,000

R-sq: Obs per group: within = 0.1196 min = 2 between = . avg = 2.0 overall = 0.0634 max = 2

                                            F(1,9999)         =    1357.76

corr(u_i, Xb) = 0.0000 Prob > F = 0.0000


   lninc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- T | .0514681 .0013968 36.85 0.000 .0487301 .0542061 _cons | 2.295573 .0009877 2324.23 0.000 2.293637 2.297509 -------------+---------------------------------------------------------------- sigma_u | .07009358 sigma_e | .09876703 rho | .33495349 (fraction of variance due to u_i)


F test that all u_i=0: F(9999, 9999) = 1.01 Prob > F = 0.3579

. nlcom (e_assuming_normal_errors:exp(_b[T] - 0.5*_se[T]^2)-1)

e_assuming~s: exp(_b[T] - 0.5*_se[T]^2)-1


               lninc |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]

-------------------------+---------------------------------------------------------------- e_assuming_normal_errors | .0528146 .0014705 35.91 0.000 .0499323 .0556968


. xtreg inc T, fe

Fixed-effects (within) regression Number of obs = 20,000 Group variable: id Number of groups = 10,000

R-sq: Obs per group: within = 0.1209 min = 2 between = . avg = 2.0 overall = 0.0641 max = 2

                                            F(1,9999)         =    1375.61

corr(u_i, Xb) = 0.0000 Prob > F = 0.0000


     inc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- T | .5231742 .0141059 37.09 0.000 .4955239 .5508245 _cons | 9.980207 .0099743 1000.59 0.000 9.960655 9.999759 -------------+---------------------------------------------------------------- sigma_u | .70835751 sigma_e | .99743422 rho | .33526336 (fraction of variance due to u_i)


F test that all u_i=0: F(9999, 9999) = 1.01 Prob > F = 0.3323

. margins, eydx(T)

Average marginal effects Number of obs = 20,000 Model VCE : Conventional

Expression : Linear prediction, predict() ey/dx w.r.t. : T


         |            Delta-method
         |      ey/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- T | .0511156 .0013804 37.03 0.000 .04841 .0538212


I also added a third way to calculate an elasticity.

Finally, you may want to review some questions on re-transformation bias. This is something that comes up eventually with logged outcome. I don't want you to have to learn this stuff on the street the hard way.

dimitriy
  • 35,430