Interpreting linear regression with endogenous treatment effects

Question

Using stata (with weighted survey design) I ran the following, where logwage is the log of wage. The log was taken because wage was not normally distributed. There is also information about the workers' demographics such as racial/ethnic, gender, previously held education, and whether or not they participated in a voluntary training (binary variable yes = 1, no = 0).

svy: etregress logwage i.race gender, treat(training = i.education gender)

Because the dependent variable is log and the treatment effect as well as all the independent variables are NOT in log form, I'm not sure how to interpret the coefficients reported.

--------------------------------------------------------------------------------------------------
                                 |             Linearized
                                 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------------------------+----------------------------------------------------------------
logwage                          |
                            race |
                African American |   .3891554   .0031105    12.20   0.000     .2000000    .8474752
                 Asian American  |   .1487310   .0002843    04.11   0.000     .027113     .8765290
                                 |
                          gender |
                         female  |  -.0230411    .010445    -6.85   0.000    -.115341   -.0107295
                                 |
                  1.training |   .3703371   .0451778    10.61   0.000     .2018037    .4186134



---------------------------------+----------------------------------------------------------------
    training                         |
                         i.education |
                         Highschool  |  -.0715731   .0490565     1.28   0.098    -.1106579    .1291781
                            College  |   .1271380   .0401052     3.95   0.003     .0329516    .2107563
                        Grad School  |   .8522143   .0085337     8.99   0.000     .8271381    .9573284
                                     |
                              gender |
                             female  |   .0127444   .0100058     5.33   0.041     .0100558    .0866312
                               _cons |  -1.260083   .0327235   -26.12   0.000    -1.531405   -1.098524
    ---------------------------------+----------------------------------------------------------------
                             /athrho |   .0051552    .031410     0.17   0.827    -.0722533    .0810246
                            /lnsigma |  -1.872551   .0166818   -73.50   0.000    -1.928624   -1.278064
    ---------------------------------+----------------------------------------------------------------
                                 rho |   .0084120   .0421116                     -.0649947    .0888529
                               sigma |   .4000831   .0038170                      .1925127    .5067780
                              lambda |   .0012673   .0226365                     -.0324029     .016937
    --------------------------------------------------------------------------------------------------

Like, what is the interpretation of the gender coefficient for the first and second entry?

Edit: My thinking is the 'female' coefficient logwage component is interpreted in the same as %Δy=100⋅β1⋅Δx. So being female results in -2.30% change in wage. But it is not clear what the 'female' in the 'training' section means. Is it also %Δy=100⋅β1⋅Δx? Or no? And if it is % change (i.e. 1.27% change), then is that for the training or the wage- as in women more likely to have the training?

This has come up before. See this, for example. You can use margins of an expression or nlcom to calculate SEs. Also, note that the rationale behind the log(y) transformation is not about the distribution of wage itself, but about the distribution of the errors conditional on x. — dimitriy, Jul 28 '20 at 19:48
@Dimitriy perhaps we are talking past one another here... but what I'm asking is what is the interpretation of the female coefficient -.0230 in the logwage component and the interpretation of the female coefficient .0127 in the training component? Also, nlcom doesn't seem to work with survey data categorical variables. As for margin, the same problem occurs, are margins values that need exponentiation to be interpreted? Here is the best I am able to find: https://www.stata.com/stata-news/news34-2/spotlight/ — iPlexipen, Jul 28 '20 at 20:00
As the link I shared tells you, it means that women earn 100*(exp(-.0230411)-1) = -2.28% less than men according to your model. The first stage probit coefficient is harder to interpret. The fact that it is positive and significant means that women are more likely to seek out training. To translate into something more meaningful (like a change in pr(training)), you will need to calculate the marginal effect somehow. The formula is here. — dimitriy, Jul 28 '20 at 20:26

dimitriy · Accepted Answer · 2020-07-28T22:11:36.127

Here's a replicable (but completely non-sensical) example where the outcome is log of lead blood levels and the treatment is diabetes. We will interpret the female coefficient from both equations.

The treatment probit equation implies a 0.7 percentage point increase in probability of having diabetes for women, relative to men (.007 on [0,1] scale is 7/10th of percentage point on [0,100] scale), on average. It also shows a 30.64% decrease in lead for females relative to males (ATE). This is called a semi-elasticity, and some care must be taken since female is a binary variable. We will use finite-differences for both.

We first calculate these estimates using margins and nlcom, which will not work with svy. Then we do it by hand using svy: mean to show that the point estimates agree.

Code is at the very bottom, code with output is below:

. webuse nhanes2f, clear
. svyset psuid [pweight=finalwgt], strata(stratid)
  pweight: finalwgt
      VCE: linearized

Single unit: missing
     Strata 1: stratid
         SU 1: psuid
        FPC 1: <zero>
. svy: etregress loglead i.female i.diabetes, treat(diabetes = weight age height i.female) // coefl
(running etregress on estimation sample)
Survey: Linear regression with endogenous treatment
Estimator: maximum likelihood
Number of strata   =        31                 Number of obs     =       4,940
Number of PSUs     =        62                 Population size   =  56,316,764
                                               Design df         =          31
                                               F(   2,     30)   =      575.75
                                               Prob > F          =      0.0000

         |             Linearized
         |      Coef.   Std. Err.      t    P&gt;|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------
loglead      |
    1.female |   -.365953   .0106445   -34.38   0.000    -.3876626   -.3442434
  1.diabetes |   .2187191   .0579993     3.77   0.001     .1004288    .3370095
       _cons |   2.760332   .0180171   153.21   0.000     2.723586    2.797078
-------------+----------------------------------------------------------------
diabetes     |
      weight |   .0120452   .0025572     4.71   0.000     .0068297    .0172606
         age |   .0227368   .0029366     7.74   0.000     .0167476    .0287259
      height |  -.0143508   .0051924    -2.76   0.010    -.0249408   -.0037608
    1.female |   .1143353   .0862421     1.33   0.195    -.0615567    .2902273
       _cons |  -1.459728    .861842    -1.69   0.100    -3.217466    .2980107
-------------+----------------------------------------------------------------
     /athrho |  -.3346261   .0729646    -4.59   0.000    -.4834384   -.1858138
    /lnsigma |   -.973891   .0302057   -32.24   0.000    -1.035496    -.912286
-------------+----------------------------------------------------------------
         rho |  -.3226714   .0653678                      -.448993   -.1837044
       sigma |   .3776109    .011406                      .3550502    .4016051
      lambda |  -.1218442   .0253314                     -.1735079   -.0701805

. display "Percent Change ln(lead) = " 100*( exp(_b[loglead:1.female]) - 1)
Percent Change ln(lead) = -30.646461
. 
. /* (1) using commands that don't work with svy */
. margins, predict(ptrt) at(female=(0 1))
Predictive margins
Number of strata   =        31                 Number of obs     =       4,940
Number of PSUs     =        62                 Population size   =  56,316,764
Model VCE    : Linearized                      Design df         =          31
Expression   : Pr(diabetes), predict(ptrt)
1._at        : female          =           0
2._at        : female          =           1

         |            Delta-method
         |     Margin   Std. Err.      t    P&gt;|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------
         _at |
          1  |   .0293652   .0037914     7.75   0.000     .0216325    .0370979
          2  |   .0371157   .0041162     9.02   0.000     .0287207    .0455106

. margins r.female, predict(ptrt)
Contrasts of predictive margins
Number of strata   =        31                 Number of obs     =       4,940
Number of PSUs     =        62                 Population size   =  56,316,764
Model VCE    : Linearized                      Design df         =          31
Expression   : Pr(diabetes), predict(ptrt)

         |         df           F        P&gt;F

-------------+----------------------------------
      female |          1        1.72     0.1989
      Design |         31

Note: F statistics are adjusted for the survey
      design.

         |            Delta-method
         |   Contrast   Std. Err.     [95% Conf. Interval]

-------------+------------------------------------------------
      female |
   (1 vs 0)  |   .0077504   .0059037     -.0042902    .0197911

. nlcom pct_eff:(100*(exp(_b[loglead:1.female])-1))
 pct_eff:  (100*(exp(_b[loglead:1.female])-1))



         |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]

-------------+----------------------------------------------------------------
     pct_eff |  -30.64646   .7382346   -41.51   0.000    -32.09337   -29.19955

. 
. /* (2) Both AMEs by hand using predict */
. replace female = 1
(4,909 real changes made)
. predict d1, ptrt
. predict lny1, xb
(2 missing values generated)
. replace female = 0
(10,337 real changes made)
. predict d0, ptrt
. predict lny0, xb
(2 missing values generated)
. gen double diff_pr = d1-d0
. gen double diff_lny = lny1 - lny0
(2 missing values generated)
. 
. svy: mean d1 d0 diff_pr diff_lny
(running mean on estimation sample)
Survey: Mean estimation
Number of strata =      31      Number of obs   =       10,335
Number of PSUs   =      62      Population size =  116,997,257
                                Design df       =           31

         |             Linearized
         |       Mean   Std. Err.     [95% Conf. Interval]

-------------+------------------------------------------------
          d1 |   .0376153   .0005965      .0363988    .0388317
          d0 |   .0297683   .0004914      .0287661    .0307705
     diff_pr |    .007847   .0001054       .007632    .0080619
    diff_lny |   -.365953          .             .           .

. display "Average ln(lead) difference as a semi-elasticity = " (100*(exp(-.365953)-1))
Average ln(lead) difference as a semi-elasticity = -30.64646

Code:

cls
webuse nhanes2f, clear
svyset psuid [pweight=finalwgt], strata(stratid)
svy: etregress loglead i.female i.diabetes, treat(diabetes = weight age height i.female) // coefl
display "Percent Change ln(lead) = " 100*( exp(_b[loglead:1.female]) - 1)
/* (1) using commands that don't work with svy /
margins, predict(ptrt) at(female=(0 1))
margins r.female, predict(ptrt)
nlcom pct_eff:(100(exp(_b[loglead:1.female])-1))
/* (2) Both AMEs by hand using predict */
replace female = 1
predict d1, ptrt
predict lny1, xb
replace female = 0
predict d0, ptrt
predict lny0, xb
gen double diff_pr = d1-d0
gen double diff_lny = lny1 - lny0
svy: mean d1 d0 diff_pr diff_lny
display "Average ln(lead) difference as a semi-elasticity = " (100*(exp(-.365953)-1))

Interpreting linear regression with endogenous treatment effects

1 Answers1

Linked