How do I perform a statistical test for a Difference-in-Differences analysis?

Question

I'm trying to analyze results from two areas at two different time points. On area A we performed an intervention, and on area B we did not. Then we asked customers whether they liked a particular part of our service, and we did that before and after our intervention. The customers were different in the two periods, so we don't have pairs. The number of people out of the ones surveyed who gave us top score are:

Area A, time=0: 64 out of 130

Area A, time=1: 82 out of 118

Area B, time=0: 44 out of 100

Area B, time=1: 60 out of 100

Which in turn gives the following proportions:

A0 = 64/130 = 0.49

A1 = 82/118 = 0.69

B0 = 44/100 = 0.44

B1 = 60/100 = 0.6

According to the difference-in-difference analysis, the estimate of effect is:

E = [(A1-A0) - (B1-B0)] = 0.04

But how do I perform a statistical test to see if this difference is significant?

@Gordon Smyth: I performed a simple diff in diff analysis by comparing the treatment and control pre exposure and post exposure. The difference pre exposure was not stat sig while the diff post exposure was stat sig. May I assume the non stat sig pre exposure difference in Treatment Vs Control to be 0 ? Hence, concluding that the post exposure stat sig lift between Test Vs Control indeed shows that the effect is real. However, upon utilizing the method described above the effect comes out to be non stat significant. Am I missing something ? My data and code inline : Trust_Score <- c(209,1024,2 — user3386377, Nov 24 '18 at 02:50

Gordon Smyth · Accepted Answer · 2022-11-01T01:04:30.237

The difference in differences is what is called an interaction in statistics (as Dimitriy Masterov has already pointed out). You want to test whether the time effect is different when you intervene compared with when you don't.

Your data is most naturally modelled as binomial, i.e., the number of top scores out of total people surveyed in each area at each time follows a binomial distribution, assuming that all the customers respond independently. The standard statistical method for testing for interaction with binomial data is to run a binomial logistic regression.

In the R language, the code for this is as follows. First input the data:

> NTopScore <- c(64,82,44,60)
> N <- c(130,118,110,100)
> Area <- factor(c("A","A","B","B"))
> Time <- factor(c(0,1,0,1))
> Proportion <- NTopScore / N

Then fit the logistic regression. In R this is done by running a generalized linear model, and telling R that the data should be treated as binomial:

> fit <- glm(Proportion~Area*Time, family=binomial, weights=N)
> summary(fit)
Call:
glm(formula = Proportion ~ Area * Time, family = binomial, weights = N)
Deviance Residuals: 
[1]  0  0  0  0
Coefficients:
            Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.03077    0.17543  -0.175  0.86076

AreaB       -0.37469    0.26202  -1.430  0.15271

Time1        0.85397    0.26599   3.211  0.00132 **
AreaB:Time1 -0.04304    0.38768  -0.111  0.91160

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2.3047e+01  on 3  degrees of freedom

Residual deviance: 7.1054e-15  on 0  degrees of freedom
AIC: 28.523
Number of Fisher Scoring iterations: 3

We see that the the p-value for the interaction (difference in differences) is $P=0.9116$, obviously not significant.

The model is fitted on a log-odds (logit) scale. The AreaB parameter shows that Area B gives a lower proportion than Area A at Time 0. The Time1 parameter shows that Time 1 gives a higher proportion than Time 0 in Area A. The AreaB:Time1 parameter is the difference in differences.

Another way to fit the logistic regression is to estimate the before-after time effect separately for areas A and B. This shows that the time effect is virtually identical for the two areas, regardless of whether you had an intervention for not:

> fit <- glm(Proportion~Area+Area:Time, family=binomial, weights=N)
> summary(fit)
Call:
glm(formula = Proportion ~ Area + Area:Time, family = binomial, 
    weights = N)
Deviance Residuals: 
[1]  0  0  0  0
Coefficients:
            Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.03077    0.17543  -0.175  0.86076

AreaB       -0.37469    0.26202  -1.430  0.15271

AreaA:Time1  0.85397    0.26599   3.211  0.00132 **
AreaB:Time1  0.81093    0.28204   2.875  0.00404 **

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance:  2.3047e+01  on 3  degrees of freedom

Residual deviance: -5.3291e-15  on 0  degrees of freedom
AIC: 28.523
Number of Fisher Scoring iterations: 3

The time effect in Area A is 0.86397 and that in Area B is 0.81093. The difference in the time effects is $0.81093 - 0.86397 = -0.04304$, which is equal to the interaction estimate we saw in the first regression.

@Alexis Dimitriy ran a normal regression instead of binomial. — Gordon Smyth, Nov 02 '17 at 00:52
Would you comment on the lack of clustering in light of the Moulton problem? — dimitriy, Nov 02 '17 at 03:16

dimitriy · Answer 2 · 2017-11-02T17:03:40.770

The easiest solution is to use the linear regression formulation of DID:

Regress the binary customer ratings on a constant, a post dummy, an area A dummy, and the interaction of last two. It may be appropriate to add other regressors measuring characteristics that are time-invariant at the individual level, but whose distribution changes through time at the group level. This may help with the significance issue below if it soaks up some residual variance.
The DID is the coefficient on the interaction of post and group A. You can conduct the hypothesis test that it is zero or just look at the p-value or t-stat.
Since treatment does not vary within area, the usual standard errors will be off (usually too small, but sometimes too large, when the within cluster error correlation is negative). The typical solution is to cluster the standard errors by area or cross-section, but with only 2-4 clusters, that will not work well since that is not enough clusters for the asymptotics to kick in. I don't really have a great solution for you here. Stata does compute something below, but it is not likely to be reliable (even with the small number of clusters adjustment). I do suspect that the correct adjustment will not yield significance since the conventional standard error on the DID coefficient is so large. People will often use simulation in cases like this to gauge how far off the SEs are.

Here's this analysis for your data:

. clear

. input area_a time noobs rating

        area_a       time      noobs     rating
  1. 1 0 64 1 
  2. 1 0 66 0
  3. 1 1 82 1 
  4. 1 1 36 0
  5. 0 0 44 1
  6. 0 0 56 0
  7. 0 1 60 1
  8. 0 1 40 0
  9. end

. egen cs = group(area_a time)

. reg rating i.area_a##i.time [fw=noobs]

      Source |       SS           df       MS      Number of obs   =       448
-------------+----------------------------------   F(3, 444)       =      6.05
       Model |  4.34181458         3  1.44727153   Prob > F        =    0.0005
    Residual |  106.149257       444  .239074903   R-squared       =    0.0393
-------------+----------------------------------   Adj R-squared   =    0.0328
       Total |  110.491071       447  .247183605   Root MSE        =    .48895

------------------------------------------------------------------------------
      rating |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.area_a |   .0523077   .0650368     0.80   0.422    -.0755105    .1801259
      1.time |        .16   .0691484     2.31   0.021     .0241012    .2958988
             |
 area_a#time |
        1 1  |   .0426076   .0929871     0.46   0.647    -.1401419     .225357
             |
       _cons |        .44   .0488953     9.00   0.000     .3439051    .5360949
------------------------------------------------------------------------------

. reg rating i.area_a##i.time [fw=noobs], vce(cluster area) // or vce(cluster cs)

Linear regression                               Number of obs     =        448
                                                F(0, 1)           =          .
                                                Prob > F          =          .
                                                R-squared         =     0.0393
                                                Root MSE          =     .48895

                                 (Std. Err. adjusted for 2 clusters in area_a)
------------------------------------------------------------------------------
             |               Robust
      rating |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.area_a |   .0523077   1.08e-16  4.8e+14   0.000     .0523077    .0523077
      1.time |        .16   1.01e-16  1.6e+15   0.000          .16         .16
             |
 area_a#time |
        1 1  |   .0426076   1.30e-16  3.3e+14   0.000     .0426076    .0426076
             |
       _cons |        .44   1.01e-16  4.4e+15   0.000          .44         .44
------------------------------------------------------------------------------

The interpretation of the interaction coefficient in both specifications is a 4.3 percentage point increase in liking your service post-intervention.

The linear model has the advantage of easier interpretation of an additive effect on a probability rather than a multiplicative effect on log odds. Moreover, in a non-linear logit model, clustering is problematic since the coefficients are identified up to scale only, the interpretation of interactions and identifying assumptions can get tricky, and the DID is no longer just the cross difference in the four means (see the Puhani paper cited below on the latter point). Finally, in a fully saturated model with all the interactions, the logit and the linear model will give identical point estimates of the cross difference marginal effect (though that is not the parameter you care about):

. /* Cross difference to mimic OLS, but wrong */
. logit rating i.area_a##i.time [fw=noobs], nolog

Logistic regression                             Number of obs     =        448
                                                LR chi2(3)        =      17.87
                                                Prob > chi2       =     0.0005
Log likelihood = -298.57102                     Pseudo R2         =     0.0291

------------------------------------------------------------------------------
      rating |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.area_a |   .2103904   .2671347     0.79   0.431    -.3131839    .7339647
      1.time |   .6466272   .2867945     2.25   0.024     .0845203    1.208734
             |
 area_a#time |
        1 1  |   .2073448   .3911528     0.53   0.596    -.5593006    .9739902
             |
       _cons |  -.2411621   .2014557    -1.20   0.231    -.6360081    .1536839
------------------------------------------------------------------------------

. margins r.area_a#r.time 

Contrasts of adjusted predictions
Model VCE    : OIM

Expression   : Pr(rating), predict()

------------------------------------------------
             |         df        chi2     P>chi2
-------------+----------------------------------
 area_a#time |          1        0.21     0.6456
------------------------------------------------

--------------------------------------------------------------------
                   |            Delta-method
                   |   Contrast   Std. Err.     [95% Conf. Interval]
-------------------+------------------------------------------------
       area_a#time |
(1 vs 0) (1 vs 0)  |   .0426076   .0926461     -.1389755    .2241906
--------------------------------------------------------------------

I believe the correct marginal effect of 4.6 percentage points is given by this:

/* Puhani's DID Estimator */
gen at = area*time
logit rating i.area_a i.time i.at [fw=noobs], nolog
margins, at(area_a==1 time==1 at ==1) at(area_a==1 time==1 at==0) contrast(atcontrast(a._at) wald)

"The Treatment Effect, the Cross Difference, and the Interaction Term in Nonlinear “Difference-in-Differences” Models by Patrick A. Puhani, Economics Letters, 2012, 115 (1), 85-87.

@Alexis Gordon is showing the index function coefficients from a GLM, so they will not be very comparable to OLS. — dimitriy, Nov 02 '17 at 16:51
@Tanguy R has the margins package, but I am not sure if it can handle contrasts like Stata. — dimitriy, Jun 06 '19 at 15:25

score 0 · Answer 3 · answered Mar 11 '22 at 14:59

Thanks for the question and response. I have the exact same issue, just that instead of 4 variables i have 8 variables as below:

N_for_KPI <- c(683,538,2225,1458,294,307,922,781)
N <- c(1951,1564,5683,4507,819,862,2479,2511)
Wave <- factor(c("A","A","B","B","C","C"))
Brand <- factor(c(0,1,0,1,0,1))
data = data.frame(N_for_KPI,N)
Proportion <-N_for_KPI / N
Proportion
fit <- glm(Proportion~Wave*Brand, family=binomial, weights=N)
summary(fit)

Although i got the significance results, i got an error:

> N_for_KPI <- c(683,538,2225,1458,294,307,922,781)
> N <- c(1951,1564,5683,4507,819,862,2479,2511)
> Wave <- factor(c("A","A","B","B","C","C"))
> Brand <- factor(c(0,1,0,1,0,1))
> Proportion <-N_for_KPI / N
> Proportion
[1] 0.3500769 0.3439898 0.3915186 0.3234968 0.3589744 0.3561485 
0.3719242 0.3110315
> fit <- glm(Proportion~Wave*Brand, family=binomial, weights=N)
Error in model.frame.default(formula = Proportion ~ Wave * Brand, 
weights = N,  : 
variable lengths differ (found for 'Wave')
> summary(fit)
Call:
glm(formula = Proportion ~ Wave * Brand, family = binomial, weights 
= N)
Deviance Residuals: 
[1]  0  0  0  0
Coefficients:
         Estimate Std. Error z value Pr(>|z|)

(Intercept)   -2.9422     0.1047 -28.096   <2e-16 ***
WaveB          0.0394     0.1203   0.328    0.743

Brand1        -0.1574     0.1507  -1.045    0.296

WaveB:Brand1  -0.4487     0.1786  -2.512    0.012 *

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance:  4.5383e+01  on 3  degrees of freedom
Residual deviance: -4.9938e-13  on 0  degrees of freedom
AIC: 35.137
Number of Fisher Scoring iterations: 3

In addition to the error, when I changed to a new sample sizes for N_for_KPI and N, the significant result didn't change

Can you please help to advise how to fit 8 variables in this model? Thank you so much in advance!!

If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From Review — Adrian Keister, Mar 11 '22 at 15:40
Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. — Community, Mar 11 '22 at 18:49

How do I perform a statistical test for a Difference-in-Differences analysis?

3 Answers3

Linked