1

This seems like it should be a really simple thing, so I apologize if this has been answered elsewhere, but I've browsed dozens of existing questions and none seem to fit exactly.

Suppose I have two cohort groups, control and target. I apply a treatment to the target, and want to use DiD to compare conversion deltas between the two groups; pre and post treatment.

I believe the following r code correctly proves that in our instance, the effect is truly non-zero (i.e. we reject the NULL hypothesis). However, because it's a binomial GLM, the confidence intervals are bounding the probability that it's non-zero, not the actual effect itself:

Converted <- c(2526,2500,2818,2268)
Offered <- c(3992,3956,4484,3996)
Cohort <- factor(c("Target","Target","Control","Control"))
Period <- factor(c(0,1,0,1))
ConvRate <- Converted / Offered

fit <- glm(ConvRate~Cohort*Period, family=binomial, weights=Offered) summary(fit)

confint(fit)

(Summarized) Output:

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)           0.52560    0.03090  17.007  < 2e-16 ***
CohortTarget          0.01850    0.04509   0.410 0.681633    
Period1              -0.25367    0.04444  -5.708 1.14e-08 ***
CohortTarget:Period1  0.25017    0.06434   3.888 0.000101 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance:  5.1064e+01  on 3  degrees of freedom

Residual deviance: -7.9270e-13 on 0 degrees of freedom AIC: 42.851

Number of Fisher Scoring iterations: 2

Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) 0.4651823 0.5863361 CohortTarget -0.0698583 0.1069002 Period1 -0.3407991 -0.1665982 CohortTarget:Period1 0.1240757 0.3762874

In order to get the actual effect size, I ran this through a LM, but it's refusing to provide me with confidence intervals;

lfit <- lm(ConvRate~Cohort*Period, weights=Offered)
summary(lfit)

confint(lfit)

Output:

Call:
lm(formula = ConvRate ~ Cohort * Period, weights = Offered)

Weighted Residuals: ALL 4 residuals are 0: no residual degrees of freedom!

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.628457 NA NA NA CohortTarget 0.004309 NA NA NA Period1 -0.060889 NA NA NA CohortTarget:Period1 0.060075 NA NA NA

Residual standard error: NaN on 0 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: NaN F-statistic: NaN on 3 and 0 DF, p-value: NA

                 2.5 % 97.5 %

(Intercept) NaN NaN CohortTarget NaN NaN Period1 NaN NaN CohortTarget:Period1 NaN NaN Warning message: In qt(a, object$df.residual) : NaNs produced

A similar issue results with family=gaussian when using GLM again:

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)
(Intercept)           0.628457         NA      NA       NA
CohortTarget          0.004309         NA      NA       NA
Period1              -0.060889         NA      NA       NA
CohortTarget:Period1  0.060075         NA      NA       NA

(Dispersion parameter for gaussian family taken to be NaN)

Null deviance: 1.2194e+01  on 3  degrees of freedom

Residual deviance: 2.0249e-28 on 0 degrees of freedom AIC: -272.54

Number of Fisher Scoring iterations: 1

Waiting for profiling to be done... Error in while ((step <- step + 1) < maxsteps && abs(z) < zmax) { : missing value where TRUE/FALSE needed Calls: confint -> confint.glm -> profile -> profile.glm In addition: Warning message: In qf(1 - alpha, 1, n - p) : NaNs produced Execution halted

I'm wondering how I can get the effect size confidence intervals? The difference in differences is ~6% but I'm looking for statistically significant intervals.

dumbledorf
  • 11
  • 1
  • Perhaps I'm misunderstanding your code, but it seems you arranged the data you feed to your regression such that you only have four data points when in reality you have several thousand data points. – num_39 Mar 07 '23 at 21:43
  • I accidentally posted this under a new account that can't comment, I'll comment from this one. Yes, I pre-aggregated the data as y is just a binary yes/no. – Darrrrrren Mar 07 '23 at 21:47
  • 1
    That's not going to work b/c you're regression won't be able to calculate the variance. You need a dataset with a row for each observation. Your case is simpler but see how it's done here in this answer: https://stats.stackexchange.com/questions/560795/is-this-the-correct-way-to-estimate-a-difference-in-differences-model-in-r-mult – num_39 Mar 07 '23 at 21:58
  • Thank you! I'll give it a try. – Darrrrrren Mar 08 '23 at 13:28

1 Answers1

1

As I noted in the comments, the problem is that you're not providing the full dataset to the regression function.

For a toy example that involves only 20 observations

period <- c(rep(0, 10), rep(1, 10))
treatment <- c(rep(1, 5), rep(0, 5), rep(1, 5), rep(0, 5))
converted <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0)

dat <- data.frame(period, treatment, conversion) head(dat)

summary(lm(converted ~ period + treatment + period * treatment))

Now, you get a standard error for the interaction because it's possible to estimate the variance.

num_39
  • 1,454