How is this simple event study design collinear?

Question

I am trying to estimate a very simple model in the form:

$Y_{it} = a'D_{event} + b'D_{i} + c'D_{t} + e_{it}$

With the outcome $Y$, the individual fixed effects $D_{i}$, calendar year fixed effects $D_{t}$, and a set of event leads $D_{event}$ ($D$ denotes a vector of dummy variables). My data looks like this (in R):

dftest <- data.frame(id = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4)),
                     y = sample.int(n = 50, size = 16, replace = T),
                     event = c(rep(1:4, 4)),
                     year = c(rep(2010:2013, 2), rep(2011:2014, 2)))
dftest
id  y event year
1   1 31     1 2010
2   1 15     2 2011
3   1 14     3 2012
4   1  3     4 2013
5   2 42     1 2010
6   2 50     2 2011
7   2 43     3 2012
8   2 37     4 2013
9   3 14     1 2011
10  3 25     2 2012
11  3 26     3 2013
12  3 27     4 2014
13  4  5     1 2011
14  4 27     2 2012
15  4 28     3 2013
16  4  9     4 2014

I have a certain amount of cohorts that experience the start of the event at different, subsequent times. For example, for the cohort 2010, the event starts in 2010 and I have data on the four years post-event.

table(dftest$event, dftest$year)
2010 2011 2012 2013 2014

1    2    2    0    0    0
  2    0    2    2    0    0
  3    0    0    2    2    0
  4    0    0    0    2    2

To recover the effect of the event $X$ years afterwards on the outcome $Y$, I thus estimate the model specified above either in felm or plm:

summary(plm(y ~ factor(event), dftest, effect = "twoways", model = "within", index = c("id", "year")))
summary(felm(y ~ factor(event) | id + year | 0 | id, dftest))

However, plm and felm both drop one event time (in addition to the reference dummy), and felm additionally gives out the warning of rank deficiency. Thus, the model is apparently collinear? It is very hard for me to see how this is the case.

Given the staggered nature of the treatment, you have to define these event time dummies in a different way. Also, do you really have one unit that starts treatment in 2010 and stays treated for the whole panel? In this setting, your model won't be able to distinguish the event time indicators with the time fixed effects. — Thomas Bilach, Aug 29 '22 at 16:11
To add some context: I have two groups that got randomised into different education schemes - I want to compare their earnings for each year post-graduation, running this sort of model for each subgroup separately. So yes, since having received either form of teaching is time-invariant, they stay treated. However, since I have a control and treatment groups finishing school for several years in a row (class of 2010, 2011, ...), I assumed that I can filter out calendar-year effects with the time dummies. How should I define the event time dummies in a different way? — hierb, Aug 30 '22 at 17:34
I find your particular study rather curious. Usually, evaluators focus on the periods around some event of interest. In your setting, however, you're only assessing post-treatment effects. These models are only estimable because the start times vary. If they didn't, then each event time indicator would be collinear with the fixed effects. I still think my answer should help you understand why some of the post-treatment effects were dropped. — Thomas Bilach, Aug 31 '22 at 01:45

Thomas Bilach · Answer 1 · 2022-09-16T22:41:21.167

I am going to assume all units eventually become treated.

Upon closer inspection of your R code, I don't know exactly how factor(event) is going to return the event time indicators for you. To do this, you need a column that delineates the relative periods around the onset of treatment. First, instantiate a variable denoting a unit's first exposure period, let's call it first_treat. Then, define a second variable by subtracting first_treat from your year variable. Take a look at the data frame below which centers the periods around the immediate treatment year:

# id          = unit identifier
# year        = time identifier
# first_treat = year of exposure
# centered    = relative periods (year - first_treat)
id year event  y first_treat centered
1   1 2010     1  8        2010        0
2   1 2011     2  1        2010        1
3   1 2012     3 14        2010        2
4   1 2013     4 10        2010        3
5   2 2010     1 34        2011       -1
6   2 2011     2 27        2011        0
7   2 2012     3 31        2011        1
8   2 2013     4 24        2011        2
9   3 2010     1 39        2012       -2
10  3 2011     2 37        2012       -1
11  3 2012     3 41        2012        0
12  3 2013     4 12        2012        1
13  4 2010     1 38        2013       -3
14  4 2011     2 13        2013       -2
15  4 2012     3 24        2013       -1
16  4 2013     4 24        2013        0

Period 0 is when a unit starts a treatment. Now that you have centered in the data frame, enter that variable into your model as a factor and R will 'dummy out' the relative periods for you. In your setting, you'd end up with, at most, three relative periods before and after the immediate adoption year. R will make some compromises at the extremes. Let's see how a few packages perform with your abridged data frame.

Package: plm

# library(plm)
summary( plm(y ~ factor(centered), model = "within", effect = "twoways", index = c("id", "year"), data = dftest) )

Coefficients: (1 dropped because of singularities)
                   Estimate Std. Error t-value Pr(>|t|)
factor(centered)-2  12.5833    14.5543  0.8646   0.4360
factor(centered)-1   6.6667    13.1682  0.5063   0.6393
factor(centered)0    5.7500    11.5329  0.4986   0.6442
factor(centered)1   21.3333    13.1682  1.6201   0.1805
factor(centered)2   19.9167    14.5543  1.3684   0.2430

Package: lfe

# library(lfe)
summary(felm(y ~ factor(centered) | id + year, data = dftest))

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
factor(centered)-2   12.583     14.554   0.865    0.436
factor(centered)-1    6.667     13.168   0.506    0.639
factor(centered)0     5.750     11.533   0.499    0.644
factor(centered)1    21.333     13.168   1.620    0.181
factor(centered)2    19.917     14.554   1.368    0.243
factor(centered)3       NaN         NA     NaN      NaN

The warning message is simply telling you that collinearity is present and at least one of the variables cannot be estimated, hence the NaN for the third lag.

Package: fixest

# library(fixest)
summary(feols(y ~ factor(centered) | id + year, panel.id = c('id', 'year'), data = dftest))

OLS estimation, Dep. Var.: y
Observations: 16 
Fixed-effects: id: 4,  year: 4
Standard-errors: Clustered (id) 
                   Estimate Std. Error  t value Pr(>|t|)    
factor(centered)-2 12.58333    9.25728 1.359291 0.267210    
factor(centered)-1  6.66667   10.15480 0.656504 0.558364    
factor(centered)0   5.75000    8.08455 0.711233 0.528253    
factor(centered)1  21.33333    4.74603 4.494985 0.020552 *  
factor(centered)2  19.91667    9.57183 2.080759 0.128909    
... 1 variable was removed because of collinearity (factor(centered)3)

In fixest, R's default behavior is to cluster, so ignore the different uncertainty estimates.

In three of the most popular packages for causal inference, we see R drops two relative period indicators. R is doing what it's supposed to do; software cannot distinguish between the unit and/or time fixed effects and the relative period indicators at the extremes. As for why, it's easier to see once we create the dummies manually. Note, by virtue of starting treatment early, you have more post-treatment periods (i.e., lags). Similarly, by starting treatment late, you have more pre-treatment periods (i.e., leads); this is true at least for balanced panels.

In the data frame below, I created the event time indicators manually and appended them to the data frame:

## - Event Time Dummies
lag_0 = first year of treatment
id year event  y first_treat centered lead_3 lead_2 lead_1 lag_0 lag_1 lag_2 lag_3
1   1 2010     1  8        2010        0      0      0      0     1     0     0     0
2   1 2011     2  1        2010        1      0      0      0     0     1     0     0
3   1 2012     3 14        2010        2      0      0      0     0     0     1     0
4   1 2013     4 10        2010        3      0      0      0     0     0     0     1
5   2 2010     1 34        2011       -1      0      0      1     0     0     0     0
6   2 2011     2 27        2011        0      0      0      0     1     0     0     0
7   2 2012     3 31        2011        1      0      0      0     0     1     0     0
8   2 2013     4 24        2011        2      0      0      0     0     0     1     0
9   3 2010     1 39        2012       -2      0      1      0     0     0     0     0
10  3 2011     2 37        2012       -1      0      0      1     0     0     0     0
11  3 2012     3 41        2012        0      0      0      0     1     0     0     0
12  3 2013     4 12        2012        1      0      0      0     0     1     0     0
13  4 2010     1 38        2013       -3      1      0      0     0     0     0     0
14  4 2011     2 13        2013       -2      0      1      0     0     0     0     0
15  4 2012     3 24        2013       -1      0      0      1     0     0     0     0
16  4 2013     4 24        2013        0      0      0      0     1     0     0     0

R defaults to dropping the first id; but this is the only unit with enough post-periods to estimate a third lag. Similarly, R's default behavior chooses 2010 as a reference year for the year fixed effects. But only the last unit (i.e., id = 4) has enough relative periods to estimate a third lead, which happens to be in 2010. In short, the dummies at the extremes cannot be estimated. The data is more sparse as we move farther and farther away from the immediate adoption period, hence the larger error bars. Again, if you saturated the model with event time dummies and a full series of year indicators, you'll find at least one year indicator is collinear with a lead and/or lag at the endpoint.

A quick word with respect to your reference year. In most real world event study settings, it's common practice to omit a period before the "event" of interest. The period immediately before the event makes sense, though a more distant pre-period seems justified as well if you're really interested in the outcome variation right before the onset of treatment.

Also, if you have a control group, then be very careful how you define the control units. Since treatment is happening at different times, we can't standardize time around any one year. In a staggered adoption design, the control units remain consistently 0 in all periods.

In short, I don't really think you have a problem. Rather, R is behaving appropriately given what you're feeding the model.

How is this simple event study design collinear?

1 Answers1

Package: plm

Package: lfe

Package: fixest

lag_0 = first year of treatment

Linked