Data points for some control variables missing in regression - still feasible?

Question

I'm currently working on an event study to examine abnormal returns.

In the first step, I've calculated abnormal returns in regards to a certain type of company event, consisting of roughly 13,000 events and >4,000 firms.

In the second step, I intend to run a regression analysis with several (control?) variables to see whether some of the effect stems from certain aspects of the event.

So far so good, now I'm having the issue that I want to control for 5-6 factors like market capitalization and total enterprise value. Unfortunately, I don't have every single datapoint for every single of the 13,000 events. As an example, for Event 1 I'm missing market capitalization, for Event 2 the total enterprise value, for Event 3 the M/B-ratio and so on.

Question: Can I still run a meaningful regression even though I have a significant number of NA's or am I required to delete every single event with incomplete data? Given the poor data availability for some variables (which I generally still would love to include), that would result in a very large number of deleted events (>7,000).

You could try to impute the missing values. See for example Multiple Imputation by Chained Equations (MICE) Explained. — dipetkov, Jul 08 '22 at 15:11
Thanks for your answer, sorry for the late reply. I've had a look into it, the only problem is that I have several binary control variables (e.g., paid in stock or cash, target company is public or private). In such cases, it seems to me that imputation doesn't make sense. — LeCV, Jul 19 '22 at 08:27
There are three options: drop rows/events with missing values; do single value imputation (replace with mean, median, mode, etc.); do multiple imputation. Whether your regression analysis is meaningful if you use only complete data points depends on the reasons the data is missing. Take a look at van Buuren's Flexible Imputation of Missing Data. — dipetkov, Jul 19 '22 at 10:09

score 0 · Answer 1 · answered Jul 23 '22 at 07:05

As dipetkov says, multiple-imputation is the best solution here. And imputing binary and categorical variables is both reasonable and quite feasible. This is a vignette that shows how to do this in the R package Amelia. And here is a Google scholar search that shows a number of papers about multiple imputation of binary variables, in case you would like to look at the primary literature on the subject.

Data points for some control variables missing in regression - still feasible?

1 Answers1