0

I have data about the occurrence of a data breach at certain companies for the period 2005-2018. Now I have a question about the model I should use. I have two options:

Probit/Logit: I set the dependent variable to 1 if there have been a data breach in a certain year. For example:

  • Company A does not suffer a data breach in period 2005-2007 --> variable Data Breach takes value of 0 for all the years 2009-2017
  • Company A suffers a data breach in 2008 --> variable Data Breach takes value of 1 in 2008
  • Company A does not suffer a data breach in period 2009-2015 --> variable Data Breach takes value of 0 for 2009-2017
  • Company A suffers again a data breach in 2018 --> variable Data Breach takes value of 1 in 2018

And with this data I run a probit/logit regression.

OLS:

I count the amount of times that in the sample period every company has been victim of a data breach. So in the example above, the variable Data Breaches will take the value of 2 for Company A. And with this data I run a OLS.

Notes:

Important to know is that my independent variables are all numerical variables that take values from 0 to 100, but the value might differ each year. So, for example:

  • Company A does not have a Cyber Committee in place for the period 2005-2008 --> Variable Cyber Committee takes 0 for all those years.
  • Company A does have a Cyber Committee in place for the period 2009-2018 --> Variable Cyber Committee takes 1 for all those years.

This example is just with a independent variable that takes 0 and 1, but there are also independent variables like the amount of cyber related jobs a someone had in the past.

I also want to include industry and year effects in the regression, because it might be possible that in certain years or industries there were/is a higher chance of data breaches.

Intuitively, I would choose the probit/logit model, but for this model it is difficult to implement the year and industry fixed effects.

1 Answers1

0

I believe quite sophisticated statistical engineering is required to adequately model this dependence. Your data is so called panel data (a combination of a time series and cross-sectional data). What is certain is that logit/probit is definitely a bad choice here, because one cannot allocate different years for the same company in one sample as if they are as independent as different years for different companies from you dataset. The huge problem with the OLS will be that the relationship between the number of breaches and the regressors is almost certainly not linear.

Alex
  • 1,057