I'm new to statistics so hoping for a ELI5 explanation! I need to use a hurdle (or zero-inflated) model to try and replicate someone elses methodology on a newer dataset for my undergraduate dissertation.
I ran the model using the pscl package in R and the plotting looks like what I'd expect, though I'm cautious because admittedly I barely understand this stuff.
I'm researching whether there is a link between a UK member of parliament's margin of victory at the previous election (majority) and the number of motions they proposed (count).
Here's what my data looks like:
The variable that has an excess of zero counts is the "count" variable -- 186/371 are zeros. There are zeros in the majority column too but these are a predictor, so I don't want them inflated if that makes sense. There is a considerable difference in the amount of motions proposed between the parties.
Here's the model:
m1 <- hurdle(count ~ majority | politicalParty, data = zinb, dist = "negbin")
I understand that there is a count component (count ~ majority) and the "logit" component (politicalParty), but I don't really understand what coefficients to include on either side. For instance, should I also include the count and majority in the logit model?
I also don't really understand this line from UCLA's example "Since zero inflated negative binomial has both a count model and a logit model, each of the two models should have good predictors. The two models do not necessarily need to use the same predictors." How do I know which model's predicted values are being used?
