I am trying to build a model for predicting Olympics Medal counts.
My data looks like this;
year country gdp pop athletes gold tot tot_gold avmedals muslim host comm soviet oneparty
1 2000 Afghanistan 5316 20094 0 0 0 298 915 1 0 0 0 0
2 2000 Algeria 54790 31184 47 1 5 298 915 1 0 0 0 0
3 2000 Argentina 284204 37057 143 0 4 298 915 0 0 0 0 0
4 2000 Armenia 1912 3070 25 0 1 298 915 0 0 1 1 0
5 2000 Australia 415034 19153 617 16 58 298 915 0 1 0 0 0
6 2000 Austria 196800 8012 92 2 3 298 915 0 0 0 0 0
altitude
1 1790.0
2 1.0
3 10.5
4 989.0
5 605.0
6 170.0
I have tried using poisson and negative binomial to fit the data. Although I have noticed there are a lot of countries won no medals at all and mean is smaller than the variance for the total medal ccounts.
> summary(df$tot)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 3.000 8.921 8.000 121.000
> var(df$tot)
[1] 291.4698
Therefore I thought fitting a zero-inflated model would be more appropriate for my data. Although I am encoutoruing some problems with fitting my model with the variables and i am not so sure what I should be doing or if there is some kind of rule with zero-inflated models that I am not aware of.
> zip_model <- zeroinfl(tot ~ log(gdp) + muslim + soviet + oneparty + host + athletes+ gold +
+ pop, data = train_set, dist = "poisson")
Error: (converted from warning) glm.fit: fitted probabilities numerically 0 or 1 occurred

hurdle()would lead to the same problems as above. But as the two parts fromhurdle()can be reproduced viaglm()andzerotrunc()(from package countreg on R-Forge), it would also be possible to fit a bias-reduced brglm2 for theglm(). Then one can go on with thezerotrunc()using only those variables that have any variation fortot > 0. – Achim Zeileis Jun 13 '23 at 02:00