1

I am trying to build a model for predicting Olympics Medal counts.

My data looks like this;

year     country    gdp   pop athletes gold tot tot_gold avmedals muslim host comm soviet oneparty
1 2000 Afghanistan   5316 20094        0    0   0      298      915      1    0    0      0        0
2 2000     Algeria  54790 31184       47    1   5      298      915      1    0    0      0        0
3 2000   Argentina 284204 37057      143    0   4      298      915      0    0    0      0        0
4 2000     Armenia   1912  3070       25    0   1      298      915      0    0    1      1        0
5 2000   Australia 415034 19153      617   16  58      298      915      0    1    0      0        0
6 2000     Austria 196800  8012       92    2   3      298      915      0    0    0      0        0
  altitude
1   1790.0
2      1.0
3     10.5
4    989.0
5    605.0
6    170.0

I have tried using poisson and negative binomial to fit the data. Although I have noticed there are a lot of countries won no medals at all and mean is smaller than the variance for the total medal ccounts.

> summary(df$tot)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   3.000   8.921   8.000 121.000 
> var(df$tot)
[1] 291.4698

enter image description here

Therefore I thought fitting a zero-inflated model would be more appropriate for my data. Although I am encoutoruing some problems with fitting my model with the variables and i am not so sure what I should be doing or if there is some kind of rule with zero-inflated models that I am not aware of.

> zip_model <- zeroinfl(tot ~ log(gdp) + muslim + soviet + oneparty + host + athletes+ gold +
+                         pop, data = train_set, dist = "poisson")
Error: (converted from warning) glm.fit: fitted probabilities numerically 0 or 1 occurred
Dome
  • 21
  • 2
  • 1
    just FYI: the mean-variance assumption is conditional on the X variables; it cannot be assessed by looking at the marginal variance/mean of the counts. Now regarding your error message: my guess is that there is perfect separation going on. What happens if you just fit a logistic regression predicting whether a country has ever won a medal or not? – John Madden Jun 12 '23 at 21:45
  • Hi @JohnMadden thanks for letting me know. I have just tried fitting a logistic regression and I got the same error as before – Dome Jun 12 '23 at 21:50
  • see if anything from this thread helps: https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression (most of the answers generalize to count likelihoods). But to make life easy, you may wish to start with the Firth correction, which is simply to increment all of your counts by 0.5. (https://stats.stackexchange.com/questions/415485/seeking-to-understand-using-the-firth-correction-in-generalized-estimating-equat) – John Madden Jun 12 '23 at 21:51
  • Rather than aiming for a zero-inflation model, I would probably try a hurdle two-part model. Just using hurdle() would lead to the same problems as above. But as the two parts from hurdle() can be reproduced via glm() and zerotrunc() (from package countreg on R-Forge), it would also be possible to fit a bias-reduced brglm2 for the glm(). Then one can go on with the zerotrunc() using only those variables that have any variation for tot > 0. – Achim Zeileis Jun 13 '23 at 02:00
  • Also, I would try to fit a negative binomial model. Even if the comment about the mean-variance assessment, conditional on the regressors, is correct, I would expect that the ratio in your data is far too large to be explained by regressors. Of course, the negative binomial model could also be affected by separation... – Achim Zeileis Jun 13 '23 at 02:03

0 Answers0