2

I assume that the number of home corners and away corners would follow the Negative Binomial Distribution and I expect that these two variables have the same parameter p and the sum of these two would also be a Negative Binomial Distribution.

Take the number of corners in football as an example

### code for dataset
eng = read.csv('https://www.football-data.co.uk/mmz4281/2223/E0.csv')[, c('HC', 'AC')]
spa = read.csv('https://www.football-data.co.uk/mmz4281/2223/SP1.csv')[, c('HC', 'AC')]
ita = read.csv('https://www.football-data.co.uk/mmz4281/2223/I1.csv')[, c('HC', 'AC')]
ger = read.csv('https://www.football-data.co.uk/mmz4281/2223/D1.csv')[, c('HC', 'AC')]
fra = read.csv('https://www.football-data.co.uk/mmz4281/2223/F1.csv')[, c('HC', 'AC')]

cor_dat = rbind(eng, spa, ita, ger, fra) cor_dat$totalC = cor_dat$HC + cor_dat$AC cor_dat$diffC = cor_dat$HC - cor_dat$AC

Check out my assumption

### home

home_mu = mean(cor_dat$HC) home_var = var(cor_dat$HC)

Pois

sim_hc_pois = rpois(1e5, home_mu)

NegBin

sim_hc_pois = rpois(1e5, home_mu) home_p = home_mu / home_var home_r = home_mu**2 / (home_var - home_mu)

sim_hc_neg = rnbinom(1e5, home_r, home_p)

Plot

plot(prop.table(table(cor_dat$HC)), type = 'l', col = 'red', ylim = c(0, 0.17), xlab = 'home corners', ylab = 'proportion') lines(prop.table(table(sim_hc_pois)), type = 'l', col = 'yellow') lines(prop.table(table(sim_hc_neg)), type = 'l', col = 'green') legend(14, 0.11, legend=c("actual", "Pois", "NegBin"), col=c("red", "yellow", 'green'), lty = 1)

enter image description here

### away

away_mu = mean(cor_dat$AC) away_var = var(cor_dat$AC)

Pois

sim_ac_pois = rpois(1e5, away_mu)

NegBin

sim_ac_pois = rpois(1e5, away_mu) away_p = away_mu / away_var away_r = away_mu**2 / (away_var - away_mu)

sim_ac_neg = rnbinom(1e5, away_r, away_p)

Plot

plot(prop.table(table(cor_dat$AC)), type = 'l', col = 'red', ylim = c(0, 0.19)) lines(prop.table(table(sim_ac_pois)), type = 'l', col = 'yellow') lines(prop.table(table(sim_ac_neg)), type = 'l', col = 'green') legend(14, 0.11, legend=c("actual", "Pois", "NegBin"), col=c("red", "yellow", 'green'), lty = 1)

enter image description here

Now I have two ways to simulate the total number of corners: one is to get the parameters from the total corners columns and draw samples from there. The second way is to sum up the simulated home and away corners above. I expected they would give me the same result but in fact it doesn't (although home_p is almost equal to away_p)

### the sum

totalC_mu = mean(cor_dat$TotalC) variance = var(cor_dat$TotalC) p = totalC_mu / variance r = totalC_mu**2 / (variance - totalC_mu)

sim_totalC_neg = rnbinom(1e5, r, p) sim_totalC_neg_sep = sim_hc_neg + sim_ac_neg

plot(prop.table(table(cor_dat$TotalC)), type = 'l', col = 'red', ylim = c(0, 0.14)) lines(prop.table(table(sim_totalC_neg)), type = 'l', col = 'green') lines(prop.table(table(sim_totalC_neg_sep)), type = 'l', col = 'blue') legend(12, 0.14, legend=c("actual", "directly from the total coners", "sum of home and away corners"), col=c("red", "green", 'blue'), lty = 1)

enter image description here

So I'm wondering: does this indicate there is another distribution more appropriate than the NegBin? If there is, what could it be?

I also want to try out for the corners difference but I don't know what distribution describes the difference of two NegBin variables.

Juan
  • 57
  • 2
    There's at least a mild suggestion in the plots there that there may be some small home-away dependence. – Glen_b Dec 10 '23 at 02:20
  • I checked I found out that P(home = i) * P(away = j) is not equal to P(home = i + away = j) so I think there is dependence. What do you suggest? – Juan Dec 12 '23 at 04:09
  • 1
    “what do you suggest” Suggestions for reaching what goal? – Sextus Empiricus Dec 12 '23 at 09:33
  • Is there a more appropriate distribution that makes P(home = i) x P(away = j) equal to P(home = i + away = j)? I think the Bivariate Poisson allows us to capture the correlation between two dependent Poisson variables, is there a distribution like that for Negative Binomial variables? – Juan Dec 12 '23 at 10:27
  • "Is there a more appropriate distribution that makes P(home = i) x P(away = j) equal to P(home = i + away = j)?" That's literally the definition of independence; you just said that this is not the case. Choosing a different distribution won't change the data, will it? – Glen_b Dec 12 '23 at 15:54
  • Why do you want to find a closed-form distribution to fit the total corner kicks? Depending on what you want to do, a closed-form distribution is not necessary. – LmnICE Dec 15 '23 at 13:22
  • Also, see https://stats.stackexchange.com/questions/286883/sum-of-correlated-negative-binomials – LmnICE Dec 15 '23 at 14:00
  • This was my thought: let say I'm a bettor who wants to place my bet the total of corner kicks. I plot the corners of each team, it looks like that each one follows the NegBin distribution but there's a problem with the sum so I think I chose an inappropriate distribution. – Juan Dec 16 '23 at 23:33
  • Now it's likely that dependence causes the problem, I was thinking of a distribution which was able to capture the correlation between the two variables but I believe that my ideas are quite wrong. – Juan Dec 16 '23 at 23:41

1 Answers1

2

Now I have two ways to simulate the total number of corners: one is to get the parameters from the total corners columns and draw samples from there. The second way is to sum up the simulated home and away corners above. I expected they would give me the same result but in fact it doesn't (although home_p is almost equal to away_p)

The reason that the second way doesn't work is because the two variables are correlated in your data, whereas your simulations assume independence (as already hinted by glenB in the comments).