I assume that the number of home corners and away corners would follow the Negative Binomial Distribution and I expect that these two variables have the same parameter p and the sum of these two would also be a Negative Binomial Distribution.
Take the number of corners in football as an example
### code for dataset
eng = read.csv('https://www.football-data.co.uk/mmz4281/2223/E0.csv')[, c('HC', 'AC')]
spa = read.csv('https://www.football-data.co.uk/mmz4281/2223/SP1.csv')[, c('HC', 'AC')]
ita = read.csv('https://www.football-data.co.uk/mmz4281/2223/I1.csv')[, c('HC', 'AC')]
ger = read.csv('https://www.football-data.co.uk/mmz4281/2223/D1.csv')[, c('HC', 'AC')]
fra = read.csv('https://www.football-data.co.uk/mmz4281/2223/F1.csv')[, c('HC', 'AC')]
cor_dat = rbind(eng, spa, ita, ger, fra)
cor_dat$totalC = cor_dat$HC + cor_dat$AC
cor_dat$diffC = cor_dat$HC - cor_dat$AC
Check out my assumption
### home
home_mu = mean(cor_dat$HC)
home_var = var(cor_dat$HC)
Pois
sim_hc_pois = rpois(1e5, home_mu)
NegBin
sim_hc_pois = rpois(1e5, home_mu)
home_p = home_mu / home_var
home_r = home_mu**2 / (home_var - home_mu)
sim_hc_neg = rnbinom(1e5, home_r, home_p)
Plot
plot(prop.table(table(cor_dat$HC)), type = 'l', col = 'red', ylim = c(0, 0.17),
xlab = 'home corners', ylab = 'proportion')
lines(prop.table(table(sim_hc_pois)), type = 'l', col = 'yellow')
lines(prop.table(table(sim_hc_neg)), type = 'l', col = 'green')
legend(14, 0.11, legend=c("actual", "Pois", "NegBin"),
col=c("red", "yellow", 'green'), lty = 1)
### away
away_mu = mean(cor_dat$AC)
away_var = var(cor_dat$AC)
Pois
sim_ac_pois = rpois(1e5, away_mu)
NegBin
sim_ac_pois = rpois(1e5, away_mu)
away_p = away_mu / away_var
away_r = away_mu**2 / (away_var - away_mu)
sim_ac_neg = rnbinom(1e5, away_r, away_p)
Plot
plot(prop.table(table(cor_dat$AC)), type = 'l', col = 'red', ylim = c(0, 0.19))
lines(prop.table(table(sim_ac_pois)), type = 'l', col = 'yellow')
lines(prop.table(table(sim_ac_neg)), type = 'l', col = 'green')
legend(14, 0.11, legend=c("actual", "Pois", "NegBin"),
col=c("red", "yellow", 'green'), lty = 1)
Now I have two ways to simulate the total number of corners: one is to get the parameters from the total corners columns and draw samples from there. The second way is to sum up the simulated home and away corners above. I expected they would give me the same result but in fact it doesn't (although home_p is almost equal to away_p)
### the sum
totalC_mu = mean(cor_dat$TotalC)
variance = var(cor_dat$TotalC)
p = totalC_mu / variance
r = totalC_mu**2 / (variance - totalC_mu)
sim_totalC_neg = rnbinom(1e5, r, p)
sim_totalC_neg_sep = sim_hc_neg + sim_ac_neg
plot(prop.table(table(cor_dat$TotalC)), type = 'l', col = 'red', ylim = c(0, 0.14))
lines(prop.table(table(sim_totalC_neg)), type = 'l', col = 'green')
lines(prop.table(table(sim_totalC_neg_sep)), type = 'l', col = 'blue')
legend(12, 0.14, legend=c("actual", "directly from the total coners",
"sum of home and away corners"),
col=c("red", "green", 'blue'), lty = 1)
So I'm wondering: does this indicate there is another distribution more appropriate than the NegBin? If there is, what could it be?
I also want to try out for the corners difference but I don't know what distribution describes the difference of two NegBin variables.


