The sum of two Negative Binomial variables doesn't follow a Negative Binomial

Question

I assume that the number of home corners and away corners would follow the Negative Binomial Distribution and I expect that these two variables have the same parameter p and the sum of these two would also be a Negative Binomial Distribution.

Take the number of corners in football as an example

### code for dataset
eng = read.csv('https://www.football-data.co.uk/mmz4281/2223/E0.csv')[, c('HC', 'AC')]
spa = read.csv('https://www.football-data.co.uk/mmz4281/2223/SP1.csv')[, c('HC', 'AC')]
ita = read.csv('https://www.football-data.co.uk/mmz4281/2223/I1.csv')[, c('HC', 'AC')]
ger = read.csv('https://www.football-data.co.uk/mmz4281/2223/D1.csv')[, c('HC', 'AC')]
fra = read.csv('https://www.football-data.co.uk/mmz4281/2223/F1.csv')[, c('HC', 'AC')]
cor_dat = rbind(eng, spa, ita, ger, fra)
cor_dat$totalC = cor_dat$HC + cor_dat$AC
cor_dat$diffC = cor_dat$HC - cor_dat$AC

Check out my assumption

### home
home_mu = mean(cor_dat$HC)
home_var = var(cor_dat$HC)
Pois
sim_hc_pois = rpois(1e5, home_mu)
NegBin
sim_hc_pois = rpois(1e5, home_mu)
home_p = home_mu / home_var
home_r = home_mu**2 / (home_var - home_mu)
sim_hc_neg = rnbinom(1e5, home_r, home_p)
Plot
plot(prop.table(table(cor_dat$HC)), type = 'l', col = 'red', ylim = c(0, 0.17),
     xlab = 'home corners', ylab = 'proportion')
lines(prop.table(table(sim_hc_pois)), type = 'l', col = 'yellow')
lines(prop.table(table(sim_hc_neg)), type = 'l', col = 'green')
legend(14, 0.11, legend=c("actual", "Pois", "NegBin"),
       col=c("red", "yellow", 'green'), lty = 1)

### away
away_mu = mean(cor_dat$AC)
away_var = var(cor_dat$AC)
Pois
sim_ac_pois = rpois(1e5, away_mu)
NegBin
sim_ac_pois = rpois(1e5, away_mu)
away_p = away_mu / away_var
away_r = away_mu**2 / (away_var - away_mu)
sim_ac_neg = rnbinom(1e5, away_r, away_p)
Plot
plot(prop.table(table(cor_dat$AC)), type = 'l', col = 'red', ylim = c(0, 0.19))
lines(prop.table(table(sim_ac_pois)), type = 'l', col = 'yellow')
lines(prop.table(table(sim_ac_neg)), type = 'l', col = 'green')
legend(14, 0.11, legend=c("actual", "Pois", "NegBin"),
       col=c("red", "yellow", 'green'), lty = 1)

Now I have two ways to simulate the total number of corners: one is to get the parameters from the total corners columns and draw samples from there. The second way is to sum up the simulated home and away corners above. I expected they would give me the same result but in fact it doesn't (although home_p is almost equal to away_p)

### the sum
totalC_mu = mean(cor_dat$TotalC)
variance = var(cor_dat$TotalC)
p = totalC_mu / variance
r = totalC_mu**2 / (variance - totalC_mu)
sim_totalC_neg = rnbinom(1e5, r, p)
sim_totalC_neg_sep = sim_hc_neg + sim_ac_neg
plot(prop.table(table(cor_dat$TotalC)), type = 'l', col = 'red', ylim = c(0, 0.14))
lines(prop.table(table(sim_totalC_neg)), type = 'l', col = 'green')
lines(prop.table(table(sim_totalC_neg_sep)), type = 'l', col = 'blue')
legend(12, 0.14, legend=c("actual", "directly from the total coners",
                          "sum of home and away corners"),
       col=c("red", "green", 'blue'), lty = 1)

So I'm wondering: does this indicate there is another distribution more appropriate than the NegBin? If there is, what could it be?

I also want to try out for the corners difference but I don't know what distribution describes the difference of two NegBin variables.

There's at least a mild suggestion in the plots there that there may be some small home-away dependence. — Glen_b, Dec 10 '23 at 02:20
I checked I found out that P(home = i) * P(away = j) is not equal to P(home = i + away = j) so I think there is dependence. What do you suggest? — Juan, Dec 12 '23 at 04:09
“what do you suggest” Suggestions for reaching what goal? — Sextus Empiricus, Dec 12 '23 at 09:33
Is there a more appropriate distribution that makes P(home = i) x P(away = j) equal to P(home = i + away = j)? I think the Bivariate Poisson allows us to capture the correlation between two dependent Poisson variables, is there a distribution like that for Negative Binomial variables? — Juan, Dec 12 '23 at 10:27
"Is there a more appropriate distribution that makes P(home = i) x P(away = j) equal to P(home = i + away = j)?" That's literally the definition of independence; you just said that this is not the case. Choosing a different distribution won't change the data, will it? — Glen_b, Dec 12 '23 at 15:54
Why do you want to find a closed-form distribution to fit the total corner kicks? Depending on what you want to do, a closed-form distribution is not necessary. — LmnICE, Dec 15 '23 at 13:22
Also, see https://stats.stackexchange.com/questions/286883/sum-of-correlated-negative-binomials — LmnICE, Dec 15 '23 at 14:00
This was my thought: let say I'm a bettor who wants to place my bet the total of corner kicks. I plot the corners of each team, it looks like that each one follows the NegBin distribution but there's a problem with the sum so I think I chose an inappropriate distribution. — Juan, Dec 16 '23 at 23:33
Now it's likely that dependence causes the problem, I was thinking of a distribution which was able to capture the correlation between the two variables but I believe that my ideas are quite wrong. — Juan, Dec 16 '23 at 23:41

score 2 · Accepted Answer · answered Dec 12 '23 at 09:31

Now I have two ways to simulate the total number of corners: one is to get the parameters from the total corners columns and draw samples from there. The second way is to sum up the simulated home and away corners above. I expected they would give me the same result but in fact it doesn't (although home_p is almost equal to away_p)

The reason that the second way doesn't work is because the two variables are correlated in your data, whereas your simulations assume independence (as already hinted by glenB in the comments).

The sum of two Negative Binomial variables doesn't follow a Negative Binomial

Pois

NegBin

Plot

Pois

NegBin

Plot

1 Answers1