Probability that number of heads exceeds sum of die rolls

Question

Let $X$ denote the sum of dots we see in $100$ die rolls, and let $Y$ denote the number of heads in $600$ coin flips. How can I compute $P(X > Y)?$

Intuitively, I don't think there's a nice way to compute the probability; however, I think that we can say $P(X > Y) \approx 1$ since $E(X) = 350$, $E(Y) = 300$, $\text{Var}(X) \approx 292$, $\text{Var}(Y) = 150$, which means that the standard deviations are pretty small.

Is there a better way to approach this problem? My explanation seems pretty hand-wavy, and I'd like to understand a better approach.

One way would be to use normal approximations to $X$ and $Y,$ then, by independence, to $X-Y$ — BruceET, Aug 26 '20 at 00:26
I'd just use a normal approximation unless I needed an exact answer. — Glen_b, Aug 26 '20 at 01:38
Your explanation is hand-wavy, and that is a great approach. Such quick and simple back-of-the-envelope calculations allow you to sanity check whether some other complicated calculation or model fit can even make sense. They are essentially the probability equivalent of Fermi problems. If I were interviewing you, I would be very happy indeed with your ideas. (Even happier if you came up with other approaches as well, like a simulation in any software package.) — Stephan Kolassa, Aug 26 '20 at 22:05
Could you ask your inquisitor to be more realistic?
"Everyone knows" the sum of dots we should see in 100 dice rolls and that isn't going to happen; half the reason dice games exist.

When I was about 12, a teacher got the class to throw hundreds of dice and the result was very clear.

Numbers two and five were twice as likely as statistics said they should be. Before you deny that, try it!

Wait, though… Nos two and five? Don't you know several dice games that depend on sevens? Isn't that to say, on twos and fives? — Robbie Goodwin, Aug 27 '20 at 21:05

Henry · Answer 1 · 2020-08-27T08:09:44.340

19

It is possible to do exact calculations. For example in R

rolls <- 100
flips <- 600
ddice <- rep(1/6, 6)
for (n in 2:rolls){
  ddice <- (c(0,ddice,0,0,0,0,0)+c(0,0,ddice,0,0,0,0)+c(0,0,0,ddice,0,0,0)+
            c(0,0,0,0,ddice,0,0)+c(0,0,0,0,0,ddice,0)+c(0,0,0,0,0,0,ddice))/6}
sum(ddice * (1-pbinom(1:flips, flips, 1/2))) # probability coins more
# 0.00809003
sum(ddice * dbinom(1:flips, flips, 1/2))     # probability equality
# 0.00111972
sum(ddice * pbinom(0:(flips-1), flips, 1/2)) # probability dice more
# 0.99079025

with this last figure matching BruceET's simulation

The interesting parts of the probability mass functions look like this (coin flips in red, dice totals in blue)

edited Aug 27 '20 at 08:09

answered Aug 26 '20 at 09:34

Henry

39,459

2

Love this (and not just because I'm an R-evangelist). Sad that the simulation answers got so many more upvotes. – Carl Witthoft Aug 26 '20 at 11:47
4

@CarlWitthoft: one reason why the simulation answer gets more upvotes may be that it is easier to understand and to code, and less prone to errors. I believe I'm reasonably fluent in R, but I do not understand what's happening here. I still upvoted. Why? Because the results match the simulation, that's why I'm confident they are fine. – Stephan Kolassa Aug 26 '20 at 22:10

score 17 · Accepted Answer · edited Oct 04 '22 at 12:44

17

Another way is by simulating a million match-offs between $X$ and $Y$ to approximate $P(X > Y) = 0.9907\pm 0.0002.$ [Simulation in R.]

set.seed(825)
d = replicate(10^6, sum(sample(1:6,100,rep=T))- 
       rbinom(1,600,.5))
mean(d > 0)
[1] 0.990736
2*sd(d > 0)/1000
[1] 0.0001916057   # aprx 95% margin of simulation error

Notes per @AntoniParellada's Comment:

In R, the function sample(1:6, 100, rep=T) simulates 100 rolls a fair die; the sum of this simulates $X$. Also rbinom is R code for simulating a binomial random variable; here it's $Y.$ The difference is $D = X - Y.$ The procedure replicate makes a vector of a million differences d. Then (d > 0) is a logical vector of a million TRUEs and FALSEs, the mean of which is its proportion of TRUEs--our Answer. Finally, the last statement gives the margin of error of a 95% confidence interval of the proportion of TRUEs (using 2 instead of 1.96), as a reality check on the accuracy of the simulated Answer. [With a million iterations one ordinarily expects 2 or 3 decimal paces of accuracy for probabilities--sometimes more for probabilities so far from 1/2.]

edited Oct 04 '22 at 12:44

kjetil b halvorsen

77,844

answered Aug 26 '20 at 00:38

BruceET

56,185

1

Can you explain the code? I use R, so it's clear to me, but I think your post would be more useful if the code was explained, as well as the use of the binomial, and the calculation of the standard error. – Antoni Parellada Aug 26 '20 at 00:43
2

The normal approximation has round-ish numbers, so may be feasible during an interview. After all it's likely your method they'd want to see. And you already have a good start in your Question. – BruceET Aug 26 '20 at 00:54
5

Nothing wrong with politely constructive suggestions. // Usually, I get the basic answer posted as soon as possible and then fuss with it or make addenda in response to questions from OP. (Not to mention fixing typos.) – BruceET Aug 26 '20 at 01:36
5

Simulation is not proof. This is, to dig up old rivalries, an engineering solution, not a mathematical solution – Carl Witthoft Aug 26 '20 at 11:46
2

A job interview question usually seeks to find out how prospective employee approaches problems. OP's initial very approx approach seems fine. Might also mention normal approx if prompted for more exact soln. (Also not proof.) For some types of jobs it might score points to mention simulation could be used, perhaps to get a more accurate answ than by normal approx. Floundering around for exact deriv of dist'n of D might not give good impression. // @CarlWitthoft's comment maybe more appropriate on the 'math' site. Doubt many users here mistake sim for formal proof--or question its utility. – BruceET Aug 26 '20 at 16:53
@BruceET the OP did ask how to compute , not estimate – Carl Witthoft Aug 26 '20 at 19:23
5

Unless the interview was for a tenure track position in statistics, I'd say a simulation that takes five minutes to code up is more useful than a computation that takes three hours to do and double-check. (And people like me would still double-check the computation with precisely this kind of simulation.) +1 from me. – Stephan Kolassa Aug 26 '20 at 22:08
1

@StephanKolassa Here's a story (true) related to me by a Tufts professor back when I was taking a coding/modelling course in prehistoric times. Two colleagues tried to estimate something, one by calculating the analytic series expansion & summing terms; the other with a finite-element grid model. Huge mismatch in results until they realized the series expansion converged so slowly that thousands of terms were required. So -- you do need to demonstrate that your "five-minute simulation" is a valid approximation of the real system. – Carl Witthoft Aug 27 '20 at 11:58
2

@CarlWitthoft Of course you need to be able to demonstrate that the approximation is valid. But normal approximations to a binomial process aren't exactly uncharted territory. – Frans Rodenburg Aug 28 '20 at 00:05

score 16 · Answer 3 · answered Aug 26 '20 at 00:50

A bit more precise:

The variance of a sum or difference of two independent random variables is the sum of their variances. So, you have a distribution with a mean equal to $50$ and standard deviation $\sqrt{292 + 150} \approx 21$. If we want to know how often we expect this variable to be below 0, we can try to approximate our difference by a normal distribution, and we need to look up the $z$-score for $z = \frac{50}{21} \approx 2.38$. Of course, our actual distribution will be a bit wider (since we convolve a binomial pdf with a uniform distribution pdf), but hopefully this will not be too inaccurate. The probability that our total will be positive, according to a $z$-score table, is about $0.992$.

I ran a quick experiment in Python, running 10000 iterations, and I got $\frac{9923}{10000}$ positives. Not too far off.

My code:

import numpy as np
c = np.random.randint(0, 2, size = (10000, 100, 6)).sum(axis=-1)
d = np.random.randint(1, 7, size = (10000, 100))
(d.sum(axis=-1) > c.sum(axis=-1)).sum()
--> 9923

It might be worth a continuity correction so $\Phi^{-1}\left(\frac{50-0.5}{\sqrt{292+150}}\right) \approx 0.9907256$ which compares better to the exact answer of $0.99079025$ — Henry, Aug 26 '20 at 09:38

Silverfish · Answer 4 · 2020-08-28T19:21:11.033

The following answer is a bit boring but seems to be the only one to date that contains the genuinely exact answer! Normal approximation or simulation or even just crunching the exact answer numerically to a reasonable level of accuracy, which doesn't take long, are probably the better way to go - but if you want the "mathematical" way of getting the exact answer, then:

Let $X$ denote the sum of dots we see in $100$ die rolls, with probability mass function $p_X(x)$.

Let $Y$ denote the number of heads in $600$ coin flips, with probability mass function $p_Y(y)$.

We seek $P(X > Y) = P(X - Y > 0) = P(D > 0)$ where $D = X - Y$ is the difference between sum of dots and number of heads.

Let $Z = -Y$, with probability mass function $p_Z(z) = p_Y(-z)$. Then the difference $D = X - Y$ can be rewritten as a sum $D = X + Z$ which means, since $X$ and $Z$ are independent, we can find the probability mass function of $D$ by taking the discrete convolution of the PMFs of $X$ and $Z$:

$$p_D(d) = \Pr(X + Z = d) = \sum_{k =-\infty}^{\infty} \Pr(X = k \cap Z = d - k) = \sum_{k =-\infty}^{\infty} p_X(k) p_Z(d-k) $$

In practice the sum only needs to be done over values of $k$ for which the probabilities are non-zero, of course. The idea here is exactly what @IlmariKaronen has done, I just wanted to write up the mathematical basis for it.

Now I haven't said how to find the PMF of $X$, which is left as an exercise, but note that if $X_1, X_2, \dots, X_{100}$ are the number of dots on each of 100 independent dice rolls, each with discrete uniform PMFs on $\{1, 2, 3, 4, 5, 6\}$, then $X = X_1 + X_2 + \dots + X_{100}$ and so...

# Store the PMFs of variables as dataframes with "value" and "prob" columns.
# Important the values are consecutive and ascending for consistency when convolving,
# so include intermediate values with probability 0 if needed!

# Function to check if dataframe conforms to above definition of PMF
# Use message_intro to explain what check is failing
is.pmf <- function(x, message_intro = "") {
  if(!is.data.frame(x)) {stop(paste0(message_intro, "Not a dataframe"))}
  if(!nrow(x) > 0) {stop(paste0(message_intro, "Dataframe has no rows"))}
  if(!"value" %in% colnames(x)) {stop(paste0(message_intro, "No 'value' column"))}
  if(!"prob" %in% colnames(x)) {stop(paste0(message_intro, "No 'prob' column"))}
  if(!is.numeric(x$value)) {stop(paste0(message_intro, "'value' column not numeric"))}
  if(!all(is.finite(x$value))) {stop(paste0(message_intro, "Does 'value' contain NA, Inf, NaN etc?"))}
  if(!all(diff(x$value) == 1)) {stop(paste0(message_intro, "'value' not consecutive and ascending"))}
  if(!is.numeric(x$prob)) {stop(paste0(message_intro, "'prob' column not numeric"))}
  if(!all(is.finite(x$prob))) {stop(paste0(message_intro, "Does 'prob' contain NA, Inf, NaN etc?"))}
  if(!all.equal(sum(x$prob), 1)) {stop(paste0(message_intro, "'prob' column does not sum to 1"))}
  return(TRUE)
}

# Function to convolve PMFs of x and y
# Note that to convolve in R we need to reverse the second vector
# name1 and name2 are used in error reporting for the two inputs
convolve.pmf <- function(x, y, name1 = "x", name2 = "y") {
  is.pmf(x, message_intro = paste0("Checking ", name1, " is valid PMF: "))
  is.pmf(y, message_intro = paste0("Checking ", name2, " is valid PMF: "))
  x_plus_y <- data.frame(
    value = seq(from = min(x$value) + min(y$value),
                to = max(x$value) + max(y$value),
                by = 1),
    prob = convolve(x$prob, rev(y$prob), type = "open")
  )
  return(x_plus_y)
}

# Let x_i be the score on individual dice throw i
# Note PMF of x_i is the same for each i=1 to i=100)
x_i <- data.frame(
  value = 1:6,         
  prob = rep(1/6, 6)   
)

# Let t_i be the total of x_1, x_2, ..., x_i
# We'll store the PMFs of t_1, t_2... in a list
t_i <- list()
t_i[[1]] <- x_i #t_1 is just x_1 so has same PMF
# PMF of t_i is convolution of PMFs of t_(i-1) and x_i 
for (i in 2:100) {
  t_i[[i]] <- convolve.pmf(t_i[[i-1]], x_i, 
        name1 = paste0("t_i[[", i-1, "]]"), name2 = "x_i")
}

# Let x be the sum of the scores of all 100 independent dice rolls
x <- t_i[[100]]
is.pmf(x, message_intro = "Checking x is valid PMF: ")

# Let y be the number of heads in 600 coin flips, so has Binomial(600, 0.5) distribution:
y <- data.frame(value = 0:600)
y$prob <- dbinom(y$value, size = 600, prob = 0.5)
is.pmf(y, message_intro = "Checking y is valid PMF: ")

# Let z be the negative of y (note we reverse the order to keep the values ascending)
z <- data.frame(value = -rev(y$value), prob = rev(y$prob))
is.pmf(z, message_intro = "Checking z is valid PMF: ")

# Let d be the difference, d = x - y = x + z
d <- convolve.pmf(x, z, name1 = "x", name2 = "z")
is.pmf(d, message_intro = "Checking d is valid PMF: ")

# Prob(X > Y) = Prob(D > 0)
sum(d[d$value > 0, "prob"])
# [1] 0.9907902

Try it online!

Not that it matters practically if you're just after reasonable accuracy, since the above code runs in a fraction of a second anyway, but there is a shortcut to do the convolutions for the sum of 100 independent identically distributed variables: since 100 = 64 + 32 + 4 when expressed as the sum of powers of 2, you can keep convolving your intermediate answers with themselves as much as possible. Writing the subtotals for the first $i$ dice rolls as $T_i = \sum_{k=1}^{k=i}X_k$ we can obtain the PMFs of $T_2 = X_1 + X_2$, $T_4 = T_2 + T_2'$ (where $T_2'$ is independent of $T_2$ but has the same PMF), and similarly $T_8 = T_4 + T_4'$, $T_{16} = T_8 + T_8'$, $T_{32} = T_{16} + T_{16}'$ and $T_{64} = T_{32} + T_{32}'$. We need two more convolutions to find the total score of all 100 dice as the sum of three independent variables, $X = T_{100} = ( T_{64} + T_{32}'' ) + T_4''$, and a final convolution for $D = X + Z$. So I think you only need nine convolutions in all - and for the final one, you can just restrict yourself to the parts of the convolution giving a positive value for $D$. Or if it's less hassle, the parts that give the non-positive values for $D$ and then take the complement. Provided you pick the most efficient way, I reckon that means your worst case is effectively eight-and-a-half convolutions. EDIT: and as @whuber suggests, this isn't necessarily optimal either!

Using the nine-convolution method I identified, with the gmp package so I could work with bigq objects and writing a not-at-all-optimised loop to do the convolutions (since R's built-in method doesn't deal with bigq inputs), it took just a couple of seconds to work out the exact simplified fraction:

1342994286789364913259466589226414913145071640552263974478047652925028002001448330257335942966819418087658458889485712017471984746983053946540181650207455490497876104509955761041797420425037042000821811370562452822223052224332163891926447848261758144860052289/1355477899826721990460331878897812400287035152117007099242967137806414779868504848322476153909567683818236244909105993544861767898849017476783551366983047536680132501682168520276732248143444078295080865383592365060506205489222306287318639217916612944423026688

which does indeed round to 0.9907902. Now for the exact answer, I wouldn't have wanted to do that with too many more convolutions, I could feel the gears of my laptop starting to creak!

Although it's intuitively obvious that such binary decompositions produce efficiencies, it may interest you that they do not necessarily produce the most efficient method. A small example is 15 = 1111B = 1+2+4+8. Following the binary decomposition, we might compute convolutions of order 2=1+1, 4=2+2, 8=4+4, and then 15=1+2+4+8, requiring 6 operations; but this can be achieved in just 5 operations by computing 2, 4, 5=1+4, 10=5+5, 15=5+10. I recall Donald Knuth discusses this problem in Art of Computer Programming. https://stats.stackexchange.com/questions/5347 is closely related. — whuber, Aug 28 '20 at 13:54
re the code: A few years ago I found it convenient to take the same approach by overloading the basic arithmetic operators: https://stats.stackexchange.com/a/116913/919. — whuber, Aug 28 '20 at 14:56
@whuber Nice! I think I was inspired by a bit of code like that I'd seen somewhere else actually, either from you or possibly wolfies. Re the most efficient way to do the convolutions: really interesting! It was noticeable how much slower the final convolutions were, which makes me wonder if there's any difference between minimising the number of convolutions and (better yet) minimising no of operations. I think convolving vectors of lengths $m$ and $n$ takes $mn$ multiplications and $mn-m-n+1$ additions (common denominators calculations for the exact fractions won't be negligibly cheap)... — Silverfish, Aug 28 '20 at 20:12
...It looks like big $mn$ is the problem either way, so the goal is to keep the product small while still building the sum up quickly. (The sum of $k$ dice can range from $k$ to $6k$ so the vector for its PMF is going to have to contain $m = 5k+1$ probabilities. Hence the sum $m+n$ is going to be near-proportional to the sum of the numbers of dice you're trying to build up to 100.) Doing 1+9 is an easier way to make 10 than doing 5+5, for example - better to keep the sum "lopsided". Which suggests my "doubling up" approach perhaps wasn't such a clever idea after all! — Silverfish, Aug 28 '20 at 20:23
(And more as a note to self: with the "big rationals" I suspect I could have saved a lot of time just storing the numerators as "big integers", and keeping track of a common denominator for each vector of probabilities...) — Silverfish, Aug 28 '20 at 20:28
During a process of successive convolutions, you will be working directly with the Fourier Transforms. Thus, each convolution requires $\max(m,n)$ multiplications and no additions. One solution I have advocated is to use approximations at the outset to estimate the range of final values that will have meaningfully nonzero probabilities and then limit all computation to values relevant to that final range: this puts an upper bound on how large $\max(m,n)$ will be. See https://stats.stackexchange.com/a/5482/919 for details and https://stats.stackexchange.com/a/291571/919 for even more. — whuber, Aug 29 '20 at 13:50
@whuber Yes ideally I would be using FFT - my comment only applied to the "big rationals" calculation I was using to get the exact answer, so sadly no approximations for me, and as far as I can tell the R package for gmp doesn't support convolutions/FFT for bigq objects :-( For more "practical" purposes the approximations given in your posts are very handy! — Silverfish, Aug 29 '20 at 16:38

score 2 · Answer 5 · answered Aug 27 '20 at 10:05

The exact answer is easy enough to compute numerically — no simulation needed. For educational purposes, here's an elementary Python 3 script to do so, using no premade statistical libraries.

from collections import defaultdict
define the distributions of a single coin and die
coin = tuple((i, 1/2) for i in (0, 1))
die = tuple((i, 1/6) for i in (1, 2, 3, 4, 5, 6))
a simple function to compute the sum of two random variables
def add_rv(a, b):
  sum = defaultdict(float)
  for i, p in a:
    for j, q in b:
      sum[i + j] += p * q
  return tuple(sum.items())
compute the sums of 600 coins and 100 dice
coin_sum = dice_sum = ((0, 1),)
for _ in range(600): coin_sum = add_rv(coin_sum, coin)
for _ in range(100): dice_sum = add_rv(dice_sum, die)
calculate the probability of the dice sum being higher
prob = 0
for i, p in dice_sum:
  for j, q in coin_sum:
    if i > j: prob += p * q
print("probability of 100 dice summing to more than 600 coins = %.10f" % prob)

Try it online!

The script above represents a discrete probability distribution as a list of (value, probability) pairs, and uses a simple pair of nested loops to compute the distribution of the sum of two random variables (iterating over all possible values of each of the summands). This is not necessarily the most efficient possible representation, but it's easy to work with and more than fast enough for this purpose.

(FWIW, this representation of probability distributions is also compatible with the collection of utility functions for modelling more complex dice rolls that I wrote for a post on our sister site a while ago.)

Of course, there are also domain-specific libraries and even entire programming languages for calculations like this. Using one such online tool, called AnyDice, the same calculation can be written much more compactly:

X: 100d6
Y: 600d{0,1}
output X > Y named "1 if X > Y, else 0"

Under the hood, I believe AnyDice calculates the result pretty much like my Python script does, except maybe with slightly more optimizations. In any case, both give the same probability of 0.9907902497 for the sum of the dice being greater than the number of heads.

If you want, AnyDice can also plot the distributions of the two sums for you. To get similar plots out of the Python code, you'd have to feed the dice_sum and coin_sum lists into a graph plotting library like pyplot.

Probability that number of heads exceeds sum of die rolls

5 Answers5

define the distributions of a single coin and die

a simple function to compute the sum of two random variables

compute the sums of 600 coins and 100 dice

calculate the probability of the dice sum being higher