Birthday paradox: How to estimate the probability of two or more people in a group of 30 sharing a birthday?

Question

I might be overthinking this. I generated the output in R and 5 of my 10 samples were successful, so that's 50%. Given that, if I am to estimate the probability of two or more people in a group of 30 sharing a birthday, what is my total sample? Should I be using combinations?

I'd say you're underthinking it. https://en.wikipedia.org/wiki/Birthday_problem — Mark L. Stone, Feb 08 '18 at 20:14
What do you mean by "total sample"? Wouldn't that be your "group of 30"? — whuber, Feb 08 '18 at 20:29

reisner · Answer 1 · 2018-02-08T23:11:49.777

2

How are you generating your birthdays? To generate 23 birthdays:

dates = sample(1:365, 23, replace = TRUE)

To see if 2 or more share the same birthday:

length(dates) != length(unique(dates)) # TRUE if there are duplicates

How often is the above TRUE?

dupe_count = 0
runs = 1000000
for (i in 1:runs) {
  dates = sample(1:365, 23, replace = TRUE)
  if (length(dates) != length(unique(dates))) {
    dupe_count = dupe_count + 1
  }
}
print(dupe_count / runs)

[1] 0.508158

This closely matches the theoretical value of 50.7% in the wikipedia page

edited Feb 08 '18 at 23:11

answered Feb 08 '18 at 20:47

reisner

548

1

With the expression round(runif(23, 1, 365)) you do not generate dates uniformly: the chances of $1$ or $365$ are only half the chances of any other numbers between them. This error won't noticeably affect the answer in this particular example, but repeating it in other situations with smaller ranges of numbers could introduce serious errors that might be difficult to detect. Interestingly, a two-sided test of your result yields p=3.6%, intimating it might be wrong: it's a tiny bit too large. – whuber Feb 08 '18 at 22:53
@whuber Good catch. I replaced it with sample(1:365, 23, replace = TRUE) – reisner Feb 08 '18 at 23:12
p = 1-cumprod(seq(365,266,by=-1) / 365) plot(1:100,p,type='o') abline(h=0.5,col='red') – HEITZ Feb 08 '18 at 23:55

BruceET · Answer 2 · 2019-10-12T08:47:47.407

Here is slightly simplified R code for finding the probability of at least one birthday match and the expected number of matches in a room with 23 randomly chosen people.

The number of matches is the total number of 'redundant' birthdays. So if A and B share a birthday and C and D share a birthday, that is two matches. It is also two matches if E, F, and G all share the same birthday. [At the end of the code nr.mat > 0 is a logical vector with a million TRUEs and FALSEs; its mean is the proportion of its TRUEs.]

set.seed(1210)
m = 10^6;  n=23
nr.unq = replicate(m, length(unique(sample(1:365,n,rep=T))))
nr.mat = n - nr.unq
mean(nr.mat > 0)
[1] 0.507083      # aprx P(At Least 1 Match)
mean(nr.mat)
[1] 0.679527      # aprx E(Nr of Matches)
table(nr.mat)/m
nr.mat
       0        1        2        3        4        5        6        7 
0.492917 0.363493 0.118073 0.022461 0.002798 0.000237 0.000019 0.000002 

hist(nr.mat, prob=T, br=(-1:max(nr.mat))+.5, col="skyblue2")

Note: Demonstration of simulating the first 'room' out of the million to be simulated. Any such simulation requires code for performing the experiment (picking 23 birthdays) and code for quantifying the result (counting matches).

The 'sample' function samples a given number of objects (second argument) from a population (first argument). Sampling is without replacement by default. Because birthday matches are possible we are sampling with replacement. so we need the argument rep=T in order to activate sampling with replacement.

set.seed(1210)
b = sample(1:365, 23, rep=T)
b
 [1] 342 166 265  24  34 270  71 268 111 230 186 100
[13] 239 106 350  27 345 268  33 135  33  32 178

Notice that 33 appears twice: one match. Also,268 appears twice: a second match. Function unique removes the two 'redundant' values. When we know how to choose birthdays and count matches for one 'room', we can easily loop though many rooms.

unique(b)
 [1] 342 166 265  24  34 270  71 268 111 230 186 100
[13] 239 106 350  27 345  33 135  32 178

Function 'length` counts unique birthdays. There are 21. So at the end $n - 21 = 23 - 21 = 2$ matches.

length(unique(b))
[1] 21

This page has additional discussion about matching birthdays.

Birthday paradox: How to estimate the probability of two or more people in a group of 30 sharing a birthday?

2 Answers2

Linked