I might be overthinking this. I generated the output in R and 5 of my 10 samples were successful, so that's 50%. Given that, if I am to estimate the probability of two or more people in a group of 30 sharing a birthday, what is my total sample? Should I be using combinations?
-
4I'd say you're underthinking it. https://en.wikipedia.org/wiki/Birthday_problem – Mark L. Stone Feb 08 '18 at 20:14
-
4What do you mean by "total sample"? Wouldn't that be your "group of 30"? – whuber Feb 08 '18 at 20:29
2 Answers
How are you generating your birthdays? To generate 23 birthdays:
dates = sample(1:365, 23, replace = TRUE)
To see if 2 or more share the same birthday:
length(dates) != length(unique(dates)) # TRUE if there are duplicates
How often is the above TRUE?
dupe_count = 0
runs = 1000000
for (i in 1:runs) {
dates = sample(1:365, 23, replace = TRUE)
if (length(dates) != length(unique(dates))) {
dupe_count = dupe_count + 1
}
}
print(dupe_count / runs)
[1] 0.508158
This closely matches the theoretical value of 50.7% in the wikipedia page
- 548
-
1With the expression
round(runif(23, 1, 365))you do not generate dates uniformly: the chances of $1$ or $365$ are only half the chances of any other numbers between them. This error won't noticeably affect the answer in this particular example, but repeating it in other situations with smaller ranges of numbers could introduce serious errors that might be difficult to detect. Interestingly, a two-sided test of your result yields p=3.6%, intimating it might be wrong: it's a tiny bit too large. – whuber Feb 08 '18 at 22:53 -
@whuber Good catch. I replaced it with
sample(1:365, 23, replace = TRUE)– reisner Feb 08 '18 at 23:12 -
p = 1-cumprod(seq(365,266,by=-1) / 365) plot(1:100,p,type='o') abline(h=0.5,col='red') – HEITZ Feb 08 '18 at 23:55
Here is slightly simplified R code for finding the probability of at least one birthday match and the expected number of matches in a room with 23 randomly chosen people.
The number of matches is the total number of 'redundant' birthdays. So if A and B
share a birthday and C and D share a birthday, that is two matches. It is also
two matches if E, F, and G all share the same birthday. [At the end of the code nr.mat > 0 is a logical vector with a million TRUEs and FALSEs; its mean is the proportion of its TRUEs.]
set.seed(1210)
m = 10^6; n=23
nr.unq = replicate(m, length(unique(sample(1:365,n,rep=T))))
nr.mat = n - nr.unq
mean(nr.mat > 0)
[1] 0.507083 # aprx P(At Least 1 Match)
mean(nr.mat)
[1] 0.679527 # aprx E(Nr of Matches)
table(nr.mat)/m
nr.mat
0 1 2 3 4 5 6 7
0.492917 0.363493 0.118073 0.022461 0.002798 0.000237 0.000019 0.000002
hist(nr.mat, prob=T, br=(-1:max(nr.mat))+.5, col="skyblue2")
Note: Demonstration of simulating the first 'room' out of the million to be simulated. Any such simulation requires code for performing the experiment (picking 23 birthdays) and code for quantifying the result (counting matches).
The 'sample' function samples a given number of objects (second
argument) from a population (first argument). Sampling is
without replacement by default. Because birthday matches are
possible we are sampling with replacement. so we need
the argument rep=T in order to activate sampling with replacement.
set.seed(1210)
b = sample(1:365, 23, rep=T)
b
[1] 342 166 265 24 34 270 71 268 111 230 186 100
[13] 239 106 350 27 345 268 33 135 33 32 178
Notice that 33 appears twice: one match. Also,268 appears twice: a second match. Function
unique removes the two 'redundant' values. When we know how to
choose birthdays and count matches for one 'room', we can easily
loop though many rooms.
unique(b)
[1] 342 166 265 24 34 270 71 268 111 230 186 100
[13] 239 106 350 27 345 33 135 32 178
Function 'length` counts unique birthdays. There are 21. So at the end $n - 21 = 23 - 21 = 2$ matches.
length(unique(b))
[1] 21
This page has additional discussion about matching birthdays.
- 56,185
