23

An interesting thought experiment involving flipping a fair is going around X/Twitter:

Flip a fair coin 100 times—it gives a sequence of heads (H) and tails (T). For each HH in the sequence of flips, Alice gets a point; for each HT, Bob does, so e.g. for the sequence THHHT Alice gets 2 points and Bob gets 1 point. Who is most likely to win?

The answer, that Bob is more likely to win, seems counterintuitive. I can of course brute-force the problem to show that 'Bob' is the correct answer - it does seem to be related in somewhat to Bob having a lower variance than Alice. But I'm having a hard time seeing why it would be that the variances differ.

library(dplyr)

sim.flip = function(X){ s2 = sample(x = c(0,1),size = X,replace = T) %>% as.character() %>% as.matrix()

s3 = s2 %>% matrix(ncol = 2,byrow = T) s4 = s2[-c(1,X)] %>% matrix(ncol = 2,byrow = T)

s5=apply(s3,1,function(X){paste(X,collapse="",sep="")}) s6=apply(s4,1,function(X){paste(X,collapse="",sep="")})

A = length(grep(x = s5,pattern = "00") )+ length(grep(x = s6,pattern = "00"))

B = length(grep(x = s5,pattern = "01"))+ length(grep(x = s6,pattern = "01"))

return(data.frame(A=A,B=B)) }

set.seed(12345) sims = sapply(c(1:10000),function(X){sim.flip(X=100)}) %>% unlist() %>% matrix(ncol = 2,byrow = T) %>% as.data.frame() colnames(sims) = c("Alice","Bob")

sims$winner = ifelse(sims$Alice > sims$Bob,yes = "Alice","Bob")
sims$
winner[sims$Alice == sims$Bob] = "Tie"

table(sims$winner)

Alice Bob Tie 4626 4791 583

Expected value is the same

mean(sims$Alice) [1] 24.7837 mean(sims$Bob) [1] 24.7572

Variance differs

var(sims$Alice) [1] 30.36575 var(sims$Bob) [1] 6.229471

Ben
  • 124,856
David B
  • 1,532
  • I felt the coin tossing theory developed by William Feller in his masterpiece An Introduction to Probability Theory and Its Applications, Chapter 4 might help. – Zhanxiong Mar 20 '24 at 20:07
  • 2
    See https://stats.stackexchange.com/questions/12174 for some intuition and techniques. For other closely related threads see the hits when searching for coin HT. – whuber Mar 20 '24 at 20:17
  • 1
    @whuber While the problem you cited is similar to this problem in many aspects, I don't agree with that they are "duplicated" -- this one requires clearly much more technical analysis as the brutal-force enumeration becomes infeasible (because the length of HT sequence is 100 instead of 4). – Zhanxiong Mar 20 '24 at 21:16
  • @whuber: I was initially going to close this question as a duplicate but decided that it is actually a variant of other questions which is asking a substantively different thing (see answer below, which focuses on skewness of the string-count). Personally, I recommend reopening, but happy to defer to your judgment. – Ben Mar 20 '24 at 21:20
  • @Ben Thank you -- I did not appreciate the "variances differ" part of the question. – whuber Mar 20 '24 at 21:28
  • @whuber: Actually, the real kicker here is the difference in skewness, but OP also asks about the difference in variance, so I address that too. – Ben Mar 20 '24 at 21:30
  • 2
    It may be worth looking at 3 rather than 100 tosses for the intuition: Bob wins with probability $\frac38$ (HTH, HTT, THT each by 1 point) while Alice wins with probability $\frac28$ (HHH by 2 points or THH by 1 point) so Bob is more likely to win despite the expected gap being 0, balanced by Alice sometimes winning by a large number of points. This (im)balance continues with more flips. – Henry Mar 21 '24 at 23:18
  • Martin Gardner did an analysis of this in one of his Mathematical Games columns in Scientific American -- in the 1970's or early 1980's, I think. He came up with a simple procedure which, given an opponent's target sequence of N coin flips, would yield the optimum sequence to bet against them. I have no idea how many people promptly started using this as a bar-bet game with biased odds (if one party knows the trick and the other doesn't, and the person who knows it at least sometimes galets to wait until the opponent's sequence has been announced before committing to theirs). – keshlam Mar 22 '24 at 02:09
  • The obvious difference to me is that Alice can win immediately after having won, because just on additional H is needed, whereas Bob cannot win after just having won.. That is Alice' winning sequence start with the end of her winning sequence, while Bob's winning sequence ends in a way where another winning sequence cannot start with. – U. Windl Mar 27 '24 at 12:06

3 Answers3

31

This result occurs due to higher positive skewness in the string-count for Alice

This type of problem relates to "string-counts", which in turn relate to deterministic finite automata (DFAs) for strings. Your question is closely related to other string-count questions for coin-flipping here and here. The simplest way to understand these problems is to construct a DFA for the "language" of interest in the problem (i.e., the set of strings being counted) and the use this to construct a Markov chain tracking the evolution of the string-states. In the figure below I show the minimised DFA for your problem, with heads represented by the blue arrows and tails represented by the red arrows. There are four different states in the minimised DFA, and the final states of interest for Alice and Bob are at the bottom of the figure. As you can see from the picture, the transitions in the DFA are not symmetric with respect to these final states. I recommend you have a look at this figure and satisfy yourself that it represents the minimised figure for the relevant string-states and transitions in the problem.

enter image description here

Formal analysis of string-counts can be conducted here by looking at the Markov chain with these four states and counting the number of times the chain enters the relevant final states (also called "counting states") for the two strings (i.e., the states $\text{HH}$ and $\text{HT}$). Assuming that this is a fair coin (i.e., tosses have independent outcomes with equal chance of a head or tail) the transition probability matrix for the string-states in the Markov chain (including state labels) is:

$$\mathbf{P} = \begin{bmatrix} & \text{NULL} & \text{H} & \text{HH} & \text{HT} \\ \text{NULL} & \tfrac{1}{2} & \tfrac{1}{2} & 0 & 0 \\ \text{H} & 0 & 0 & \tfrac{1}{2} & \tfrac{1}{2} \\ \text{HH} & 0 & 0 & \tfrac{1}{2} & \tfrac{1}{2} \\ \text{HT} & \tfrac{1}{2} & \tfrac{1}{2} & 0 & 0 \\ \end{bmatrix}.$$

The problem here is to find the probability that the string-count for state $\text{HH}$ is higher/equal/lower than the string-count for state $\text{HT}$ after $n=100$ coin tosses. To determine this we need to find the joint distribution of the string-counts for these two states (i.e., the number of times the chain enters these states) over the stipulated number of transitions in the chain. This can be computed analytically (see below) or we can simulate from the Markov chain to obtain the relevant string-counts (i.e., number of times the chain enters each counting state) over a large number of simulations. It turns out that the expected string-count for each string is the same for both $\text{HH}$ and $\text{HT}$ but their distributions are different, with the former having higher variance and higher positive skewness.

By looking at the transitions in the DFA, we can already see why we would get a higher variance in the string-count for the state $\text{HH}$ (Alice) than for the string-count for $\text{HT}$ (Bob). The reason is that the former string has a direct recurring path that means it can occur several times in a row, but if it transitions away to another state it has a longer path, on average, to come back. Contrarily, the latter string has no direct recurring path to occur in consecutive coin-flips but it transitions away to another states that are, on average, closer to it as a return state. In fact, this same phenomenon also leads to a higher skewness in the string-count for the state $\text{HH}$ (Alice) than for the string-count for $\text{HT}$ (Bob). (You can easily confirm this in your own simulations --- the positive skewness for Alice is about ten times the skewness for Bob.) This occurs because the former state tends to get some extreme cases of a high number of recurring consecutive outcomes, whereas the latter state does not. This higher skewness for $\text{HH}$, coupled with having the same expected value, means that Alice tends to win slightly less often than Bob, but when she wins, she tends to win by more points.


Exact computation of outcome probabilities for the game: We can compute the exact joint distribution of the string-counts using recursive computation for the specified Markov chain in R. To avoid arithmetic underflow it is best to conduct computations in log-space for the Markov chain and then convert back to probabilities for the final computations of the distributions and outcome probabilities. In the code below we compute the log-probabilities for all relevant joint states (i.e., the state of the Markov chain and the states of the two string-counts) for each iteration $n=0,1,...,100$ and we also compute the joint and marginal probability distributions for the points.

#Set log-probability array
N <- 100
LOGPROBS <- array(-Inf, 
                  dim = c(4, N+1, N+1, N+1),
                  dimnames = list(c('NULL', 'H', 'HH', 'HT'), 
                                  sprintf('Alice[%s]', 0:N), 
                                  sprintf('Bob[%s]', 0:N), 
                                  sprintf('n[%s]', 0:N)))

#Compute log-probabilities for joint states starting in NULL state LOGPROBS[1, 1, 1, 1] <- 0 for (n in 1:N) { for (a in 0:n) { for (b in 0:n) { LOGPROBS[1, a+1, b+1, n+1] <- matrixStats::logSumExp(c(LOGPROBS[1, a+1, b+1, n] - log(2), LOGPROBS[4, a+1, b+1, n] - log(2))) LOGPROBS[2, a+1, b+1, n+1] <- LOGPROBS[1, a+1, b+1, n+1] if (a > 0) { LOGPROBS[3, a+1, b+1, n+1] <- matrixStats::logSumExp(c(LOGPROBS[2, a, b+1, n] - log(2), LOGPROBS[3, a, b+1, n] - log(2))) } if (b > 0) { LOGPROBS[4, a+1, b+1, n+1] <- matrixStats::logSumExp(c(LOGPROBS[2, a+1, b, n] - log(2), LOGPROBS[3, a+1, b, n] - log(2))) } } } }

#Compute log-probabilities for string-counts at n = 100 LOGPROBS.FINAL <- array(-Inf, dim = c(N+1, N+1), dimnames = list(sprintf('Alice[%s]', 0:N), sprintf('Bob[%s]', 0:N))) for (a in 0:N) { for (b in 0:N) { LOGPROBS.FINAL[a+1, b+1] <- matrixStats::logSumExp(LOGPROBS[, a+1, b+1, N+1]) } }

#Compute joint probability distribution of points PROBS.JOINT <- exp(LOGPROBS.FINAL)

#Compute marginal probability distributions of points PROBS.MARG <- matrix(0, nrow = 2, ncol = N+1) rownames(PROBS.MARG) <- c('Alice', 'Bob') colnames(PROBS.MARG) <- 0:N for (a in 0:N) { PROBS.MARG[1, a+1] <- exp(matrixStats::logSumExp(LOGPROBS.FINAL[a+1, ])) } for (b in 0:N) { PROBS.MARG[2, b+1] <- exp(matrixStats::logSumExp(LOGPROBS.FINAL[, b+1])) }

This now allows us to compute the exact probabilities of the three possible outcomes of the game at $n=100$ (where either Alice wins, Bob wins, or we have a tied game).

#Compute outcome probabilities
LOGPROBS.WIN <- c(-Inf, -Inf, -Inf)
names(LOGPROBS.WIN) <- c('Alice Wins', 'Tie', 'Bob Wins')
for (a in 0:N) {
for (b in 0:N) {
  if (a > b)  { LOGPROBS.WIN[1] <- matrixStats::logSumExp(c(LOGPROBS.WIN[1], LOGPROBS.FINAL[a+1, b+1])) }
  if (a == b) { LOGPROBS.WIN[2] <- matrixStats::logSumExp(c(LOGPROBS.WIN[2], LOGPROBS.FINAL[a+1, b+1])) }
  if (a < b)  { LOGPROBS.WIN[3] <- matrixStats::logSumExp(c(LOGPROBS.WIN[3], LOGPROBS.FINAL[a+1, b+1])) } } }

#Plot outcome probabilities PROBS.WIN <- exp(LOGPROBS.WIN) barplot(PROBS.WIN, ylim = c(0,1), col = 'darkblue', main = 'Outcome Probabilities (n = 100)', ylab = 'Probability')

#Print outcome probabilities PROBS.WIN Alice Wins Tie Bob Wins 0.45764026 0.05652694 0.48583280

enter image description here

We see from this computation that there is a 45.76% chance that Alice wins the game, a 48.58% chance that Bob wins the game and a 5.65% chance that the game ends in a tie. This confirms that Bob is more likely to win the game than Alice. We can also plot the marginal distributions of the points for each player and their moments to see the differences.

#Compute moments
PA <- unname(PROBS.MARG[1, ])
PB <- unname(PROBS.MARG[2, ])
MEAN.A <- sum(PA*(0:N))
MEAN.B <- sum(PB*(0:N))
VAR.A  <- sum(PA*(0:N - MEAN.A)^2)
VAR.B  <- sum(PB*(0:N - MEAN.B)^2)
SKEW.A <- sum(PA*(0:N - MEAN.A)^3)/VAR.A^(3/2)
SKEW.B <- sum(PB*(0:N - MEAN.B)^3)/VAR.B^(3/2)
KURT.A <- sum(PA*(0:N - MEAN.A)^4)/VAR.A^2
KURT.B <- sum(PB*(0:N - MEAN.B)^4)/VAR.B^2
MOMENTS <- data.frame(Mean      = c(MEAN.A, MEAN.B),
                      Var       = c(VAR.A,  VAR.B),
                      Skew      = c(SKEW.A, SKEW.B),
                      Kurt      = c(KURT.A, KURT.B),
                      Ex.Kurt   = c(KURT.A, KURT.B) - 3,
                      row.names = c('Alice', 'Bob'))

#Print moments MOMENTS Mean Var Skew Kurt Ex.Kurt Alice 24.75 30.8125 2.148656e-01 3.024579 0.02457941 Bob 24.75 6.3125 1.185231e-13 2.980198 -0.01980198

#Plot marginal points distributions PA <- unname(PROBS.MARG[1, ]) PB <- unname(PROBS.MARG[2, ]) for (a in 0:N) { if (PA[a+1] < 1e-4) { PA[a+1] <- NA } } for (b in 0:N) { if (PB[b+1] < 1e-4) { PB[b+1] <- NA } } plot(0, xlim = c(0, N), ylim = c(-0.16, 0.16), type = 'n', yaxt='n', main = 'Points Distribution (n = 100)', xlab = 'Points', ylab = '') barplot(height = PA, col = 'red', add = TRUE, axes = FALSE) barplot(height = -PB, col = 'blue', add = TRUE, axes = FALSE) text(x = 60, y = 0.030, labels = 'Alice', col = 'darkred', cex = 1.4) text(x = 60, y = -0.028, labels = 'Bob', col = 'darkblue', cex = 1.4)

enter image description here

Ben
  • 124,856
  • 6
    +1. This is also an example of a more general phenomenon that win probabilities, which are functions of joint distributions, behave much less intuitively than expectations, which are functions of marginal distributions. – Thomas Lumley Mar 20 '24 at 22:01
  • For those less familiar with R it would be great to write the recursion in formulas. Your code basically computes the probability of each possible event P(NULL=a, H=b, HT=c, HH=d) for integers $a,b,c,d$? – Julian Mar 21 '24 at 08:50
  • 3
    What does the null state represent? Is it just a name for all the irrelevant states, i.e. T, TH etc.? – AccidentalTaylorExpansion Mar 21 '24 at 10:35
  • 1
    @AccidentalTaylorExpansion, yes, except that TH is a relevant state, because on the next flip somebody gets a point. But TH is included in the H state (there's no difference between TH and getting H as the first flip). – cjm Mar 21 '24 at 17:01
  • 3
    I don't believe your argument rigorously holds. If we're looking at the distribution of Alice's points minus Bob's points, a positive skew or a long right tail away from the mean does not always imply the median is less than the mean - for many counterexamples see https://www.tandfonline.com/doi/pdf/10.1080/10691898.2005.11910556. These counterexamples are certainly rare in some sense, so I agree that this argument still provides a good amount of intuition. – Alexander51413 Mar 21 '24 at 18:26
  • 5
    @Alexander51413: The rigour here is the exact calculation; the stuff about skewness is to aid the intuition. Also note that part of my intuitive argument here is that the skewness is higher in one case but the means are the same. Moments obscure the exact distribution, so this cannot be made to be an exact argument, but it aids intuition. – Ben Mar 21 '24 at 22:48
  • 2
    Did anyone find an actual rigorous proof? (other than direct computation) – Thomas Ahle Mar 30 '24 at 07:10
6

The key insight is that although both players can expect to get the same number of points over a set of games, Alice's wins (ie where she got the most points) tend to be bigger but rarer. Bob's wins have a smaller margin, and happen more often. Looking at all the possibilities in a 3-flip game, we see:

Game Score (A vs B) Winner
$HHH$ 2 vs 0 A
$HHT$ 1 vs 1 -
$HTH$ 0 vs 1 B
$HTT$ 0 vs 1 B
$THH$ 1 vs 0 A
$THT$ 0 vs 1 B
$TTH$ 0 vs 0 -
$TTT$ 0 vs 0 -

In this case, each player got 4 points over all the games, but Bob wins 3 games to 2.

Essentially, Alice's points are squandered by being thrown at games she's already won (just the $HHH$ one in this case). It's similar to any game where you play by winning intermediate sub-games (games/sets in tennis, constituencies in FPTP elections etc), as each sub-game win is a non-linear decision function that ignores the magnitude of the win.

And that's why it's counter-intuitive: it takes effort to realize the consequences that arise when equally-likely points are not spread evenly over the possible games.

SusanW
  • 171
-3

It's even 50/50. I simulated it myself. They are confusing this riddle with HHT HTT which is not even. An easy way to check is to look at 4 flips, write out all 16 permutations. They each get 12 points

Jason
  • 1
  • 1
  • 10
    Hi Jason, I think you've misunderstood the puzzle. It isn't how many points (which is even as you say) it's who gets the most, and has greater likelihood of winning. If Alice wins, she tends to have more points; Bob's wins have a smaller margin, and they happen more often. If played regularly, Bob gets more wins. In your example with 4 flips, Bob wins 6 games to 4. – SusanW Mar 21 '24 at 20:07
  • 2
    Cool, the riddle is more interesting than I thought. – Jason Mar 21 '24 at 20:16