16

Consider the following process:

There are $n$ bins arranged from top to bottom. Initially, each bin contains one ball. In every step, we

  1. pick a ball $b$ uniformly at random and
  2. move all the balls from the bin containing $b$ to the bin below it. If it already was the lowest bin, we remove the balls from the process.

How many steps does it take in expectation until the process terminates, i.e., until all $n$ balls have been removed from the process? Has this been studied before? Does the answer follow easily from known techniques?

In the best case, the process can finish after $n$ steps. In the worst case it can take $\Theta(n^2)$ steps. Both cases should be very unlikely though. My conjecture is that it takes $\Theta(n\log n)$ steps and I did some experiments which seem to confirm this.

(Note that picking a bin uniformly at random is a very different process that will obviously take $\Theta(n^2)$ steps to finish.)

Matthias
  • 1,668
  • 1
  • 14
  • 22
  • The question looks interesting (although I do not know the answer). It seems difficult because of non-monotonicity; if all the n balls are in the top bin, the process clearly terminates in exactly n steps. – Tsuyoshi Ito Nov 08 '10 at 03:02

2 Answers2

11

Not really an answer, but an extended comment on András's answer.

András's answer contains a nice intuition, though I do not believe it is a rigorous calculation of the expected number of steps. I think it is perhaps a good approximation to an answer, but it does not seem to properly deal with cases where the bin below the highest occupied bin becomes empty before that the upper bin is emptied downwards. Still, this might be a reasonable approximation to make (I'm not sure).

His calculation contains an error which affects the scaling. I'm going to take exactly the same starting point, and redo and expand the calculation.

It misses a factor of p inside the summation, as the probability of randomly choosing the correct bin is $\frac{p}{n}$ rather than $\frac{1}{n}$. As a result we have

$\begin{eqnarray*} n + \sum_{p=1}^n \sum_{k=0}^{\infty} (k+1) \frac{p}{n} \left(\frac{n-p}{n}\right)^k & = & n + \sum_{p=1}^{n} \frac{p}{n} \sum_{k=0}^{\infty} (k+1) \left(\frac{n-p}{n}\right)^k \\\\& = & n + \sum_{p=1}^{n} \frac{p}{n} \cdot \frac{n^2}{p^2} \\\\& = & n + n\sum_{p=1}^{n} 1/p \\\\& = & n (1+H_n) \end{eqnarray*}$

where $H_n = \sum_{p=1}^{n} 1/p$ is the nth Harmonic number. To approximate $H_n$ we can simply replace the summation with an integral: $H_n \approx \int_{1}^{n+1} \frac{1}{x} dx = \log(n+1)$. Thus the scaling is $n (1+\log(n+1))$ or approximately $n \log(n+1)$. While this scaling does not match the scaling of the problem exactly (see simulation below) it is out by almost exactly a factor of $\log(2)$.

Simulation vs theory

Red circles: Data points from simulation of process averaged over 10k runs. Green: $n \log_2(n+1)$. Blue: $n \log(n+1)$.

Joe Fitzsimons
  • 13,675
  • 47
  • 91
  • @Joe: Nice work! It would be interesting to now show rigorously how the $\ln 2$ factor comes in from the creation of gaps. – András Salamon Nov 09 '10 at 08:39
  • @András: I don't really have a good feeling for if this is a sound approximation to make or not. @Peter's idea of bunches forming which shift down, seems like it should give the correct expression assuming that these are equally likely to form in any bin. – Joe Fitzsimons Nov 09 '10 at 08:49
  • @Joe: The top most ball will remain isolated in almost 1/3 of the cases. Consider the top 3 balls. If the middle one is picked first (out of those 3), it will join the third one. These two will, from then on, move twice as fast as the top ball. The distance between them and the top ball is a heavily biased random walk and the probability for the top ball to catch up is bounded by a small(ish) constant (rough estimate 15%). But the good news is that the top log n balls shouldn't really matter. If everything else is cleared in n\log n steps, they will only add additional n\log n steps. – Matthias Nov 09 '10 at 23:58
  • Here are two plots. Both show the number of steps divided by $n$, until everything but $\log n$ balls are cleared. For the first one, balls that drop out of the system can still be picked (like András proposed it): http://tinyurl.com/2wg7a9y . For the second one, balls that drop out of the system are not picked anymore: http://tinyurl.com/33b63pq . As you can see, the bounds the first process can give are probably too weak. Maybe it can be tuned by considering phases (like Peter wrote somewhere) in which we always halve the number of balls in the system? – Matthias Nov 10 '10 at 00:54
  • @Matthias: Analyzing the expected time assuming Peter's intuition is correct is not the road block (at least from my perspective). To me proving that this intuition is in fact a fair reflection of what happens is necessary first, though I do suspect it is a good approximation. – Joe Fitzsimons Nov 10 '10 at 06:03
  • @Joe: That is not what I meant to say. I'm saying that I believe that the assumption that balls don't leave the system is not a good approximation - at least not without amending it with some other (significant) arguments. And I tried to give some theoretical and experimental evidence for this. – Matthias Nov 10 '10 at 12:23
  • Look at the first $n$ steps. In these steps, most balls are just going to catch up with a constant number of other balls, but a few are going to be picked $O(\log n / \log \log n)$ times, and these will probably accumulate at least $O(\log n / \log \log n)$ other balls in the same bin. (Possibly even $O(\log n)$ balls. I can't decide, but it doesn't matter much.) In the next $O(n)$ steps, each of these is moving so much faster than the balls immediately below it, that it roughly doubles in size. I think this keeps happening, so after $O(n \log n)$ steps, half the balls drop out of the system. – Peter Shor Nov 10 '10 at 23:27
  • @Matthias: My simulations (the red data points in the plot) did remove the balls from the system. From your plot it is quite hard to tell the scaling. Do you have a loglog version? – Joe Fitzsimons Nov 11 '10 at 05:52
  • @Joe: http://tinyurl.com/2u9vke3 shows the number of steps - until only $\log n$ balls are left - divided by $n$. Red points for removing the balls, green points without removing them. Apologies for the noise in the data (right now I was only able to do one try for each data point). By the way, I think @Peter's arguments are pretty good. – Matthias Nov 11 '10 at 14:57
9

Edit: I am leaving this answer as is (for now) to illustrate the messy process of proving theorems, something that is left out of published papers. The core intuition here is that it is enough to focus on the top ball, as it sweeps away all below it. Please see the comments (in particular @Michael pointing out that gaps can occur) and @Joe's later answer for how errors were identified and corrected. I especially like Joe's use of experiments to double-check that the formulas were sensible.


The lower bound is $n$ as you point out, but somewhat surprising there seems to be an upper bound of $(1 + \pi^2/6)n$ for the expected number of steps.

To derive this, note that a sequence of balls will clear all the bins precisely if it contains a subsequence $b_1b_2\cdots b_n$ such that $b_1 = n$, $b_2 \ge n-1$, $\dots$, $b_i \ge n-i+1$. Additional conditions are necessary on the sequence to avoid balls being chosen that are no longer in the system, but for the purposes of an upper bound, suppose that there is an infinite decreasing sequence of bins (so the balls don't disappear when leaving bin 1, but are moved to bin 0, then bin -1, and so on). Then the expected number of steps for such a subsequence to be seen is the expected number of steps before $b_1$ is seen, plus the expected number of steps before $b_2$ is seen, and so on (down to 1, since $b_n$ can be any of the numbers $1,2,\ldots,n$). These can be seen as separate events, one after the other. The expected number of steps is then

$\begin{eqnarray*}n + \sum_{p=1}^n \sum_{k=0}^{\infty} \frac{k+1}{n} \left(\frac{n-p}{n}\right)^k & = & n + \sum_{p=1}^{n-1} \frac{1}{n-p} \sum_{k=1}^{\infty} k\left(\frac{n-p}{n}\right)^k \\& = & n + \sum_{p=1}^{n-1} \frac{1}{n-p} n(n-p)/p^2 \\& = & n + n\sum_{p=1}^{n-1} 1/p^2 \\& \le & (1 + \pi^2/6)n. \end{eqnarray*}$

András Salamon
  • 19,000
  • 3
  • 64
  • 150
  • Hi Andras, I think there might be a bug somewhere. I ran a few simulations with 10k trials and got some slightly different results. See this graph which compares the data points (circles) I got versus the exact expression in the line before the inequality in your answer (line). – Joe Fitzsimons Nov 08 '10 at 15:28
  • 3
    @Andras @Joe: Holy schmoley. If all the people asking the questions on this site took their questions as seriously as you take answering them, this would be the badassest url on the internet. – Aaron Sterling Nov 08 '10 at 15:53
  • @Joe: thanks for the reality check! Let me see what I've done wrong. Is the x-axis plotting $n$, and the y-axis plotting the average number of steps to clear the system? – András Salamon Nov 08 '10 at 15:55
  • If somebody thinks it is helpful, I can try to provide experimental data up to something like 10^9. (I have it somewhere but I would need to find the program on my hard disk first.) – Matthias Nov 08 '10 at 16:04
  • @Andras: Yes. It is quite possible there is a bug in my code, but I calculated the probability analytically as a check, and it matched pretty well. – Joe Fitzsimons Nov 08 '10 at 16:08
  • 1
    @András: I'm trying to understand your statement "a sequence of balls will clear all the bins precisely if it contains a subsequence...". Maybe I've misunderstood something, but say we have four balls. If the sequence is 3,4,3,2,4 then it seems to satisfy your subsequence requirement, yet not all the bins have been cleared. – Michael Nov 08 '10 at 16:12
  • 1
    @András: If you want to show a reasonable upper bound, you have to use the fact that balls disappear from the process and are no longer picked. Otherwise, the top most ball is always only picked with probability 1/n and there is a good chance (maybe slightly less than 1/2) that this ball will stay isolated the whole time. For this ball, you will need n^2 steps. – Matthias Nov 08 '10 at 17:50
  • @Matthias: I thought so too, until I developed this reasoning. It does seem too good to be true... – András Salamon Nov 08 '10 at 18:10
  • 1
    @Michael: I think you have identified the mistake. I'm assuming falsely that the top ball will move down even if there is a gap. – András Salamon Nov 08 '10 at 18:13
  • 2
    Here's my intuition. After a few steps, some clump of balls is going to be larger than any other clump of balls. At this point, the clump moves faster than everything else, clears everything below it and falls out of the system. This whole process should take $O(n)$ or maybe $O(n \log n)$ steps. This first clump is uniformly distributed in the line, so on average it takes half the balls with it. Now, we're left with a system of around $n/2$ balls, and another clump forms. So after around $\log n$ clumps, we're done. – Peter Shor Nov 09 '10 at 02:00
  • @Peter: Yes indeed, this intuition appears do be close to what is happening. Note however, that the first clump is not really uniformly distributed. It is more likely to happen somewhere in the middle than at the far end. But this should have no effect on your argument. Another thing is that these phases aren't really disjoint. My intuition is that the behavior of the remaining balls while the first clump is cleared, is similar to what András described ("there is an infinite decreasing sequence of bins for these balls"). – Matthias Nov 09 '10 at 03:08
  • @Peter: My intuition is slightly different. I suspect several reasonably large clumps form in parallel and for the most part don't interact. However, $n \log(n+1)$ does indeed seem to be the correct scaling. See http://download.jfitzsimons.org/cstheory2.jpg for a comparison between $n \log(n+1)$ (blue line) and data points from a simulation of $10^4$ runs (red circles). – Joe Fitzsimons Nov 09 '10 at 04:23
  • @András: I think you also miss out a factor of p in the initial summation. – Joe Fitzsimons Nov 09 '10 at 06:08
  • @Peter: It would be nice to capture your intuitive idea rigorously. – András Salamon Nov 09 '10 at 08:41
  • @Matthias, Unless I'm wrong, the uniformity of the process on the line means that for large $n$, discounting boundary effects, the first big clump is equally likely to form anywhere. (Of course, as Joe remarks, some other clumps will be forming simultaneously, and if they happen to be close to the bottom, they will beat the larger clump in the race to drop off.) My first inclination would be to look what happens with this process on an infinite line. – Peter Shor Nov 09 '10 at 13:53
  • @Peter: I agree, one could (and should) probably ignore boundary effects for now. – Matthias Nov 09 '10 at 14:31