2

I ended up asking here.

My problem might be familiar with the coupon collector's problem and related to this post Probability of throwing n different numbers in m throws of a die but it does not solve my specific problem:

I want to know how many unique values there are after throwing a die with $k$ sides $n$ times. The coupon collector's problem asks for a constant number of unique values. It is something like the other way around and or I have a brain fart.

Let

  • $k$ be the maximum value in range of iid values $[1:k]$ to appear. i.e. $[1:100]$ (known) so there are 100 unique values possible.
  • $n$ number of throws (known)
  • $u$ number of unique values after $n$ throws (wanted)

In other words, I want to predict this result if $k = 100$ and $n = 80$:

length(unique(order(table(floor(runif(80, min=1, max=101)))))) # u will be app. 54
#how to predict?

(The above is R code)

Edit: Problem is not that trivial. Found answer here... How can I estimate unique occurrence counts from a random sampling of data?

Alexis
  • 29,850
  • 2
    The question you link gives the probability $p_m$ of seeing exactly $m$ different values after $n$ rolls of a $k$-sided die. So for the expected number of values, you can just calculate $\sum_{m=1}^kmp_m$. Does that answer your question? – Stephan Kolassa Nov 25 '20 at 09:26
  • Thank you for your comment. Unfortunately i don't see the connection. Could you define $p_m$ or please express it in a formula with $u = ...$? – augeaufdiewelt Nov 25 '20 at 09:42
  • 1
    In the answer to the linked question, Franck uses $T_{n,m}$ for the event "exactly $n$ different values in $m$ rolls". So in our nomenclature, we need his $P(T_{m,n})$. (His $D$ corresponds to our $k$.) – Stephan Kolassa Nov 25 '20 at 10:17
  • Using different nomenclature is confusing. To keep up with Franck's variables: How to compute $ P(T_{n,m}) $ when $ n $ is exactly what I'm looking for? Should the equation not be changed? Or do you mean using every possible $ n $ and taking the highest probability? – augeaufdiewelt Nov 25 '20 at 10:37
  • 1
    Well, if I am understanding you correctly, you are looking for the expected value. Which is just the sum over the (finitely many) possible outcomes $n$, times each outcome's probability $p_n$, which is the $P(T_{n,m})$ you can calculate by Franck's answer. (I agree that switching indices is hard. Easiest to either translate your question into the indices the other thread uses, or the other way around.) – Stephan Kolassa Nov 25 '20 at 11:04
  • I'm not even gonna contradict you, but I'm giving up on this. Even if Franck's equation is correct and I implemented his solution correctly, this would mean too much runtime complexity for a large number of throws (I need about 100k). Nevertheless thank you for your help. I'll estimate the result by the average of the repetitions of the R command. – augeaufdiewelt Nov 25 '20 at 13:04
  • The link you supply in your edit does not answer the question you have asked. Indeed, that thread only seems to supply some simulation code and an answer to a completely different question. – whuber Nov 25 '20 at 17:37
  • I think you are right. At the moment I have a lot on my mind and would like to review it again later, sorry. – augeaufdiewelt Nov 25 '20 at 18:08

2 Answers2

3

How many unique values can you expect after throwing a dice with k sides?

If you only need the expectation value then the answer is relatively simple.

We can compute the probability that a specific number is rolled which is

$$P = 1-\left(\frac{k-1}{k}\right)^n$$

For the other numbers, this is the same. So on average you will have a number in the sample a fraction $P$ of the time.

The number of unique values to expect is then $k$ times that fraction $$Pk = k \left[ 1-\left(\frac{k-1}{k}\right)^n \right]$$

0

This Python code gives 55.2 as the expected number of unique values given 100 possible outcomes and 80 trials:

def distribution_unique(k, n):
        if n == 0:
                # If 0 trials, probability of 0 unique values is 1
                return [1 if i == 0 else 0 for i in range(k + 1)]
        else:
                previous = distribution_unique(k, n - 1)
                current = [0 for i in range(k + 1)]
                for i in range(k + 1):
                        current[i] = previous[i] * (i / k)
                        if i > 0:
                                current[i] += previous[i-1] * ((k - i + 1) / k)
                return current

def expected_unique(k, n): distribution = distribution_unique(k, n) tot = 0 for i in range(len(distribution)): tot += i * distribution[i] return tot

print(expected_unique(100, 80))

fblundun
  • 3,959
  • Nice solution thank you. I used C++ to do that. For bigger k, n Python throws "RecursionError: maximum recursion depth exceeded in comparison". In the end looping the R command worked for me. Would probably be smarter to use distinct value estimators for big numbers. http://www.vldb.org/conf/1995/P311.PDF – augeaufdiewelt Nov 25 '20 at 15:21
  • NB each recursive call in my code actually corresponds to multiplication by a fixed matrix. The code could be rewritten to explicitly use matrix exponentiation, which would have lower computational complexity. – fblundun Nov 25 '20 at 16:11