How can I estimate unique occurrence counts from a random sampling of data?

Question

Let's say I have a large set of $S$ values which sometimes repeat. I wish to estimate the total number of unique values in the large set.

If I take a random sample of $T$ values, and determine that it contains $T_u$ unique values, can I use this to estimate the number of unique values in the large set?

Can you also keep count of the number of copies of each unique value in the sample? Strikes me that might help. — onestop, Nov 27 '11 at 19:40

score 14 · Answer 1 · edited Jul 14 '23 at 13:59

Here is a whole paper about the problem, with a summary of various approaches. It's called Distinct Value Estimation in the literature.

If I had to do this myself, without having read fancy papers, I'd do this. In building language models, one often has to estimate the probability of observing a previously unknown word, given a bunch of text. A pretty good approach at solving this problem for language models in particular is to use the number of words that occurred exactly once, divided by the total number of tokens. It's called the Good Turing Estimate.

Let u1 be the number of values that occurred exactly once in a sample of m items.

P[new item next] ~= u1 / m.

Let u be the number of unique items in your sample of size m.

If you mistakenly assume that the 'new item next' rate didn't decrease as you got more data, then using Good Turing, you'll have

us ~= u + u1 / m * (s - m)
s - entire set size
m - sample size
us - total number of unique elements in the entire set
u - number of unique elements in the sample
u1 - number of values that occurred exactly once in the sample

This has some nasty behavior as u1 becomes really small, but that might not be a problem for you in practice.

Indeed, s occurs twice in this, both on the left and right hand size? — PascalVKooten, Sep 16 '19 at 13:32
@user2739472 it confused me as well but it's defined above: u is the number of unique items in your sample of size m and s is the size of entire set — Danio, Jul 14 '23 at 12:52

score 2 · Answer 2 · answered Mar 30 '22 at 00:53

There is a python package estndv for this task. For example, your sample is [1,1,1,3,5,5,12] and the original large set has 100000 values:

from estndv import ndvEstimator
estimator = ndvEstimator()
ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)

ndv is the estimated number of unique/distinct values for the large set. This method achieves the best results on sampled-based estimation for the number of unique values, see the paper: https://vldb.org/pvldb/vol15/p272-wu.pdf

Brash Equilibrium · Answer 3 · 2011-11-28T08:04:23.490

1

The simulation strategy

Collect m random samples of size n from the set S. For each of the m samples, compute the number u of unique values and divide by n to normalize. From the simulated distribution of normalized u, compute summary statistics of interest (e.g., mean, variance, interquartile range). Multiply the simulated mean of normalized u by the cardinality of S to estimate the number of unique values.

The greater are m and n, the more closely your simulated mean will match the true number of unique values.

edited Nov 28 '11 at 08:04

answered Nov 28 '11 at 07:56

Brash Equilibrium

3,845

1

Isn't this solution kind of lame? It doesn't take into account saturation effects at all. – rrenaud Nov 28 '11 at 21:36
@rrenaud Compared to your solution, I agree that mine appears inferior. – Brash Equilibrium Nov 28 '11 at 22:21
@rrenaud I do still advocate a simulation strategy whereby you calculate the probability of unique items using the GTFE on as many-as-feasible large-as-feasible samples to get some sense of sampling error for the probability of unique items. Or is there an explicit formula to calculate all of the moments? I wouldn't think it is the negative binomial since the binomial distribution, according to the Wikipedia reference, does not characterize the distribution of the number of unique items. But awesome! I'll file this away for later. – Brash Equilibrium Nov 28 '11 at 22:31

score 0 · Answer 4 · answered Sep 16 '19 at 15:43

Here's an implementation for pandas:

import math
import numpy as np
from collections import Counter

def estimate_uniqueness(df, col, r=10000, n=None):
    """ Draws a sample of size r from column col from dataframe df and 
        returns an estimate for the number of unique values given a
        population size of n """
    n = n or df.shape[0]
    sample = df[col][np.random.randint(0, n, r)]
    counts = sample.value_counts()
    fis = Counter(counts)
    estimate = math.sqrt(n / r) * fis[1] + sum([fis[x] for x in fis if x > 1])
    return estimate

Relies on Section 2 and 4 of this paper: http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

How can I estimate unique occurrence counts from a random sampling of data?

4 Answers4

Linked