21

Let's say I have a large set of $S$ values which sometimes repeat. I wish to estimate the total number of unique values in the large set.

If I take a random sample of $T$ values, and determine that it contains $T_u$ unique values, can I use this to estimate the number of unique values in the large set?

cardinal
  • 26,862
sanity
  • 370

4 Answers4

14

Here is a whole paper about the problem, with a summary of various approaches. It's called Distinct Value Estimation in the literature.

If I had to do this myself, without having read fancy papers, I'd do this. In building language models, one often has to estimate the probability of observing a previously unknown word, given a bunch of text. A pretty good approach at solving this problem for language models in particular is to use the number of words that occurred exactly once, divided by the total number of tokens. It's called the Good Turing Estimate.

Let u1 be the number of values that occurred exactly once in a sample of m items.

P[new item next] ~= u1 / m.

Let u be the number of unique items in your sample of size m.

If you mistakenly assume that the 'new item next' rate didn't decrease as you got more data, then using Good Turing, you'll have

us ~= u + u1 / m * (s - m)

s - entire set size m - sample size us - total number of unique elements in the entire set u - number of unique elements in the sample u1 - number of values that occurred exactly once in the sample

This has some nasty behavior as u1 becomes really small, but that might not be a problem for you in practice.

Danio
  • 103
rrenaud
  • 1,088
  • 5
  • 11
2

There is a python package estndv for this task. For example, your sample is [1,1,1,3,5,5,12] and the original large set has 100000 values:

from estndv import ndvEstimator
estimator = ndvEstimator()
ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)

ndv is the estimated number of unique/distinct values for the large set. This method achieves the best results on sampled-based estimation for the number of unique values, see the paper: https://vldb.org/pvldb/vol15/p272-wu.pdf

Bob
  • 63
1

The simulation strategy

Collect m random samples of size n from the set S. For each of the m samples, compute the number u of unique values and divide by n to normalize. From the simulated distribution of normalized u, compute summary statistics of interest (e.g., mean, variance, interquartile range). Multiply the simulated mean of normalized u by the cardinality of S to estimate the number of unique values.

The greater are m and n, the more closely your simulated mean will match the true number of unique values.

  • 1
    Isn't this solution kind of lame? It doesn't take into account saturation effects at all. – rrenaud Nov 28 '11 at 21:36
  • @rrenaud Compared to your solution, I agree that mine appears inferior. – Brash Equilibrium Nov 28 '11 at 22:21
  • @rrenaud I do still advocate a simulation strategy whereby you calculate the probability of unique items using the GTFE on as many-as-feasible large-as-feasible samples to get some sense of sampling error for the probability of unique items. Or is there an explicit formula to calculate all of the moments? I wouldn't think it is the negative binomial since the binomial distribution, according to the Wikipedia reference, does not characterize the distribution of the number of unique items. But awesome! I'll file this away for later. – Brash Equilibrium Nov 28 '11 at 22:31
0

Here's an implementation for pandas:

import math
import numpy as np
from collections import Counter

def estimate_uniqueness(df, col, r=10000, n=None):
    """ Draws a sample of size r from column col from dataframe df and 
        returns an estimate for the number of unique values given a
        population size of n """
    n = n or df.shape[0]
    sample = df[col][np.random.randint(0, n, r)]
    counts = sample.value_counts()
    fis = Counter(counts)
    estimate = math.sqrt(n / r) * fis[1] + sum([fis[x] for x in fis if x > 1])
    return estimate

Relies on Section 2 and 4 of this paper: http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf