7

My friend and I recently saw our old passions for Kinder Surprise toys reignited with a new animal toy line which resembled the old toys we were missing. To our dismay, however, this series did not include a "check-list" of toys to be collected. Hence arose the natural problem of estimating the number of distinct toys in the series, and when to stop buying. (We have an embarrassing amount of data.)

As neither of us has much experience with this type of problem, we do not know any standard approaches. That being said, here is what we cooked up: Suppose that there are $N$ distinct toys in the series. We can calculate the probability of observing $v_1$ toys repeated once ("singles"), $v_2$ toys repeated twice ("doubles"), etc. If this were indeed our data, we assume the most likely thing happened, and maximize the calculated probability as a function of $N$. I recognize that there are problems with this approach. Could anyone suggest alternatives?

The data is sourced only from purchases in the supermarket. For fun, here is some data, presented as a quintuple with the number of singles, doubles, etc.: $(5,2,3,2,0)$ was a week ago, $(5,2,3,1,1)$ and $(5,2,2,2,1)$ are more recent.

This question is similar, except here we assume that the value labelled $S$ there is infinite.

N.N.
  • 71
  • 2
    This sounds like a mark and recapture problem: https://en.wikipedia.org/wiki/Mark_and_recapture – Jeremy Miles Jun 28 '22 at 20:37
  • How are you making observations, exactly? What does a "single," "double," etc. actually mean? How one would model and analyze this situation ought to depend substantially on, say, whether you are searching through catalogs, buying toys as you encounter them, sampling databases, or whatever else it is you might be doing. – whuber Jun 28 '22 at 21:11
  • @whuber Sorry for the lack of clarity. The only way I make observations is by buying toys, grouping them by type (i.e. "wolf", "orca", etc.), and counting repeats. Hence by "singles" I mean the number of toys which appear exactly once, by "doubles" exactly twice, etc. My only source of data is the supermarket! – N.N. Jun 28 '22 at 21:24
  • 1
    Please add thgis new information as an edit to the post. We want posts to be self-contained, as comments are easily missed, and can be deleted. – kjetil b halvorsen Jun 28 '22 at 22:56
  • 1
  • As a practical matter, it would be untenable to suppose that what you observe or purchase in a supermarket behaves at all like a random sample. For theoretical amusement you could assume that and derive answers, but don't expect them to reflect reality. – whuber Jun 29 '22 at 14:03
  • @kjetilbhalvorsen Done! It should be better now. – N.N. Jun 29 '22 at 19:57
  • @COOLSerdash It's quite similar, but there $S$ is finite, whereas we assume it to be infinite. (Maybe this assumption is actually not so well-founded in retrospect.) – N.N. Jun 29 '22 at 19:58
  • @whuber Could you suggest why supermarket purchases do not behave like random samples? Regardless, the reason we made these purchases was precisely for the theoretical amusement, but it would be interesting if they did indeed reflect reality. – N.N. Jun 29 '22 at 20:00
  • First, they aren't random, so the onus is on you to establish that they behave randomly. Among the possibilities you need to rule out are (1) local distribution of batches can be highly non-random. (2) Availability on the shelves reflects purchases by those who shopped before you and therefore is biased towards items not locally popular. (3) The supermarket itself might purchase quantities that do not reflect the full range of product. (4) Availability might reflect regional marketing strategies rather than the full breadth of the product line. Etc., etc. – whuber Jun 29 '22 at 20:30
  • A related famous problem is known as the Coupon Collector's problem (https://en.wikipedia.org/wiki/Coupon_collector%27s_problem). Your version essentially has the total number of objects (ie coupons, $n$) unknown. – bdeonovic Jun 30 '22 at 18:16
  • 1
    @whuber while I think you are generally correct that these may not be a random sample think only your first point (1) applies here; the toys in question are hidden inside a chocolate egg and so local preferences and marketing strategies probably don't apply, but I suspect that batch distribution still could be not random – bdeonovic Jun 30 '22 at 19:36
  • @N.N. just so I understand correctly your notation for $(5,2,2,2,1)$ means that you have a total of $51 + 22 + 23 + 24 + 1*5 = 28$ kinder surprise toys, with a total of $5+2+2+2+1=12$ unique ones correct? – bdeonovic Jun 30 '22 at 19:44
  • @bdeonovic Exactly! – N.N. Jul 01 '22 at 20:24

1 Answers1

4

You could potentially approach this as an inference from an occupancy problem, depending on the sampling method. For simplicity, let's assume that there are $N$ types of toys and your sampling method gives you an IID sample of $n$ toys that are equiprobable over the different types. Let $K_n$ denote the number of disinct toys in your sample. Given the values $N$ and $n$ this =random variable follows the classical occupancy distribution (see e.g., O'Neill 2019), with probability mass function:

$$\mathbb{P}(K_n = k) = \text{Occ}(k|n,N) = \frac{(N)_k \cdot S(n,k)}{N^n} \quad \quad \quad \text{for all } 1 \leqslant k \leqslant \min(n,N).$$

Observing the occupancy value $K_n=k$ gives you the log-likelihood function:

$$\ell_k(N) = \sum_{i=1}^k \log (N-i+1) - n \log(N) \quad \quad \quad \quad \quad \text{for all } N \geqslant k.$$

You can find the details for computing the MLE and MoM estimators for $N$ in this related question. If you can specify the number of toys you've collected and the number of distinct toys you got I can complete the estimation. (I'm impressed that you have an embarrassing amount of data on Kinder egg toys; if you'd be interested in sharing that data, it would make a nice example for inferences in occupancy problems.)

Ben
  • 124,856
  • Thanks for the answer! The question has been updated to include data now. – N.N. Jun 29 '22 at 20:05
  • 2
    @N.N. Your data is unclear. Can you please aggregate it so that there is a single set of results for all time periods (it is unclear what the category overlaps is in the different time periods). – Ben Jun 29 '22 at 22:31
  • My apologies. I think the most recent one is also the most relevant. Namely, (5,2,2,2,1) should be most accurate. – N.N. Jul 01 '22 at 20:25