Skip to content

Instantly share code, notes, and snippets.

@bsidhom
Last active February 12, 2023 03:34
Show Gist options
  • Save bsidhom/b5cbb5d0c19e4f85c1c24324552944db to your computer and use it in GitHub Desktop.
Save bsidhom/b5cbb5d0c19e4f85c1c24324552944db to your computer and use it in GitHub Desktop.
Compute the distribution of unique (equi-probable) coupons given a fixed number of draws

Suppose that you have a pool of n different types of coupons (one of each) and want to sample k coupons with replacement. What is the distribution of the random variable X, which counts the number of distinct coupons observed with each k-sized sample? I was unable to come up with a closed-form solution on my own, instead resorting to simulation and brute force. This captures a solution I was able to find by searching online and putting the pieces together. I've written the equations below in approximate Wolfram language syntax, since math typesetting is not available.

Related problems

This problem can be considered a variation on the classic coupon collector's problem. (This is the reason I chose the formulation above.) However, instead of asking for the expectation of number of trials to collect c coupons, we're asking for the distribution of X, the number of distinct coupons we observe. Finding the expectation has a well-known solution, which exploits the linearity of expectation over indicator variables for each coupon type, i. That problem is actually more general than the question we consider here, since it allows each coupon to have a distinct probability, p_i, of being pulled. In our problem, we assume that each coupon has probability 1/n. On the other hand, we want the full PMF as opposed to a simple expectation.

Starting our solution

Let's look at the probability of observing X=x distinct coupons out of k draws. First, we know that there are n^k different sequences we might draw. We can immediately see the probability of observing just a single unique coupon (we only consider a positive support) is n/n^k = 1/n^(k-1). This is because there are exactly n different sequences that result in just a single unique coupon: one for each different coupon type, where they are all the same. Similarly, it's easy to reason about the probability of observing exactly k distinct coupons. There are (n)_k = n!/(k-1)! different sequences which result in k distinct coupons, where (n)_k is the falling factorial. Therefore, P(X=k) = (n)_k/n^k.

However, things get more difficult as soon as we consider some x between 1 and k. A simple starting point is X=2. There are Binomial[n, 2] ways to choose the 2 coupon types that we are considering. From there, we can consider all of the ways to create ordered partitions of k objects into 2 non-empty subsets. For each "break", we have to consider the number of permutations that correspond to the target partition. In other words, we need to count each unordered partition multiple times if multiple permutations give rise to that same partition (i.e., we cannot use stars and bars counting to arrive at a general solution). For this special case, we can easily sum over the different partition points. Each unordered partition where we have i observations of the first coupon type (and k-i observations of the second coupon type) can arise in Binomial[k, i] ways. In total, we have:

P(X=2) = Binomial[n, 2]*Sum[Binomial[k, i], {i, 1, k-1}]/n^n

Computing the full distribution

As mentioned above, we cannot directly extend the solution for X=2 by counting unordered partitions because we need to consider the permutation factor for each set of partitioning points. However, this is almost exactly the quantity counted by Stirling numbers of the second kind. Stirling numbers count the number of ways to make ordered partitions of a given target, but consider the partitions themselves to be exchangable. This does not quite work for us because we want to count the number of individual permurations that give rise to a given count, X=x. To account for the non-exchangability of the different partition buckets, we have to multiply the Stirling counts by x!.

To count the total number of ways to get X=x, we multiply the following factors: 1) the number of ways to choose x unique coupons types out of the n types, 2) the number of ways to create ordered partitions of x where the buckets themselves are exchangeable, and 3) the number of permutations over x buckets. Putting this all together, we have the full PMF of X:

P(X=x) = Binomial[n, x]*StirlingS2[k, x]*x!/n^k

Verifying with simulation

The closed-form solution for n=10, k=5 gives the following PMF:

x P(X=x)
1 0.001
2 0.0135
3 0.18
4 0.504
5 0.3024

And simulating in R:

simulate <- function(N, n, k) {
    x <- seq(1, N) %>% map_int(function(ignored) {
        sample.int(n, size=k, replace=T) %>% unique() %>% length()
    })
    tibble(trial=seq_along(x), coupons=x)
}

simulate(1e6, 10, 5) %>% group_by(coupons) %>% summarize(p=n()/1e6)
  coupons        p
      <int>    <dbl>
      1       1 0.000096
      2       2 0.0137
      3       3 0.180
      4       4 0.504
      5       5 0.303

Approximating for large inputs

Note that the analytic solution may require high- or even arbitrary-precision numbers for accurate results, depending on the range of n and k.

To get around this, we can use the asymptotic approximation from Wikipedia due to N. M. Temme, Asymptotic Estimates of Stirling Numbers, STUDIES IN APPLIED MATHEMATICS 89:233-243 (1993), Elsevier Science Publishing. To prevent over/underflow, we do most arithmetic in logspace.

log_stirling_approx <- function(n, k) {
    v <- n / k
    g <- -lamW::lambertW0(-v*exp(-v))
    # Split the equation into log-factors for readability.
    lf1 <- 1/2*(log(v-1) - log(v) - log(1-g))
    lf2 <- (n-k)*(log(v-1) - log(v-g))
    lf3 <- n*log(k) - k*log(n) + k*(1-g)
    lf4 <- lchoose(n, k)
    lf1 + lf2 + lf3 + lf4
}

log_pmf <- function(n, k, x) {
    lchoose(n, x) + log_stirling_approx(k, x) + lfactorial(x) - k*log(n)
}

pmf <- function(n, k, x) {
    exp(log_pmf(n, k, x))
}

The above depends on the third-party lamW package for the Lambert W function. To avoid this dependency, you can try to use a Lambert-W approximation or something like Newton's method to converge on a solution.

Here's a similar implementation in calc, an arbitrary precision calculator:

## This depends on the lambert-W, digamma, and gamma functions. To run this,
## you'll first need to load specialfunctions.cal and factorial2.cal, e.g.:
##
## read specialfunctions
## read factorial2
##
## Make sure that the precision (epsilon()) is fine enough to make results
## meaningful.

define log_choose(n, k) {
    return lngamma(n+1) - lngamma(k+1) - lngamma(n-k+1);
}

define log_stirling_approx(n, k) {
    local v;
    local g;
    local lf1;
    local lf2;
    local lf3;
    local lf4;
    if (n == k || (n >= 1 && k == 1)) {
        ## stirling(n, k) = 1 => log(stirling(n, k)) = 0
        return 0;
    }
    if ((k == 0 && n >= 1) || (k > n)) {
        ## stirling(n, k) = 0 => log(stirling(n, k)) = -infinity
        return -(1/epsilon());
    }
    v = n / k;
    g = -lambertw(-v*exp(-v), 0);

    lf1 = 1/2*(ln(v-1) - ln(v) - ln(1-g));
    lf2 = (n-k)*(ln(v-1) - ln(v-g));
    lf3 = n*ln(k) - k*ln(n) + k*(1-g);
    lf4 = log_choose(n, k);
    return lf1 + lf2 + lf3 + lf4;
}

define log_pmf(n, k, x) {
    ## Annoyingly, calc does no provide ln(comb(n, k)) directly, so we use
    ## lngamma to do this efficiently.
    return lngamma(n+1) - lngamma(x+1) - lngamma(n-x+1) + log_stirling_approx(k, x) + lngamma(x+1) - k*ln(n);
}

define pmf(n, k, x) {
    return exp(log_pmf(n, k, x));
}

## Approximation of the nth harmonic number.
define aharmonic(n) {
    ## NOTE: You could use harmonic from specialfunctions.cal, but this performs
    ## poorly for large numbers and large precisions. psi() is just the digamma
    ## function.
    return ln(n) - psi(1) + 1/2/n - 1/12/n^2;
}

## Probability that _all_ coupons have been collected after drawing the expected
## number of coupons required to collect all coupons. Seems to approach
## 0.570376..., a number which I don't recognize (and is definitely _not_ the
## Euler-Mascheroni constant).
define prob(n) {
    return pmf(n, n*aharmonic(n), n);
}

Some very sketchy approximations and rules of thumb

I state the following without proof or justification. These are just trends that I've noticed based on sampling and manual pattern matching. They may not hold in all regimes.

In all of the following statements, let X be the random variable of how many coupons you collect after k draws, from a pool of n coupons (with replacement).

Expectation of fraction of unique coupons collected

E[X/n] = 1 - exp(-k/n)

For the actual number of coupons, rather than the fraction:

E[X] = n * (1 - exp(-k/n))

Required number of pulls, k, to reach target E[X]

Solving the above for k, we have:

k = -n*log(1 - E[X]/n)

Note that X <= n in all cases.

Find k such that P(X < n) ~= p

This is useful when trying to minimize the probability that you have not collected all coupons, and trying to place bounds on those. For rigor, you'll need to verify this using the full (or approximate) PMF, since I pulled this out of nowhere. This estimate appears to work well for small p only.

k = n*log(n/p)

Find k such that Mode(X) = m

When m is near the middle of the range [1, n] and far from the extremes, the PMF looks normal-ish (though the tail weights are often way off), with a clear mode in the center. You can adjust the mode of X using:

k = n*log(n/(n - m))

The mode trick appears to work even near the extremes (as m approaches 1 or n), even though the PMF is heavily skewed.

Note that k -> infinity as m -> n. However, this formulation does not require m to be an integer, so you can set m arbitrarily close to n (within numerical stability). As long as m > n - 1, n should be the mode. Driving m closer to m just increases P(X = n).

Generalizations

This solution to a generalization of the problem described here shows how to compute the exact distribution where each coupon type i has a different probability, p_i, of being drawn. It involves manually building up partial solutions based on ordered partitions (dynamic programming). I believe there is no closed-form solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment