Suppose that you have a pool of n
different types of coupons (one of each) and
want to sample k
coupons with replacement. What is the distribution of the
random variable X
, which counts the number of distinct coupons observed with
each k
-sized sample? I was unable to come up with a closed-form solution on my
own, instead resorting to simulation and brute force. This captures a solution
I was able to find by searching online and putting the pieces together. I've
written the equations below in approximate Wolfram language syntax, since math
typesetting is not available.
This problem can be considered a variation on the classic coupon collector's
problem. (This is
the reason I chose the formulation above.) However, instead of asking for the
expectation of number of trials to collect c
coupons, we're asking for the
distribution of X
, the number of distinct coupons we observe. Finding the
expectation has a well-known
solution,
which exploits the linearity of expectation over indicator variables for each
coupon type, i
. That problem is actually more general than the question we
consider here, since it allows each coupon to have a distinct probability,
p_i
, of being pulled. In our problem, we assume that each coupon has
probability 1/n
. On the other hand, we want the full PMF as opposed to a
simple expectation.
Let's look at the probability of observing X=x
distinct coupons out of k
draws. First, we know that there are n^k
different sequences we might draw. We
can immediately see the probability of observing just a single unique coupon
(we only consider a positive support) is n/n^k = 1/n^(k-1)
. This is because
there are exactly n
different sequences that result in just a single unique
coupon: one for each different coupon type, where they are all the same.
Similarly, it's easy to reason about the probability of observing exactly k
distinct coupons. There are (n)_k = n!/(k-1)!
different sequences which result
in k
distinct coupons, where (n)_k
is the falling factorial. Therefore,
P(X=k) = (n)_k/n^k
.
However, things get more difficult as soon as we consider some x
between 1
and k
. A simple starting point is X=2
. There are Binomial[n, 2]
ways to
choose the 2 coupon types that we are considering. From there, we can consider
all of the ways to create ordered partitions of k
objects into 2
non-empty
subsets. For each "break", we have to consider the number of permutations that
correspond to the target partition. In other words, we need to count each
unordered partition multiple times if multiple permutations give rise to that
same partition (i.e., we cannot use stars and bars
counting to
arrive at a general solution). For this special case, we can easily sum over the
different partition points. Each unordered partition where we have i
observations of the first coupon type (and k-i
observations of the second
coupon type) can arise in Binomial[k, i]
ways. In total, we have:
P(X=2) = Binomial[n, 2]*Sum[Binomial[k, i], {i, 1, k-1}]/n^n
As mentioned above, we cannot directly extend the solution for X=2
by counting
unordered partitions because we need to consider the permutation factor for each
set of partitioning points. However, this is almost exactly the quantity counted
by Stirling numbers of the second
kind.
Stirling numbers count the number of ways to make ordered partitions of a
given target, but consider the partitions themselves to be exchangable. This
does not quite work for us because we want to count the number of individual
permurations that give rise to a given count, X=x
. To account for the
non-exchangability of the different partition buckets, we have to multiply the
Stirling counts by x!
.
To count the total number of ways to get X=x
, we multiply the following
factors: 1) the number of ways to choose x
unique coupons types out of the n
types, 2) the number of ways to create ordered partitions of x
where the
buckets themselves are exchangeable, and 3) the number of permutations over x
buckets. Putting this all together, we have the full PMF of X
:
P(X=x) = Binomial[n, x]*StirlingS2[k, x]*x!/n^k
The closed-form solution for n=10
, k=5
gives the following PMF:
x |
P(X=x) |
---|---|
1 | 0.001 |
2 | 0.0135 |
3 | 0.18 |
4 | 0.504 |
5 | 0.3024 |
And simulating in R:
simulate <- function(N, n, k) {
x <- seq(1, N) %>% map_int(function(ignored) {
sample.int(n, size=k, replace=T) %>% unique() %>% length()
})
tibble(trial=seq_along(x), coupons=x)
}
simulate(1e6, 10, 5) %>% group_by(coupons) %>% summarize(p=n()/1e6)
coupons p
<int> <dbl>
1 1 0.000096
2 2 0.0137
3 3 0.180
4 4 0.504
5 5 0.303
Note that the analytic solution may require high- or even arbitrary-precision
numbers for accurate results, depending on the range of n
and k
.
To get around this, we can use the asymptotic approximation from Wikipedia due to N. M. Temme, Asymptotic Estimates of Stirling Numbers, STUDIES IN APPLIED MATHEMATICS 89:233-243 (1993), Elsevier Science Publishing. To prevent over/underflow, we do most arithmetic in logspace.
log_stirling_approx <- function(n, k) {
v <- n / k
g <- -lamW::lambertW0(-v*exp(-v))
# Split the equation into log-factors for readability.
lf1 <- 1/2*(log(v-1) - log(v) - log(1-g))
lf2 <- (n-k)*(log(v-1) - log(v-g))
lf3 <- n*log(k) - k*log(n) + k*(1-g)
lf4 <- lchoose(n, k)
lf1 + lf2 + lf3 + lf4
}
log_pmf <- function(n, k, x) {
lchoose(n, x) + log_stirling_approx(k, x) + lfactorial(x) - k*log(n)
}
pmf <- function(n, k, x) {
exp(log_pmf(n, k, x))
}
The above depends on the third-party lamW package for the Lambert W function. To avoid this dependency, you can try to use a Lambert-W approximation or something like Newton's method to converge on a solution.
Here's a similar implementation in calc, an arbitrary precision calculator:
## This depends on the lambert-W, digamma, and gamma functions. To run this,
## you'll first need to load specialfunctions.cal and factorial2.cal, e.g.:
##
## read specialfunctions
## read factorial2
##
## Make sure that the precision (epsilon()) is fine enough to make results
## meaningful.
define log_choose(n, k) {
return lngamma(n+1) - lngamma(k+1) - lngamma(n-k+1);
}
define log_stirling_approx(n, k) {
local v;
local g;
local lf1;
local lf2;
local lf3;
local lf4;
if (n == k || (n >= 1 && k == 1)) {
## stirling(n, k) = 1 => log(stirling(n, k)) = 0
return 0;
}
if ((k == 0 && n >= 1) || (k > n)) {
## stirling(n, k) = 0 => log(stirling(n, k)) = -infinity
return -(1/epsilon());
}
v = n / k;
g = -lambertw(-v*exp(-v), 0);
lf1 = 1/2*(ln(v-1) - ln(v) - ln(1-g));
lf2 = (n-k)*(ln(v-1) - ln(v-g));
lf3 = n*ln(k) - k*ln(n) + k*(1-g);
lf4 = log_choose(n, k);
return lf1 + lf2 + lf3 + lf4;
}
define log_pmf(n, k, x) {
## Annoyingly, calc does no provide ln(comb(n, k)) directly, so we use
## lngamma to do this efficiently.
return lngamma(n+1) - lngamma(x+1) - lngamma(n-x+1) + log_stirling_approx(k, x) + lngamma(x+1) - k*ln(n);
}
define pmf(n, k, x) {
return exp(log_pmf(n, k, x));
}
## Approximation of the nth harmonic number.
define aharmonic(n) {
## NOTE: You could use harmonic from specialfunctions.cal, but this performs
## poorly for large numbers and large precisions. psi() is just the digamma
## function.
return ln(n) - psi(1) + 1/2/n - 1/12/n^2;
}
## Probability that _all_ coupons have been collected after drawing the expected
## number of coupons required to collect all coupons. Seems to approach
## 0.570376..., a number which I don't recognize (and is definitely _not_ the
## Euler-Mascheroni constant).
define prob(n) {
return pmf(n, n*aharmonic(n), n);
}
I state the following without proof or justification. These are just trends that I've noticed based on sampling and manual pattern matching. They may not hold in all regimes.
In all of the following statements, let X
be the random variable of how many
coupons you collect after k
draws, from a pool of n
coupons (with
replacement).
E[X/n] = 1 - exp(-k/n)
For the actual number of coupons, rather than the fraction:
E[X] = n * (1 - exp(-k/n))
Solving the above for k
, we have:
k = -n*log(1 - E[X]/n)
Note that X <= n
in all cases.
This is useful when trying to minimize the probability that you have not
collected all coupons, and trying to place bounds on those. For rigor, you'll
need to verify this using the full (or approximate) PMF, since I pulled this out
of nowhere. This estimate appears to work well for small p
only.
k = n*log(n/p)
When m
is near the middle of the range [1, n]
and far from the extremes, the
PMF looks normal-ish (though the tail weights are often way off), with a clear
mode in the center. You can adjust the mode of X
using:
k = n*log(n/(n - m))
The mode trick appears to work even near the extremes (as m
approaches 1
or
n
), even though the PMF is heavily skewed.
Note that k -> infinity
as m -> n
. However, this formulation does not
require m
to be an integer, so you can set m
arbitrarily close to n
(within numerical stability). As long as m > n - 1
, n
should be the mode.
Driving m
closer to m
just increases P(X = n)
.
This solution to a generalization of
the problem described here shows how to compute the exact distribution where
each coupon type i
has a different probability, p_i
, of being drawn. It
involves manually building up partial solutions based on ordered partitions
(dynamic programming). I believe there is no closed-form solution.