Skip to content

Instantly share code, notes, and snippets.

@robcowie
Created August 27, 2012 11:29
Show Gist options
  • Save robcowie/3487702 to your computer and use it in GitHub Desktop.
Save robcowie/3487702 to your computer and use it in GitHub Desktop.
Maintain a random sample from a sequence of unknown length (reservoir sampling)
from random import randint
from itertools import islice
def random_sample(sample_size, items):
"""Maintain an evenly distributed random sample of a population of
unknown size
The Reservoir Sampling Algorithm
(http://gregable.com/2007/10/reservoir-sampling.html)
"""
items = iter(items)
sample = list(islice(items, sample_size))
yield sample
for num, item in enumerate(items, start=sample_size):
try:
sample[randint(0, num)] = item
except IndexError:
pass
yield sample
for i in random_sample(10, xrange(10000)):
print i
@paxan
Copy link

paxan commented Sep 23, 2013

Nit: might be faster to avoid exception handling.

...
n = len(sample)
...
for ...
    k = randint(0, num)
    if k < n:
        sample[k] = item
    yield sample

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment