Last active
June 30, 2019 06:17
-
-
Save m00nlight/bfe54d1b2db362755a3a to your computer and use it in GitHub Desktop.
Python reservoir sampling algorithm
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from random import randrange | |
def reservoir_sampling(items, k): | |
""" | |
Reservoir sampling algorithm for large sample space or unknow end list | |
See <http://en.wikipedia.org/wiki/Reservoir_sampling> for detail> | |
Type: ([a] * Int) -> [a] | |
Prev constrain: k is positive and items at least of k items | |
Post constrain: the length of return array is k | |
""" | |
sample = items[0:k] | |
for i in range(k, len(items)): | |
j = randrange(1, i + 1) | |
if j <= k: | |
sample[j - 1] = items[i] | |
return sample |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@tommyod. Thanks for pointing out the problem, fix the bug. The idea of reservoir_sampling is that we don't know the length of the items or the items are too large to hold in memory. Yes, if the len(items) is know and can be hold in memory, the algorithm is useless.