Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save krackers/2fe4a30b8bfe739fed427189224a7834 to your computer and use it in GitHub Desktop.
Save krackers/2fe4a30b8bfe739fed427189224a7834 to your computer and use it in GitHub Desktop.
How to prove reservoir sampling?
Say we want to generate a set of s elements and that we have already seen n>s elements.
Let's assume that our current s elements have already each been chosen with probability s/n.
By the definition of the algorithm, we choose element n+1 with probability s/(n+1).
Each element already part of our result set has a probability 1/s of being replaced.
The probability that an element from the n-seen result set is replaced in the n+1-seen result set is therefore (1/s)*s/(n+1)=1/(n+1). Conversely, the probability that an element is not replaced is 1-1/(n+1)=n/(n+1).
Thus, the n+1-seen result set contains an element either if it was part of the n-seen result set and was not replaced---this probability is (s/n)*n/(n+1)=s/(n+1)---or if the element was chosen---with probability s/(n+1).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment