Last active
November 16, 2025 11:41
-
-
Save dill/50132a69295b0822239230b885c44f6c to your computer and use it in GitHub Desktop.
how much unique data is there in a bootstrap?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| --- | |
| title: Bootstraps and unique data | |
| author: David L Miller | |
| date: 2025/11/16 | |
| execute: | |
| keep-md: true | |
| cache: true | |
| knitr: | |
| opts_chunk: | |
| fig.path: ./ | |
| --- | |
| "The bootstrap draws $N$ times uniformly with replacement from a dataset with $N$ items. The probability an item is picked at least once is $1 − (1 − 1/N )^N$ , which for large $N$ becomes $1 − e^{−1} \approx 0.632$. Hence, the number of unique data points in a bootstrap sample is $0.632 N$ on average." | |
| From [this paper](http://arxiv.org/abs/1612.01474). | |
| "For large samples, $N$..." how big does $N$ need to be? | |
| ```{r, how-big-is-N} | |
| pickp <- function(N) 1-(1-1/N)^N | |
| pickp_dat <- data.frame(N=100:10000, | |
| p=pickp(100:10000)) | |
| library(ggplot2) | |
| ggplot(pickp_dat) + | |
| geom_line(aes(x=N, y=p))+ | |
| geom_hline(yintercept=1-exp(1)^(-1), lty=2, colour="red") + | |
| labs(x="Sample size", y="Probability of inclusion at least once") + | |
| theme_minimal() | |
| ``` | |
| Empirically, what does that look like | |
| ```{r vis-inclusion} | |
| samples <- 1:10000 | |
| vismat <- matrix(0, 10000, 10000) | |
| for(i in samples){ | |
| this_sample <- sample(samples, length(samples), replace=TRUE) | |
| tab_this_sample <- table(this_sample) | |
| vismat[i, as.numeric(names(tab_this_sample))] <- as.numeric(tab_this_sample) | |
| } | |
| # now make the samples out of order for plotting | |
| # otherwise we just plot static | |
| vismat <- apply(vismat, 1, sort) | |
| vismat[vismat==0] <- NA | |
| # rotate | |
| vismat <- t(apply(vismat, 2, rev)) | |
| ``` | |
| ```{r plot-inc, fig.width=4, fig.height=7} | |
| library(viridis) | |
| image(z=vismat, col=viridis_pal()(11), xlab="", ylab="", axes=FALSE) | |
| axis(2) | |
| ``` |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Plots!

