-
-
Save mrdwab/6424112 to your computer and use it in GitHub Desktop.
stratified <- function(df, group, size, select = NULL, | |
replace = FALSE, bothSets = FALSE) { | |
if (is.null(select)) { | |
df <- df | |
} else { | |
if (is.null(names(select))) stop("'select' must be a named list") | |
if (!all(names(select) %in% names(df))) | |
stop("Please verify your 'select' argument") | |
temp <- sapply(names(select), | |
function(x) df[[x]] %in% select[[x]]) | |
df <- df[rowSums(temp) == length(select), ] | |
} | |
df.interaction <- interaction(df[group], drop = TRUE) | |
df.table <- table(df.interaction) | |
df.split <- split(df, df.interaction) | |
if (length(size) > 1) { | |
if (length(size) != length(df.split)) | |
stop("Number of groups is ", length(df.split), | |
" but number of sizes supplied is ", length(size)) | |
if (is.null(names(size))) { | |
n <- setNames(size, names(df.split)) | |
message(sQuote("size"), " vector entered as:\n\nsize = structure(c(", | |
paste(n, collapse = ", "), "),\n.Names = c(", | |
paste(shQuote(names(n)), collapse = ", "), ")) \n\n") | |
} else { | |
ifelse(all(names(size) %in% names(df.split)), | |
n <- size[names(df.split)], | |
stop("Named vector supplied with names ", | |
paste(names(size), collapse = ", "), | |
"\n but the names for the group levels are ", | |
paste(names(df.split), collapse = ", "))) | |
} | |
} else if (size < 1) { | |
n <- round(df.table * size, digits = 0) | |
} else if (size >= 1) { | |
if (all(df.table >= size) || isTRUE(replace)) { | |
n <- setNames(rep(size, length.out = length(df.split)), | |
names(df.split)) | |
} else { | |
message( | |
"Some groups\n---", | |
paste(names(df.table[df.table < size]), collapse = ", "), | |
"---\ncontain fewer observations", | |
" than desired number of samples.\n", | |
"All observations have been returned from those groups.") | |
n <- c(sapply(df.table[df.table >= size], function(x) x = size), | |
df.table[df.table < size]) | |
} | |
} | |
temp <- lapply( | |
names(df.split), | |
function(x) df.split[[x]][sample(df.table[x], | |
n[x], replace = replace), ]) | |
set1 <- do.call("rbind", temp) | |
if (isTRUE(bothSets)) { | |
set2 <- df[!rownames(df) %in% rownames(set1), ] | |
list(SET1 = set1, SET2 = set2) | |
} else { | |
set1 | |
} | |
} |
This is genius. Thanks!
Thank you for posting this! It's just what I needed! I am using this function to prepare some data for my research. Do you have a preferred citation so I can cite your function in my paper?
Hi there! Great function! I am using it to randomly select a subset of species observations so that we can verify the species identification of a randomly selected subset. So, I can easily stratify by species. However, if I want the observations to also be stratified by transect so that if possible, the species checked are from different transects, it becomes more complicated. I tried: size=c("TransectName"=1,"SpeciesName"=5)
to choose 5 observations of each species, each from a different transect, but this didn't work.
Error message: (Error in stratified(df, group = c("SpeciesName", "TransectName"), size = c(TransectName = 1, : Number of groups is 508 but number of sizes supplied is 2)
It gets more complicated, because if it isn't possible to get 5 observations from different transects, they can come from the same transect.
Any ideas on how I would accomplish this?
If not, I'll drop the stratification by transect.
Thanks!!
Great function! I have a case in which I am not sure it could be used.
Let's say I have a dataset x and a dataset y. Dataset x contains N observations and dataset y contains M observations, where N>M. Both datasets contain the same variables k. If I want to make from dataset x a representative sample of dataset y.
Is it possible to use the function and to specify the proportions of dataset y for the category vars of: 1) Size and 2) Sector?
stratified(Dataset x, c("Size", "Sector"), ...)
Thanks very much!
Great Function! @mrdwab Could you please provide a official citation guide to cite your function/package? Thank you!
thank you!
Great Function !!!!!......Thanks a lot !!
Great. Thank you very much for that.
awesome!
Hi, Thank you for the amazing code. But i have a query regarding using multiple columns to create strata.
Here you have shown one example "stratified(dat1, c("E", "D"), size = 0.15)" where both "E" and "D" are categorical columns. I was wondering if we can use multiple numerical columns. Please guide me for the same.
Basically your code : stratified(dat1, c("B", "C"), size = 0.15) should return some output.
Thanks in advance.
Hi, I tried to load the function using the following commands:
library(devtools)
source_gist("https://gist.github.com/mrdwab/6424112")
But, I got the following error:
Error in r_files[[which]] : invalid subscript type 'closure'
Really appreciate your help to fix this. This is exactly the function that I have been looking for and desperately need to use it.
Wow this is exactly what I need! Thank you so much!
By the way, is there a way to apply population weights for the sampling?
Thanks so much for this code, it works perfectly.
Hi there Ananda,
How do I make attribution to your article?
Such as citing the material. This is top stuff, indeed.
Dearr mrdwab
Hope this find you well,
really it great code,
and i want to ask you question please about
stacked regression as it mention in article of Brieman
http://statistics.berkeley.edu/sites/default/files/tech-reports/367.pdf
have you any code for that, please.