Skip to content

Instantly share code, notes, and snippets.

@mikelove
Last active November 25, 2015 15:44
Show Gist options
  • Save mikelove/c7ec7cf026dfd64d9ed5 to your computer and use it in GitHub Desktop.
Save mikelove/c7ec7cf026dfd64d9ed5 to your computer and use it in GitHub Desktop.
dplyr's summarize_each much faster than by()
n <- 50
m <- matrix(1:(50000*n),ncol=n)
f <- factor(rep(1:25000,each=2))
system.time({ z <- do.call(rbind, by(m, f, colSums)) })
# 16.3 seconds
library(dplyr)
d <- as.data.frame(cbind(f,m))
system.time({ d %>% group_by(f) %>% summarize_each(funs(sum)) })
# 0.137 seconds
# Mark's code
inds <- split(1:nrow(m), f)
system.time( zz <- t(sapply(inds, function(u) colSums(m[u,,drop=FALSE]))) )
# 1.01 seconds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment