Here's another comparison between dplyr
and data.table
on relatively large data, with different number of unique groups.
require(dplyr)
require(data.table)
N = 20e6L # 20 million rows, UPDATE: also ran for 50 million rows (see table below)
K = 10L # other values tested are 25L, 50L, 100L
DT <- data.table(Insertion = sample(K, N, TRUE),
Unit = sample(paste("V", 1:K, sep=""), N, TRUE),
Channel = sample(K, N, TRUE),
Value = runif(N))
DF <- as.data.frame(DT)
cols = c("MeanValue", "Residuals")
system.time(ans1 <- DT[, (cols) := { m = mean(Value); list(m, Value-m)}, by=list(Insertion, Unit, Channel)])
system.time(ans2 <- DF %>% group_by(Insertion, Unit, Channel) %>% mutate(MeanValue = mean(Value), Residuals = Value-MeanValue))
# all.equal(ans1, ans2, check.attributes=FALSE) # [1] TRUE
# timings are in seconds:
# N K ~groups data.table dplyr
# 20m 10 1,000 4.31 6.60
# 20m 25 15,625 5.14 8.76
# 20m 50 125,000 6.82 20.11
# 20m 100 1,000,000 12.56 42.45
# 50m 10 1,000 12.01 17.54
# 50m 25 15,625 17.61 29.12
# 50m 50 125,000 19.41 56.00
# 50m 100 1,000,000 26.37 84.05
# ...see TODO...
TODO: Invstigate timings on same set of unique groups, but on increasing data size as well: 100m, 500m, 1b rows (m=million, b=billion).
PS: Note that the mean
here for data.table
is not yet optimised to run on GForce
. When that's implemented, this benchmark should be updated.