Generating some data first:
# R version 3.3.0
require(data.table) ## 1.9.7, commit 2433, github
require(dplyr) ## devel, commit 3189, github
set.seed(45L)
DT = data.table(x=sample(1e4, 5e7, TRUE), y=sample(letters, 5e7, TRUE))
DF = as.data.frame(DT)
When we do:
options(datatable.auto.index = FALSE)
DF[DF$x %in% 4:10, ] ## (a) ---- 1.398 seconds
# or
DT[x %in% 4:10, ] ## (b) ---- 1.202 seconds
# or
DT[x %between% c(4L, 10L), ] ## (c) ---- 0.572 seconds
# or
filter(DF, x %in% 4:10) ## (d) ---- 1.215 seconds
# or
filter(DF, dplyr::between(x, 4L, 10L)) ## (e) ---- 0.556 seconds
we perform a vector scan. That is, the column x
is searched for all the values that occur in the closed interval [4,10] by comparing each and every row. And to do that, it also requires a logical vector the size = nrow(DF)
.
The syntax shown in (b) is a vector scan, but using data.table
. It's identical to data.frame
, except minus the DF$
. dplyr
implements the logic x >= . & x <= .
in C++ IIUC.
Note that here I show
%in%
because for integer ranges, it's straightforward (and efficient) to use%in%
thanx >= . & x <= .
.
Absolutely. We can do better with data.table
, using binary search based subset. It is substantially faster, as it's computational complexity is O(log n)
as opposed to O(n)
, where n = nrow(DF)
.
In versions <= 1.9.2, the only requirement to using this feature of data.table
was to setkey()
on the column prior to fast-subset. For example (illustrated on a copy here):
DT2 = copy(DT)
setkey(DT2, x) ## 4.9 seconds
DT2[.(4:10)] ## 0.002 seconds
This is especially useful in case of repeated subset. Why? Because we've to setkey()
just once! But a vector scan goes through all the rows of the data.frame
or data.table
each time we subset.
Note that most of the time spent in 4.9 seconds is not in finding the order, rather to reorder the data by reference, in a memory efficient manner.
In versions 1.9.3+
, Matt has taken this one step further and implemented automatic indexing - as a result of implementing a long standing very useful feature set2key
. Now you don't need to even setkey()
. The first time you subset, an index is automatically generated while performing a vector scan, and therefore subsequent subsets are very fast. Things to note here are:
-
It all happens under the hood, seamlessly. You can use the normal base syntax you've already used. No change to existing code.
-
And since we don't rearrange the data even the first time (no need to
setkey()
) is much faster. We only require an extra time to create the index, and that too, just once. Usually, the index creation is incredibly fast (as shown below).options(datatable.auto.index = TRUE) # enable auto indexing back to illustrate the benefits ## First run: construct index + binary search based subset implicitly and automatically DT[ x %in% 4:10, verbose=TRUE ] # Creating new index 'x' # Starting bmerge ...done in 0 secs # user system elapsed # 1.045 0.132 1.182 ## Second run: reusing existing index DT[ x %in% 4:10, verbose=TRUE ] # Using existing index 'x' # Starting bmerge ...done in 0 secs # user system elapsed # 0.007 0.001 0.004
-
The data set is not rearranged - which in itself might be required in some cases.
At the moment, only basic subsets are supported (this'll be expanded):
DT[ x == . ] # and
DT[ x %in% . ]
dplyr_between <- function(df, left, right) {
df %>% filter(dplyr::between(x, left, right))
}
dt_autoidx <- function(dt, left, right) {
dt[x %in% left:right]
}
require(rbenchmark)
benchmark(dplyr_between(DF, 4L, 10L), dt_autoidx(DT, 4L, 10L))
test replications elapsed relative user.self sys.self user.child sys.child
1 dplyr_between(DF, 4, 10) 100 62.026 95.132 41.018 20.156 0 0
2 dt_autoidx(DT, 4, 10) 100 0.652 1.000 1.924 0.231 0 0
So if we run subset this data set a 100 times, we get ~95x speedup (0.65s vs 62s)!!