A basic/naive test
- Found gem about
join + update-by-referencefrom Left join using data.table - Saw a tweet reply by Michael chirico about using
setkeyduring join - Combine them together
"I want to do fast & memory efficient joins but I don't want to read the performance benchmark...". You can skip to Keyfindings & Examples sections below.
# NOTE: Performance and memory efficiency also depend on data/hardware/systems/etc
library(data.table) ; set.seed(1)
x <- data.table(a = 1:1e8, b = sample(12:19, 1e7, TRUE))
y <- data.table(a = sample(x$a, 2e5), c = sample(2:8, 2e5, TRUE))
x_2 <- copy(x) ; y_2 <- copy(y)
x_3 <- copy(x) ; y_3 <- copy(y)
x_4 <- copy(x) ; y_4 <- copy(y)
benchmark <- bench::mark( # TODO validate with `microbenchmark`
time_unit = 'ms', check = FALSE, # Turn off `check` for `setkey`
normal_join = x <- y[x, on = 'a'],
update_by_reference = x_2[y_2, on = 'a', c := c],
setkey_n_update = setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ],
# Wrapper for complex cases (ie many cols). See codes below.
wrap_setkey_n_update = setjoin(x_4, y_4, on = 'a')
)
benchmark[, c('expression', 'median', 'mem_alloc')]
# # A tibble: 4 x 3
# expression median mem_alloc
# <bch:expr> <dbl> <bch:byt>
# 1 normal_join 5593. 4.1GB
# 2 update_by_reference 724. 767.6MB
# 3 setkey_n_update 48.5 767.7MB
# 4 wrap_setkey_n_update 56.9 772.2MB
# NOTE: `dplyr::left_join` was also tested and it's the slowest with ~9,000 ms,
# use more memory than both `update_by_reference` and `setkey_n_update`
# but use less memory than normal_join. It consumed about ~2.0GB of memory.
# I didn't include it as I want to focus solely on {data.table}
# Wrapper
setjoin <- function(dt, dt_join, on, cols, drop) {
# TODO (WORK IN PROGRESS ONLY) Get cols
# Focus on the main codes about `join + update-by-reference + setkey` below, not these 'aesthetic' elements
cols <- {
if ( ! deparse(substitute(on)) %like% '\\.\\(' ) {
if (is.character(on)) {
if ( missing(cols) & missing(drop) )
setdiff( names(dt_join), on )
else if ( ! missing(drop) )
setdiff( names(dt_join), c(on, drop) )
else cols
}
}
# TODO Case: complex NSE. Currently, if `on` is complex, `cols` must be specified.
else cols
}
# NSE
expr <- quote({
setkey(dt, on) ; setkey(dt_join, on)
dt[dt_join, on = on, (cols) := mget( paste0('i.', cols) ) ]
# dt[dt_join, on = on, (cols) := dt_join[, ..cols] ]
})
eval( do.call(
substitute,
list(expr, list(on = substitute(on),
cols = substitute(cols)))
))
}I thought the performance gains were unrealistic and decided to test again using system time. Two rounds were tested, where the first round was the same as above and the second round was tested for two rounds of similar joins to investigate the performance gains of setkey (ie setkey is only set for the first time and used for subsequent joins). Wrapper of setjoin was not tested here, for brevity. It has similar performance as setkey + update.
# `bench::system_time` ---------------------------
library(data.table) ; set.seed(1)
x <- data.table(a = 1:1e8, b = sample(12:19, 1e7, TRUE))
y <- data.table(a = sample(x$a, 2e5), c = sample(2:8, 2e5, TRUE))
z <- data.table(a = sample(x$a, 2e5), c = sample(2:8, 2e5, TRUE))
x_2 <- copy(x) ; y_2 <- copy(y) ; z_2 <- copy(z)
x_3 <- copy(x) ; y_3 <- copy(y) ; z_3 <- copy(z)
# _ Join once ------------------------------------
normal_join <- bench::system_time({
x <- y[x, on = 'a']
})
update_by_reference <- bench::system_time({
x_2[y_2, on = 'a', c := c]
})
update_n_setkey <- bench::system_time({
setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ]
})
join_once <- rbind(normal_join, update_by_reference, update_n_setkey)[, 'real']
# _ Join twice -----------------------------------
# Same chunk as above
library(data.table) ; set.seed(1)
x <- data.table(a = 1:1e8, b = sample(12:19, 1e7, TRUE))
y <- data.table(a = sample(x$a, 2e5), c = sample(2:8, 2e5, TRUE))
z <- data.table(a = sample(x$a, 2e5), c = sample(2:8, 2e5, TRUE))
x_2 <- copy(x) ; y_2 <- copy(y) ; z_2 <- copy(z)
x_3 <- copy(x) ; y_3 <- copy(y) ; z_3 <- copy(z)
# Different chunk
normal_join <- bench::system_time({
x <- y[x, on = 'a']
y <- z[x, on = 'a']
})
update_by_reference <- bench::system_time({
x_2[y_2, on = 'a', c := c]
x_2[z_2, on = 'a', c := c]
})
update_n_setkey <- bench::system_time({
setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ]
x_3[ setkey(z_3, a), on = 'a', c := c ] # NOTE -------- No `setkey` for x_3 here --------
})
join_twice <- rbind(normal_join, update_by_reference, update_n_setkey)[, 'real']
# _ Compare --------------------------------------
data <- data.table(n_join = c( rep('join_once', 3), rep('join_twice', 3) ),
type = names(join_once),
seconds = c(join_once, join_twice)) [,
seconds := round(seconds, 1)]
data
# n_join type seconds
# <char> <char> <num>
# 1: join_once normal_join 5.6
# 2: join_once update_by_reference 0.8
# 3: join_once update_n_setkey 0.7
# 4: join_twice normal_join 10.2
# 5: join_twice update_by_reference 1.5
# 6: join_twice update_n_setkey 0.9
library(ggplot2)
ggplot(data, aes(n_join, seconds, fill = type)) +
geom_col(position = 'dodge', width = .5) +
geom_text(aes(label = seconds), position = position_dodge(width = 0.9), vjust = - .25, color = 'grey')
# Personal package was used below for themingSomehow bench::mark didn't capture the overhead of setkey during the initial test, hence the unrealistic of ~100x gains were shown.
setkey + updateandupdateare ~11 and ~6.5 times faster thannormal join, respectively- on first join, performance of
setkey + updateis similar toupdateas overhead ofsetkeylargely offsets its own performance gains - on second and subsequent joins, as
setkeyis not required,setkey + updateis faster thanupdateby ~1.8 times (or faster thannormal joinby ~11 times)
For performant & memory efficient joins, use either update or setkey + update, where the latter is faster at the cost of more codes.
Let's see some pseudo codes, for brevity. The logics are the same.
For one or a few columns
a <- data.table(x = ..., y = ..., z = ..., ...)
b <- data.table(x = ..., y = ..., z = ..., ...)
# `update`
a[b, on = .(x), y := y]
a[b, on = .(x), `:=` (y = y, z = z, ...)]
# `setkey + update`
setkey(a, x) [ setkey(b, x), on = .(x), y := y ]
setkey(a, x) [ setkey(b, x), on = .(x), `:=` (y = y, z = z, ...) ]For many columns
cols <- c('x', 'y', ...)
# `update`
a[b, on = .(x), (cols) := mget( paste0('i.', cols) )]
# `setkey + update`
setkey(a, x) [ setkey(b, x), on = .(x), (cols) := mget( paste0('i.', cols) ) ]Wrapper for fast & memory efficient joins...many of them...with similar join-pattern, wrap them like setjoin above
- with
update - with or without
setkey
setjoin(a, b, on = ...) # join all columns
setjoin(a, b, on = ..., select = c('columns_to_be_included', ...))
setjoin(a, b, on = ..., drop = c('columns_to_be_excluded', ...))
# With that, you can even use it with `magrittr` pipe
a %>%
setjoin(...) %>%
setjoin(...)With setkey, argument on can be omitted. It can also be included for readability, especially for collaborating with others.
Have a nice day ☺☺☺♥♣♠
The above are not my ideas at all. Needed some fast join for some projects, did some benchmark and just sharing with #RStats community ☺. Please give all credits to the people below.
- The R Foundation
- CRAN
- 75 contributors of data.table team, more than 4,000 commits, since 2008
- Left join using data.table
- Tweet reply by Michael chirico about using
setkeyduring join - #RStats
- and more! (too tired typing...you know ☺)

using setkey and repeated runs seems pointless because it may escape sorting evaluation when it detects object is already sortedsorry, just realized it is not repeated evaluation, but you just use extra package for measuring time.