library(tidyverse)
library(dtplyr)
library(data.table)
library(bench)
library(DBI)tidy_finance <- dbConnect(
RSQLite::SQLite(),
"data/tidy_finance_r.sqlite",
extended_types = TRUE
)
crsp_monthly_dplyr <- tbl(tidy_finance, "crsp_monthly") |>
select(permno, month, ret_excess, mktcap_lag) |>
collect()
crsp_monthly_dt <- copy(as.data.table(crsp_monthly_dplyr))
crsp_monthly_dtplyr <- lazy_dt(copy(as.data.table(crsp_monthly_dplyr)))The following version of assign_portfolio() operates on a single
vector. Perhaps this is a better set-up for data.table … it certainly
seems more “tidy” than the “whole data frame” approach in the code
here.1
assign_portfolio <- function(x, n_portfolios) {
probs <- seq(0, 1, length.out = n_portfolios + 1)
breakpoints <- quantile(x, probs, na.rm = TRUE, names = FALSE)
findInterval(x, breakpoints, all.inside = TRUE)
}Now, I apply assign_portfolio() in three different ways. First, using
a regular dplyr data frame.
sort_dplyr <- function() {
crsp_monthly_dplyr |>
group_by(month) |>
mutate(portfolio = assign_portfolio(mktcap_lag, n_portfolios = 10)) |>
group_by(portfolio) |>
summarize(ret = mean(ret_excess)) |>
as.data.frame()
}Second, using data.table and dtplyr.
sort_dt <- function() {
crsp_monthly_dt |>
group_by(month) |>
mutate(portfolio = assign_portfolio(mktcap_lag, n_portfolios = 10)) |>
group_by(portfolio) |>
summarize(ret = mean(ret_excess)) |>
as.data.frame()
}Third, using data.table and dtplyr, but applied to the version
returned by lazy_dt().
sort_dtplyr <- function() {
crsp_monthly_dtplyr |>
group_by(month) |>
mutate(portfolio = assign_portfolio(mktcap_lag, n_portfolios = 10)) |>
group_by(portfolio) |>
summarize(ret = mean(ret_excess)) |>
as.data.frame()
}Now, run these three.
iterations <- 100
results <- bench::mark(
sort_dplyr(), sort_dt(), sort_dtplyr(),
iterations = iterations
)autoplot(results, type = "violin") +
labs(y = NULL, x = NULL,
title = "Execution time of porfolio sorts using dplyr and dtplyr",
subtitle = "'sort_dt' uses `dtplyr` and 'sort_dtplyr' simply adds `lazy_dt()`")Footnotes
-
Of course, the pipe (
|>) operates on a whole data frame, but I think this makes it seem even odder to have a function that is applied using the pipe that takes and returns a data frame. ↩

