library(tidyverse)
library(dtplyr)
library(data.table)
library(bench)
library(DBI)
tidy_finance <- dbConnect(
RSQLite::SQLite(),
"data/tidy_finance_r.sqlite",
extended_types = TRUE
)
crsp_monthly_dplyr <- tbl(tidy_finance, "crsp_monthly") |>
select(permno, month, ret_excess, mktcap_lag) |>
collect()
crsp_monthly_dt <- copy(as.data.table(crsp_monthly_dplyr))
crsp_monthly_dtplyr <- lazy_dt(copy(as.data.table(crsp_monthly_dplyr)))
The following version of assign_portfolio()
operates on a single
vector. Perhaps this is a better set-up for data.table
… it certainly
seems more “tidy” than the “whole data frame” approach in the code
here.1
assign_portfolio <- function(x, n_portfolios) {
probs <- seq(0, 1, length.out = n_portfolios + 1)
breakpoints <- quantile(x, probs, na.rm = TRUE, names = FALSE)
findInterval(x, breakpoints, all.inside = TRUE)
}
Now, I apply assign_portfolio()
in three different ways. First, using
a regular dplyr
data frame.
sort_dplyr <- function() {
crsp_monthly_dplyr |>
group_by(month) |>
mutate(portfolio = assign_portfolio(mktcap_lag, n_portfolios = 10)) |>
group_by(portfolio) |>
summarize(ret = mean(ret_excess)) |>
as.data.frame()
}
Second, using data.table
and dtplyr
.
sort_dt <- function() {
crsp_monthly_dt |>
group_by(month) |>
mutate(portfolio = assign_portfolio(mktcap_lag, n_portfolios = 10)) |>
group_by(portfolio) |>
summarize(ret = mean(ret_excess)) |>
as.data.frame()
}
Third, using data.table
and dtplyr
, but applied to the version
returned by lazy_dt()
.
sort_dtplyr <- function() {
crsp_monthly_dtplyr |>
group_by(month) |>
mutate(portfolio = assign_portfolio(mktcap_lag, n_portfolios = 10)) |>
group_by(portfolio) |>
summarize(ret = mean(ret_excess)) |>
as.data.frame()
}
Now, run these three.
iterations <- 100
results <- bench::mark(
sort_dplyr(), sort_dt(), sort_dtplyr(),
iterations = iterations
)
autoplot(results, type = "violin") +
labs(y = NULL, x = NULL,
title = "Execution time of porfolio sorts using dplyr and dtplyr",
subtitle = "'sort_dt' uses `dtplyr` and 'sort_dtplyr' simply adds `lazy_dt()`")
Footnotes
-
Of course, the pipe (
|>
) operates on a whole data frame, but I think this makes it seem even odder to have a function that is applied using the pipe that takes and returns a data frame. ↩