The following are some benchmarks that emerged naturally from some work I was doing to get Stata data from WRDS. In essence, there are two steps. First, retrieve data from WRDS's PostgreSQL database. Second, do some analysis of the retrieved data (here a simple summary query). Below I have tabulated the elapsed times in seconds for each of these two tasks for a number of different approaches.
Method | Data retrieval | Summary query |
---|---|---|
Stata with ODBC | 52.50 | 7.42 |
Stata with JDBC | 45.25 | 7.01 |
R with ODBC | 36.40 | 0.093 |
R with RPostgres | 25.30 | 0.097 |
R with DuckDB | 12.09 | 0.060 |
Note that if I use wrds_update_pq()
to put ibes.statsum_epsint
in a file (statsum_epsint.parquet
; 348 MB) that I then put in Dropbox, I can download the data in about 7 seconds and then load it into a computed table in DuckDB in about 1 second.
This illustrates that PostgreSQL doesn't shine as brightly when a query needs to retrieve a substantial portion of the data in a table, as in this case.
- I think even the slowest option for "Data retrieval" is fine. This is the kind of task you would be doing (say) once for a project and saving the data.
- The tabulation of
fpi
that takes seven seconds with Stata takes milliseconds with any of the R options. - I would say seven seconds is intolerable if one is doing many such manipulations.
- I think Stata and R (as I use it here ... with PostgreSQL, DuckDB, etc) are optimized for different use-cases.
- Stata is best when you have data and just need to run hundreds or thousands of regressions to find results. (At least, that's one hypothesis.)
- R (as I use it here) is better set up for the use case where you spend most of your time organizing data to run the regression you have in mind. Once you run the regression (along with a few variations), you've spent so much time prepping the data that a few seconds here or there are not noticeable. (Of course, it's not clear that R is necessarily much slower than Stata, but I'm assuming Stata has its raison d'être ... other than "all the cool kids [i.e., economists] use it!")
- I assume that the difference between
7.42
and7.01
for the two summary queries in Stata is just random noise. The data are in RAM in Stata in each case. - The above were conducted on a 14" M1 Pro MacBook Pro with 16GB of RAM and a 1Gb ethernet connection to a 1Gb optic fibre internet connection in Massachusetts.
- This laptop is a university-issued laptop that I don't use much ... but it's the only computer on which I have Stata.
- I did the tests the day before on WiFi and the above are faster. This surprises me; I would've thought that bandwidth constraints on my internet connection would not be a factor for any of these options.
The R with DuckDB option does not currently work with Windows. This is because it relies on thepostgres_scanner
extension and DuckDB extensions don't seem to get installed with the Windows R version of DuckDB.- Let me know if you have questions about setting up ODBC with Stata on MacOS. It took me a while to figure it out.
- Using
collect()
instead ofcompute()
with the DuckDB query does not materially change the results. Thecollect()
operates on local data and for DuckDB,collect()
seems very fast (I suspect it is operating at a low-level for high performance). While DuckDB does reliably summarize the data faster than R does, I don't think the difference between 0.09 seconds and 0.06 seconds is material.
The file odbc_test.do
contains the following code:
set rmsg on
global query ///
select ticker,cusip,statpers,fpi,curcode,numest,meanest,fpedats,measure,fiscalp ///
from ibes.statsum_epsint ///
where statpers>'20090101' ///
and measure='EPS' ///
and fpi<'6'
odbc load, exec("$query") noquote dsn("wrds-pgdata-64") clear
tabstat statpers, by(fpi) stats(N)
set rmsg off
. odbc list
Data Source Name Driver
--------------------------------------------------
> -----------------------------
wrds-pgdata-64 /opt/homebrew/C
> ellar/psqlodbc/16.00.0000/li
--------------------------------------------------
> -----------------------------
. do "odbc_test.do"
. set rmsg on
r; t=0.00 12:01:12
.
. global query ///
> select ticker,cusip,statpers,fpi,curcode,numest,meanest,
> fpedats,measure,fiscalp ///
> from ibes.statsum_epsint ///
> where statpers>'20090101' ///
> and measure='EPS' ///
> and fpi<'6'
r; t=0.00 12:01:12
.
. odbc load, exec("$query") noquote dsn("wrds-pgdata-64") clear
r; t=52.50 12:02:04
. tabstat statpers, by(fpi) stats(N)
Summary for variables: statpers
Group variable: fpi
fpi | N
-------+----------
0 | 615279
1 | 2210319
2 | 2142663
3 | 1796838
4 | 651510
5 | 371919
-------+----------
Total | 7788528
------------------
r; t=7.42 12:02:12
.
. set rmsg off
The file jdbc_test.do
contains the code below.
I adapted this from code found here.
set rmsg on
local jar "postgresql-42.7.0.jar"
local classname "org.postgresql.Driver"
local url "jdbc:postgresql://wrds-pgdata.wharton.upenn.edu:9737/wrds?ssl=require&sslfactory=org.postgresql.ssl.NonValidatingFactory"
local user "iangow"
local password "xxxx"
jdbc connect, jar("`jar'") driverclass("`classname'") url("`url'") ///
user("`user'") password("`password'")
global query ///
select ticker,cusip,statpers,fpi,curcode,numest,meanest,fpedats,measure,fiscalp ///
from ibes.statsum_epsint ///
where statpers>'20090101' ///
and measure='EPS' ///
and fpi<'6'
jdbc load, exec("$query") clear
tabstat statpers, by(fpi) stats(N)
set rmsg off
Partial output from the code above is given below.
. jdbc load, exec("$query") clear
(7788528 observations loaded)
r; t=45.25 12:25:10
. tabstat statpers, by(fpi) stats(N)
Summary for variables: statpers
Group variable: fpi (fpi)
fpi | N
-------+----------
0 | 615279
1 | 2210319
2 | 2142663
3 | 1796838
4 | 651510
5 | 371919
-------+----------
Total | 7788528
------------------
r; t=7.01 12:25:17
Now using ODBC in R.
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(odbc::odbc(), DSN = "wrds-pgdata-64")
statsum_epsint <- tbl(pg, Id(schema = "ibes", table = "statsum_epsint"))
system.time({
ibes_statsum_epsint <-
statsum_epsint |>
filter(statpers > '2009-01-01', measure == 'EPS', fpi < '6') |>
select(ticker, cusip, statpers, fpi, curcode,
numest, meanest, fpedats, measure, fiscalp) |>
collect()
})
ibes_statsum_epsint |>
count(fpi) |>
collect() |>
print() |>
system.time()
Partial output from the code above is given below.
user system elapsed
23.647 2.999 36.399
>
> ibes_statsum_epsint |>
+ count(fpi) |>
+ collect() |>
+ print() |>
+ system.time()
# A tibble: 6 × 2
fpi n
<chr> <int>
1 0 615279
2 1 2210319
3 2 2142663
4 3 1796838
5 4 651510
6 5 371919
user system elapsed
0.084 0.008 0.093
First, set the environment variables. These environmental variables are used for the DuckDB option below too.
Sys.setenv(PGHOST = "wrds-pgdata.wharton.upenn.edu",
PGPORT = 9737L,
PGDATABASE = "wrds",
PGUSER = "YOUR_WRDS_ID")
Now run the code.
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres())
statsum_epsint <- tbl(pg, Id(schema = "ibes", table = "statsum_epsint"))
system.time({
ibes_statsum_epsint <-
statsum_epsint |>
filter(statpers > '2009-01-01', measure == 'EPS', fpi < '6') |>
select(ticker, cusip, statpers, fpi, curcode,
numest, meanest, fpedats, measure, fiscalp) |>
collect()
})
ibes_statsum_epsint |>
count(fpi) |>
collect() |>
print() |>
system.time()
Partial output from the code above is given below.
user system elapsed
9.070 0.851 25.300
> ibes_statsum_epsint |>
+ count(fpi) |>
+ collect() |>
+ print() |>
+ system.time()
# A tibble: 6 × 2
fpi n
<chr> <int>
1 0 615279
2 1 2210319
3 2 2142663
4 3 1796838
5 4 651510
6 5 371919
user system elapsed
0.088 0.005 0.097
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(duckdb::duckdb())
# rs <- dbExecute(pg, "INSTALL postgres_scanner")
rs <- dbExecute(pg, "LOAD postgres_scanner")
rs <- dbExecute(pg, "SET threads TO 3")
statsum_epsint <-
tbl(pg, "postgres_scan_pushdown('', 'tr_ibes', 'statsum_epsint')")
system.time({
ibes_statsum_epsint <-
statsum_epsint |>
filter(statpers > '2009-01-01', measure == 'EPS', fpi < '6') |>
select(ticker, cusip, statpers, fpi, curcode,
numest, meanest, fpedats, measure, fiscalp) |>
compute()
})
ibes_statsum_epsint |>
count(fpi) |>
collect() |>
print() |>
system.time()
dbDisconnect(pg, shutdown = TRUE)
Partial output from the code above is given below.
user system elapsed
4.393 1.239 12.094
Warning message:
Connection is garbage-collected, use dbDisconnect() to avoid this.
>
> ibes_statsum_epsint |>
+ count(fpi) |>
+ collect() |>
+ print() |>
+ system.time()
# A tibble: 6 × 2
fpi n
<chr> <dbl>
1 0 615279
2 1 2210319
3 2 2142663
4 3 1796838
5 4 651510
6 5 371919
user system elapsed
0.093 0.003 0.060