Skip to content

Instantly share code, notes, and snippets.

@iangow
Last active August 29, 2024 09:29
Show Gist options
  • Save iangow/41cd9c6b9f6dc373e4e55186558fc72b to your computer and use it in GitHub Desktop.
Save iangow/41cd9c6b9f6dc373e4e55186558fc72b to your computer and use it in GitHub Desktop.

Benchmarking WRDS data retrieval

Summary

The following are some benchmarks that emerged naturally from some work I was doing to get Stata data from WRDS. In essence, there are two steps. First, retrieve data from WRDS's PostgreSQL database. Second, do some analysis of the retrieved data (here a simple summary query). Below I have tabulated the elapsed times in seconds for each of these two tasks for a number of different approaches.

Method Data retrieval Summary query
Stata with ODBC 52.50 7.42
Stata with JDBC 45.25 7.01
R with ODBC 36.40 0.093
R with RPostgres 25.30 0.097
R with DuckDB 12.09 0.060

Note that if I use wrds_update_pq() to put ibes.statsum_epsint in a file (statsum_epsint.parquet; 348 MB) that I then put in Dropbox, I can download the data in about 7 seconds and then load it into a computed table in DuckDB in about 1 second. This illustrates that PostgreSQL doesn't shine as brightly when a query needs to retrieve a substantial portion of the data in a table, as in this case.

Notes

  • I think even the slowest option for "Data retrieval" is fine. This is the kind of task you would be doing (say) once for a project and saving the data.
  • The tabulation of fpi that takes seven seconds with Stata takes milliseconds with any of the R options.
  • I would say seven seconds is intolerable if one is doing many such manipulations.
  • I think Stata and R (as I use it here ... with PostgreSQL, DuckDB, etc) are optimized for different use-cases.
  • Stata is best when you have data and just need to run hundreds or thousands of regressions to find results. (At least, that's one hypothesis.)
  • R (as I use it here) is better set up for the use case where you spend most of your time organizing data to run the regression you have in mind. Once you run the regression (along with a few variations), you've spent so much time prepping the data that a few seconds here or there are not noticeable. (Of course, it's not clear that R is necessarily much slower than Stata, but I'm assuming Stata has its raison d'être ... other than "all the cool kids [i.e., economists] use it!")
  • I assume that the difference between 7.42 and 7.01 for the two summary queries in Stata is just random noise. The data are in RAM in Stata in each case.
  • The above were conducted on a 14" M1 Pro MacBook Pro with 16GB of RAM and a 1Gb ethernet connection to a 1Gb optic fibre internet connection in Massachusetts.
  • This laptop is a university-issued laptop that I don't use much ... but it's the only computer on which I have Stata.
  • I did the tests the day before on WiFi and the above are faster. This surprises me; I would've thought that bandwidth constraints on my internet connection would not be a factor for any of these options.
  • The R with DuckDB option does not currently work with Windows. This is because it relies on the postgres_scanner extension and DuckDB extensions don't seem to get installed with the Windows R version of DuckDB.
  • Let me know if you have questions about setting up ODBC with Stata on MacOS. It took me a while to figure it out.
  • Using collect() instead of compute() with the DuckDB query does not materially change the results. The collect() operates on local data and for DuckDB, collect() seems very fast (I suspect it is operating at a low-level for high performance). While DuckDB does reliably summarize the data faster than R does, I don't think the difference between 0.09 seconds and 0.06 seconds is material.

Stata with ODBC

Code

The file odbc_test.do contains the following code:

set rmsg on

global query ///
	select ticker,cusip,statpers,fpi,curcode,numest,meanest,fpedats,measure,fiscalp ///
	from ibes.statsum_epsint ///
	where statpers>'20090101' ///
	and measure='EPS' ///
	and fpi<'6'
	
odbc load, exec("$query") noquote dsn("wrds-pgdata-64") clear 
tabstat statpers, by(fpi) stats(N)

set rmsg off

Output

. odbc list

Data Source Name                   Driver
--------------------------------------------------
> -----------------------------
wrds-pgdata-64                     /opt/homebrew/C
> ellar/psqlodbc/16.00.0000/li
--------------------------------------------------
> -----------------------------

. do "odbc_test.do"

. set rmsg on
r; t=0.00 12:01:12

. 
. global query ///
>         select ticker,cusip,statpers,fpi,curcode,numest,meanest,
> fpedats,measure,fiscalp ///
>         from ibes.statsum_epsint ///
>         where statpers>'20090101' ///
>         and measure='EPS' ///
>         and fpi<'6'
r; t=0.00 12:01:12

.         
. odbc load, exec("$query") noquote dsn("wrds-pgdata-64") clear 
r; t=52.50 12:02:04

. tabstat statpers, by(fpi) stats(N)

Summary for variables: statpers
Group variable: fpi 

   fpi |         N
-------+----------
     0 |    615279
     1 |   2210319
     2 |   2142663
     3 |   1796838
     4 |    651510
     5 |    371919
-------+----------
 Total |   7788528
------------------
r; t=7.42 12:02:12

. 
. set rmsg off

Stata with JDBC

Code

The file jdbc_test.do contains the code below. I adapted this from code found here.

set rmsg on

local jar "postgresql-42.7.0.jar"
local classname "org.postgresql.Driver"
local url "jdbc:postgresql://wrds-pgdata.wharton.upenn.edu:9737/wrds?ssl=require&sslfactory=org.postgresql.ssl.NonValidatingFactory"
local user "iangow"
local password "xxxx"
jdbc connect, jar("`jar'") driverclass("`classname'") url("`url'")   ///
    user("`user'") password("`password'")

global query ///
   select ticker,cusip,statpers,fpi,curcode,numest,meanest,fpedats,measure,fiscalp ///
   from ibes.statsum_epsint ///
   where statpers>'20090101' ///
   and measure='EPS' ///
   and fpi<'6'
   
jdbc load, exec("$query") clear 

tabstat statpers, by(fpi) stats(N)

set rmsg off

Output

Partial output from the code above is given below.

. jdbc load, exec("$query") clear 
(7788528 observations loaded)
r; t=45.25 12:25:10

. tabstat statpers, by(fpi) stats(N)

Summary for variables: statpers
Group variable: fpi (fpi)

   fpi |         N
-------+----------
     0 |    615279
     1 |   2210319
     2 |   2142663
     3 |   1796838
     4 |    651510
     5 |    371919
-------+----------
 Total |   7788528
------------------
r; t=7.01 12:25:17

R with ODBC

Now using ODBC in R.

Code

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(odbc::odbc(), DSN = "wrds-pgdata-64")
statsum_epsint <- tbl(pg, Id(schema = "ibes", table = "statsum_epsint"))

system.time({
  ibes_statsum_epsint <-
    statsum_epsint |>
    filter(statpers > '2009-01-01', measure == 'EPS', fpi < '6') |>
    select(ticker, cusip, statpers, fpi, curcode,
           numest, meanest, fpedats, measure, fiscalp) |>
    collect()
})

ibes_statsum_epsint |>
  count(fpi) |>
  collect() |>
  print() |>
  system.time()

Output

Partial output from the code above is given below.

   user  system elapsed 
 23.647   2.999  36.399 
> 
> ibes_statsum_epsint |>
+   count(fpi) |>
+   collect() |>
+   print() |>
+   system.time()
# A tibble: 6 × 2
  fpi         n
  <chr>   <int>
1 0      615279
2 1     2210319
3 2     2142663
4 3     1796838
5 4      651510
6 5      371919
   user  system elapsed 
  0.084   0.008   0.093 

Using R with RPostgres()

Code

First, set the environment variables. These environmental variables are used for the DuckDB option below too.

Sys.setenv(PGHOST = "wrds-pgdata.wharton.upenn.edu",
           PGPORT = 9737L,
           PGDATABASE = "wrds",
           PGUSER = "YOUR_WRDS_ID")

Now run the code.

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(RPostgres::Postgres())
statsum_epsint <- tbl(pg, Id(schema = "ibes", table = "statsum_epsint"))

system.time({
  ibes_statsum_epsint <-
    statsum_epsint |>
    filter(statpers > '2009-01-01', measure == 'EPS', fpi < '6') |>
    select(ticker, cusip, statpers, fpi, curcode,
           numest, meanest, fpedats, measure, fiscalp) |>
    collect()
})

ibes_statsum_epsint |>
  count(fpi) |>
  collect() |>
  print() |>
  system.time()

Output

Partial output from the code above is given below.

   user  system elapsed 
  9.070   0.851  25.300 
> ibes_statsum_epsint |>
+   count(fpi) |>
+   collect() |>
+   print() |>
+   system.time()
# A tibble: 6 × 2
  fpi         n
  <chr>   <int>
1 0      615279
2 1     2210319
3 2     2142663
4 3     1796838
5 4      651510
6 5      371919
   user  system elapsed 
  0.088   0.005   0.097 

Using DuckDB's postgres_scanner extension

Code

library(dplyr, warn.conflicts = FALSE)
library(DBI)

pg <- dbConnect(duckdb::duckdb())
# rs <- dbExecute(pg, "INSTALL postgres_scanner")
rs <- dbExecute(pg, "LOAD postgres_scanner")
rs <- dbExecute(pg, "SET threads TO 3")

statsum_epsint <- 
  tbl(pg, "postgres_scan_pushdown('', 'tr_ibes', 'statsum_epsint')")

system.time({
  ibes_statsum_epsint <-
    statsum_epsint |>
    filter(statpers > '2009-01-01', measure == 'EPS', fpi < '6') |>
    select(ticker, cusip, statpers, fpi, curcode,
           numest, meanest, fpedats, measure, fiscalp) |>
    compute()
})

ibes_statsum_epsint |>
  count(fpi) |>
  collect() |>
  print() |>
  system.time()

dbDisconnect(pg, shutdown = TRUE)

Output

Partial output from the code above is given below.

  user  system elapsed 
 4.393   1.239  12.094 
Warning message:
Connection is garbage-collected, use dbDisconnect() to avoid this. 
> 
> ibes_statsum_epsint |>
+   count(fpi) |>
+   collect() |>
+   print() |>
+   system.time()
# A tibble: 6 × 2
 fpi         n
 <chr>   <dbl>
1 0      615279
2 1     2210319
3 2     2142663
4 3     1796838
5 4      651510
6 5      371919
  user  system elapsed 
 0.093   0.003   0.060  
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment