Last active
May 2, 2016 22:39
-
-
Save szilard/7d810877fcc6b019a478 to your computer and use it in GitHub Desktop.
SQLite vs R data.table
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sqlite vs R's data.table | |
TLDR; sqlite (:memory:) 250 sec data.table 7 sec | |
data: 100 million rows, 1 million groups | |
generated by: https://github.com/szilard/benchm-databases/blob/master/0-gendata.txt | |
sqlite3 :memory: | |
## SQLite version 3.8.4.1 | |
create table d(x int, y double); | |
.mode csv | |
.import d-noh.csv d | |
.timer on | |
select x, avg(y) as ym | |
from d | |
group by x | |
order by ym desc | |
limit 5; | |
## Run Time: real 249.674 user 227.471406 sys 13.416719 | |
R | |
## R version 3.2.2 | |
library(data.table) | |
## data.table 1.9.6 | |
d <- fread("d-noh.csv") | |
setnames(d, c("x","y")) | |
system.time( | |
print(head(d[, list(ym=mean(y)), by=x][order(-ym)],5)) | |
) | |
## user system elapsed | |
## 6.811 0.176 7.005 | |
Hi Szilard,
just sent you a PR to also check MonetDBLite (R package, https://www.monetdb.org/blog/monetdblite-r), which does these queries in < 10 seconds on my box.
When running on the m3.2xlarge instance and following your test protocol, I get the following timings: Group query: 7.0s Join query: 1.5s.
Awesome, thanks :) I added it to the benchmark: https://github.com/szilard/benchm-databases
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Ah so :) Perhaps you can consider adding the obvious indexes to the db benchmarks - I think it represents a more realistic use of databases. If not, the benchmarks will benefit from an explanation about why db indexes are left out, imho. Anyways, interesting work!