-
-
Save szilard/7d810877fcc6b019a478 to your computer and use it in GitHub Desktop.
sqlite vs R's data.table | |
TLDR; sqlite (:memory:) 250 sec data.table 7 sec | |
data: 100 million rows, 1 million groups | |
generated by: https://github.com/szilard/benchm-databases/blob/master/0-gendata.txt | |
sqlite3 :memory: | |
## SQLite version 3.8.4.1 | |
create table d(x int, y double); | |
.mode csv | |
.import d-noh.csv d | |
.timer on | |
select x, avg(y) as ym | |
from d | |
group by x | |
order by ym desc | |
limit 5; | |
## Run Time: real 249.674 user 227.471406 sys 13.416719 | |
R | |
## R version 3.2.2 | |
library(data.table) | |
## data.table 1.9.6 | |
d <- fread("d-noh.csv") | |
setnames(d, c("x","y")) | |
system.time( | |
print(head(d[, list(ym=mean(y)), by=x][order(-ym)],5)) | |
) | |
## user system elapsed | |
## 6.811 0.176 7.005 | |
No (auto) key for DT above. But you can set a key, in which case the DT query will run in 1.5 sec :)
Ah so :) Perhaps you can consider adding the obvious indexes to the db benchmarks - I think it represents a more realistic use of databases. If not, the benchmarks will benefit from an explanation about why db indexes are left out, imho. Anyways, interesting work!
Hi Szilard,
just sent you a PR to also check MonetDBLite (R package, https://www.monetdb.org/blog/monetdblite-r), which does these queries in < 10 seconds on my box.
When running on the m3.2xlarge instance and following your test protocol, I get the following timings: Group query: 7.0s Join query: 1.5s.
Awesome, thanks :) I added it to the benchmark: https://github.com/szilard/benchm-databases
Interesting. Does DT do the automatic indexing here? If so, would be a more accurate comparison to add an index to the x column in sqlite. On my machine that reduces the query time to 150 seconds. Not quite 7 but at least 100 seconds faster.