-
-
Save szilard/7d810877fcc6b019a478 to your computer and use it in GitHub Desktop.
| sqlite vs R's data.table | |
| TLDR; sqlite (:memory:) 250 sec data.table 7 sec | |
| data: 100 million rows, 1 million groups | |
| generated by: https://github.com/szilard/benchm-databases/blob/master/0-gendata.txt | |
| sqlite3 :memory: | |
| ## SQLite version 3.8.4.1 | |
| create table d(x int, y double); | |
| .mode csv | |
| .import d-noh.csv d | |
| .timer on | |
| select x, avg(y) as ym | |
| from d | |
| group by x | |
| order by ym desc | |
| limit 5; | |
| ## Run Time: real 249.674 user 227.471406 sys 13.416719 | |
| R | |
| ## R version 3.2.2 | |
| library(data.table) | |
| ## data.table 1.9.6 | |
| d <- fread("d-noh.csv") | |
| setnames(d, c("x","y")) | |
| system.time( | |
| print(head(d[, list(ym=mean(y)), by=x][order(-ym)],5)) | |
| ) | |
| ## user system elapsed | |
| ## 6.811 0.176 7.005 | |
No (auto) key for DT above. But you can set a key, in which case the DT query will run in 1.5 sec :)
Ah so :) Perhaps you can consider adding the obvious indexes to the db benchmarks - I think it represents a more realistic use of databases. If not, the benchmarks will benefit from an explanation about why db indexes are left out, imho. Anyways, interesting work!
Hi Szilard,
just sent you a PR to also check MonetDBLite (R package, https://www.monetdb.org/blog/monetdblite-r), which does these queries in < 10 seconds on my box.
When running on the m3.2xlarge instance and following your test protocol, I get the following timings: Group query: 7.0s Join query: 1.5s.
Awesome, thanks :) I added it to the benchmark: https://github.com/szilard/benchm-databases

Interesting. Does DT do the automatic indexing here? If so, would be a more accurate comparison to add an index to the x column in sqlite. On my machine that reduces the query time to 150 seconds. Not quite 7 but at least 100 seconds faster.