Last active
June 8, 2016 17:32
-
-
Save nwstephens/79d7cca4da977dcc8d50c45c4f32ffc3 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ### Compare SQLite to Spark | |
| ### June 5, 2016 | |
| ### EC2 4 Core 16 Gb RAM | |
| ### Install ### | |
| # Install | |
| devtools::install_github("hadley/devtools") | |
| devtools::install_github("hadley/dplyr") | |
| devtools::install_github("rstudio/rspark", auth_token = "XXXXXXXXXX") | |
| rspark::spark_install(version = "1.6.0", hadoop_version = "2.6", reset = TRUE) | |
| ### SQLite ### | |
| # Setup | |
| require(dplyr) | |
| require(readr) | |
| require(DBI) | |
| airontimesql <- src_sqlite('airontime.sqlite3', create = TRUE) | |
| # Load | |
| for(i in 1987:2008){ | |
| ff <- paste0('/tmp/flights/', i, '.csv') | |
| cat('reading', ff) | |
| x <- read_csv(ff) | |
| cat('loading flights', i) | |
| dbWriteTable(airontimesql$con, 'flights', x, append = TRUE) | |
| } | |
| # Query performance (27 seconds) | |
| flightsql <- tbl(airontimesql, 'flights') | |
| Sys.time() | |
| flightsql %>% group_by(year) %>% summarize(year = n()) | |
| Sys.time() | |
| ### Spark ### | |
| # Setup | |
| library(rspark) | |
| sc <- spark_connect(memory = '14G', cores = '4') | |
| db <- src_spark(sc) | |
| # Load | |
| load_csv(db, 'airontime', '/home/nathan/flights.csv') | |
| src_tbls(db) | |
| # Query Performance (7:35 min) | |
| flightspark <- tbl(db, 'airontime') | |
| Sys.time() | |
| flightspark %>% group_by(year) %>% summarize(year=n()) | |
| Sys.time() | |
| # Investigate | |
| spark_web(sc) | |
Author
Oh, that was sloppy; I had it correct before I pasted it to gist. It now reads correctly, thanks.
@nwstephens, posting this to a public gist caused the auth_token I am using to be revoked. Let's keep the install code out of public gists.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Nathan, maybe I'm missing something, but in the SQLite load step, shouldn't the command be:
ff <- paste0('tmp/flights/', i, '.csv')
It looks like you just dumped the same year into SQLite which might impact the group_by performance.