Skip to content

Instantly share code, notes, and snippets.

@nwstephens
Last active June 8, 2016 17:32
Show Gist options
  • Select an option

  • Save nwstephens/79d7cca4da977dcc8d50c45c4f32ffc3 to your computer and use it in GitHub Desktop.

Select an option

Save nwstephens/79d7cca4da977dcc8d50c45c4f32ffc3 to your computer and use it in GitHub Desktop.
### Compare SQLite to Spark
### June 5, 2016
### EC2 4 Core 16 Gb RAM
### Install ###
# Install
devtools::install_github("hadley/devtools")
devtools::install_github("hadley/dplyr")
devtools::install_github("rstudio/rspark", auth_token = "XXXXXXXXXX")
rspark::spark_install(version = "1.6.0", hadoop_version = "2.6", reset = TRUE)
### SQLite ###
# Setup
require(dplyr)
require(readr)
require(DBI)
airontimesql <- src_sqlite('airontime.sqlite3', create = TRUE)
# Load
for(i in 1987:2008){
ff <- paste0('/tmp/flights/', i, '.csv')
cat('reading', ff)
x <- read_csv(ff)
cat('loading flights', i)
dbWriteTable(airontimesql$con, 'flights', x, append = TRUE)
}
# Query performance (27 seconds)
flightsql <- tbl(airontimesql, 'flights')
Sys.time()
flightsql %>% group_by(year) %>% summarize(year = n())
Sys.time()
### Spark ###
# Setup
library(rspark)
sc <- spark_connect(memory = '14G', cores = '4')
db <- src_spark(sc)
# Load
load_csv(db, 'airontime', '/home/nathan/flights.csv')
src_tbls(db)
# Query Performance (7:35 min)
flightspark <- tbl(db, 'airontime')
Sys.time()
flightspark %>% group_by(year) %>% summarize(year=n())
Sys.time()
# Investigate
spark_web(sc)
@slopp

slopp commented Jun 7, 2016

Copy link
Copy Markdown

Nathan, maybe I'm missing something, but in the SQLite load step, shouldn't the command be:
ff <- paste0('tmp/flights/', i, '.csv')

It looks like you just dumped the same year into SQLite which might impact the group_by performance.

@nwstephens

Copy link
Copy Markdown
Author

Oh, that was sloppy; I had it correct before I pasted it to gist. It now reads correctly, thanks.

@jjallaire

Copy link
Copy Markdown

@nwstephens, posting this to a public gist caused the auth_token I am using to be revoked. Let's keep the install code out of public gists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment