Skip to content

Instantly share code, notes, and snippets.

@duttashi
Last active July 3, 2017 06:02
Show Gist options
  • Save duttashi/970924402ab00eeca23358a1455c2e23 to your computer and use it in GitHub Desktop.
Save duttashi/970924402ab00eeca23358a1455c2e23 to your computer and use it in GitHub Desktop.
Connecting to spark on local cluster and other basic spark functions
# Load sparlyr library in R environment
library(sparklyr)
# connecting to spark local cluster
sc <- spark_connect(master = "local", version="2.1.0")
# print the spark version
spark_version(sc)
# check data tables in spark local cluster
src_tbls(sc) # If no table copied in local cluster, then NULL or character(0) will be returned
# Copy data to spark local instance
flights_tbl <- copy_to(sc, nycflights13::flights, "flights", overwrite = TRUE)
# check data tables in spark local cluster
src_tbls(sc) # flights
# check amount of memory taken up by the flights_tbl tibble
object.size(flights_tbl)
# check colnames data table
colnames(flights)
# USING SQL
# It’s also possible to execute SQL queries directly against tables within a Spark cluster. The spark_connection object implements a DBI interface for Spark, so you can use dbGetQuery to execute SQL and return the result as an R data frame
library(DBI)
flights2013<- tbl(sc, sql("select flight, tailnum, origin, dest FROM flights where year=2013"))
# Writing data to a local csv file
write.csv(flightdetail.df) # will write to local storage
# stop the spark local cluster
spark_disconnect(sc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment