This is the preferred method if you're collecting small objects from the Spark Cluster because it is otherwise easy to set up, doesn't appear to cause many "out of memory" errors, and requires fewer dependencies.
-
Install and configure Java.
-
Install Java 8
brew tap AdoptOpenJDK/openjdk brew install --cask adoptopenjdk8
-
Export correct
JAVA_HOME
and runR CMD javareconf
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8) R CMD javareconf
-
-
Install
rJava
andRJDBC
# rJava can hit some issues: # - Error: `clang: error: unsupported option '-fopenmp'` # - https://stackoverflow.com/a/43943631 # - Error: `ld: framework not found JavaVM` # - https://stackoverflow.com/a/65297107 install.packages('rJava', configure.args="--disable-jri") install.packages('RJDBC')
-
Make sure you have a proper
~/.databricks-connect
file.-
This is just a simple JSON array. If you don't have it, you can generate it in
vim
.vim ~/.databricks-connect
-
Then paste
{ "host": "https://abc-d1e2345f-a6b2.cloud.databricks.com", "token": "[API_TOKEN_OMITTED]", "cluster_id": "0123-2345678-hello987", "org_id": "0" }
-
The only thing you'll have to add here yourself is your
token
, replacing[API_TOKEN_OMITTED]
with your databricks API token.
-
Click here for instructions on generating a new token.
The follow script has a JDBC "header" which does everything for you. If you include it at the start of your script, you can get functions get_simba_driver()
and jdbc_url()
.
get_simba_driver()
downloads the Simba JDBC driver if you don't have it. Otherwise, it returns the driver path on your machine.jdbc_url()
generates the correct JDBC url using your~/.databricks-connect
credentials.
These two functions are then used to create the JDBC driver in RJDBC::JDBC
and then the proper connection in RJDBC::dbConnect()
.
!! IMPORTANT !!
dbGetQuery()
works with small queries but if you try to pull large objects, you'll likely get ajava.lang.OutOfMemoryError: Java heap space
error.- Use
dbSendQuery()
,dbFetch()
anddbClearResult()
to fetch small chunks of the query at a time.
# <<<< JDBC Header <<<< --------------
# function to get simba JDBC driver
# SOURCE: https://databricks.com/spark/jdbc-drivers-download
get_simba_driver <- function(
simba_url = paste0(
"https://databricks-bi-artifacts.s3.us-east-2.amazonaws.com/",
"simbaspark-drivers/jdbc/2.6.17/SimbaSparkJDBC42-2.6.17.1021.zip")
) {
#
SIMBA_DRIVER_PATH =
"~/Library/simba/spark"
DEST_FILE = file.path(SIMBA_DRIVER_PATH, basename(simba_url))
dir.create(SIMBA_DRIVER_PATH, recursive = TRUE, showWarnings = FALSE)
if(!file.exists(DEST_FILE)) {
download.file(
simba_url,
destfile = DEST_FILE
)
}
unzip(DEST_FILE, exdir = SIMBA_DRIVER_PATH)
list.files(SIMBA_DRIVER_PATH,
pattern = ".*.jar$",
full.names = TRUE)
#
}
# MOTIVATING SOURCE:
# https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html
jdbc_url = function() {
DATABRICKS_CONF = '~/.databricks-connect'
stopifnot(file.exists(DATABRICKS_CONF))
conf = jsonlite::read_json(DATABRICKS_CONF)
url = gsub('http[s]?://', '', conf[['host']])
sprintf(
paste0(
"jdbc:spark://%s:443/default;transportMode=http;",
"ssl=1;httpPath=sql/protocolv1/o/%s/%s;AuthMech=3;",
"UID=token;PWD=%s"),
url, conf[['org_id']],
conf[['cluster_id']], conf[['token']]
)
}
# >>>> END JDBC Header >>>> -------------------
# JDBC example -------------
drv <- RJDBC::JDBC(
"com.simba.spark.jdbc.Driver",
get_simba_driver(),
identifier.quote="`"
)
# load driver and connect to the jdbc_url
conn = DBI::dbConnect(
drv,
jdbc_url()
)
# we fetch the table in chunks to avoid memory issues
# - note that `DBI::dbGetQuery` will try to pull the whole object at once
# - better for small queries
qry = "
SELECT id, date, value
FROM my_schema.some_table
LIMIT 5000"
n <- 500
rs <- DBI::dbSendQuery(conn, qry)
chunks <- list()
while (!DBI::dbHasCompleted(rs)) {
chunks <- append(
chunks,
list(data.table::setDT(DBI::dbFetch(rs, n)))
)
}
DBI::dbClearResult(rs)
df = do.call(rbind, chunks)
head(df)
nrow(df)
DBI::dbDisconnect(conn)
- https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html (how to determine JDBC url configuration)
- https://databricks.com/spark/jdbc-drivers-download (where I found the JDBC driver for Databricks)