Databricks - JDBC

This is the preferred method if you're collecting small objects from the Spark Cluster because it is otherwise easy to set up, doesn't appear to cause many "out of memory" errors, and requires fewer dependencies.

Setup

Install and configure Java.

Install Java 8

  brew tap AdoptOpenJDK/openjdk
  brew install --cask adoptopenjdk8

Export correct JAVA_HOME and run R CMD javareconf

  export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)
  R CMD javareconf

Install rJava and RJDBC

 # rJava can hit some issues:
 # - Error: `clang: error: unsupported option '-fopenmp'`
 #   - https://stackoverflow.com/a/43943631
 # - Error: `ld: framework not found JavaVM`            
 #   - https://stackoverflow.com/a/65297107
 install.packages('rJava', configure.args="--disable-jri")
 
 install.packages('RJDBC')

Make sure you have a proper ~/.databricks-connect file.
- This is just a simple JSON array. If you don't have it, you can generate it in vim.
```
  vim ~/.databricks-connect
```
- Then paste
```
  {
    "host": "https://abc-d1e2345f-a6b2.cloud.databricks.com",
    "token": "[API_TOKEN_OMITTED]",
    "cluster_id": "0123-2345678-hello987",
    "org_id": "0"
  }
```
- The only thing you'll have to add here yourself is your token, replacing [API_TOKEN_OMITTED] with your databricks API token.

Click here for instructions on generating a new token.

Using JDBC

The follow script has a JDBC "header" which does everything for you. If you include it at the start of your script, you can get functions get_simba_driver() and jdbc_url().

get_simba_driver() downloads the Simba JDBC driver if you don't have it. Otherwise, it returns the driver path on your machine.
jdbc_url() generates the correct JDBC url using your ~/.databricks-connect credentials.

These two functions are then used to create the JDBC driver in RJDBC::JDBC and then the proper connection in RJDBC::dbConnect().

!! IMPORTANT !!

dbGetQuery() works with small queries but if you try to pull large objects, you'll likely get a java.lang.OutOfMemoryError: Java heap space error.
Use dbSendQuery(), dbFetch() and dbClearResult() to fetch small chunks of the query at a time.

# <<<< JDBC Header <<<< --------------
# function to get simba JDBC driver
# SOURCE: https://databricks.com/spark/jdbc-drivers-download
get_simba_driver <- function(
  simba_url = paste0(
    "https://databricks-bi-artifacts.s3.us-east-2.amazonaws.com/",
    "simbaspark-drivers/jdbc/2.6.17/SimbaSparkJDBC42-2.6.17.1021.zip")
) {
  #
  SIMBA_DRIVER_PATH =
    "~/Library/simba/spark"
  DEST_FILE = file.path(SIMBA_DRIVER_PATH, basename(simba_url))
  dir.create(SIMBA_DRIVER_PATH, recursive = TRUE, showWarnings = FALSE)
  if(!file.exists(DEST_FILE)) {
    download.file(
      simba_url,
      destfile = DEST_FILE
    )
  }
  unzip(DEST_FILE, exdir = SIMBA_DRIVER_PATH)
  list.files(SIMBA_DRIVER_PATH,
             pattern = ".*.jar$",
             full.names = TRUE)
  #
}

# MOTIVATING SOURCE:
# https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html
jdbc_url = function() {
  DATABRICKS_CONF = '~/.databricks-connect'
  stopifnot(file.exists(DATABRICKS_CONF))
  conf = jsonlite::read_json(DATABRICKS_CONF)
  url = gsub('http[s]?://', '', conf[['host']])
  sprintf(
    paste0(
      "jdbc:spark://%s:443/default;transportMode=http;",
      "ssl=1;httpPath=sql/protocolv1/o/%s/%s;AuthMech=3;",
      "UID=token;PWD=%s"),
    url, conf[['org_id']],
    conf[['cluster_id']], conf[['token']]
  )
}
# >>>> END JDBC Header >>>> -------------------

# JDBC example -------------

drv <- RJDBC::JDBC(
  "com.simba.spark.jdbc.Driver",
  get_simba_driver(),
  identifier.quote="`"
)

# load driver and connect to the jdbc_url
conn = DBI::dbConnect(
  drv,
  jdbc_url()
)

# we fetch the table in chunks to avoid memory issues
#  - note that `DBI::dbGetQuery` will try to pull the whole object at once
#    - better for small queries
qry = "
  SELECT id, date, value
  FROM my_schema.some_table
  LIMIT 5000"
  
n <- 500
rs <- DBI::dbSendQuery(conn, qry)
chunks <- list()
while (!DBI::dbHasCompleted(rs)) {
  chunks <- append(
    chunks,
    list(data.table::setDT(DBI::dbFetch(rs, n)))
  )
}
DBI::dbClearResult(rs)

df = do.call(rbind, chunks)

head(df)
nrow(df)

DBI::dbDisconnect(conn)

Sources

https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html (how to determine JDBC url configuration)
https://databricks.com/spark/jdbc-drivers-download (where I found the JDBC driver for Databricks)

dantonnoriega/databricks-jdbc-instructions.md

Databricks - JDBC

Setup

Using JDBC

Sources