Amazon EMR with tidymodels and tune

Create EMR cluster with support for R.

Fix Development Tools to fix gower package installation,

sudo yum remove gcc72-c++.x86_64 libgcc72.x86_64
sudo yum groupinstall 'Development Tools'

also create file .R\Makevars:

CC = /usr/bin/gcc64
CXX = /usr/bin/g++
SHLIB_OPENMP_CFLAGS = -fopenmp

Also this:

sudo yum install R-devel

Then install required packages

install.packages("tidymodels")
install.packages("tune")
install.packages("mlbench")
install.packages("magrittr")
install.packages("dplyr")
install.packages("parsnip")
install.packages("kernlab")

library(dplyr)
library(magrittr)
library(parsnip)
library(recipes)
library(rsample)
library(yardstick)
library(tune)

Follow the tidymodels Grid Search Tutorial.

Now lets try with sparklyr, in this case, using a 3 node cluster:

install.packages("remotes")
remotes::install_github("sparklyr/sparklyr")
library(sparklyr)

# Connect to Spark using 3 nodes with 8 CPUs each
sc <- spark_connect(
  master = "yarn",
  spark_home = "/usr/lib/spark/",
  config = list(
    "spark.executor.instances" = 24
  )
)

# Validate spark_apply() is working properely, repartition to 3 nodes with 8 CPUs each
sdf_len(sc, 3 * 8, repartition = 3 * 8) %>% spark_apply(~ 42)

First, lets capture execution time without using Spark:

system.time({
    tune_grid(
        Class ~ .,
        model = svm_mod,
        resamples = iono_rs,
        metrics = roc_vals,
        control = ctrl
    )
})

   user  system elapsed 
133.386   0.503 133.883

You can then register Spark as a foreeach backend, notice this is a new feature to be released in sparklyr 1.2:

# Registere Spark as foreach backend
registerDoSpark(sc)

# Check number of parallel workers
foreach::getDoParWorkers()

[1] 24

And then rerun using grid search using Spark this time:

system.time({
    tune_grid(
        Class ~ .,
        model = svm_mod,
        resamples = iono_rs,
        metrics = roc_vals,
        control = ctrl
    )
})

   user  system elapsed 
  3.735   0.310  85.088

javierluraschi/amazon-emr-tidymodels-tune.md