Create EMR cluster with support for R.
Fix Development Tools
to fix gower
package installation,
sudo yum remove gcc72-c++.x86_64 libgcc72.x86_64
sudo yum groupinstall 'Development Tools'
also create file .R\Makevars
:
CC = /usr/bin/gcc64
CXX = /usr/bin/g++
SHLIB_OPENMP_CFLAGS = -fopenmp
Also this:
sudo yum install R-devel
Then install required packages
install.packages("tidymodels")
install.packages("tune")
install.packages("mlbench")
install.packages("magrittr")
install.packages("dplyr")
install.packages("parsnip")
install.packages("kernlab")
library(dplyr)
library(magrittr)
library(parsnip)
library(recipes)
library(rsample)
library(yardstick)
library(tune)
Follow the tidymodels
Grid Search Tutorial.
Now lets try with sparklyr
, in this case, using a 3 node cluster:
install.packages("remotes")
remotes::install_github("sparklyr/sparklyr")
library(sparklyr)
# Connect to Spark using 3 nodes with 8 CPUs each
sc <- spark_connect(
master = "yarn",
spark_home = "/usr/lib/spark/",
config = list(
"spark.executor.instances" = 24
)
)
# Validate spark_apply() is working properely, repartition to 3 nodes with 8 CPUs each
sdf_len(sc, 3 * 8, repartition = 3 * 8) %>% spark_apply(~ 42)
First, lets capture execution time without using Spark:
system.time({
tune_grid(
Class ~ .,
model = svm_mod,
resamples = iono_rs,
metrics = roc_vals,
control = ctrl
)
})
user system elapsed
133.386 0.503 133.883
You can then register Spark as a foreeach
backend, notice this is a new feature to be released in sparklyr 1.2
:
# Registere Spark as foreach backend
registerDoSpark(sc)
# Check number of parallel workers
foreach::getDoParWorkers()
[1] 24
And then rerun using grid search using Spark this time:
system.time({
tune_grid(
Class ~ .,
model = svm_mod,
resamples = iono_rs,
metrics = roc_vals,
control = ctrl
)
})
user system elapsed
3.735 0.310 85.088