Skip to content

Instantly share code, notes, and snippets.

@eddjberry
Last active August 22, 2019 21:24
Show Gist options
  • Save eddjberry/002988b5edef60ac93ff8dd2a3bb1578 to your computer and use it in GitHub Desktop.
Save eddjberry/002988b5edef60ac93ff8dd2a3bb1578 to your computer and use it in GitHub Desktop.
An example of creating a Spark pipeline with sparklyr
# Load packages
library(dplyr)
library(sparklyr)
# Set up connect
sc <- spark_connect(master = "local")
# Create a Spark DataFrame of mtcars
mtcars_sdf <- copy_to(sc, mtcars)
# The feature cols
feature_cols <-
c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "gear", "carb")
# Vector assembler
vector_assembler <-
ft_vector_assembler(
sc,
input_cols = feature_cols,
output_col = "features")
# Estimator
estimator <-
ml_random_forest_classifier(
sc,
label_col = "am")
# Evaluator
evaluator <-
ml_binary_classification_evaluator(
sc,
label_col = "am")
# A parameter grid
param_grid <- list(
random_forest = list(
num_trees = list(20, 30, 40),
max_depth = list(5, 6),
impurity = list("entropy")))
# Create the pipeline
pipeline <- ml_pipeline(vector_assembler) %>%
ml_cross_validator(estimator,
param_grid,
evaluator = evaluator,
num_folds = 5)
# Fit the pipeline
pipeline_model <- pipeline %>%
ml_fit(mtcars_sdf)
# Pull out the CV stage
pipeline_model_cv <- ml_stage(pipeline_model, 2)
# Print out the avg metrics
pipeline_model_cv$avg_metrics_df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment