This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mc$defaultLibrary <- "sparklyr" | |
library(sparklyr) | |
library(tidyverse) | |
speeches <- magpie::sql(mc, "SELECT * FROM presidential_speeches WHERE president") | |
partitions <- speeches %>% | |
ft_tokenizer(input_col = 'speech_text', output_col = 'words') %>% | |
ft_stop_words_remover(input_col = 'words', output_col = 'clean_words') %>% |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.ml import Pipeline | |
from pyspark.ml.feature import VectorAssembler | |
from pyspark.ml.regression import RandomForestRegressor | |
from pyspark.ml.evaluation import RegressionEvaluator | |
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator | |
import matplotlib.pyplot as plt | |
import numpy as np | |
# Pull in the data | |
df = mc.sql("SELECT * FROM kings_county_housing") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.ml.feature import VectorAssembler | |
feature_list = [] | |
for col in df.columns: | |
if col == 'label': | |
continue | |
else: | |
feature_list.append(col) | |
assembler = VectorAssembler(inputCols=feature_list, outputCol="features") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cvModel = crossval.fit(trainingData) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
predictions = cvModel.transform(testData) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(trainingData, testData) = df.randomSplit([0.8, 0.2]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
bestPipeline = cvModel.bestModel | |
bestModel = bestPipeline.stages[1] | |
importances = bestModel.featureImportances | |
x_values = list(range(len(importances))) | |
plt.bar(x_values, importances, orientation = 'vertical') | |
plt.xticks(x_values, feature_list, rotation=40) | |
plt.ylabel('Importance') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df = mc.sql("SELECT * FROM kings_county_housing") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import matplotlib.pyplot as plt | |
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse") | |
rmse = evaluator.evaluate(predictions) | |
rfPred = model.transform(df) | |
rfResult = rfPred.toPandas() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.ml import Pipeline | |
pipeline = Pipeline(stages=[assembler, rf]) |
NewerOlder