BF bfraiche

2 followers · 9 following

Washington, DC

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

bfraiche / bayes_w_r_and_sparklyr.R

Last active April 22, 2020 01:32

This gist contains the complete code for my blogpost: 'Bayesian Machine Learning and NLP with R and sparklyr'

	mc$defaultLibrary <- "sparklyr"

	library(sparklyr)
	library(tidyverse)

	speeches <- magpie::sql(mc, "SELECT * FROM presidential_speeches WHERE president")

	partitions <- speeches %>%
	ft_tokenizer(input_col = 'speech_text', output_col = 'words') %>%
	ft_stop_words_remover(input_col = 'words', output_col = 'clean_words') %>%

bfraiche / random_forest_with_python_and_spark_ml.py

Created April 2, 2019 22:30

This gist contains the complete code for my blogpost: 'Random Forest with Python and Spark ML'

	from pyspark.ml import Pipeline
	from pyspark.ml.feature import VectorAssembler
	from pyspark.ml.regression import RandomForestRegressor
	from pyspark.ml.evaluation import RegressionEvaluator
	from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
	import matplotlib.pyplot as plt
	import numpy as np

	# Pull in the data
	df = mc.sql("SELECT * FROM kings_county_housing")

bfraiche / vec_asmbl.py

Created April 2, 2019 17:43

This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'

	from pyspark.ml.feature import VectorAssembler

	feature_list = []
	for col in df.columns:
	if col == 'label':
	continue
	else:
	feature_list.append(col)

	assembler = VectorAssembler(inputCols=feature_list, outputCol="features")

bfraiche / train_model.py

Created April 2, 2019 17:43

This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'

cvModel = crossval.fit(trainingData)

bfraiche / test_pred.py

Created April 2, 2019 17:42

This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'

predictions = cvModel.transform(testData)

bfraiche / split_data.py

Created April 2, 2019 17:42

This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'

(trainingData, testData) = df.randomSplit([0.8, 0.2])

bfraiche / importance.py

Last active April 2, 2019 22:17

This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'

	bestPipeline = cvModel.bestModel
	bestModel = bestPipeline.stages[1]

	importances = bestModel.featureImportances

	x_values = list(range(len(importances)))

	plt.bar(x_values, importances, orientation = 'vertical')
	plt.xticks(x_values, feature_list, rotation=40)
	plt.ylabel('Importance')

bfraiche / get_df.py

Created April 2, 2019 17:42

This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'

df = mc.sql("SELECT * FROM kings_county_housing")

bfraiche / evaluate.py

Created April 2, 2019 17:42

This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'

	import matplotlib.pyplot as plt

	evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")

	rmse = evaluator.evaluate(predictions)

	rfPred = model.transform(df)

	rfResult = rfPred.toPandas()

bfraiche / build_pl.py

Created April 2, 2019 17:41

This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'

	from pyspark.ml import Pipeline

	pipeline = Pipeline(stages=[assembler, rf])

NewerOlder