shengch02’s gists

shengch02 / Classifying sentiment of review with logistic regression

Last active December 23, 2016 20:17

(Python) Use SFrames to do some feature engineering Train a logistic regression model to predict the sentiment of product reviews. Inspect the weights (coefficients) of a trained logistic regression model. Make a prediction (both class and probability) of sentiment for a new product review. Given the logistic regression weights, predictors and g…

	#the dataset consists of baby product reviews on Amazon.com
	#link for data: https://d18ky98rnyall9.cloudfront.net/_35bdebdff61378878ea2247780005e52_amazon_baby.gl.zip?Expires=1482278400&Signature=blPJv6YQNFgcZh~dULuDECzZlA6eGL1x9lzQKzHknqVHSdudmfjq0XPaokFjv-~Qy8nGADiBBdx4ar0BWgeboW1eTkYHOZzoUIMBfSPQGqA4Q9H8X8vwFyr9R-TC0LE4h4CsTRFH56BtbqpKtjKeJKxVv5E5LfZZiyhZEr6We5M_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A
	import sframe
	products = sframe.SFrame('amazon_baby.gl/')

	#clean the original data: remove punctuation, fill in N/A, remove neutral sentiment,
	# perform a train/test split, produce word count matrix
	def remove_punctuation(text):
	import string
	return text.translate(None, string.punctuation)

shengch02 / Implementing logistic regression from scratch

Last active December 23, 2016 20:15

(Python) Extract features from Amazon product reviews. Convert an SFrame into a NumPy array. Implement the link function for logistic regression. Write a function to compute the derivative of the log likelihood function with respect to a single coefficient. Implement gradient ascent. Given a set of coefficients, predict sentiments. Compute class…

	#implement logistic regression from scratch
	import math
	import pandas as pd
	import numpy as np
	#the dataset consists a subset of baby product reviews on Amazon.com
	import sframe
	products = sframe.SFrame('amazon_baby_subset.gl/')
	products = sframe.SFrame.to_dataframe(products)
	print sum(products['sentiment']==1) #num of positive sentiment 26579
	print sum(products['sentiment']==-1) #num of negative sentiment 26493

shengch02 / Logistic Regression with L2 regularization

Created December 24, 2016 23:13

(Python) Extract features from Amazon product reviews. Convert an dataframe into a NumPy array. Write a function to compute the derivative of log likelihood function with an L2 penalty with respect to a single coefficient. Implement gradient ascent with an L2 penalty. Empirically explore how the L2 penalty can ameliorate overfitting.

	#Logistic Regression with L2 regularization
	import math
	import pandas as pd
	import numpy as np

	#the dataset consists a subset of baby product reviews on Amazon.com
	import sframe
	products = sframe.SFrame('amazon_baby_subset.gl/')
	print sum(products['sentiment']==1) #num of positive sentiment 26579
	print sum(products['sentiment']==-1) #num of negative sentiment 26493

shengch02 / Identifying safe loans with decision trees

Created December 25, 2016 23:20

(Python) Use SFrames to do some feature engineering. Train a decision-tree on the LendingClub dataset. Visualize the tree. Predict whether a loan will default along with prediction probabilities (on a validation set). Train a complex tree model and compare it to simple tree model.

	#Identifying safe loans with decision trees
	import math
	import pandas as pd
	import numpy as np

	#the dataset consists data from the LendingClub to predict whether a loan will be paid off in full or
	#the loan with be charged off and possibly go into default
	import sframe
	loans = sframe.SFrame('lending-club-data.gl/')

shengch02 / Implementing binary decision trees from scratch

Created December 27, 2016 04:19

(Python) Use SFrames to do some feature engineering. Transform categorical variables into binary variables. Write a function to compute the number of misclassified examples in an intermediate node. Write a function to find the best feature to split on. Build a binary decision tree from scratch. Make predictions using the decision tree. Evaluate …

	#build decision trees where the data contain only binary features
	import math
	import pandas as pd
	import numpy as np

	#the dataset consists data from the LendingClub to predict whether a loan will be paid off in full or
	#the loan with be charged off and possibly go into default
	import sframe
	loans = sframe.SFrame('lending-club-data.gl/')

shengch02 / explore various techniques for preventing overfitting in decision trees

Created December 28, 2016 22:00

(Python) Implement binary decision trees with different early stopping methods. Compare models with different stopping parameters.

	#explore various techniques for preventing overfitting in decision trees
	import math
	import pandas as pd
	import numpy as np

	#the dataset consists data from the LendingClub to predict whether a loan will be paid off in full or
	#the loan with be charged off and possibly go into default
	import sframe
	loans = sframe.SFrame('lending-club-data.gl/')

shengch02 / Gradient boosted trees

Created January 2, 2017 01:29

(Python) Train a boosted ensemble of decision-trees (gradient boosted trees) on the lending club dataset. Predict whether a loan will default along with prediction probabilities (on a validation set). Find the most positive and negative loans using the learned model. Explore how the number of trees influences classification performance.

	#use the pre-implemented gradient boosted trees
	import pandas as pd
	import numpy as np

	#the dataset consists data from the LendingClub to predict whether a loan will be paid off in full or
	#the loan with be charged off and possibly go into default
	import sframe
	loans = sframe.SFrame('lending-club-data.gl/')

	#target column 'safe_loans' with +1 means a safe loan and -1 for risky loan

shengch02 / Implementing gradient boosted trees from scratch

Created January 2, 2017 17:08

(Python) Train a boosted ensemble of decision-trees (gradient boosted trees) on the lending club dataset. Predict whether a loan will default along with prediction probabilities. Evaluate the trained model and compare it with a baseline.

	#Boosting a decision stump from scratch
	import pandas as pd
	import numpy as np

	#the dataset consists data from the LendingClub to predict whether a loan will be paid off in full or
	#the loan with be charged off and possibly go into default
	import sframe
	loans = sframe.SFrame('lending-club-data.gl/')

	#target column 'safe_loans' with +1 means a safe loan and -1 for risky loan

shengch02 / Explore precision and recall

Created January 3, 2017 17:59

(Python) Explore various evaluation metrics: accuracy, confusion matrix, precision, recall. Explore how various metrics can be combined to produce a cost of making an error. Explore precision and recall curves.

	#explore precision and recall
	import pandas as pd
	import numpy as np

	#the dataset consists of baby product reviews on Amazon.com
	import sframe
	products = sframe.SFrame('amazon_baby.gl/')

	#clean the original data: remove punctuation, fill in N/A, remove neutral sentiment,
	# perform a train/test split, produce word count matrix

shengch02 / Stochastic gradient ascent

Created January 5, 2017 02:06

(Python) Implement stochastic gradient ascent with L2 penalty. Compare convergence of stochastic gradient ascent with that of batch gradient ascent.

	#Training logistic regression via stochastic gradient ascent
	import math
	import pandas as pd
	import numpy as np

	#the dataset consists a subset of baby product reviews on Amazon.com
	import sframe
	products = sframe.SFrame('amazon_baby_subset.gl/')
	products = sframe.SFrame.to_dataframe(products)