Dan Ofer ddofer

Useful Pandas Snippets

A personal diary of DataFrame munging over the years.

Data Types and Conversion

Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)

Beating the Forest Cover Type Prediction benchmark

Day 4 of the Beat 5 Kaggle Benchmarks in 5 Days challenge

For the Forest Cover Type Prediction competition on Kaggle, the goal is to predict the predominant type of trees in a given section of forest. The score is based on average classification accuracy for the 7 different tree cover classes.

To beat the all fir/spruce benchmark I obviously tried a random forest. Using the default settings of scikit-learn's RandomForestClassifier, I was able to beat the benchmark with an accuracy score of 0.72718 on the competition leaderboard. By using 100 estimators (versus the default of 10), I was able to raise that accuracy score up to 0.75455.

Random Forest Cover Types

Using pandas I loaded the train and test data sets into Python.

	from sklearn.datasets import fetch_20newsgroups, load_digits
	from sklearn.feature_extraction.text import TfidfVectorizer
	from sklearn.cross_validation import train_test_split
	import numpy as np
	from sklearn.naive_bayes import MultinomialNB, BernoulliNB
	from sklearn.linear_model import LogisticRegression, SGDClassifier
	from sklearn import metrics

	newsgroups_train = fetch_20newsgroups(subset='train')
	vectorizer = TfidfVectorizer(encoding='latin-1', max_features=10000)

	"""
	A deep neural network with or w/o dropout in one file.
	"""

	import numpy
	import theano
	import sys
	import math
	from theano import tensor as T
	from theano import shared

	"""
	A deep neural network with or w/o dropout in one file.

	License: Do What The Fuck You Want to Public License http://www.wtfpl.net/
	"""

	import numpy, theano, sys, math
	from theano import tensor as T
	from theano import shared
	from theano.tensor.shared_randomstreams import RandomStreams

	import numpy as np
	#from scipy.special import chdtrc
	from scipy.sparse import spdiags

	from sklearn.base import BaseEstimator, TransformerMixin
	from sklearn.preprocessing import LabelBinarizer


	def _chisquare(f_obs, f_exp, reduce):
	"""Replacement for scipy.stats.chisquare with custom reduction.

	import seaborn as sns
	from scipy.optimize import curve_fit

	# Function for linear fit
	def func(x, a, b):
	return a + b * x

	# Seaborn conveniently provides the data for
	# Anscombe's quartet.
	df = sns.load_dataset("anscombe")

	# Alec Radford, Indico, Kyle Kastner
	# License: MIT
	"""
	Convolutional VAE in a single file.
	Bringing in code from IndicoDataSolutions and Alec Radford (NewMu)
	Additionally converted to use default conv2d interface instead of explicit cuDNN
	"""
	import theano
	import theano.tensor as T
	from theano.compat.python2x import OrderedDict

	"""
	preprocess-twitter.py

	python preprocess-twitter.py "Some random text with #hashtags, @mentions and http://t.co/kdjfkdjf (links). :)"

	Script for preprocessing tweets by Romain Paulus
	with small modifications by Jeffrey Pennington
	with translation to Python by Motoki Wu

	Translation of Ruby script to create features for GloVe vectors for Twitter data.

	from lasagne.layers import Layer

	class HighwayLayer(Layer):
	def __init__(self, incoming, layer_class, gate_nonlinearity=None,
	**kwargs):
	super(HighwayLayer, self).__init__(incoming)

	self.H_layer = layer_class(incoming, **kwargs)

	if gate_nonlinearity: