Skip to content

Instantly share code, notes, and snippets.

@hopped
Last active March 4, 2016 14:18
Show Gist options
  • Save hopped/58027fc9e416c87519e4 to your computer and use it in GitHub Desktop.
Save hopped/58027fc9e416c87519e4 to your computer and use it in GitHub Desktop.
Identifying risky bank loans using a decision tree classifier (Pandas, Scikit Learn)
# Download data set via:
# http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn_pandas import DataFrameMapper
cols = ["chk_acct", "duration", "history", "purpose", "amount", "sav_acct", "employment", "install_rate", "pstatus", "other_debtor", "time_resid", "property", "age", "other_install", "housing", "other_credits", "job", "num_depend", "telephone", "foreign", "default"]
df = pd.read_csv("/Users/hoppe/Downloads/german.data", sep=" ", header=0, names=cols)
classes = df.default
del df['default']
# convert categorical features using LabelBinarizer and DataFrameMapper
cols_to_modify = ['chk_acct', 'history', 'purpose', 'sav_acct', 'employment', 'pstatus', 'other_debtor', 'property', 'other_install', 'housing', 'job', 'telephone', 'foreign']
cols_to_keep = list(set(df.columns.values).difference(cols_to_modify))
modified_cols = [(col, preprocessing.LabelBinarizer()) for col in cols_to_modify]
mapper = DataFrameMapper(modified_cols)
data = pd.DataFrame(np.column_stack([mapper.fit_transform(df), df[cols_to_keep].values]))
# shuffle
data.reindex(np.random.permutation(data.index))
# build training and test set
X_train = data.ix[:900]
X_test = data.ix[900:]
y_train = classes.ix[:900]
y_test = classes.ix[900:]
# classify
clf = tree.DecisionTreeClassifier(criterion='entropy', min_samples_leaf=5)
clf = clf.fit(X_train, y_train)
# predict
pred = clf.predict(X_test)
# metrics
confusion_matrix(y_test, pred)
accuracy_score(y_test, pred)
@economy
Copy link

economy commented Jan 27, 2015

This is great, really helped me get off the ground training a simple decision tree. My only issue is, how can I interpret the tree when the LabelBinarizer has wiped my column meanings?

Is there a way to identify the names of the features created with this method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment