Last active
March 4, 2016 14:18
-
-
Save hopped/58027fc9e416c87519e4 to your computer and use it in GitHub Desktop.
Identifying risky bank loans using a decision tree classifier (Pandas, Scikit Learn)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Download data set via: | |
# http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 | |
import numpy as np | |
import pandas as pd | |
from sklearn import tree | |
from sklearn import preprocessing | |
from sklearn.metrics import accuracy_score | |
from sklearn.metrics import confusion_matrix | |
from sklearn_pandas import DataFrameMapper | |
cols = ["chk_acct", "duration", "history", "purpose", "amount", "sav_acct", "employment", "install_rate", "pstatus", "other_debtor", "time_resid", "property", "age", "other_install", "housing", "other_credits", "job", "num_depend", "telephone", "foreign", "default"] | |
df = pd.read_csv("/Users/hoppe/Downloads/german.data", sep=" ", header=0, names=cols) | |
classes = df.default | |
del df['default'] | |
# convert categorical features using LabelBinarizer and DataFrameMapper | |
cols_to_modify = ['chk_acct', 'history', 'purpose', 'sav_acct', 'employment', 'pstatus', 'other_debtor', 'property', 'other_install', 'housing', 'job', 'telephone', 'foreign'] | |
cols_to_keep = list(set(df.columns.values).difference(cols_to_modify)) | |
modified_cols = [(col, preprocessing.LabelBinarizer()) for col in cols_to_modify] | |
mapper = DataFrameMapper(modified_cols) | |
data = pd.DataFrame(np.column_stack([mapper.fit_transform(df), df[cols_to_keep].values])) | |
# shuffle | |
data.reindex(np.random.permutation(data.index)) | |
# build training and test set | |
X_train = data.ix[:900] | |
X_test = data.ix[900:] | |
y_train = classes.ix[:900] | |
y_test = classes.ix[900:] | |
# classify | |
clf = tree.DecisionTreeClassifier(criterion='entropy', min_samples_leaf=5) | |
clf = clf.fit(X_train, y_train) | |
# predict | |
pred = clf.predict(X_test) | |
# metrics | |
confusion_matrix(y_test, pred) | |
accuracy_score(y_test, pred) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is great, really helped me get off the ground training a simple decision tree. My only issue is, how can I interpret the tree when the LabelBinarizer has wiped my column meanings?
Is there a way to identify the names of the features created with this method?