Last active
August 29, 2015 14:04
-
-
Save glamp/688d8d1ad8bca2b2a4d0 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# building the classifier | |
from sklearn.ensemble import RandomForestClassifier | |
# I find it easiest to just define all the features you'll be using in a list. In | |
# our case it's just the fuzzy scores that we generated using fuzzywuzzy. | |
features = [ | |
'name_ratio', | |
'name_token_sort_ratio', | |
'name_partial_ratio', | |
'street_ratio', | |
'street_token_sort_ratio', | |
'street_partial_ratio' | |
] | |
# We're going to fit a RandomForest model to the data. Our target variable will be match | |
# which is a 0/1 variable indicating whether or not a given set of names/addresses are | |
# actually the same record. | |
clf = RandomForestClassifier() | |
clf.fit(df[features], df['match']) | |
# Just taking a look at the results. Note that this isn't cross-validated, I'm just trying | |
# to get a quick look at the model. | |
pd.crosstab(clf.predict(df[features]), df['match']) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When I run this part of the code I get the following error:
KeyError: 'match'
Do I need to define match before using it?