Skip to content

Instantly share code, notes, and snippets.

@glamp
Last active August 29, 2015 14:04
Show Gist options
  • Save glamp/688d8d1ad8bca2b2a4d0 to your computer and use it in GitHub Desktop.
Save glamp/688d8d1ad8bca2b2a4d0 to your computer and use it in GitHub Desktop.
# building the classifier
from sklearn.ensemble import RandomForestClassifier
# I find it easiest to just define all the features you'll be using in a list. In
# our case it's just the fuzzy scores that we generated using fuzzywuzzy.
features = [
'name_ratio',
'name_token_sort_ratio',
'name_partial_ratio',
'street_ratio',
'street_token_sort_ratio',
'street_partial_ratio'
]
# We're going to fit a RandomForest model to the data. Our target variable will be match
# which is a 0/1 variable indicating whether or not a given set of names/addresses are
# actually the same record.
clf = RandomForestClassifier()
clf.fit(df[features], df['match'])
# Just taking a look at the results. Note that this isn't cross-validated, I'm just trying
# to get a quick look at the model.
pd.crosstab(clf.predict(df[features]), df['match'])
@rdempsey
Copy link

When I run this part of the code I get the following error:
KeyError: 'match'

Do I need to define match before using it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment