Skip to content

Instantly share code, notes, and snippets.

@nassimhaddad
Created January 26, 2013 09:55
Show Gist options
  • Save nassimhaddad/4641470 to your computer and use it in GitHub Desktop.
Save nassimhaddad/4641470 to your computer and use it in GitHub Desktop.
String matching, distance between two strings. Works particularly well to detect retweets or tweet variations.
### string matching
### metric to find the similarity between two strings
### some context in:
### http://en.wikipedia.org/wiki/String_metric
### testing levenshtein metric
library(RecordLinkage)
str1 <- "scientific american"
str2 <- "american scientific"
levenshteinDist(str1, str2) # pure distance (number of changes needed to go from str1 to str2)
levenshteinSim(str1, str2) # similarity metric based on the distance metric
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment