Skip to content

Instantly share code, notes, and snippets.

@ashaw
Created July 26, 2012 15:37
Show Gist options
  • Save ashaw/3182770 to your computer and use it in GitHub Desktop.
Save ashaw/3182770 to your computer and use it in GitHub Desktop.
def tokenize(text)
word = /([a-z][a-z'\-]+[a-z]+)/
stopwords = ["the","and","to","of","a","i","in","was","he","that","it","his","her","you","as","had","with","for","she","not","at","but","be","my","on","have","him","is","said","me","which","by","so","this","all","from","they","no","were","if","would","or","when","what","there","been","one","could","very","an"]
text.tr('’', "'"). # Normalize apostrophe
downcase. # Normalize
scan(word). # Tokens
to_a.flatten. # Flat array of matches
reject {|w| stopwords.include? w } # Remove stopwords
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment