Last active
December 12, 2015 03:38
-
-
Save brendte/4708535 to your computer and use it in GitHub Desktop.
Prep docs for indexing: remove stop words and punctuation, downcase, and stem (with fast_stemmer gem, a Ruby implementation of Porter's stemming algorithm)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/ruby | |
require 'rubygems' | |
require 'fast_stemmer' | |
stop_words = "a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your,use,used".split(',') | |
unstemmed_words = ARGV[0].gsub(/[[:punct:]]/, '').downcase.split.select { |x| !stop_words.include?(x) } | |
stemmed_words = [] | |
unstemmed_words.each { |x| stemmed_words << x.stem } | |
puts stemmed_words.join(' ') |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment