Created
September 26, 2013 08:39
-
-
Save zh4ngx/6711476 to your computer and use it in GitHub Desktop.
Given string and punctuation, tokenize string on punctuation and whitespace. Done using both split and scan.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def tokenize_query_split query, punctuation | |
punc_regex = /[#{punctuation}\s]+/ | |
tokens = query.split(punc_regex) | |
tokens.each do |token| | |
p token | |
end | |
end | |
def tokenize_query_scan query, punctuation | |
token_regex = /[^#{punctuation}\s]+/ | |
tokens = query.scan(token_regex) | |
tokens.each do |token| | |
p token | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment