Skip to content

Instantly share code, notes, and snippets.

@veer66
Created December 19, 2023 07:49
Show Gist options
  • Save veer66/0c9a388b1dc0ac81ac830114bbee2ad1 to your computer and use it in GitHub Desktop.
Save veer66/0c9a388b1dc0ac81ac830114bbee2ad1 to your computer and use it in GitHub Desktop.
def tokenize_en(text)
# modified https://github.com/luismsgomes/mosestokenizer/blob/master/src/mosestokenizer/tokenizer-v1.1.perl
text = " #{text.chomp} "
text.gsub!(/\s+/, ' ')
text.gsub!(/[\000-\037]/, '')
text.gsub!(/([^\p{Alnum}\s\.\'\`\,\-])/, ' \1 ')
text.gsub!(/\.([\.]+)/, 'DOTMULTI\1')
while text =~ /DOTMULTI\./
text.gsub!(/DOTMULTI\.([^\.])/, 'DOTDOTMULTI \1')
text.gsub!(/DOTMULTI\./, 'DOTDOTMULTI')
end
text.gsub!(/([^\d])[,]/, '\1 , ')
text.gsub!(/[,]([^\d])/, ' , \1/')
text.gsub!(/([^\p{alpha}])[']([^\p{alpha}])/, '\1 \' \2')
text.gsub!(/([^\p{alpha}\d])[']([\p{alpha}])/, '\1 \' \2')
text.gsub!(/([\p{alpha}])[']([^\p{alpha}])/, '\1 \' \2')
text.gsub!(/([\p{alpha}])[']([\p{alpha}])/, '\1 \'\2')
text.gsub!(/([\d])[']([s])/, '\1 \'\2')
while text =~ /DOTDOTMULTI/
text.gsub!(/DOTDOTMULTI/, 'DOTMULTI.')
end
text.gsub!(/DOTMULTI/, '.')
text.gsub!(/\&/, '\&') # escape escape
text.gsub!(/\|/, '\|') # factor separator
text.gsub!(/\</, '\&lt;') # xml
text.gsub!(/\>/, '\&gt;') # xml
text.gsub!(/\'/, '\&apos;') # xml
text.gsub!(/\"/, '\&quot;') # xml
text.gsub!(/\[/, '\&#91;') # syntax non-terminal
text.gsub!(/\]/, '\&#93;') # syntax non-terminal
return text.split(/ /).select {|tok| tok.length > 0}
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment