Created
February 22, 2012 15:21
-
-
Save buddye/1885520 to your computer and use it in GitHub Desktop.
rjb - Ruby Java Bridge - Loading multiple .jar files, for example, Boilerpipe and Stanford parts of speech tagger
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## rjb - Ruby Java Bridge - Loading multiple .jar files, for example, Boilerpipe and Stanford parts of speech tagger | |
## Works in Ruby 1.9.3 and Nginx with Unicorn. Doesn't work in Phusion Passenger. In unicorn, preload_app has to be false (default) | |
## 'java' folder is in the root rails folder. | |
## http://rjb.rubyforge.org/ | |
## For Stanford PoS tagger: ../java/ folder contains: stanford-postagger.jar, left3words-wsj-0-18.tagger | |
## http://nlp.stanford.edu/software/tagger.shtml | |
## For Boilerpipe, ../java/ folder contains: boilerpipe-1.2.0.jar, nekohtml-1.9.13.jar, xerces-2.9.1.jar | |
## http://code.google.com/p/boilerpipe/ | |
## http://nekohtml.sourceforge.net/ | |
## http://xerces.apache.org/xerces2-j/ | |
## -- Initialize -- | |
## I have this as a ../config/initializers/java.rb file in rails (not sure if that's the best way). | |
## '-Xmx1024m' param at the end is how much memory you want to give to Java | |
Rjb::load("java/stanford-postagger.jar:java/xerces-2.9.1.jar:java/nekohtml-1.9.13.jar:java/boilerpipe-1.2.0.jar", ['-Xmx1024m']) | |
MaxentTagger = Rjb::import('edu.stanford.nlp.tagger.maxent.MaxentTagger') | |
StringReader = Rjb::import('java.io.StringReader') | |
PosTagger = MaxentTagger.new('java/left3words-wsj-0-18.tagger') | |
ArticleSentencesExtractor = Rjb::import('de.l3s.boilerpipe.extractors.ArticleSentencesExtractor') | |
## -- Example Usage -- | |
## Example using Boilerpipe in Ruby | |
extracted_text = ArticleSentencesExtractor.INSTANCE.getText(raw_html) | |
## Example using stanford parts of speech tagger: | |
## Note: reader, tokenizedText, tagedSentences, sentences are rjb Java objects, text_to_be_parsed is a Ruby string | |
reader = StringReader.new(text_to_be_parsed) | |
tokenizedText = MaxentTagger.tokenizeText(reader) | |
tagedSentences = PosTagger.process(tokenizedText) | |
sentences = tagedSentences.toArray() | |
sentences.each do |sentence| | |
words = sentence.toArray() | |
words.each do |word| | |
p word.word().to_s | |
p word.tag().to_s | |
end | |
end | |
## -- Rjb setup required, check rjb docs, my info below: -- | |
## OSX .profile file, make sure java bin is in $PATH | |
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home | |
## Ubuntu .bash_rc file | |
## after installing openjdk 6 | |
export PATH=/usr/lib/jvm/java-6-openjdk/bin:$PATH | |
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I really recommend not trying this on Ruby though, try doing a service in Scala instead; works great on the Play framework.
http://scala.playframework.org/