Skip to content

Instantly share code, notes, and snippets.

@buddye
Created February 22, 2012 15:21
Show Gist options
  • Save buddye/1885520 to your computer and use it in GitHub Desktop.
Save buddye/1885520 to your computer and use it in GitHub Desktop.
rjb - Ruby Java Bridge - Loading multiple .jar files, for example, Boilerpipe and Stanford parts of speech tagger
## rjb - Ruby Java Bridge - Loading multiple .jar files, for example, Boilerpipe and Stanford parts of speech tagger
## Works in Ruby 1.9.3 and Nginx with Unicorn. Doesn't work in Phusion Passenger. In unicorn, preload_app has to be false (default)
## 'java' folder is in the root rails folder.
## http://rjb.rubyforge.org/
## For Stanford PoS tagger: ../java/ folder contains: stanford-postagger.jar, left3words-wsj-0-18.tagger
## http://nlp.stanford.edu/software/tagger.shtml
## For Boilerpipe, ../java/ folder contains: boilerpipe-1.2.0.jar, nekohtml-1.9.13.jar, xerces-2.9.1.jar
## http://code.google.com/p/boilerpipe/
## http://nekohtml.sourceforge.net/
## http://xerces.apache.org/xerces2-j/
## -- Initialize --
## I have this as a ../config/initializers/java.rb file in rails (not sure if that's the best way).
## '-Xmx1024m' param at the end is how much memory you want to give to Java
Rjb::load("java/stanford-postagger.jar:java/xerces-2.9.1.jar:java/nekohtml-1.9.13.jar:java/boilerpipe-1.2.0.jar", ['-Xmx1024m'])
MaxentTagger = Rjb::import('edu.stanford.nlp.tagger.maxent.MaxentTagger')
StringReader = Rjb::import('java.io.StringReader')
PosTagger = MaxentTagger.new('java/left3words-wsj-0-18.tagger')
ArticleSentencesExtractor = Rjb::import('de.l3s.boilerpipe.extractors.ArticleSentencesExtractor')
## -- Example Usage --
## Example using Boilerpipe in Ruby
extracted_text = ArticleSentencesExtractor.INSTANCE.getText(raw_html)
## Example using stanford parts of speech tagger:
## Note: reader, tokenizedText, tagedSentences, sentences are rjb Java objects, text_to_be_parsed is a Ruby string
reader = StringReader.new(text_to_be_parsed)
tokenizedText = MaxentTagger.tokenizeText(reader)
tagedSentences = PosTagger.process(tokenizedText)
sentences = tagedSentences.toArray()
sentences.each do |sentence|
words = sentence.toArray()
words.each do |word|
p word.word().to_s
p word.tag().to_s
end
end
## -- Rjb setup required, check rjb docs, my info below: --
## OSX .profile file, make sure java bin is in $PATH
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home
## Ubuntu .bash_rc file
## after installing openjdk 6
export PATH=/usr/lib/jvm/java-6-openjdk/bin:$PATH
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
@buddye
Copy link
Author

buddye commented Feb 27, 2012

I really recommend not trying this on Ruby though, try doing a service in Scala instead; works great on the Play framework.
http://scala.playframework.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment