Skip to content

Instantly share code, notes, and snippets.

@yyyyyyuanfei
Created December 9, 2012 08:04
Show Gist options
  • Select an option

  • Save yyyyyyuanfei/4243850 to your computer and use it in GitHub Desktop.

Select an option

Save yyyyyyuanfei/4243850 to your computer and use it in GitHub Desktop.
nlp

Directory Tree:

  NLP/
    pipeline.rb
    stanford-segmenter-2012-11-11/
      test.simp.utf8
      segment.sh
    stanford-postagger-full-2012-11-11/
      models/
      stanford-postagger.sh

Usage

  $ cd NLP
  $ bundle --without development
  $ ruby pipeline.rb /path/to/input/file

Note

我对stanford-postagger.sh做了小修改,请用修改后的版本。(主要是修正了-classpath

# encoding: utf-8
require 'cocaine'
require 'tempfile'
require 'benchmark'
require 'log4r'
module NLP
def chomp_bracket!
self.gsub!('}{', '} {')
capture_bracket_regexp = /\{(.*?)\}/
self.gsub!(capture_bracket_regexp) do
$1.tr(' ', '')
end
end
def linerize!
self.gsub!(' ', "\n").gsub(/^#.*$/, '')
end
end
class String
include NLP
end
SEGMENTER_DIR = '/Users/stranbird/Documents/NLP/stanford-segmenter-2012-11-11'
POSTAGGER_DIR = '/Users/stranbird/Documents/NLP/stanford-postagger-full-2012-11-11'
def normalize(input_file, options = {})
doc = File.read(input_file)
capture_tags_regexp = /{(.*?)\/(.*?)}/
doc.gsub(capture_tags_regexp) do
if options[:keep_bracket] then
'{' + $1 + '}'
else
$1
end
end
end
def segment(input_file)
segmenter = File.join(SEGMENTER_DIR, 'segment.sh')
segment_command = Cocaine::CommandLine.new(segmenter, ':model :filename :encoding :size', swallow_stderr: true)
params = {
model: 'ctb', # => alter. pku
filename: input_file,
encoding: 'UTF-8',
size: '0'
}
res = nil
segment_time = Benchmark.realtime do
res = segment_command.run(params)
end
res.chomp_bracket!
[res, segment_time]
end
def postag(input_file)
postagger = File.join(POSTAGGER_DIR, 'stanford-postagger.sh')
postag_command = Cocaine::CommandLine.new(postagger, ':model :filename', swallow_stderr: true)
params = {
model: File.join(POSTAGGER_DIR, 'models', 'chinese-distsim.tagger'),
filename: input_file
}
res = nil
postag_time = Benchmark.realtime do
res = postag_command.run(params)
end
[res, postag_time]
end
def store_result(res)
tmpfile = Tempfile.new('result')
tmpfile.write(res)
tmpfile.close
tmpfile.path
end
res = normalize(ARGV[0], keep_bracket: true)
path = store_result(res)
res, segment_time = segment(path)
path = store_result(res)
res, postag_time = postag(path)
res.linerize!
puts res
puts "segment: #{segment_time}s"
puts "postag: #{postag_time}s"
#!/bin/sh
#
# usage: ./stanford-postagger.sh model textFile
# e.g., ./stanford-postagger.sh models/left3words-wsj-0-18.tagger sample-input.txt
BASEDIR=$(dirname $0)
java -mx300m -cp "$BASEDIR/stanford-postagger.jar:" edu.stanford.nlp.tagger.maxent.MaxentTagger -model $1 -textFile $2
@yyyyyyuanfei
Copy link
Copy Markdown
Author

Directory Tree:

NLP/
    pipeline.rb
    stanford-segmenter-2012-11-11/
        test.simp.utf8
        segment.sh
    stanford-postagger-full-2012-11-11/
        models/
        stanford-postagger.sh

Usage

$ cd NLP
$ bundle --without development
$ ruby pipeline.rb

Note

我对stanford-postagger.sh做了小修改,请用修改后的版本。(主要是修正了-classpath

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment