-
-
Save stephenmac7/fa430fb3b3cfc033398a to your computer and use it in GitHub Desktop.
# Gem Depends: ve, docopt | |
# System Depends: mecab, mecab-ipadic-utf-8 | |
require 'csv' | |
require 've' | |
require 'docopt' | |
doc = <<DOCOPT | |
Lemma Frequency Report. | |
Usage: | |
#{__FILE__} [options] FILE ... | |
#{__FILE__} -h | --help | |
#{__FILE__} --version | |
Options: | |
-h --help Show this screen. | |
-m --morpheme Target morphemes, instead of lexemes. | |
--version Show version. | |
DOCOPT | |
def main(opt) | |
# Input from args, UTF-8 required | |
contents = '' | |
opt['FILE'].each do |f| | |
if f == '-' | |
f = '/dev/stdin' | |
end | |
contents << File.read(f) | |
end | |
# Pre-processing | |
lines = remove_rubies(contents).split # We need to give mecab bite-sized | |
# pieces, because pipes can't handle | |
# big sizes and ve uses pipes | |
# Process the text and count lemmas, this might take a while | |
freq = calculate_frequency(lines, opt['--morpheme']) | |
# Show count | |
show_count(freq) | |
end | |
def calculate_frequency(lines, morpheme) | |
# Creates a hash with the frequency for all the lines | |
lines.reduce(Hash.new(0)) do |freq,line| | |
ve_line = filter_blacklisted(Ve.in(:ja).words(line)) | |
get_frequency_hash(ve_line, morpheme, freq) | |
end | |
end | |
def remove_rubies(text) | |
# For Aozora Bunko text as input, rubies need to be removed | |
text.gsub(/《.*》/, "") | |
end | |
# For morpheme operations, it would be much faster to use mecab directly | |
def get_frequency_hash(words, morpheme, freq = Hash.new(0)) | |
words.each do |word| | |
unless word.lemma == "*" # if lemma could not be found, don't count | |
if morpheme | |
word.tokens.each do |token| | |
index = [token[:lemma], token[:pos]] | |
freq[index] += 1 | |
end | |
else | |
index = [word.lemma, word.part_of_speech.name] | |
freq[index] += 1 | |
end | |
end | |
end | |
freq | |
end | |
def filter_blacklisted(words) | |
pos_blacklist = [Ve::PartOfSpeech::Symbol, Ve::PartOfSpeech::ProperNoun] | |
words.select { |word| not pos_blacklist.include? word.part_of_speech } | |
end | |
def show_count(counts) | |
counts.sort_by{|_,count| count}.reverse.each do |ind,count| | |
print [count, ind.first, ind.last].to_csv | |
end | |
end | |
if __FILE__==$0 | |
begin | |
main Docopt::docopt(doc, version: '0.0.1') | |
rescue Docopt::Exit => e | |
puts e.message | |
end | |
end |
So, I did profiling, etc. and ve is the bottleneck for speed. However, the issue you encountered was that the size of the file was greater than the max file size for a pipe (which ve uses to feed information to mecab). For the moment, I have fixed the issues by feeding mecab line-by-line, which doesn't seem to have too much overhead. I've also added an option to use the mecab lemmas and part of speech.
Also, it no longer combines identical lemmas that seem to have different parts of speech. This did in fact change some of the counts. For example, using your file, the の lemma lost a few hundred for its count.
The output is now CSV for easier processing and because, upon adding part of speech, I realized that it would be impractical to try to print double-width unicode in columns on a terminal screen so that it would be lined up. Maybe we can add some sort of pretty printing later.
So, you should be able to run the program on that book in a few seconds (up to 10, I would say). Any longer, and something has gone wrong.
Also, as you may notice in the comment in the file, if we're only counting morphemes, it would be better to just directly access mecab and completely skip ve. If that's going to be a commonly used feature, I'll make sure that happens. However, I would like to ask the author of ve if he can figure out how to fix the piping issue first.
To store the data, redirect the output and save it as CSV. Just about any programming language has easy facilities to read and access that untyped data. For now, this should be fine. At the time, I don't see any significant benefits in storing the results in a database. If you have a good reason, I'd be glad to implement it.
Concerning overall speed, if your data is going to be huge, we might need to get ve optimized (and fixed) or rewrite it in a more efficient language/way. However, if the current speed is fine, then I would suggest against doing work that has already been done :).
I should have noted that Ve includes the MeCab results for each Ve lemma in a
tokens
member, so we do have easy access to prettified MeCab output.