-
-
Save kyamaguchi/03d6f68b0d410b0ef471 to your computer and use it in GitHub Desktop.
#/usr/bin/env ruby | |
raise "Set input file. $ ruby #{__FILE__} input.txt" if ARGV.empty? | |
PAGE_SEPARATOR = %r{--- Page (\S+) ---} | |
ANNOTATION_SEPARATOR = %r{(?<type>Highlight|Note) \((?<color>[^)]+)\), (?<time>.*)(?:, (?<owner>.*))?:} | |
lines = File.readlines(ARGV[0]) | |
annotations = [] | |
lines.map(&:chomp) | |
.reject{|l| l == "" } | |
.slice_before(PAGE_SEPARATOR) | |
.select{|page| page.any?{|line| line =~ PAGE_SEPARATOR} } | |
.each do |page| | |
page_no = page.shift.match(PAGE_SEPARATOR)[1] | |
page.slice_before(ANNOTATION_SEPARATOR).each do |annotation| | |
info = annotation.shift.match(ANNOTATION_SEPARATOR) | |
# text = annotation.map{|h| h.gsub(/[[:space:]]/, '') }.join(' ') # For Japanese | |
text = annotation.join(' ') # For English | |
annotations << { | |
type: info[:type], | |
page_no: page_no, | |
color: info[:color], | |
time: info[:time], | |
owner: info[:owner], | |
text: text, | |
} | |
end | |
end | |
### Output | |
# Change the following part as you like | |
## --- output as hash | |
# annotations.each do |a| | |
# puts a.inspect | |
# end | |
## --- group by type | |
# annotations.group_by{|a| a[:type] }.each do |type, group| | |
# puts "[#{type}]" | |
# group.each do |a| | |
# puts "#{a[:text]} (p#{a[:page_no]})" | |
# end | |
# puts | |
# end | |
## --- group by color | |
annotations.group_by{|a| a[:type] }.each do |type, group| | |
puts "[#{type}]" | |
group.group_by{|a| a[:color] }.each do |color, subgroup| | |
puts "--- #{color} ---" | |
subgroup.each do |a| | |
puts "#{a[:text]} (p#{a[:page_no]})" | |
end | |
puts | |
end | |
puts | |
end | |
## --- using rainbow | |
# require 'rainbow' | |
# annotations.each do |a| | |
# puts Rainbow(a[:text]).background(a[:color].to_sym) + "(p#{a[:page_no]})" | |
# end | |
## sample input.txt for testing | |
=begin | |
File: refactoring-ja-special-edition_p1_0.pdf | |
Annotation summary: | |
--- Page xi --- | |
Highlight (yellow), 2015/03/05 9:10: | |
2000 年に発行された『リファクタリング プログラミングの体質改善テクニック』 | |
--- Page xix --- | |
Highlight (yellow), 2015/03/05 9:10: | |
リファクタリングの父は 2 人います。Ward Cunningham と Kent Beck です。 | |
Highlight (yellow), 2015/03/05 9:10: | |
John Brant と Don Roberts は単に論文を書くのに止まらず、ツールの作成まで行いました。それが 「Refactoring Browser」 、すなわちリファクタリングを行うための Smalltalk のブラウザです。 | |
--- Page 12 --- | |
Highlight (yellow), 2015/03/05 9:10: | |
変更がほんの少しであれば、それによって生じるエラーを見つけるのは簡単 なことです。 | |
--- Page 66 --- | |
Highlight (red), 2015/03/06 14:56: | |
決してリファクタリングをしてはいけない場合もあります。第 1 の例は、変更するよりも最 初からの書き直した方が早いという場合です。 | |
Highlight (yellow), 2015/03/06 14:56: | |
リファクタリングを避けるべき第 2 の例として、 期間が迫っている場合があります。 こうした状況では、リファクタリングをしても生産性の向上が見られるのは締め切り後であり、 時すでに遅しということになってしまいます。 | |
Highlight (blue), 2015/03/06 14:56: | |
時間が足りなくなるというのは、たいて いの場合、リファクタリングが必要であることを示唆しているのです。 | |
Note (yellow), 2015/03/06 14:56: | |
あいうえお | |
=end |
Generally, you can inspect errors with adding debug print.
For instance, add puts info.inspect
or puts annotation.inspect
before line 18.
And run the program again.
The error says info
is nil around lines 18-22.
This probably means Regular expressionANNOTATION_SEPARATOR
doesn't match the text from your input.txt .
I don't know why it happens but I suspect something is different on Ruby on Windows. (I use Mac.)
One of the idea of fixing errors is changing the Regular expression.
ANNOTATION_SEPARATOR = %r{(?<type>Highlight|Note) \((?<color>[^)]+)\), (?<time>\d{4}/\d{2}/\d{2} \d{1,2}:\d{1,2})(?:, (?<owner>.*))?:}
is very strict.
You can loosen the expression with deleting some parts from the end.
For example, change it to ANNOTATION_SEPARATOR = %r{(?<type>Highlight|Note) \((?<color>[^)]+)\), (?<time>\d{4}/\d{2}/\d{2} \d{1,2}:\d{1,2})}
and try it. (Remove the part from the end. You could also remove part next if you still get error.)
I could take a look if you give me the input.txt in email.
Try ANNOTATION_SEPARATOR = %r{(?<type>Highlight|Note) \((?<color>[^)]+)\), (?<time>.*)(?:, (?<owner>.*))?:}
.
The difference is (?<time>\d{4}/\d{2}/\d{2} \d{1,2}:\d{1,2})
-> (?<time>.*)
.
The format of time is different. (Your format is "Highlight (yellow), 14 mrt. 2015 14:17:"
)
This worked like a charm! Thank you!
Forgive me for these total Ruby newby questions. But what you have here is the solution to my annotation extraction problem which I have been struggling with the last weeks!
Fyi, I first had to learn how to get the ruby file running and where to put the input.txt file. When I finally managed to run the .rb file in a Windows command prompt I ran the file exactly as you provided it.
It came back with the error:
C:\Users\Jochem\Desktop\Ruby test>ruby "C:\Users\Jochem\Desktop\Ruby test\Goodre
': undefined method`[]' for nil:NilClass (NoMethodError) from C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:17:in `<<' from C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:17:in`each' from C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:17:in `each' from C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:17:in`block in ' from C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:14:in `each' from C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:14:in`'ader test.rb" input.txt
C:/Users/Jochem/Desktop/Ruby test/Goodreader test.rb:21:in `block (2 levels) in
Can you see or guess what I am doing wrong? I study a lot and have loads of files on Goodreader of which I would like to extract the annotations.
Any help would be greatly appreciated. Thanks for posting this in the first place!
Kind regards,
Traveller