Last active
November 20, 2015 01:53
-
-
Save kyamaguchi/03d6f68b0d410b0ef471 to your computer and use it in GitHub Desktop.
Extract highlights and notes which are exported from Good Reader app
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#/usr/bin/env ruby | |
raise "Set input file. $ ruby #{__FILE__} input.txt" if ARGV.empty? | |
PAGE_SEPARATOR = %r{--- Page (\S+) ---} | |
ANNOTATION_SEPARATOR = %r{(?<type>Highlight|Note) \((?<color>[^)]+)\), (?<time>.*)(?:, (?<owner>.*))?:} | |
lines = File.readlines(ARGV[0]) | |
annotations = [] | |
lines.map(&:chomp) | |
.reject{|l| l == "" } | |
.slice_before(PAGE_SEPARATOR) | |
.select{|page| page.any?{|line| line =~ PAGE_SEPARATOR} } | |
.each do |page| | |
page_no = page.shift.match(PAGE_SEPARATOR)[1] | |
page.slice_before(ANNOTATION_SEPARATOR).each do |annotation| | |
info = annotation.shift.match(ANNOTATION_SEPARATOR) | |
# text = annotation.map{|h| h.gsub(/[[:space:]]/, '') }.join(' ') # For Japanese | |
text = annotation.join(' ') # For English | |
annotations << { | |
type: info[:type], | |
page_no: page_no, | |
color: info[:color], | |
time: info[:time], | |
owner: info[:owner], | |
text: text, | |
} | |
end | |
end | |
### Output | |
# Change the following part as you like | |
## --- output as hash | |
# annotations.each do |a| | |
# puts a.inspect | |
# end | |
## --- group by type | |
# annotations.group_by{|a| a[:type] }.each do |type, group| | |
# puts "[#{type}]" | |
# group.each do |a| | |
# puts "#{a[:text]} (p#{a[:page_no]})" | |
# end | |
# puts | |
# end | |
## --- group by color | |
annotations.group_by{|a| a[:type] }.each do |type, group| | |
puts "[#{type}]" | |
group.group_by{|a| a[:color] }.each do |color, subgroup| | |
puts "--- #{color} ---" | |
subgroup.each do |a| | |
puts "#{a[:text]} (p#{a[:page_no]})" | |
end | |
puts | |
end | |
puts | |
end | |
## --- using rainbow | |
# require 'rainbow' | |
# annotations.each do |a| | |
# puts Rainbow(a[:text]).background(a[:color].to_sym) + "(p#{a[:page_no]})" | |
# end | |
## sample input.txt for testing | |
=begin | |
File: refactoring-ja-special-edition_p1_0.pdf | |
Annotation summary: | |
--- Page xi --- | |
Highlight (yellow), 2015/03/05 9:10: | |
2000 年に発行された『リファクタリング プログラミングの体質改善テクニック』 | |
--- Page xix --- | |
Highlight (yellow), 2015/03/05 9:10: | |
リファクタリングの父は 2 人います。Ward Cunningham と Kent Beck です。 | |
Highlight (yellow), 2015/03/05 9:10: | |
John Brant と Don Roberts は単に論文を書くのに止まらず、ツールの作成まで行いました。それが 「Refactoring Browser」 、すなわちリファクタリングを行うための Smalltalk のブラウザです。 | |
--- Page 12 --- | |
Highlight (yellow), 2015/03/05 9:10: | |
変更がほんの少しであれば、それによって生じるエラーを見つけるのは簡単 なことです。 | |
--- Page 66 --- | |
Highlight (red), 2015/03/06 14:56: | |
決してリファクタリングをしてはいけない場合もあります。第 1 の例は、変更するよりも最 初からの書き直した方が早いという場合です。 | |
Highlight (yellow), 2015/03/06 14:56: | |
リファクタリングを避けるべき第 2 の例として、 期間が迫っている場合があります。 こうした状況では、リファクタリングをしても生産性の向上が見られるのは締め切り後であり、 時すでに遅しということになってしまいます。 | |
Highlight (blue), 2015/03/06 14:56: | |
時間が足りなくなるというのは、たいて いの場合、リファクタリングが必要であることを示唆しているのです。 | |
Note (yellow), 2015/03/06 14:56: | |
あいうえお | |
=end |
Try ANNOTATION_SEPARATOR = %r{(?<type>Highlight|Note) \((?<color>[^)]+)\), (?<time>.*)(?:, (?<owner>.*))?:}
.
The difference is (?<time>\d{4}/\d{2}/\d{2} \d{1,2}:\d{1,2})
-> (?<time>.*)
.
The format of time is different. (Your format is "Highlight (yellow), 14 mrt. 2015 14:17:"
)
This worked like a charm! Thank you!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Generally, you can inspect errors with adding debug print.
For instance, add
puts info.inspect
orputs annotation.inspect
before line 18.And run the program again.
The error says
info
is nil around lines 18-22.This probably means Regular expression
ANNOTATION_SEPARATOR
doesn't match the text from your input.txt .I don't know why it happens but I suspect something is different on Ruby on Windows. (I use Mac.)
One of the idea of fixing errors is changing the Regular expression.
ANNOTATION_SEPARATOR = %r{(?<type>Highlight|Note) \((?<color>[^)]+)\), (?<time>\d{4}/\d{2}/\d{2} \d{1,2}:\d{1,2})(?:, (?<owner>.*))?:}
is very strict.You can loosen the expression with deleting some parts from the end.
For example, change it to
ANNOTATION_SEPARATOR = %r{(?<type>Highlight|Note) \((?<color>[^)]+)\), (?<time>\d{4}/\d{2}/\d{2} \d{1,2}:\d{1,2})}
and try it. (Remove the part from the end. You could also remove part next if you still get error.)I could take a look if you give me the input.txt in email.