Skip to content

Instantly share code, notes, and snippets.

@sck
Created May 31, 2012 20:15
Show Gist options
  • Select an option

  • Save sck/2845923 to your computer and use it in GitHub Desktop.

Select an option

Save sck/2845923 to your computer and use it in GitHub Desktop.
Convert all pdfs to text
#! /usr/bin/env ruby
require 'fileutils.rb'
$home=ENV['HOME']
$documents="#{$home}/Documents"
$already_tried_index="#{$documents}/already_tried_files.txt"
def does_pdf_file_contain_text?(fn)
text=`pdftotext #{fn.inspect} /dev/stdout`
text =~ /\w+/
end
$all_pdfs = Dir["#{$documents}/**/*.pdf"]
$already_tried_files = File.read($already_tried_index).split("\n")
$already_tried = $already_tried_files.inject({}) {|h,v| h[v] = 1; h}
$pdfs_without_text = $all_pdfs.select {|fn| !$already_tried.has_key?(fn) &&
!does_pdf_file_contain_text?(fn) }
if $pdfs_without_text.size > 0
system "open -a 'Scan to Searchable PDF' #{$pdfs_without_text.map{|n| n.inspect}.join(" ")}"
$pdfs_without_text.each {|fn|
puts "Indexing: #{fn}"
File.open($already_tried_index, "a+") {|idx| idx.puts fn }
}
else
puts "No pdfs to index."
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment