Skip to content

Instantly share code, notes, and snippets.

@walterdavis
Created October 19, 2010 22:45
Show Gist options
  • Save walterdavis/635302 to your computer and use it in GitHub Desktop.
Save walterdavis/635302 to your computer and use it in GitHub Desktop.
This pair of code-bits convert a PDF to plain text using the ancient pdftotext library.
#models/document.rb
class Document < ActiveRecord::Base
has_attached_file :pdf,:styles => { :text => { :fake => 'variable' } }, :processors => [:text]
#more class stuff here
end
#lib/paperclip_processors/text.rb
module Paperclip
# Handles extracting plain text from PDF file attachments
class Text < Processor
attr_accessor :whiny
# Creates a Text extract from PDF
def make
src = @file
dst = Tempfile.new([@basename, 'txt'].compact.join("."))
command = <<-end_command
"#{ File.expand_path(src.path) }"
"#{ File.expand_path(dst.path) }"
end_command
begin
success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", command.gsub(/\s+/, " "))
Rails.logger.info "Processing #{src.path} to #{dst.path} in the text processor."
rescue PaperclipCommandLineError
raise PaperclipError, "There was an error processing the text for #{@basename}" if @whiny
end
dst
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment