Created
September 14, 2020 13:09
-
-
Save rietta/90ae2187606953bee9735c00f3a6e766 to your computer and use it in GitHub Desktop.
Ruby extract plain text for PDF by wrapping pdftotext shell command.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# frozen_string_literal: true | |
## | |
# Primary responsibility is extracting text from a PDF or confirming if | |
# text is available in the PDF. | |
# | |
# Security note: This simple wrapper assumes that the PDF filename that you give it has been | |
# chosen by an internal method, such as a tempfile name. Do not pass unsafe user supplied file names | |
# into this class. | |
# | |
# Copyright 2017 Rietta Inc. BSD Licensed. | |
# | |
class PdfTextExtractor | |
attr_accessor :pdf_file | |
def initialize(pdf_file:) | |
unless command?('pdftotext') | |
raise 'pdftotext is not installed, but is required.' | |
end | |
@pdf_file = pdf_file | |
end | |
# Determine if a command is available on the current Unix system. | |
def command?(command) | |
system("which #{command} > /dev/null 2>&1") | |
end | |
def text | |
@text ||= `pdftotext '#{@pdf_file}' -`.strip | |
end | |
def text? | |
text != '' | |
end | |
def as_json(_opts = {}) | |
{ | |
filename: @pdf_file, | |
text: text | |
} | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment