Skip to content

Instantly share code, notes, and snippets.

@ichramm
Created May 13, 2020 19:09
Show Gist options
  • Save ichramm/11ef27a35c3fa3655d23fd14465c0803 to your computer and use it in GitHub Desktop.
Save ichramm/11ef27a35c3fa3655d23fd14465c0803 to your computer and use it in GitHub Desktop.
extract pdf text
(import org.apache.pdfbox.pdmodel.PDDocument)
(import org.apache.pdfbox.text.PDFTextStripper)
(import java.net.URL)
(defn extract-pdf-text
[url]
(with-open [pd (PDDocument/load (.openStream (URL. url)))]
(let [stripper (PDFTextStripper.)]
(.getText stripper pd))))
@ichramm
Copy link
Author

ichramm commented May 14, 2020

Extract HTML instead of plain text:

; [org.apache.pdfbox/pdfbox "2.0.19"]
; [org.apache.pdfbox/pdfbox-tools "2.0.19"]

(import org.apache.pdfbox.pdmodel.PDDocument)
(import org.apache.pdfbox.tools.PDFText2HTML)
(import java.net.URL)

(defn extract-pdf-text
  [url]
  (with-open [pd (PDDocument/load (.openStream (URL. url)))]
    (let [stripper (PDFText2HTML.)]
      (.getText stripper pd))))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment