Skip to content

Instantly share code, notes, and snippets.

@dwbapst
Created November 15, 2017 10:26
Show Gist options
  • Save dwbapst/d66a884f4c8b9b3f8ef0f84ce72db4a8 to your computer and use it in GitHub Desktop.
Save dwbapst/d66a884f4c8b9b3f8ef0f84ce72db4a8 to your computer and use it in GitHub Desktop.
Ever need to get word counts for a directory of PDFs, using nothing but R? Have I got a solution for you!
library(pdftools)
# finds files in current directory
pdfs <- list.files(pattern = "pdf", full.names = TRUE)
# uses pdftools to convert pdfs to plain-text, replaces line breaks with spaces
# and then counts the words, ignoring non-word symbols
readCleanCount<-function(pdf){
txt<-pdf_text(pdf)
txt<-paste(gsub(txt,pattern="\r\n",replace=" "),collapse=" ")
# regex stolen from StackOverFlow, like 99% of all regex
# https://stackoverflow.com/questions/8920145/count-the-number-of-words-in-a-string-in-r
count<-sapply(gregexpr("[[:alpha:]]+", txt), function(x) sum(x > 0))
return(count)
}
cbind(pdfs,sapply(pdfs, readCleanCount))
@jpardofurness
Copy link

You can use DocuCount by Tradutema (https://tradutema.com/docucount-by-tradutema/) to get number of words straightaway in just 3 seconds from a list of files (images, pdf, locked pdf, protected pdf, Docx…), even handwritten text. You will get a log including number of words per file, number of pages, totals etc that you can copy paste in Excel . It’s free. It’s incredibily fast because it uses Artificial Intelligence, instead of OCR conversion function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment