Created
November 15, 2017 10:26
-
-
Save dwbapst/d66a884f4c8b9b3f8ef0f84ce72db4a8 to your computer and use it in GitHub Desktop.
Ever need to get word counts for a directory of PDFs, using nothing but R? Have I got a solution for you!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(pdftools) | |
# finds files in current directory | |
pdfs <- list.files(pattern = "pdf", full.names = TRUE) | |
# uses pdftools to convert pdfs to plain-text, replaces line breaks with spaces | |
# and then counts the words, ignoring non-word symbols | |
readCleanCount<-function(pdf){ | |
txt<-pdf_text(pdf) | |
txt<-paste(gsub(txt,pattern="\r\n",replace=" "),collapse=" ") | |
# regex stolen from StackOverFlow, like 99% of all regex | |
# https://stackoverflow.com/questions/8920145/count-the-number-of-words-in-a-string-in-r | |
count<-sapply(gregexpr("[[:alpha:]]+", txt), function(x) sum(x > 0)) | |
return(count) | |
} | |
cbind(pdfs,sapply(pdfs, readCleanCount)) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
You can use DocuCount by Tradutema (https://tradutema.com/docucount-by-tradutema/) to get number of words straightaway in just 3 seconds from a list of files (images, pdf, locked pdf, protected pdf, Docx…), even handwritten text. You will get a log including number of words per file, number of pages, totals etc that you can copy paste in Excel . It’s free. It’s incredibily fast because it uses Artificial Intelligence, instead of OCR conversion function.