Skip to content

Instantly share code, notes, and snippets.

@nniiicc
Forked from erichare/pdftools_tables.R
Created May 14, 2019 15:59
Show Gist options
  • Save nniiicc/28488e7193277c7f0bc8feb07091a089 to your computer and use it in GitHub Desktop.
Save nniiicc/28488e7193277c7f0bc8feb07091a089 to your computer and use it in GitHub Desktop.
Extract Tables from PDF with PDFTools 2.0
library(pdftools)
library(tidyverse)
parse_tables <- function(url, remove_last = TRUE) {
my_data <- pdf_data(url)
lapply(my_data, function(my_data2) {
header_row <- my_data2 %>%
filter(y == min(y))
subdata <- my_data2 %>%
filter(y != min(y))
first_table_temp <- subdata %>%
group_by(y) %>%
summarise(text = paste(text, space, collapse = " ")) %>%
mutate(text = gsub("TRUE ", "", text),
text = gsub("FALSE", ",,", text)) %>%
mutate(text = sapply(strsplit(text, ",,"), str_trim)) %>%
rowwise() %>%
mutate(variable = list(tail(c("", header_row$text), length(text)))) %>%
ungroup() %>%
mutate(id = 1:nrow(.))
if (remove_last) {
first_table_temp <- first_table_temp %>%
slice(-nrow(.))
}
first_table_temp %>%
unnest() %>%
spread(key = variable, value = text) %>%
select(-y, -id) %>%
select(one_of(c("V1", header_row$text))) %>%
mutate_all(parse_guess)
})
}
parse_tables("https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment