Skip to content

Instantly share code, notes, and snippets.

@muschellij2
Created November 30, 2018 23:17
Show Gist options
  • Save muschellij2/cbd6378be2405dab1ab8c879628ffe1c to your computer and use it in GitHub Desktop.
Save muschellij2/cbd6378be2405dab1ab8c879628ffe1c to your computer and use it in GitHub Desktop.
library(pdftools)
library(purrr)
res = pdftools::pdf_text("YG-Archive-DatingSocialMediaInternal-090818.pdf")
ss = strsplit(res, "\n")
names(ss) = 1:length(ss)
df = map_df(ss, function(x) {
data.frame(txt = x,
row = seq_along(x),
start = grepl("unweighted base", tolower(trimws(x))),
stringsAsFactors = FALSE)
}, .id = "page")
start = df$txt[ df$start]
start = sub(".*base", "", start)
start = strsplit(trimws(start), " ")
start = map(start, function(x) {
x[ x != ""]
})
ncols = sapply(start, length)
df$ncol = NA
df$ncol[ df$start] = ncols
df$ncol = zoo::na.locf(df$ncol, na.rm = FALSE)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment