Skip to content

Instantly share code, notes, and snippets.

@iangow
Created January 3, 2024 15:59
Show Gist options
  • Save iangow/5679633918166608a918c1487f651614 to your computer and use it in GitHub Desktop.
Save iangow/5679633918166608a918c1487f651614 to your computer and use it in GitHub Desktop.
Code to scrape data from a PDF
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(readr) # For read_lines(), read_fwf(), etc.
library(stringr) # For str_c(), str_detect()
library(pdftools) # For pdf_text()
library(lubridate) # For ymd()
library(ggplot2)
url <- paste0("https://aaahq.org/portals/0/documents/meetings/2023/RC/",
"2023%20Rookie%20Camp%20Alphabetical%20Presentation%20Schedule.pdf")
skip_rows <- 6
col_names <- c("name", "school", "room", "start",
"end", "method", "interest", "area")
camp_data_raw <-
pdf_text(url) |> read_lines(skip = skip_rows) |>
tibble(temp = _)
regex <- str_c("^",
"(.*?)\\s{2,}",
"(.*?)\\s{2,}",
"(Cityview [0-9])\\s+",
"([0-9]{1,2}:[0-9]{2} [AP]M)\\s+",
"([0-9]{1,2}:[0-9]{2} [AP]M)\\s+",
"(.*?)\\s{2,}",
"(.*?)\\s{2,}",
"(.*)$")
camp_data <-
camp_data_raw |>
extract(temp, col_names, regex)
camp_data
camp_data |> count()
camp_data |> count(school, sort = TRUE)
camp_data |>
ggplot(aes(x = method, fill = method)) +
geom_bar()
camp_data |>
ggplot(aes(x = area, fill = area)) +
geom_bar()
camp_data |>
ggplot(aes(x = interest, fill = interest)) +
geom_bar()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment