Skip to content

Instantly share code, notes, and snippets.

@gadenbuie
Last active May 8, 2020 13:54
Show Gist options
  • Save gadenbuie/3217979ba0daa089bf35d9c3cf5a89ec to your computer and use it in GitHub Desktop.
Save gadenbuie/3217979ba0daa089bf35d9c3cf5a89ec to your computer and use it in GitHub Desktop.
check a page or blog post for valid urls
library(dplyr)
library(purrr)
library(stringr)
library(rvest)
library(crul)
base_url <- "https://deploy-preview-NN--USERNAME.netlify.app/"
page_url <- file.path(base_url, "blog/post")
page <- read_html(page_url)
res <-
page %>%
xml_nodes("a, script, img, figure") %>%
xml_attrs() %>%
map_dfr(as.list) %>%
select(src, href) %>%
tidyr::pivot_longer(
everything(),
names_to = "drop",
values_to = "href",
values_drop_na = TRUE
) %>%
count(href) %>%
filter(!str_detect(href, "^mailto:")) %>%
mutate(
url = case_when(
str_detect(href, "^[/]") ~ file.path(base_url, str_remove(href, "/")),
!str_detect(href, "^http") ~ file.path(page_url, href),
TRUE ~ href
),
ok = map_lgl(url, crul::ok)
)
res %>% filter(!ok)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment