Last active
October 3, 2021 06:29
-
-
Save paulrougieux/e1ee769577b40cd9ed9db7f75e9a2cc2 to your computer and use it in GitHub Desktop.
Extract link texts and urls from a web page into an R data frame
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#' Extract link texts and urls from a web page | |
#' @param url character an url | |
#' @return a data frame of link text and urls | |
#' @examples | |
#' \dontrun{ | |
#' scraplinks("http://localhost/") | |
#' glinks <- scraplinks("http://google.com/") | |
#' } | |
#' @export | |
scraplinks <- function(url){ | |
# Create an html document from the url | |
webpage <- xml2::read_html(url) | |
# Extract the URLs | |
url_ <- webpage %>% | |
rvest::html_nodes("a") %>% | |
rvest::html_attr("href") | |
# Extract the link text | |
link_ <- webpage %>% | |
rvest::html_nodes("a") %>% | |
rvest::html_text() | |
return(tibble(link = link_, url = url_)) | |
} |
@goodyonsen you will have better luck asking on Stackoverflow. To get you started, prepare answers to the following questions. What have you tried so far? What is the error message? Make a reproducible example with a public URL.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@paulrougieux;
Hi Paul,
I want to use this code but I first need to get into a link
href=database.htm
in the main page of the website. And then I have to extract information from many more links there.How do I modify this code accordingly?
Thanks.