Last active
October 3, 2021 06:29
-
-
Save paulrougieux/e1ee769577b40cd9ed9db7f75e9a2cc2 to your computer and use it in GitHub Desktop.
Extract link texts and urls from a web page into an R data frame
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#' Extract link texts and urls from a web page | |
#' @param url character an url | |
#' @return a data frame of link text and urls | |
#' @examples | |
#' \dontrun{ | |
#' scraplinks("http://localhost/") | |
#' glinks <- scraplinks("http://google.com/") | |
#' } | |
#' @export | |
scraplinks <- function(url){ | |
# Create an html document from the url | |
webpage <- xml2::read_html(url) | |
# Extract the URLs | |
url_ <- webpage %>% | |
rvest::html_nodes("a") %>% | |
rvest::html_attr("href") | |
# Extract the link text | |
link_ <- webpage %>% | |
rvest::html_nodes("a") %>% | |
rvest::html_text() | |
return(tibble(link = link_, url = url_)) | |
} |
@paulrougieux;
Hi Paul,
I want to use this code but I first need to get into a link href=database.htm
in the main page of the website. And then I have to extract information from many more links there.
How do I modify this code accordingly?
Thanks.
@goodyonsen you will have better luck asking on Stackoverflow. To get you started, prepare answers to the following questions. What have you tried so far? What is the error message? Make a reproducible example with a public URL.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@paulrougieux missed the dplyr items for url and link items meaning
tibble()
is already in the namespace; however, the description says it outputs into a data frame, which could be misleading as the returned object is a tibble. Which could be problematic for some users as tibble doesn't always play well with other functions that expect data.frame inputs, e.g.write.csv()
.As for
stringsAsFactors=FALSE
in data.frame; if you wanted to set that, you'd likely want to do that in the initial function arguments so user is stuck to pre-defined arguments in the function.Though either way, do what you want, each use case is different. Either way, the original function provided me a starting point when I needed it and got me where I needed.