-
-
Save paulrougieux/e1ee769577b40cd9ed9db7f75e9a2cc2 to your computer and use it in GitHub Desktop.
#' Extract link texts and urls from a web page | |
#' @param url character an url | |
#' @return a data frame of link text and urls | |
#' @examples | |
#' \dontrun{ | |
#' scraplinks("http://localhost/") | |
#' glinks <- scraplinks("http://google.com/") | |
#' } | |
#' @export | |
scraplinks <- function(url){ | |
# Create an html document from the url | |
webpage <- xml2::read_html(url) | |
# Extract the URLs | |
url_ <- webpage %>% | |
rvest::html_nodes("a") %>% | |
rvest::html_attr("href") | |
# Extract the link text | |
link_ <- webpage %>% | |
rvest::html_nodes("a") %>% | |
rvest::html_text() | |
return(tibble(link = link_, url = url_)) | |
} |
ln20 should read:
return(data.frame(link = link_, url = url_))
ln20 should read:
return(data.frame(link = link_, url = url_))
don't you mean line 20?
@ajatoledo the dplyr::data_frame has been replaced with dplyr::tibble you probably have seen the message: "data_frame()
is deprecated, use tibble()
." I replaced this in the function above.
@paulrougieux data.frame() != data_frame(); my suggestion returns a data frame, tibble returns a tibble. Approach is yours, but data.frame() is base R while tibble() is tidyverse.
@ajatoledo data.frame
creates factor variables by default, you should add the argument stringsAsFactors=FALSE
to have character variables: data.frame(link = link_, url = url_, stringsAsFactors=FALSE)
. tibble(link = link_, url = url_)
is preferable in this simple example, because it will creates the link
and url
columns as character variables. In addition rvest is also part of the tidyverse suite of packages, and of course tibble
is loaded by default by the very well known dplyr
package.
@paulrougieux missed the dplyr items for url and link items meaning tibble()
is already in the namespace; however, the description says it outputs into a data frame, which could be misleading as the returned object is a tibble. Which could be problematic for some users as tibble doesn't always play well with other functions that expect data.frame inputs, e.g. write.csv()
.
As for stringsAsFactors=FALSE
in data.frame; if you wanted to set that, you'd likely want to do that in the initial function arguments so user is stuck to pre-defined arguments in the function.
Though either way, do what you want, each use case is different. Either way, the original function provided me a starting point when I needed it and got me where I needed.
@paulrougieux;
Hi Paul,
I want to use this code but I first need to get into a link href=database.htm
in the main page of the website. And then I have to extract information from many more links there.
How do I modify this code accordingly?
Thanks.
@goodyonsen you will have better luck asking on Stackoverflow. To get you started, prepare answers to the following questions. What have you tried so far? What is the error message? Make a reproducible example with a public URL.
Hi Paul,
thanks for your code. I tried to use it for extracting only link text from standalone hyperlink string
something ,but it does not work for me...
Please, could you help me, how to extract the word "something" from that hyperlink ?
Thanks a lot !
Viktor