Created
December 3, 2014 21:09
-
-
Save aammd/dd9860d5eb70ffde1622 to your computer and use it in GitHub Desktop.
This demonstrates web-scraping in order to generate a list of links on a site, in order to download them all. The example used is the Municipality of Burnaby's list of demolition permits
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library("rvest") | |
library("XML") | |
links_list <- html("http://www.burnaby.ca/City-Services/Building/Permits-Issued.html") %>% | |
html_nodes("#ctl15_nestedList a") | |
## Where do links lead? This is encoded in the attribute "href". First get all the attributes | |
links_attr <- sapply(links_list, xmlAttrs) | |
## We have the attributes! each link just became a single named character | |
## vector. Each element in the vector gives the attributes value; the name of | |
## each element is the attribute name. We must pull out the attribute named "href": | |
link_address <- sapply(links_attr, function(x) x[["href"]]) | |
## or, more succinctly, if you are into that sort of thing: | |
link_address <- sapply(links_attr, `[[`, i = "href") | |
## Final step of cleaning: the "header" links (the items you click to make data | |
## for a whole month appear) have a value of "#". We need to drop 'em: | |
demolition_permits <- Filter(function(x) x != "#", link_address) | |
## Note that there are many ways to do this, e.g. `which`, `subset` and `stringr::str_detect` | |
# All these are relative links, all with the same base url: | |
baseurl <- "http://www.burnaby.ca" | |
demolition_pdfs <- sapply(demolition_permits, function(x) paste0(baseurl, x)) | |
## now you download em all! | |
get_pdf <- function(url){ | |
download.file(url, destfile = basename(url)) | |
} | |
sapply(demolition_pdfs, get_pdf) | |
## This has the advantage of catching weirdo link addresses, see for example | |
demolition_pdfs[[115]] | |
demolition_pdfs[202] | |
## these don't match the usual pattern |
I was surprised too! one has four have a random dollar sign and exclamation point in it:
http://www.burnaby.ca/Assets/city+services/building/Permits+Issued/07+-+June+2014/June+13$!2c+2014.pdf
Another doesn't even have a pdf in the url! but it definitely leads to a pdf
http://www.burnaby.ca/AssetFactory.aspx?did=13300
That. Is. Just. Not. Right.
And yet nothing surprises me anymore. Nothing.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Oh cool! I'm not going to run this now. Are you saying their own PDF links don't follow the pattern I teased out and assumed?