Skip to content

Instantly share code, notes, and snippets.

@christophergandrud
Last active September 27, 2015 14:28
Show Gist options
  • Save christophergandrud/1284498 to your computer and use it in GitHub Desktop.
Save christophergandrud/1284498 to your computer and use it in GitHub Desktop.
Simple Web Crawler for Text
# Load RCurl package
library(RCurl)
addresses <- read.csv("~/links.csv") # Create a .csv file with all of the links you want to crawl
for (i in addresses) full.text <- getURL(i)
text.sub <- gsub("<.+?>", "", full.text) # Removes HTML tags
text <- data.frame(text.sub)
outpath <- "~/text.indv/"
x <- 1:nrow(text)
for(i in x) {
write(as.character(text[i,1]), file = paste(outpath,"/",i,".txt",sep=""))
} # Note: this is for Mac OS paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment