Skip to content

Instantly share code, notes, and snippets.

@infinex
Last active March 13, 2016 13:11
Show Gist options
  • Save infinex/2b77c835c9a25609f48a to your computer and use it in GitHub Desktop.
Save infinex/2b77c835c9a25609f48a to your computer and use it in GitHub Desktop.
Extraction of word Cloud in R using tm/rvest/wordcloud
library(rvest)
library(tm)
library(wordcloud)
setwd(paste0(getwd(),'/hwz'))
qoo10<-"http://forums.hardwarezone.com.sg/hardware-clinic-2/qoo10-deals-strictly-no-referral-link-part-2-a-5288313.html"
last_page<-qoo10 %>% read_html() %>% html_nodes(".pagination a") %>% html_attr('href') %>% na.omit(.) %>% strsplit('-') %>% pluck(13,character(1)) %>% gsub(".html","",.) %>% as.numeric %>% max
urls<-vector()
for(i in 1:last_page){
urls[i]<-paste0("http://forums.hardwarezone.com.sg/hardware-clinic-2/qoo10-deals-strictly-no-referral-link-part-2-a-5288313-",i,".html")
}
parse<-function(url){
url %>% html() %>% html_nodes(xpath="//*[starts-with(@id,'post_message')]/text()") %>% html_text()
}
text<-sapply(urls,parse)
save(text,file="hwz_qoo10.Rda")
load("hwz_qoo10.Rda")
result<-unlist(text)
myCorpus = Corpus(VectorSource(result))
myCorpus = tm_map(myCorpus, content_transformer(tolower))
myCorpus = tm_map(myCorpus, removePunctuation)
#myCorpus = tm_map(myCorpus, removeNumbers)
myCorpus = tm_map(myCorpus, removeWords, c(stopwords("SMART"),'im','dont'))
tdm = TermDocumentMatrix(myCorpus, control = list(wordLengths=c(1,Inf)))
m = as.matrix(tdm)
# get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing = TRUE)
# create a data frame with words and their frequencies
dm = data.frame(word = names(word_freqs), freq = word_freqs)
png("word_hwz_qoo10.png", width=1280,height=800,res=200)
wordcloud(dm$word, dm$freq,min.freq =20, ,random.order = FALSE, colors = brewer.pal(8, "Dark2"))
dev.off()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment